--- title: "Per-Item Lambda (Experimental)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Per-Item Lambda (Experimental)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4 ) ``` > **Experimental.** Per-item lambda uses a frozen expected-count approximation, > not the full marginal-MML objective. The IRT marginal likelihood integrates the > *joint* response pattern and does not decompose item-wise, so a theoretically > principled per-item marginal objective does not yet exist in this package. > Results from per-item tuning should be treated as approximate and validated > against the scalar-lambda MML baseline. See the documentation of > `tune_lambda_ability_risk_item()` for details. ## Why per-item lambda? A single global $\lambda$ is fragile: if even one item has poorly-correlated LLM predictions, the ability-risk criterion may force the global optimum to $\lambda = 0$, preventing all items from benefiting. Per-item $\lambda_j$ allows each item to draw on the LLM data at its own optimal level. Consider a test with eight items where: - Items 1–4 are straightforward factual questions — the LLM predicts these well. - Items 5–8 require nuanced reasoning — the LLM is nearly random on these. With a scalar $\lambda$, the four poor items push the optimum toward 0. With per-item $\lambda_j$, items 1–4 get $\lambda_j \approx 0.5$–0.7 while items 5–8 get $\lambda_j \approx 0$. ## Simulate a heterogeneous test ```{r simulate} library(mixedsubjectsirt) library(ggplot2) set.seed(2026) n_human <- 400 n_generated <- 1200 n_items <- 8 n_good <- 4 # items where LLM predicts well true_pars <- data.frame( item = paste0("Item", seq_len(n_items)), a = seq(0.8, 1.6, length.out = n_items), d = seq(-1.1, 1.1, length.out = n_items) ) true_pars$b <- -true_pars$d / true_pars$a theta_human <- rnorm(n_human) observed <- simulate_2pl(theta_human, true_pars) # LLM: good for items 1–4 (same DGP), poor for 5–8 (random noise) llm_pars_good <- true_pars llm_pars_poor <- true_pars llm_pars_poor$a <- pmax(0.05, rnorm(n_items, 0, 0.1)) # near-random llm_pars_poor$d <- rnorm(n_items, 0, 2) llm_pars_poor$b <- -llm_pars_poor$d / llm_pars_poor$a # Build predicted (same subjects as human) predicted <- observed # F = Y for first 4 items predicted[, 5:8] <- simulate_2pl(theta_human, llm_pars_poor)[, 5:8] # Build generated generated_good <- simulate_2pl(rnorm(n_generated), true_pars) generated_poor <- simulate_2pl(rnorm(n_generated), llm_pars_poor) generated <- cbind(generated_good[, 1:4], generated_poor[, 5:8]) colnames(generated) <- true_pars$item ``` The first four items have perfect paired predictions (F = Y); items 5–8 have near-random LLM predictions. ## Step 1: Fit 2PL baseline and get global scalar lambda ```{r baseline} human_pars <- fit_2pl(observed, technical = list(NCYCLES = 500))$pars global_tuned <- tune_lambda_ability_risk( lambda_grid = seq(0, 1, by = 0.1), observed = observed, predicted = predicted, generated = generated, initial_pars = human_pars, fit_fn = fit_mixed_subjects_mml, n_quad = 11, control = list(maxit = 200) ) cat("Global scalar best lambda:", global_tuned$best_lambda, "\n") ``` The global scalar is forced to a compromise — the four poor items constrain it to a value smaller than what items 1–4 could support. ## Step 2: PPI++ score per item (fast diagnostic) `tune_lambda_ppi_score_item()` applies the Proposition 2 formula independently per item using the 2×2 diagonal block of $H^{-1}$ and the item-level sub-vectors of the score matrices. This is fast (no fitting required) and shows which items are well-predicted. ```{r ppi-item} ppi_item <- tune_lambda_ppi_score_item( observed = observed, predicted = predicted, item_pars = human_pars, n_generated = n_generated, n_quad = 11 ) cat("Per-item PPI++ lambda:\n") print(data.frame(item = ppi_item$item, lambda = round(ppi_item$lambda, 3))) cat("N/(n+N) upper bound:", round(n_generated / (n_human + n_generated), 3), "\n") ``` Items 1–4 (F = Y) should show $\lambda_j \approx N/(n+N) = 0.75$; items 5–8 (random LLM) should show $\lambda_j \approx 0$. ## Step 3: Per-item ability-risk tuning `tune_lambda_ability_risk_item()` uses coordinate descent: for each item $j$, it finds the $\lambda_j$ that minimizes ability-score risk while holding all other $\lambda_{j'}$ fixed. By default each coordinate is solved by **direct 1-D optimization** (continuous $\lambda_j$); pass `method = "grid"` to restrict it to the values in `lambda_grid` (which otherwise only bounds the search). Each evaluation fits with the **frozen expected-count Q-function** (not the full marginal-MML objective) because the IRT marginal likelihood does not decompose item-wise. Starting from the global scalar optimum (not from all-zeros) is essential — see the note below. ```{r item-risk, cache = FALSE} item_tuned <- tune_lambda_ability_risk_item( lambda_grid = seq(0, 1, by = 0.25), observed = observed, predicted = predicted, generated = generated, initial_pars = human_pars, init_lambda = global_tuned$best_lambda, # start from global best n_quad = 11, n_pass = 1, control = list(maxit = 200) ) cat("Per-item ability-risk lambda:\n") print(data.frame(item = item_tuned$item, lambda = round(item_tuned$lambda, 3))) ``` Items 1–4 should receive positive $\lambda_j$ (good predictor); items 5–8 should be near zero (poor predictor). ## Step 4: Compare scalar vs. per-item parameter recovery ```{r compare} fit_scalar <- global_tuned$best_fit fit_per_item <- item_tuned$final_fit rmse <- function(x, y) sqrt(mean((x - y)^2)) comparison <- data.frame( item = true_pars$item, true_a = round(true_pars$a, 3), human_a = round(human_pars$a, 3), scalar_a = round(fit_scalar$item_pars$a, 3), item_a = if (is.null(fit_per_item)) NA_real_ else round(fit_per_item$item_pars$a, 3) ) knitr::kable(comparison, row.names = FALSE, caption = "Discrimination recovery: scalar lambda vs. per-item lambda") cat("RMSE(a) human-only: ", round(rmse(human_pars$a, true_pars$a), 4), "\n") cat("RMSE(a) scalar MML: ", round(rmse(fit_scalar$item_pars$a, true_pars$a), 4), "\n") if (!is.null(fit_per_item)) { cat("RMSE(a) per-item MML: ", round(rmse(fit_per_item$item_pars$a, true_pars$a), 4), "\n") } ``` ## Important note on initialization Starting coordinate descent from all-zeros is not recommended. When all other items are at $\lambda_j = 0$, each single-item improvement is diluted across the full ability-risk criterion, making improvements hard to detect. The recommended workflow is: 1. Fit the scalar global optimum with `tune_lambda_ability_risk(..., fit_fn = fit_mixed_subjects_mml)`. 2. Pass that global value as `init_lambda` to `tune_lambda_ability_risk_item()`. 3. The coordinate descent then identifies which items should deviate from the global default — raising $\lambda_j$ for well-predicted items and lowering it for poorly-predicted ones. ## Approximation caveat The per-item lambda coordinate descent uses the **frozen Q-function** (not the full marginal-MML objective) for each candidate evaluation. This is necessary because the IRT marginal likelihood integrates the joint response pattern and does not decompose item-wise. The approximation is good when `initial_pars` is close to the converged parameters. For final reporting, always: - Compare the per-item fit against the scalar MML baseline. - Report per-item results as approximate / experimental. - Use `vcov(item_tuned$final_fit)` for uncertainty (which applies `vcov_mixed_subjects` with the vector-lambda bread and meat scaling).