Paper Example 1: Linked-Test Design with 1PL Estimation

Purpose

This vignette is a faithful reproduction of Example 1 from Schroeders and Gnambs (2025), “Sample Size Planning for Item Response Models: A Tutorial for the Quantitative Researcher” and its companion R code at https://ulrich-schroeders.github.io/IRT-sample-size/example_1.html. The goal is to let irtsim users compare its Monte Carlo output directly against the published reference.

The paper’s Example 1 asks a classic question: how many examinees are needed to recover item difficulty parameters in a linked, two-form achievement test fit with a Rasch (1PL) model?

Design, from the paper

Decision	Paper value	irtsim mapping
Estimation model	1PL (Rasch)	`estimation_model = "1PL"`
Number of items	30	`n_items = 30`
Item discriminations (generation)	`rnorm(30, 1, 0.1)`	`item_params$a`
Item difficulties	`seq(-2, 2, length.out = 30)`	`item_params$b`
Two forms, common block items 13–18	2 × 30 linking matrix	`missing = "linking"`
Monte Carlo iterations	438	`iterations = 438`
Sample sizes	`seq(100, 600, 50)`	`sample_sizes`
Performance criterion	MSE, threshold 0.05	`summary(res)$item_summary$mse`

Note on the data-generating model: the paper uses a near-constant discrimination (mean 1, sd 0.1) and fits a Rasch model. irtsim’s 1PL generation fixes a = 1 exactly, so to match the paper we generate under a 2PL with a ~ rnorm(30, 1, 0.1) and set estimation_model = "1PL". The estimation model is the one the paper targets; the generation model is a faithful implementation of the paper’s a draws.

Reproducing the study

The code below mirrors the paper. It is shown for reference; the actual simulation is precomputed and cached in inst/extdata/vignette_ex1_paper.rds to keep vignette build time low.

library(irtsim)

set.seed(2024)
n_items <- 30L

# Item parameters exactly as in the paper
a_vals <- rnorm(n_items, mean = 1, sd = 0.1)
b_vals <- seq(-2, 2, length.out = n_items)

# Linking matrix: form 1 = odd items + common block 13-18,
#                 form 2 = even items + common block 13-18
linking_matrix <- matrix(0L, nrow = 2L, ncol = n_items)
linking_matrix[1L, sort(unique(c(seq(1L, n_items, 2L), 13:18)))] <- 1L
linking_matrix[2L, sort(unique(c(seq(2L, n_items, 2L), 13:18)))] <- 1L

design <- irt_design(
  model       = "2PL",
  n_items     = n_items,
  item_params = list(a = a_vals, b = b_vals),
  theta_dist  = "normal"
)

study <- irt_study(
  design,
  sample_sizes     = seq(100L, 600L, by = 50L),
  missing          = "linking",
  test_design      = list(linking_matrix = linking_matrix),
  estimation_model = "1PL"
)

res <- irt_simulate(
  study,
  iterations = 438L,
  seed       = 2024L,
  parallel   = TRUE
)

Summary: MSE by sample size

We summarize the recovered item-difficulty MSE and pair it with its Monte Carlo standard error (MCSE), following Morris et al. (2019).

s <- summary(res, criterion = c("mse", "mcse_mse"), param = "b")
head(s$item_summary, 10)
#>    sample_size item param true_value       mse    mcse_mse n_converged
#> 1          100    1     b -2.0000000 0.2503685 0.018046678         438
#> 2          100    2     b -1.8620690 0.2138392 0.017983836         438
#> 3          100    3     b -1.7241379 0.1597980 0.011247343         438
#> 4          100    4     b -1.5862069 0.1667535 0.012496766         438
#> 5          100    5     b -1.4482759 0.1862439 0.018703552         438
#> 6          100    6     b -1.3103448 0.1711703 0.012482780         438
#> 7          100    7     b -1.1724138 0.1426810 0.011621785         438
#> 8          100    8     b -1.0344828 0.1378920 0.008631983         438
#> 9          100    9     b -0.8965517 0.1351424 0.009190359         438
#> 10         100   10     b -0.7586207 0.1482162 0.010252653         438

The paper plots the MSE trajectory for two representative items: item 1 (difficulty ≈ −2, extreme) and item 15 (difficulty ≈ 0, central). We do the same.

item_df <- s$item_summary
focal <- subset(item_df, item %in% c(1L, 15L))
focal$item_label <- factor(
  focal$item,
  levels = c(1L, 15L),
  labels = c("Item 1 (b \u2248 -2)", "Item 15 (b \u2248 0)")
)

ggplot(focal, aes(x = sample_size, y = mse, colour = item_label)) +
  geom_hline(yintercept = 0.05, linetype = "dashed", colour = "grey40") +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  geom_errorbar(
    aes(
      ymin = pmax(mse - 1.96 * mcse_mse, 0),
      ymax = mse + 1.96 * mcse_mse
    ),
    width = 15
  ) +
  scale_x_continuous(breaks = seq(100, 600, 100)) +
  labs(
    title    = "Example 1: MSE of b-parameter vs. sample size",
    subtitle = "Dashed line = paper's 0.05 MSE threshold",
    x        = "Sample size (N)",
    y        = "MSE(b)",
    colour   = NULL
  ) +
  theme_minimal(base_size = 12)

Recommended N

Using irtsim’s built-in recommended_n() helper we can extract the smallest N that meets the paper’s MSE ≤ 0.05 threshold for each item. The paper reports sample-size requirements in the same sense.

sim_summary <- summary(res, criterion = "mse", param = "b")
rec <- recommended_n(sim_summary, criterion = "mse", threshold = 0.05, param = "b")
head(rec, 10)
#>    item param recommended_n criterion threshold
#> 1     1     b            NA       mse      0.05
#> 2     2     b           450       mse      0.05
#> 3     3     b           300       mse      0.05
#> 4     4     b           350       mse      0.05
#> 5     5     b           450       mse      0.05
#> 6     6     b           400       mse      0.05
#> 7     7     b           300       mse      0.05
#> 8     8     b           250       mse      0.05
#> 9     9     b           300       mse      0.05
#> 10   10     b           300       mse      0.05

Comparison notes — paper vs. irtsim

Should reproduce (within MC noise):

MSE trajectory shape for central items (e.g., item 15) is tightest.
Extreme items (item 1, item 30) may not reach the 0.05 threshold within the tested sample-size range (N ≤ 600). An NA in the recommended_n column means the criterion was never met — this is informative, not an error.
Required N to meet MSE ≤ 0.05 for the central block (items 13–18) is substantially smaller than for the tails.

Expected small numerical differences:

Form assignment. The paper uses random per-examinee form assignment (sample(c(1, 2), n, replace = TRUE)). irtsim’s apply_missing_structured() uses deterministic round-robin assignment. At N ≥ 100 the induced difference in per-form sample sizes is small (< 1 examinee gap) and does not materially shift the MSE trajectories.
RNG dispatch. irtsim runs in parallel mode with future.seed = TRUE, which uses L’Ecuyer-CMRG substreams. The paper uses the session default Mersenne-Twister. Both are valid Monte Carlo streams; the specific numbers differ. Only trajectory shape is expected to match, not bit-for-bit numerical equality.
Iterations. We use the paper’s 438 iterations as-is. That value reflects a Burton (2003) precision target that irtsim exposes via [irt_iterations()]; users can recompute it if they want a different MC precision.

References

Burton, A., Altman, D. G., Royston, P., & Holder, R. L. (2006). The design of simulation studies in medical statistics. Statistics in Medicine, 25, 4279–4292. https://doi.org/10.1002/sim.2673

Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38, 2074–2102. https://doi.org/10.1002/sim.8086

Schroeders, U., & Gnambs, T. (2025). Sample size planning for item response models: A tutorial for the quantitative researcher. Companion code: https://ulrich-schroeders.github.io/IRT-sample-size/.