When developing a new questionnaire, scale, or test, researchers typically ask a panel of subject-matter experts to rate each candidate item for relevance to the construct being measured. The expert ratings are then summarized into content validity indices that quantify how well the items represent the intended construct.
The contentValidity package implements the standard set
of content validity indices used in nursing, education, psychology, and
health sciences research:
The package ships with cvi_example, a simulated set of
expert ratings for a 10-item depression screening instrument, with 6
expert raters using a 4-point relevance scale (1 = not relevant, 4 =
highly relevant).
The simplest place to start is icvi(), which gives the
proportion of experts rating each item as 3 or 4:
icvi(cvi_example)
#> item1 item2 item3 item4 item5 item6 item7 item8
#> 1.0000000 1.0000000 1.0000000 0.8333333 0.6666667 1.0000000 0.8333333 1.0000000
#> item9 item10
#> 0.5000000 1.0000000By Polit and Beck (2006), I-CVI ≥ 0.78 is considered excellent with six or more experts. Items 5 and 9 in our example (0.67 and 0.50) would be flagged for revision.
Plain I-CVI doesn’t correct for chance agreement. With small panels, a high I-CVI can be partly luck. Modified kappa addresses this:
mod_kappa(cvi_example)
#> item1 item2 item3 item4 item5 item6 item7 item8
#> 1.0000000 1.0000000 1.0000000 0.8160920 0.5646259 1.0000000 0.8160920 1.0000000
#> item9 item10
#> 0.2727273 1.0000000Notice that item 9 drops sharply (0.50 → 0.27) — its I-CVI was inflated by chance agreement among only six raters.
Aiken’s V uses the full rating scale rather than dichotomizing relevant/not-relevant. A “4” contributes more than a “3”:
Two scale-level indices summarize content validity across all items:
scvi_ave(cvi_example) # average of I-CVIs
#> [1] 0.8833333
scvi_ua(cvi_example) # proportion of items with universal agreement
#> [1] 0.6Polit and Beck (2006) recommend reporting both. S-CVI/Ave ≥ 0.90 indicates excellent overall content validity; S-CVI/UA gives a stricter view of how many items achieved unanimous endorsement.
content_validity() is the workhorse function for routine
analysis. It returns the complete set of item-level and scale-level
indices in one tidy structure:
result <- content_validity(cvi_example)
result
#> Content Validity Analysis
#> -------------------------
#> Experts: 6
#> Items: 10
#>
#> Item-level indices:
#> item icvi mod_kappa aiken_v gwet_ac1 gwet_ac2
#> item1 1.0000 1.0000 1.0000 1.0000 1.0000
#> item2 1.0000 1.0000 0.8889 1.0000 0.8964
#> item3 1.0000 1.0000 0.7778 1.0000 0.8964
#> item4 0.8333 0.8161 0.6667 0.5385 0.8286
#> item5 0.6667 0.5646 0.6111 0.0400 0.6940
#> item6 1.0000 1.0000 0.9444 1.0000 0.9494
#> item7 0.8333 0.8161 0.6667 0.5385 0.8286
#> item8 1.0000 1.0000 0.9444 1.0000 0.9494
#> item9 0.5000 0.2727 0.5000 -0.2000 0.8714
#> item10 1.0000 1.0000 0.9444 1.0000 0.9494
#>
#> Scale-level indices (overall):
#> scvi_ave scvi_ua mean_kappa mean_ac1 mean_ac2
#> 0.8833 0.6000 0.8470 0.6917 0.8864The result is an object you can subset, just like a list:
result$items
#> item icvi mod_kappa aiken_v gwet_ac1 gwet_ac2
#> 1 item1 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
#> 2 item2 1.0000000 1.0000000 0.8888889 1.0000000 0.8964029
#> 3 item3 1.0000000 1.0000000 0.7777778 1.0000000 0.8964029
#> 4 item4 0.8333333 0.8160920 0.6666667 0.5384615 0.8285714
#> 5 item5 0.6666667 0.5646259 0.6111111 0.0400000 0.6940000
#> 6 item6 1.0000000 1.0000000 0.9444444 1.0000000 0.9494382
#> 7 item7 0.8333333 0.8160920 0.6666667 0.5384615 0.8285714
#> 8 item8 1.0000000 1.0000000 0.9444444 1.0000000 0.9494382
#> 9 item9 0.5000000 0.2727273 0.5000000 -0.2000000 0.8714286
#> 10 item10 1.0000000 1.0000000 0.9444444 1.0000000 0.9494382
result$scale
#> scvi_ave scvi_ua mean_kappa mean_ac1 mean_ac2
#> 0.8833333 0.6000000 0.8469537 0.6916923 0.8863692apa_table() formats the result for journal
manuscripts:
apa_table(result)
#> Item I-CVI Modified Kappa Kappa Interpretation Aiken's V Gwet's AC1
#> 1 item1 1.00 1.00 Excellent 1.00 1.00
#> 2 item2 1.00 1.00 Excellent 0.89 1.00
#> 3 item3 1.00 1.00 Excellent 0.78 1.00
#> 4 item4 0.83 0.82 Excellent 0.67 0.54
#> 5 item5 0.67 0.56 Fair 0.61 0.04
#> 6 item6 1.00 1.00 Excellent 0.94 1.00
#> 7 item7 0.83 0.82 Excellent 0.67 0.54
#> 8 item8 1.00 1.00 Excellent 0.94 1.00
#> 9 item9 0.50 0.27 Poor 0.50 -0.20
#> 10 item10 1.00 1.00 Excellent 0.94 1.00
#> Gwet's AC2
#> 1 1.00
#> 2 0.90
#> 3 0.90
#> 4 0.83
#> 5 0.69
#> 6 0.95
#> 7 0.83
#> 8 0.95
#> 9 0.87
#> 10 0.95For R Markdown output (HTML, PDF, Word), use the appropriate format
argument. The function returns a knitr::kable() object that
renders correctly in your document:
| Item | I-CVI | Modified Kappa | Kappa Interpretation | Aiken’s V | Gwet’s AC1 | Gwet’s AC2 |
|---|---|---|---|---|---|---|
| item1 | 1.00 | 1.00 | Excellent | 1.00 | 1.00 | 1.00 |
| item2 | 1.00 | 1.00 | Excellent | 0.89 | 1.00 | 0.90 |
| item3 | 1.00 | 1.00 | Excellent | 0.78 | 1.00 | 0.90 |
| item4 | 0.83 | 0.82 | Excellent | 0.67 | 0.54 | 0.83 |
| item5 | 0.67 | 0.56 | Fair | 0.61 | 0.04 | 0.69 |
| item6 | 1.00 | 1.00 | Excellent | 0.94 | 1.00 | 0.95 |
| item7 | 0.83 | 0.82 | Excellent | 0.67 | 0.54 | 0.83 |
| item8 | 1.00 | 1.00 | Excellent | 0.94 | 1.00 | 0.95 |
| item9 | 0.50 | 0.27 | Poor | 0.50 | -0.20 | 0.87 |
| item10 | 1.00 | 1.00 | Excellent | 0.94 | 1.00 | 0.95 |
CVR uses a different rating convention: each expert classifies items
as essential, useful but not
essential, or not necessary. Use Lawshe-style
coding (1 = essential, 2 = useful, 3 = not necessary) and call
cvr() directly:
# 10 experts rating 3 items on Lawshe's scale
lawshe_ratings <- matrix(
c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, # 8 of 10 essential
1, 1, 1, 2, 2, 2, 2, 3, 3, 3, # 3 of 10 essential
1, 1, 1, 1, 1, 1, 1, 1, 1, 1), # 10 of 10 essential
nrow = 10,
dimnames = list(NULL, paste0("item", 1:3))
)
cvr(lawshe_ratings)
#> item1 item2 item3
#> 0.6 -0.4 1.0Compare each item’s CVR to the critical value for the panel size, using the corrected Wilson, Pan, and Schumsky (2012) thresholds:
cvr_critical(n_experts = 10) # one-tailed alpha = 0.05
#> [1] 0.8
cvr_critical(n_experts = 10, alpha = 0.01)
#> [1] 1In this example, only items 1 and 3 (CVR = 0.6 and 1.0) reach the critical value of 0.8 at α = 0.05. Item 2 would be revised or dropped.
All six relevance-scale indices and Lawshe’s CVR now accept an
optional ci = TRUE argument that returns bootstrap
confidence intervals alongside the point estimate. The CI is the
percentile bootstrap by default (Efron & Tibshirani, 1993);
ci_method = "bca" requests the bias-corrected accelerated
interval (DiCiccio & Efron, 1996), which is preferable when the
bootstrap distribution is skewed (common for I-CVI near 1.0). Default
2000 replicates, configurable via n_boot. The resampling
unit is the expert (row), not the item (column), matching the standard
inferential frame for inter-rater reliability analyses (Gwet, 2014).
icvi(cvi_example, ci = TRUE, n_boot = 1000, seed = 1)
#> item icvi ci_lower ci_upper ci_method conf_level n_boot
#> 1 item1 1.0000000 1.0000000 1.0000000 percentile 0.95 1000
#> 2 item2 1.0000000 1.0000000 1.0000000 percentile 0.95 1000
#> 3 item3 1.0000000 1.0000000 1.0000000 percentile 0.95 1000
#> 4 item4 0.8333333 0.5000000 1.0000000 percentile 0.95 1000
#> 5 item5 0.6666667 0.3333333 1.0000000 percentile 0.95 1000
#> 6 item6 1.0000000 1.0000000 1.0000000 percentile 0.95 1000
#> 7 item7 0.8333333 0.5000000 1.0000000 percentile 0.95 1000
#> 8 item8 1.0000000 1.0000000 1.0000000 percentile 0.95 1000
#> 9 item9 0.5000000 0.1666667 0.8333333 percentile 0.95 1000
#> 10 item10 1.0000000 1.0000000 1.0000000 percentile 0.95 1000Two new chance-corrected agreement coefficients are available:
gwet_ac1() for binary classification (dichotomized at the
relevance threshold) and gwet_ac2() for the full ordinal
scale with a weight matrix. Both use Gwet’s marginal-adjusted
chance-correction, which differs from Polit’s modified kappa (fixed p =
0.5 null) and gives substantively different answers when the prevalence
of “relevant” ratings is far from 0.5 — the common case in
content-validity work.
gwet_ac1(cvi_example)
#> item1 item2 item3 item4 item5 item6 item7
#> 1.0000000 1.0000000 1.0000000 0.5384615 0.0400000 1.0000000 0.5384615
#> item8 item9 item10
#> 1.0000000 -0.2000000 1.0000000
gwet_ac2(cvi_example, categories = 1:4)
#> item1 item2 item3 item4 item5 item6 item7 item8
#> 1.0000000 0.8964029 0.8964029 0.8285714 0.6940000 0.9494382 0.8285714 0.9494382
#> item9 item10
#> 0.8714286 0.9494382For AC2, always pass the full theoretical rating
scale via categories (e.g., 1:4 for a
standard 4-point relevance scale). If omitted, the function infers
categories from the observed ratings, which can silently collapse the
weight matrix and give incorrect results when extreme categories are
unused.
The implementation matches irrCAC::gwet.ac1.raw() (by
Kilem Gwet, the original author of AC1/AC2) bit-for-bit on the same
inputs.
cv_sample_size_icvi() answers “how many expert raters do
I need to estimate I-CVI within a given confidence-interval half-width?”
— a question that has been answered only by rule-of-thumb in the
content-validity literature (Lynn, 1986; Polit & Beck, 2006).
# Anticipating I-CVI ≈ 0.85 with target half-width ≤ 0.10
cv_sample_size_icvi(expected = 0.85, half_width = 0.10)
#> [1] 49
# Sensitivity table across plausible expected I-CVI values
sapply(seq(0.70, 0.95, by = 0.05), function(p) {
cv_sample_size_icvi(expected = p, half_width = 0.10)
})
#> [1] 81 73 62 49 35 19A useful caveat: the function typically recommends 20+ experts for realistic targets, well above Lynn’s rule-of-thumb minimum of 6 — worth flagging in study protocols and grant applications.
For instruments structured into subscales (e.g., a depression scale
with cognitive, somatic, and behavioral domains),
content_validity() now accepts a subscale
argument that maps items to subscales and computes scale-level indices
per subscale in addition to the overall scale.
# Treat items 1-5 as subscale "Cognitive" and 6-10 as "Somatic"
result_multi <- content_validity(
cvi_example,
subscale = c(rep("Cognitive", 5), rep("Somatic", 5))
)
result_multi$subscales
#> subscale n_items scvi_ave scvi_ua mean_kappa mean_ac1 mean_ac2
#> 1 Cognitive 5 0.9000000 0.6 0.8761436 0.7156923 0.8630754
#> 2 Somatic 5 0.8666667 0.6 0.8177638 0.6676923 0.9096629The items data frame also carries the subscale assignment, which makes it easy to filter or facet downstream analyses.
plot.content_validity() produces a scatter of I-CVI
against an agreement index (modified kappa by default; choose
gwet_ac1, gwet_ac2, or aiken_v
via y_index). Reference lines mark the adequacy region and
items outside it are highlighted in red and labeled.
By default, items are flagged (“Below I-CVI or AC2 threshold”) if they fail either criterion. This is the conservative “needs any review” default. When the plot is presenting one index specifically, you may prefer to flag only items that fail on that axis:
# Flag only items below the AC2 threshold (ignores I-CVI verdict)
plot(result_multi, y_index = "gwet_ac2", flag_logic = "y_index")
# Flag only items below the I-CVI threshold (ignores AC2 verdict)
plot(result_multi, y_index = "gwet_ac2", flag_logic = "icvi")The legend always names the criterion that drives the flag, so the plot stays unambiguous about why an item is highlighted.
apa_table() accepts interpretation_index to
choose which agreement index drives the verdict column (“Excellent” /
“Good” / etc.). The interpretation column is positioned immediately
adjacent to its source column to avoid confusion when the table contains
multiple indices.
apa_table(result_multi, interpretation_index = "gwet_ac2")
#> Item I-CVI Modified Kappa Aiken's V Gwet's AC1 Gwet's AC2
#> 1 item1 1.00 1.00 1.00 1.00 1.00
#> 2 item2 1.00 1.00 0.89 1.00 0.90
#> 3 item3 1.00 1.00 0.78 1.00 0.90
#> 4 item4 0.83 0.82 0.67 0.54 0.83
#> 5 item5 0.67 0.56 0.61 0.04 0.69
#> 6 item6 1.00 1.00 0.94 1.00 0.95
#> 7 item7 0.83 0.82 0.67 0.54 0.83
#> 8 item8 1.00 1.00 0.94 1.00 0.95
#> 9 item9 0.50 0.27 0.50 -0.20 0.87
#> 10 item10 1.00 1.00 0.94 1.00 0.95
#> AC2 Interpretation
#> 1 Very good
#> 2 Very good
#> 3 Very good
#> 4 Very good
#> 5 Good
#> 6 Very good
#> 7 Very good
#> 8 Very good
#> 9 Very good
#> 10 Very goodIf you use contentValidity in published research, please
run:
to get a current citation block in BibTeX or plain-text form.
Aiken, L. R. (1985). Three coefficients for analyzing the reliability and validity of ratings. Educational and Psychological Measurement, 45(1), 131–142. https://doi.org/10.1177/0013164485451012
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28(4), 563–575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
Lynn, M. R. (1986). Determination and quantification of content validity. Nursing Research, 35(6), 382–385. https://doi.org/10.1097/00006199-198611000-00017
Polit, D. F., & Beck, C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and recommendations. Research in Nursing & Health, 29(5), 489–497. https://doi.org/10.1002/nur.20147
Polit, D. F., Beck, C. T., & Owen, S. V. (2007). Is the CVI an acceptable indicator of content validity? Appraisal and recommendations. Research in Nursing & Health, 30(4), 459–467. https://doi.org/10.1002/nur.20199
Wilson, F. R., Pan, W., & Schumsky, D. A. (2012). Recalculation of the critical values for Lawshe’s content validity ratio. Measurement and Evaluation in Counseling and Development, 45(3), 197–210. https://doi.org/10.1177/0748175612440286
Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48. https://doi.org/10.1348/000711006X126600
Gwet, K. L. (2014). Handbook of inter-rater reliability (4th ed.). Advanced Analytics, LLC.
Wongpakaran, N., Wongpakaran, T., Wedding, D., & Gwet, K. L. (2013). A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients. BMC Medical Research Methodology, 13(1), 61. https://doi.org/10.1186/1471-2288-13-61
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman and Hall.
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11(3), 189–228. https://doi.org/10.1214/ss/1032280214
Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion. Statistics in Medicine, 17(8), 857–872.
Altman, D. G. (1991). Practical statistics for medical research. Chapman and Hall.