Copy Number Variation (CNV) is a genetic phenomenon in which the number of copies of a specific DNA segment varies among individuals. CNVs can take the form of duplications, deletions, or other structural changes in DNA sequences and can extend to long genomic segments. CNVs play an essential role in driving cancer development, and their characteristics are critical for disease diagnosis, prognosis, and treatment (Freeman et al., 2006).
CNVs are broadly classified into germline and somatic types. Germline CNVs contribute to hereditary disorders and inherited predispositions to cancer. Somatic CNVs occur due to aberrant cell division and often relate directly to cancer progression. The study of somatic CNVs can uncover driver genes involved in tumorigenesis or tumour suppression and serve as biomarkers for predicting prognosis or treatment outcomes (Xie & Tammi, 2009).
Prostate cancer is one of the most common malignancies in men worldwide, with marked heterogeneity in progression and treatment outcomes. Indolent (low-Gleason) cancers show minor genomic alterations, while aggressive metastatic tumours demonstrate gross CNVs (Hieronymus et al., 2014). Risk stratification — categorising patients by their likelihood of harbouring aggressive disease — is central to clinical decision-making and personalised treatment planning.
The RiskyCNV package addresses a critical gap in
existing CNV tools. While packages such as CNVizard, CINmetrics, GISTIC,
and CopywriteR offer visualisation, significant CNV identification, and
chromosomal instability quantification, none provide stratified class
analysis, particularly risk-class analysis (Krause et al., 2024; Oza et
al., 2023; Mermel et al., 2011; Kuilman et al., 2015).
RiskyCNV was designed specifically for class-stratified
exploration and analysis of CNV omics data, with the potential to
deliver personalised insights regarding disease course and risk
stratification.
This vignette demonstrates the complete RiskyCNV
workflow applied to prostate adenocarcinoma (PRAD) data from The Cancer
Genome Atlas (TCGA), illustrating how CNV analysis integrated with
clinical grading and RNA expression data can identify biologically
meaningful biomarkers.
The data used in this case study are derived from the TCGA PRAD dataset (https://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/PRAD/20160128/). The dataset comprises:
The grade-wise distribution of TCGA-PRAD samples is as follows:
| Grade | Gleason Score | Number of Samples |
|---|---|---|
| Grade 1 | ≤ 6 | 44 |
| Grade 2 | 3+4=7 | 146 |
| Grade 3 | 4+3=7 | 99 |
| Grade 4 | 8 | 63 |
| Grade 5 | 9-10 | 139 |
| Control | — | 438 |
| Total | 929 |
sample_file <- system.file("extdata", "sample_data.csv",
package = "RiskyCNV")
cnv_file <- system.file("extdata", "cnv_data.txt",
package = "RiskyCNV")
gene_file <- system.file("extdata", "gene_annotation.csv",
package = "RiskyCNV")
annotated_file <- system.file("extdata", "annotated_cnv.csv",
package = "RiskyCNV")
cnv_matrix_file <- system.file("extdata", "cnv_matrix.csv",
package = "RiskyCNV")
rna_file <- system.file("extdata", "rna_data.csv",
package = "RiskyCNV")
# Preview the clinical sample data
head(read.csv(sample_file))
#> sample gleason_score pattern1 pattern2 grade
#> 1 TCGA.HC.7748 6 3 3 grade1
#> 2 TCGA.CH.5743 7 3 4 grade2
#> 3 TCGA.CH.5762 7 4 3 grade3
#> 4 TCGA.EJ.A46D 8 3 5 grade4
#> 5 TCGA.CH.5741 9 4 5 grade5
#> 6 TCGA.V1.A9OF 6 3 3 grade1Each sample contains its total Gleason score, primary histological pattern (pattern1), secondary histological pattern (pattern2), and assigned grade label (Balraj & Muthamilselvan, 2024).
Gleason grading is the cornerstone of prostate cancer pathological assessment. The 2014 ISUP Consensus Conference (Epstein et al., 2016) formalised a five-tier Grade Group system:
The extract_metadata() function implements the ISUP
Grade Group system directly when disease_type = "prostate"
is specified:
grade_groups <- extract_metadata(
file_path = sample_file,
column_name = "gleason_score",
disease_type = "prostate",
output_dir = tempdir()
)
#> Note: 'pattern_col' not supplied. All Gleason 7 samples are assigned to Grade Group 2. To distinguish Grade Group 2 (3+4=7) from Grade Group 3 (4+3=7), supply the primary pattern column name via pattern_col (e.g., pattern_col = 'pattern1').
#> Grade groups saved to: /tmp/RtmpX5HEhM/sample_data_with_grade_groups_20260514_140743.csv
print(names(grade_groups))
#> [1] "Grade Group 1" "Grade Group 2" "Grade Group 4" "Grade Group 5"
print(sapply(grade_groups, length))
#> Grade Group 1 Grade Group 2 Grade Group 4 Grade Group 5
#> 3 4 2 2For datasets where clinical thresholds are unknown,
disease_type = "auto" automatically derives a normalised
Risk Score from the data using min-max normalisation and divides samples
into the requested number of groups. The function assesses distribution
skewness and selects equal-width or quantile-based splitting
accordingly:
grade_groups_auto <- extract_metadata(
file_path = sample_file,
column_name = "gleason_score",
disease_type = "auto",
n_groups = 5,
group_type = "grade",
output_dir = tempdir()
)
#> Risk Score computed. Distribution skewness: 0.325. Using equal-width splitting.
#> Grade groups saved to: /tmp/RtmpX5HEhM/sample_data_with_grade_groups_20260514_140743.csv
print(names(grade_groups_auto))
#> [1] "Grade 1" "Grade 2" "Grade 4" "Grade 5"The Risk Score is computed as: Risk Score = (score − min) / (max − min)
This normalises any numeric scoring system to a 0–1 scale, making the package applicable to any disease with a numeric grading or staging score.
Risk stratification translates Gleason grades into actionable clinical risk categories. The D’Amico classification (D’Amico et al., 1998), the most widely used system in prostate cancer, defines:
risk_groups <- classify_risk(
file_path = sample_file,
column_name = "gleason_score",
disease_type = "prostate",
output_dir = tempdir()
)
#> Risk categories saved to: /tmp/RtmpX5HEhM/sample_data_with_risk_categories_20260514_140743.csv
print(names(risk_groups))
#> [1] "low_risk" "intermediate_risk" "high_risk"
print(sapply(risk_groups, length))
#> low_risk intermediate_risk high_risk
#> 3 4 4The auto mode supports any number of risk groups. This
is useful for diseases that use only two risk levels or four levels:
# Two risk groups
risk_2 <- classify_risk(
file_path = sample_file,
column_name = "gleason_score",
disease_type = "auto",
n_groups = 2,
output_dir = tempdir()
)
#> Risk Score computed. Distribution skewness: 0.325. Using equal-width splitting.
#> Risk categories saved to: /tmp/RtmpX5HEhM/sample_data_with_risk_categories_20260514_140743.csv
print(names(risk_2))
#> [1] "low_risk" "high_risk"
# Four risk groups
risk_4 <- classify_risk(
file_path = sample_file,
column_name = "gleason_score",
disease_type = "auto",
n_groups = 4,
output_dir = tempdir()
)
#> Risk Score computed. Distribution skewness: 0.325. Using equal-width splitting.
#> Risk categories saved to: /tmp/RtmpX5HEhM/sample_data_with_risk_categories_20260514_140743.csv
print(names(risk_4))
#> [1] "very_low_risk" "low_risk" "high_risk" "very_high_risk"Aberration detection identifies genomic segments showing significant gains (oncogene amplifications) or losses (tumour suppressor deletions). Grade-wise analysis of TCGA-PRAD data showed progressive increases in aberration frequency:
This escalation in genomic instability with Gleason grade underscores the biological basis of clinical risk stratification.
aberrations <- aberration(
cnv_data_file = cnv_file,
effect_size = 0.3
)
# Aberrant regions per chromosome
print(sapply(aberrations, nrow))
#> 2
#> 1Segments with segment mean > +0.3 are classified as Gains (Aberration_Code = 1); those with segment mean < −0.3 are classified as Losses (Aberration_Code = 0).
Recurrent CNVs — those appearing consistently across multiple samples within a risk group — are more likely to represent functional driver events rather than stochastic noise. Recurrent gains in oncogenes and recurrent losses in tumour suppressor genes carry particular clinical significance.
recurrent_file <- recurrent(
x = risk_groups,
risk_level = "high_risk",
cnv_data_file = cnv_file,
threshold = 2
)
recurrent_data <- read.csv(recurrent_file)
head(recurrent_data)The output CSV contains recurrent CNV regions for the high-risk group:
| Sample | Chromosome | Start | End | Num_Probes | Segment_Mean |
|---|---|---|---|---|---|
| TCGA.2A.A8VL | 1 | 66506701 | 66515081 | 9 | -1.1952 |
| TCGA.2A.A8VL | 1 | 74699099 | 74700802 | 5 | -1.2440 |
| TCGA.2A.A8VL | 2 | 123900537 | 123920905 | 14 | -0.7297 |
The threshold parameter controls stringency — higher
values yield more robustly recurrent regions. Here we focus on high-risk
samples where aggressive CNV patterns are most pronounced.
Annotating CNV regions with gene symbols — by finding genomic overlaps with a reference gene annotation — translates chromosomal coordinates into functionally meaningful gene lists. Key genes identified in TCGA-PRAD CNV analyses include ARHGEF16, CHMP7, ELP3, ASH2L, and RNF157, whose roles span immune modulation, cell cycle regulation, and tumour suppression.
annotated <- annotate(
genes_file = gene_file,
risk_file = recurrent_file,
output_dir = tempdir()
)
head(annotated)Each recurrent CNV region is linked to the overlapping gene(s), sample ID, and segment mean value.
A sample × gene CNV matrix organises annotated data into a format suitable for statistical integration with expression data. Positive values indicate copy number gains; negative values indicate losses. This matrix structure allows efficient examination of gene variation patterns across all samples simultaneously.
old_wd <- getwd()
setwd(tempdir())
cnv_matrix <- create_CNVMatrix(input_file = annotated_file)
#> CNV matrix saved to: /tmp/RtmpX5HEhM/cnv_output_matrix_20260514_140744.csv
setwd(old_wd)
print(dim(cnv_matrix))
#> [1] 8 9
print(cnv_matrix[, 1:min(5, ncol(cnv_matrix))])
#> # A tibble: 8 × 5
#> Sample ARHGEF16 VN1R3 CD163L1 PDE12
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 TCGA.EJ.5514 0.0007 NA NA NA
#> 2 TCGA.G9.7509 NA -0.0033 NA NA
#> 3 TCGA.J4.A67Q NA NA 0.0005 NA
#> 4 TCGA.KK.A8IK NA NA NA -0.0004
#> 5 TCGA.Y6.A8TL NA NA NA NA
#> 6 TCGA.ZG.A8QX NA NA NA NA
#> 7 TCGA.ZG.A8QY NA NA NA NA
#> 8 TCGA.ZG.A9MC NA NA NA NAThe final step integrates CNV data with transcriptomic profiles to identify CNV-driven gene expression changes. When a gene is amplified, its expression typically increases; when deleted, expression typically decreases. Genes showing statistically significant positive Pearson correlations between CNV segment mean and RNA expression are strong biomarker candidates.
In the full TCGA-PRAD analysis, grade-specific biomarkers included:
| Grade | Key CNV-correlated Gene | Role |
|---|---|---|
| Grade 1 | PPP1R3B | Glycogen metabolism regulation |
| Grade 2 | ASH2L | Histone methylation, chromatin remodeling |
| Grade 3 | IWS1 | RNA processing |
| Grade 4 | UBE2Q1 | Ubiquitin-mediated proteolysis |
| Grade 5 | GSK3A | Cell cycle, apoptosis regulation |
Cross-grade consensus genes CHMP7 and ELP3 were common across all five grades, suggesting fundamental roles in prostate cancer biology.
old_wd <- getwd()
setwd(tempdir())
corr_results <- correlate_with_expr(
cnv_file = cnv_matrix_file,
rna_file = rna_file
)
#> Common genes found: 5
#> Correlation results saved to: all_correlations.csv, significant_corr.csv, high_correlation.csv
setwd(old_wd)
cat("All correlations:\n")
#> All correlations:
print(corr_results$all_correlations)
#> gene cor_val p.value
#> 1 ARHGEF16 0.57254735 0.42745265
#> 2 RQCD1 0.05635468 0.94364532
#> 3 PDE12 -0.80909159 0.19090841
#> 4 FGFBP2 -0.98551703 0.01448297
#> 5 CD163L1 0.40120322 0.59879678
cat("\nSignificant correlations (p < 0.05):\n")
#>
#> Significant correlations (p < 0.05):
print(corr_results$significant)
#> gene cor_val p.value
#> 4 FGFBP2 -0.985517 0.01448297
cat("\nHigh-confidence CNV-driven genes (p < 0.05, r > 0.8):\n")
#>
#> High-confidence CNV-driven genes (p < 0.05, r > 0.8):
print(corr_results$high_correlation)
#> [1] gene cor_val p.value
#> <0 rows> (or 0-length row.names)Three output files are produced — all correlations, significant (p < 0.05), and high-confidence (p < 0.05 AND r > 0.8). Genes in the high-confidence output are the strongest candidates for further investigation as CNV-driven expression biomarkers in prostate cancer.
The RiskyCNV workflow applied to TCGA-PRAD data produced
the following key findings:
Risk-grade comparison: Comparing low-risk (Grade 1-2) and high-risk (Grade 4-5) analyses revealed common genes with distinct biological roles. Low-risk/Grade 1 shared genes included HMGA1, BRCA1, TP53, MYC, and EGFR — major oncogenes and tumour suppressors correlated with baseline cancer susceptibility. High-risk/Grade 4-5 shared genes included UBE2C and UBE2Z — ubiquitin pathway members associated with tumour aggressiveness and poor prognosis.
Consensus biomarkers: Integrated analysis of recurrent somatic CNVs, germline CNVs, and RNA-correlated genes identified grade-specific consensus biomarkers, including ME1, CLU, SERPINB5, NECAB1, CTHRC1, and DPYS. RNF157 emerged as a multi-consensus candidate biomarker across multiple analytical approaches, consistent with its known role in M2 macrophage polarisation and tumour immune evasion (Guan et al., 2022).
Clinical implications: By integrating risk
stratification with grade-wise CNV analysis, RiskyCNV
enables a more comprehensive understanding of prostate cancer.
Recognising genes common across both analyses enhances the ability to
distinguish between genes that primarily drive tumour progression and
those that serve as early indicators of cancer susceptibility,
ultimately supporting personalised treatment planning.
Although demonstrated here with prostate cancer Gleason scores,
RiskyCNV is fully generalised:
# Breast cancer with Nottingham scores
breast_grades <- extract_metadata(
file_path = "breast_samples.csv",
column_name = "nottingham_score",
disease_type = "auto",
n_groups = 3,
group_type = "grade",
output_dir = tempdir()
)
# Lymphoma with two risk groups (limited vs. advanced)
lymphoma_risk <- classify_risk(
file_path = "lymphoma_samples.csv",
column_name = "ann_arbor_stage",
disease_type = "auto",
n_groups = 2,
output_dir = tempdir()
)Seven built-in disease presets are available
("prostate", "breast",
"colorectal", "lung", "cervical",
"lymphoma", "melanoma"), and fully custom
thresholds can be supplied via disease_type = "custom".