--- title: "RiskyCNV: A Prostate Cancer Case Study Using TCGA-PRAD Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{RiskyCNV: A Prostate Cancer Case Study Using TCGA-PRAD Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Copy Number Variation (CNV) is a genetic phenomenon in which the number of copies of a specific DNA segment varies among individuals. CNVs can take the form of duplications, deletions, or other structural changes in DNA sequences and can extend to long genomic segments. CNVs play an essential role in driving cancer development, and their characteristics are critical for disease diagnosis, prognosis, and treatment (Freeman et al., 2006). CNVs are broadly classified into germline and somatic types. Germline CNVs contribute to hereditary disorders and inherited predispositions to cancer. Somatic CNVs occur due to aberrant cell division and often relate directly to cancer progression. The study of somatic CNVs can uncover driver genes involved in tumorigenesis or tumour suppression and serve as biomarkers for predicting prognosis or treatment outcomes (Xie & Tammi, 2009). Prostate cancer is one of the most common malignancies in men worldwide, with marked heterogeneity in progression and treatment outcomes. Indolent (low-Gleason) cancers show minor genomic alterations, while aggressive metastatic tumours demonstrate gross CNVs (Hieronymus et al., 2014). Risk stratification — categorising patients by their likelihood of harbouring aggressive disease — is central to clinical decision-making and personalised treatment planning. The `RiskyCNV` package addresses a critical gap in existing CNV tools. While packages such as CNVizard, CINmetrics, GISTIC, and CopywriteR offer visualisation, significant CNV identification, and chromosomal instability quantification, none provide stratified class analysis, particularly risk-class analysis (Krause et al., 2024; Oza et al., 2023; Mermel et al., 2011; Kuilman et al., 2015). `RiskyCNV` was designed specifically for class-stratified exploration and analysis of CNV omics data, with the potential to deliver personalised insights regarding disease course and risk stratification. This vignette demonstrates the complete `RiskyCNV` workflow applied to prostate adenocarcinoma (PRAD) data from The Cancer Genome Atlas (TCGA), illustrating how CNV analysis integrated with clinical grading and RNA expression data can identify biologically meaningful biomarkers. ```{r setup} library(RiskyCNV) ``` --- ## Dataset The data used in this case study are derived from the TCGA PRAD dataset (https://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/PRAD/20160128/). The dataset comprises: - **Clinical data**: Sample IDs, Gleason scores, and histopathological pattern information for 929 prostate cancer samples - **Somatic CNV data**: Segment-level copy number profiles including chromosome, start/end coordinates, number of probes, and segment mean - **Gene annotation**: Reference genomic coordinates for gene symbol assignment - **RNA expression data**: Transcriptomic profiles for correlation analysis The grade-wise distribution of TCGA-PRAD samples is as follows: | Grade | Gleason Score | Number of Samples | |---|---|---| | Grade 1 | ≤ 6 | 44 | | Grade 2 | 3+4=7 | 146 | | Grade 3 | 4+3=7 | 99 | | Grade 4 | 8 | 63 | | Grade 5 | 9-10 | 139 | | Control | — | 438 | | **Total** | | **929** | ```{r file_paths} sample_file <- system.file("extdata", "sample_data.csv", package = "RiskyCNV") cnv_file <- system.file("extdata", "cnv_data.txt", package = "RiskyCNV") gene_file <- system.file("extdata", "gene_annotation.csv", package = "RiskyCNV") annotated_file <- system.file("extdata", "annotated_cnv.csv", package = "RiskyCNV") cnv_matrix_file <- system.file("extdata", "cnv_matrix.csv", package = "RiskyCNV") rna_file <- system.file("extdata", "rna_data.csv", package = "RiskyCNV") # Preview the clinical sample data head(read.csv(sample_file)) ``` Each sample contains its total Gleason score, primary histological pattern (pattern1), secondary histological pattern (pattern2), and assigned grade label (Balraj & Muthamilselvan, 2024). --- ## Step 1: Classify Samples into Gleason Grade Groups ### Background Gleason grading is the cornerstone of prostate cancer pathological assessment. The 2014 ISUP Consensus Conference (Epstein et al., 2016) formalised a five-tier Grade Group system: - **Grade Group 1**: Gleason ≤ 6 — well differentiated, very low risk - **Grade Group 2**: Gleason 3+4=7 — favourable intermediate risk - **Grade Group 3**: Gleason 4+3=7 — unfavourable intermediate risk - **Grade Group 4**: Gleason 8 — poorly differentiated, high risk - **Grade Group 5**: Gleason 9-10 — anaplastic, very high risk ### Using the Prostate Cancer Preset The `extract_metadata()` function implements the ISUP Grade Group system directly when `disease_type = "prostate"` is specified: ```{r grade_preset} grade_groups <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", output_dir = tempdir() ) print(names(grade_groups)) print(sapply(grade_groups, length)) ``` ### Using the Auto Mode For datasets where clinical thresholds are unknown, `disease_type = "auto"` automatically derives a normalised Risk Score from the data using min-max normalisation and divides samples into the requested number of groups. The function assesses distribution skewness and selects equal-width or quantile-based splitting accordingly: ```{r grade_auto} grade_groups_auto <- extract_metadata( file_path = sample_file, column_name = "gleason_score", disease_type = "auto", n_groups = 5, group_type = "grade", output_dir = tempdir() ) print(names(grade_groups_auto)) ``` The Risk Score is computed as: **Risk Score = (score − min) / (max − min)** This normalises any numeric scoring system to a 0–1 scale, making the package applicable to any disease with a numeric grading or staging score. --- ## Step 2: Stratify Samples by Clinical Risk Level ### Background Risk stratification translates Gleason grades into actionable clinical risk categories. The D'Amico classification (D'Amico et al., 1998), the most widely used system in prostate cancer, defines: - **Low risk**: Gleason ≤ 6 — active surveillance appropriate - **Intermediate risk**: Gleason 7 — selective treatment indicated - **High risk**: Gleason ≥ 8 — aggressive treatment required ### Using the Prostate Cancer Preset ```{r risk_preset} risk_groups <- classify_risk( file_path = sample_file, column_name = "gleason_score", disease_type = "prostate", output_dir = tempdir() ) print(names(risk_groups)) print(sapply(risk_groups, length)) ``` ### Flexible Risk Grouping with Auto Mode The `auto` mode supports any number of risk groups. This is useful for diseases that use only two risk levels or four levels: ```{r risk_auto} # Two risk groups risk_2 <- classify_risk( file_path = sample_file, column_name = "gleason_score", disease_type = "auto", n_groups = 2, output_dir = tempdir() ) print(names(risk_2)) # Four risk groups risk_4 <- classify_risk( file_path = sample_file, column_name = "gleason_score", disease_type = "auto", n_groups = 4, output_dir = tempdir() ) print(names(risk_4)) ``` --- ## Step 3: Detect Copy Number Aberrations ### Background Aberration detection identifies genomic segments showing significant gains (oncogene amplifications) or losses (tumour suppressor deletions). Grade-wise analysis of TCGA-PRAD data showed progressive increases in aberration frequency: - **Grade 1**: Amplifications in chromosomes 8, 14, 20; losses in chromosomes 2-21 - **Grade 5**: Amplifications in chromosomes 1, 3, 7, 9, 10, 12, 14, 16, 17, 19, 21; losses in chromosomes 1-14 and 16-22 This escalation in genomic instability with Gleason grade underscores the biological basis of clinical risk stratification. ```{r aberration} aberrations <- aberration( cnv_data_file = cnv_file, effect_size = 0.3 ) # Aberrant regions per chromosome print(sapply(aberrations, nrow)) ``` Segments with segment mean > +0.3 are classified as Gains (Aberration_Code = 1); those with segment mean < −0.3 are classified as Losses (Aberration_Code = 0). --- ## Step 4: Identify Recurrent Copy Number Variations ### Background Recurrent CNVs — those appearing consistently across multiple samples within a risk group — are more likely to represent functional driver events rather than stochastic noise. Recurrent gains in oncogenes and recurrent losses in tumour suppressor genes carry particular clinical significance. ````{r recurrent, eval = FALSE} recurrent_file <- recurrent( x = risk_groups, risk_level = "high_risk", cnv_data_file = cnv_file, threshold = 2 ) recurrent_data <- read.csv(recurrent_file) head(recurrent_data) ```` The output CSV contains recurrent CNV regions for the high-risk group: | Sample | Chromosome | Start | End | Num_Probes | Segment_Mean | |---|---|---|---|---|---| | TCGA.2A.A8VL | 1 | 66506701 | 66515081 | 9 | -1.1952 | | TCGA.2A.A8VL | 1 | 74699099 | 74700802 | 5 | -1.2440 | | TCGA.2A.A8VL | 2 | 123900537 | 123920905 | 14 | -0.7297 | The `threshold` parameter controls stringency — higher values yield more robustly recurrent regions. Here we focus on high-risk samples where aggressive CNV patterns are most pronounced. --- ## Step 5: Annotate CNV Regions with Gene Symbols ### Background Annotating CNV regions with gene symbols — by finding genomic overlaps with a reference gene annotation — translates chromosomal coordinates into functionally meaningful gene lists. Key genes identified in TCGA-PRAD CNV analyses include ARHGEF16, CHMP7, ELP3, ASH2L, and RNF157, whose roles span immune modulation, cell cycle regulation, and tumour suppression. ```{r annotate, eval = FALSE} annotated <- annotate( genes_file = gene_file, risk_file = recurrent_file, output_dir = tempdir() ) head(annotated) ``` Each recurrent CNV region is linked to the overlapping gene(s), sample ID, and segment mean value. --- ## Step 6: Build a CNV Expression Matrix ### Background A sample × gene CNV matrix organises annotated data into a format suitable for statistical integration with expression data. Positive values indicate copy number gains; negative values indicate losses. This matrix structure allows efficient examination of gene variation patterns across all samples simultaneously. ```{r cnv_matrix} old_wd <- getwd() setwd(tempdir()) cnv_matrix <- create_CNVMatrix(input_file = annotated_file) setwd(old_wd) print(dim(cnv_matrix)) print(cnv_matrix[, 1:min(5, ncol(cnv_matrix))]) ``` --- ## Step 7: Correlate CNV with RNA Expression ### Background The final step integrates CNV data with transcriptomic profiles to identify CNV-driven gene expression changes. When a gene is amplified, its expression typically increases; when deleted, expression typically decreases. Genes showing statistically significant positive Pearson correlations between CNV segment mean and RNA expression are strong biomarker candidates. In the full TCGA-PRAD analysis, grade-specific biomarkers included: | Grade | Key CNV-correlated Gene | Role | |---|---|---| | Grade 1 | PPP1R3B | Glycogen metabolism regulation | | Grade 2 | ASH2L | Histone methylation, chromatin remodeling | | Grade 3 | IWS1 | RNA processing | | Grade 4 | UBE2Q1 | Ubiquitin-mediated proteolysis | | Grade 5 | GSK3A | Cell cycle, apoptosis regulation | Cross-grade consensus genes CHMP7 and ELP3 were common across all five grades, suggesting fundamental roles in prostate cancer biology. ```{r correlations} old_wd <- getwd() setwd(tempdir()) corr_results <- correlate_with_expr( cnv_file = cnv_matrix_file, rna_file = rna_file ) setwd(old_wd) cat("All correlations:\n") print(corr_results$all_correlations) cat("\nSignificant correlations (p < 0.05):\n") print(corr_results$significant) cat("\nHigh-confidence CNV-driven genes (p < 0.05, r > 0.8):\n") print(corr_results$high_correlation) ``` Three output files are produced — all correlations, significant (p < 0.05), and high-confidence (p < 0.05 AND r > 0.8). Genes in the high-confidence output are the strongest candidates for further investigation as CNV-driven expression biomarkers in prostate cancer. --- ## Biological Summary The `RiskyCNV` workflow applied to TCGA-PRAD data produced the following key findings: **Risk-grade comparison**: Comparing low-risk (Grade 1-2) and high-risk (Grade 4-5) analyses revealed common genes with distinct biological roles. Low-risk/Grade 1 shared genes included HMGA1, BRCA1, TP53, MYC, and EGFR — major oncogenes and tumour suppressors correlated with baseline cancer susceptibility. High-risk/Grade 4-5 shared genes included UBE2C and UBE2Z — ubiquitin pathway members associated with tumour aggressiveness and poor prognosis. **Consensus biomarkers**: Integrated analysis of recurrent somatic CNVs, germline CNVs, and RNA-correlated genes identified grade-specific consensus biomarkers, including ME1, CLU, SERPINB5, NECAB1, CTHRC1, and DPYS. RNF157 emerged as a multi-consensus candidate biomarker across multiple analytical approaches, consistent with its known role in M2 macrophage polarisation and tumour immune evasion (Guan et al., 2022). **Clinical implications**: By integrating risk stratification with grade-wise CNV analysis, `RiskyCNV` enables a more comprehensive understanding of prostate cancer. Recognising genes common across both analyses enhances the ability to distinguish between genes that primarily drive tumour progression and those that serve as early indicators of cancer susceptibility, ultimately supporting personalised treatment planning. --- ## Generalisability Although demonstrated here with prostate cancer Gleason scores, `RiskyCNV` is fully generalised: ```{r generalised, eval = FALSE} # Breast cancer with Nottingham scores breast_grades <- extract_metadata( file_path = "breast_samples.csv", column_name = "nottingham_score", disease_type = "auto", n_groups = 3, group_type = "grade", output_dir = tempdir() ) # Lymphoma with two risk groups (limited vs. advanced) lymphoma_risk <- classify_risk( file_path = "lymphoma_samples.csv", column_name = "ann_arbor_stage", disease_type = "auto", n_groups = 2, output_dir = tempdir() ) ``` Seven built-in disease presets are available (`"prostate"`, `"breast"`, `"colorectal"`, `"lung"`, `"cervical"`, `"lymphoma"`, `"melanoma"`), and fully custom thresholds can be supplied via `disease_type = "custom"`. --- ## References - Balraj AS & Muthamilselvan S. (2024). PRADclass: Hybrid Gleason grade-informed computational strategy. *Cancer Informatics*. - D'Amico AV, et al. (1998). Biochemical outcome after radical prostatectomy. *JAMA*, 280(11):969-974. - Epstein JI, et al. (2016). The 2014 ISUP Consensus Conference on Gleason Grading. *Am J Surg Pathol*, 40(2):244-252. - Freeman JL, et al. (2006). Copy number variation: new insights in genome diversity. *Genome Research*, 16:949-961. - Guan H, et al. (2022). Exosomal RNF157 mRNA from prostate cancer cells contributes to M2 macrophage polarization. *Front Oncol*, 12:1021270. - Hieronymus H, et al. (2014). Copy number alteration burden predicts prostate cancer relapse. *PNAS*, 111:11139-11144. - Krause J, et al. (2024). CNVizard. *BMC Bioinformatics*, 25:376. - Kuilman T, et al. (2015). CopywriteR. *Genome Biology*, 16:49. - Mermel CH, et al. (2011). GISTIC2.0. *Genome Biology*, 12:R41. - Oza VH, et al. (2023). CINmetrics. *PeerJ*, 11:e15244. - Xie C & Tammi MT. (2009). CNV-seq. *BMC Bioinformatics*, 10:80.