Gene annotations are provided by several databases that integrate data from systematic explorations of various experimental and computational resources. The GENCODE project [1], part of the ENCODE project [2], offers accurate annotations of the human and mouse genomes derived from literature, primary data repositories, computational predictions, manual annotation, and experimental validation of genes and transcripts, including noncoding transcripts that are continually discovered and annotated. GENCODE is one of the most comprehensive and standardized databases for gene annotations, widely used by the scientific community.
The gene sets provided by GENCODE are comprehensive and include protein-coding and non-coding loci, encompassing alternatively spliced isoforms and pseudogenes. GENCODE gene annotations are regularly updated and released as the Ensembl/GENCODE gene sets, accessible via the official website https://www.gencodegenes.org. As of October 2024, the latest human release is GENCODE 47, and the latest mouse release is GENCODE M36 [1]. GENCODE gene sets are released approximately four times a year for mouse and twice a year for human [1].
GencoDymo2 is an R package designed to interrogate different releases of GENCODE annotations from the human and mouse genomes. It provides a streamlined interface for accessing and processing GENCODE annotations, providing easily accessible data regarding annotation statistics, release comparisons, and information on introns and splice sites. Moreover, GencoDymo2 can produce FASTA files of donor and acceptor splice site motifs that can be directly uploaded to the MaxEntScan tool [3] for the calculation of splice site scores.
GencoDymo2 can be used to import and process any gtf/gff3 formatted file. It automatizes and speeds up the following data manipulation steps:
GencoDymo2 runs in the R statistical computing environment. You will need R version 4.1.0 or higher. The most recent version of GencoDymo2 can be installed from GitHub. We recommend using the pak package for fast and reliable installation.
# Install pak if not already installed
if (!require("pak")) install.packages("pak")
# Install from GitHub
pak::pkg_install("github::monahton/GencoDymo2")
# Load the package
library(GencoDymo2)
GENCODE releases annotations approximately every three months,
coinciding with Ensembl releases. The latest release from GENCODE for
human and mouse genomes can be obtained using the get_latest_release()
function.
This function queries the official GENCODE website to determine the most
recent release for a given species (human or mouse).
# Fetch the most recent human and mouse GENCODE release identifiers
human_release <- get_latest_release("human", verbose = T)
mouse_release <- get_latest_release("mouse", verbose = T)
## Latest human GENCODE release: release_47
## Latest human GENCODE release: release_M36
NOTE: Behind the scenes, get_latest_release() parses the Ensembl public FTP site, extracts the release version (e.g., “release_47”), and returns it for programmatic use.
GencoDymo2 allows the download of gtf
and
gff3
files of the latest release or any release specified
by the user. These files can be obtained for human and mouse genome and
can be saved in a path specified by the user. With the release
identifiers in hand, we can download these annotation files:
# Download latest human long noncoding RNAs GTF
lnc_47_gtf <- get_gtf(
species = "human",
release_version = human_release,
annotation_type = "long_noncoding_RNAs.gtf.gz",
dest_folder = tempdir()
)
# Download previous human release (release_46) for comparison
lnc_46_gtf <- get_gtf(
species = "human",
release_version = "release_46",
annotation_type = "long_noncoding_RNAs.gtf.gz",
dest_folder = tempdir()
)
# Download latest mouse primary assembly annotations (GFF3)
mouse_36_gff3 <- get_gff3(
species = "mouse",
release_version = mouse_release,
annotation_type = "primary_assembly.annotation.gff3.gz",
dest_folder = tempdir()
)
The valid annotation types for both the gtf and gff3 format can be one of the following:
## Valid Annotation Types:
## [1] "annotation"
## [2] "basic.annotation"
## [3] "chr_patch_hapl_scaff.annotation"
## [4] "chr_patch_hapl_scaff.basic.annotation"
## [5] "long_noncoding_RNAs"
## [6] "primary_assembly.annotation"
## [7] "primary_assembly.basic.annotation"
## [8] "tRNAs"
## [9] "polyAs"
A gtf
or gff3
file can be imported to R as
a dataframe using the load_file()
function. Once
downloaded, load the files into R as data frames using load_file(). This
function automatically detects GTF vs GFF3 and parses the attributes
into columns.
# Loading using the stored paths from previous steps
lnc_47_df <- load_file(lnc_47_gtf)
head(lnc_47_df)
# Alternatively, specify the file path directly
lnc_46_df <- load_file(file.path(tempdir(), "gencode.v46.long_noncoding_RNAs.gtf.gz"))
head(lnc_46_df)
# Load mouse GFF3
mouse_pri_36 <- load_file(file.path(tempdir(),"gencode.vM36.primary_assembly.annotation.gff3.gz"))
head(mouse_pri_36)
As GENCODE regularly updates its annotations, different releases from
GENCODE contain differing number of annotations. Some genes are added in
new releases while others are classified into different gene classes. In
the majority of cases, the Ensembl gene ID of genes remains constant
while the gene name might vary.
To quantify changes between two releases, use compare_release()
. Specify the
two annotation data frames and the type of feature to compare (“gene”,
“transcript”, “exon”, etc.). Additional filters like the gene_type and
the baseline can allow to focus on subsets (e.g., only TEC or
protein_coding genes) and different reference counts for the calculation
of the difference percentage.
# Compare gene counts between release 47 and 46
gene_comparison <- compare_release(lnc_47_df, lnc_46_df, type = "gene")
# Compare exon counts
exon_comparison <- compare_release(lnc_47_df, lnc_46_df, type = "exon")
# Compare a specific gene biotype (e.g., TEC) using a custom baseline
comparison <- compare_release(
lnc_47_df,
lnc_46_df,
type = "gene",
gene_type = "TEC",
baseline = "count1"
)
The function produces as an output a Delta value, that corresponds to the difference in number between the specified elements of the two inputs, in addition to its Percentage.
NOTE: The output is a data frame summarizing absolute and percentage differences in feature counts, enabling rapid assessment of annotation growth or refinement.
The gtf files provided by GENCODE list genes, transcripts, and exons
while intronic regions must be inferred. Given the loaded data frame
from the gtf files of GENCODE, the extract_introns()
function
computes intron coordinates for each transcript by sorting exon ranges
and identifying gaps. It returns a dataframe containing introns
coordinates and position within each transcript.
# Human lncRNA introns for release 47
introns_lnc_47 <- extract_introns(lnc_47_df, verbose = T)
# Mouse introns (filtering to primary chromosomes first)
mouse_pri_36 <- mouse_pri_36[grepl("^chr", mouse_pri_36$seqnames), ]
mouse_introns_pri_36 <- extract_introns(mouse_pri_36, verbose = T)
Splice sites are defined by dinucleotide motifs at intron boundaries. At the 5’ end, the donor site includes an almost invariant sequence GT (GU in the RNA). At the 3’ end, the intron terminates with the acceptor site, an almost invariant AG sequence.
The majority of introns belong to the U2-type spliceosome and are
flanked by GT– AG splice site dinucleotides. The most frequent exception
to this rule are the U2-type GC–AG splice sites, comprising ∼0.8% of
human splice sites and about ∼0.09% of the human splice sites belong to
the U12-type which are processed by the minor spliceosome and are
flanked by AT–AC dinucleotides [4].
The assign_splice_sites()
function retrieves these flanking sequences from a BSgenome object. It
adds two new columns to the data frame containing the introns
coordinates, assigning both the 5’ and 3’ splice sites. Sequences should
be retrieved from a loaded BSgenome object such as
BSgenome.Hsapiens.UCSC.hg38
for human and
BSgenome.Mmusculus.UCSC.mm39
for mouse.
# Human
library(BSgenome.Hsapiens.UCSC.hg38)
lnc_47_ss <- assign_splice_sites(
introns_lnc_47,
genome = BSgenome.Hsapiens.UCSC.hg38,
verbose = T
)
# Mouse
library(BSgenome.Mmusculus.UCSC.mm39)
mouse_pri_36_ss <- assign_splice_sites(
mouse_introns_pri_36,
genome = BSgenome.Mmusculus.UCSC.mm39,
verbose = T
)
Cryptic splice sites are those that do not match the canonical donor
(GT) or acceptor motifs (AG). The find_cryptic_splice_sites()
function identifies potential cryptic splice sites by comparing sequence
motifs in introns to canonical splice site motifs (donor and acceptor)
that can be also specified by the user. It compares the identified
splice sites with the provided canonical motifs and flags the sites that
differ from the canonical patterns, making it useful for studying
aberrant splicing events.
# Identify cryptic (non-canonical) splice sites
cryptic_ss <- find_cryptic_splice_sites(
lnc_47_ss,
genome = BSgenome.Hsapiens.UCSC.hg38,
canonical_donor = "GT",
canonical_acceptor = "AG",
verbose = TRUE
)
MaxEntScan is a webtool based on the approach for
modeling the sequences of short sequence motifs involved in RNA splicing
which simultaneously accounts for non-adjacent as well as adjacent
dependencies between positions. This method is based on the ‘Maximum
Entropy Principle’ and generalizes most previous probabilistic
models of sequence motifs such as weight matrix models and inhomogeneous
Markov models [3].
The MaxEntScan::score5ss assign scores for donor splice
sites according to four models by using 9-mer SEQUENCES
motif as input in a .fa file
. The 9-mers motif contain 3
bases in the exon and 6 bases in the intron.
The MaxEntScan::score3ss assign scores for donor splice
sites according to three models. It takes as input a
.fa file
of 23-mer SEQUENCES in which 20
bases are in the intron and 3 bases in the exon.
The extract_ss_motif()
function is used along with BSgenome object of the studied species to
retrieve MaxEntScan::score5ss and MaxEntScan::score3ss
motif sequences.
The user can choose to generate a fasta file in the
working directory or any specified path as an output, that contains
either 9-mers or 23-mers, respectively for each 5’ and 3’ splice-sites.
It generates also a dataframe for the coordinates and
IDs of the corresponding motifs.
The generated fasta files can then be directly utilized by
MaxEntScan tools.
The first argument input
is a dataframe containing the
intron coordinates. The second argument genome
is a
BSgenome object (the human genome sequence hg38 is used by default). The
third argument type
specifies whether to extract motifs at
the 5ss or the 3ss. Users can optionally set the argument
save_fasta
to TRUE to and the output_file
argument to save a fasta file.
# Donor motifs (5'ss)
motifs_donor <- extract_ss_motif(
input = lnc_47_ss,
genome = BSgenome.Hsapiens.UCSC.hg38,
type = "5ss",
verbose = T,
save_fasta = T,
output_file = file.path(tempdir(), "lnc_47_5ss_motifs.fa")
)
# Acceptor motifs (3'ss)
motifs_acc <- extract_ss_motif(
input = lnc_47_ss,
genome = BSgenome.Hsapiens.UCSC.hg38,
type = "3ss",
verbose = T,
save_fasta = T,
output_file = file.path(tempdir(), "lnc_47_3ss_motifs.fa")
)
Some genes are unspliced: composed by one single exon and have no
introns.
The extract_single_exon()
function can extract all the single-exon genes
or transcripts and returns in a dataframe with their ensembl gene IDs or
transcript IDs.
## identify single exon genes and transcripts
single_exon_genes <- extract_single_exon(lnc_47_df, level = "gene")
single_exon_trans <- extract_single_exon(lnc_47_df, level = "transcript")
The classify_exons()
adds a new column (EXON_CLASSIFICATION)
to the input
dataframe, describing the position of each exon. It labels each exon as
(first_exons), (inner_exons), (last_exons)
and (single_exons) facilitating position-based analyses. This
classification is informative considering that first and last exons
often have peculiar regulative roles.
# Assign the ordinal position of exons
lnc_47_class_exons <- classify_exons(lnc_47_df, verbose = TRUE)
The spliced_trans_length()
function
computes the mature (post-splicing) transcript length by summing exon
widths per transcript.
# Length of spliced transcript
lnc_47_spliced_length <- spliced_trans_length(lnc_47_df)
head(lnc_47_spliced_length)
GencoDymo2 provides the stat_summary()
function that
gives a brief summary of data annotations in a particular GENCODE
release regarding genes, transcripts, exons and introns.
# Exon length statistics
lnc_47_exon_stats <- stat_summary(lnc_47_class_exons, type = "exon")
# Intron length statistics
lnc_47_intron_stats <- stat_summary(introns_lnc_47, type = "intron")
The calculate_gc_content()
function
computes GC percentage for any feature set using a BSgenome
reference.
# Human
lnc_47_gc <- calculate_gc_content(
lnc_47_df,
genome = BSgenome.Hsapiens.UCSC.hg38,
verbose = TRUE
)
# Mouse
mouse_pri_36_gc <- calculate_gc_content(
mouse_pri_36,
genome = BSgenome.Mmusculus.UCSC.mm39,
verbose = TRUE
)
For coding annotations, extract_cds_sequences()
writes
CDS FASTA files from GRanges objects.
# Convert to GRanges and extract
library(GenomicRanges)
mouse_pri_36_granges <- GRanges(mouse_pri_36)
mouse_cds_seqs <- extract_cds_sequences(
mouse_pri_36_granges,
BSgenome.Mmusculus.UCSC.mm39,
save_fasta = TRUE,
output_file = file.path(tempdir(), "mouse_pri_36_CDS.fa.gz")
verbose = TRUE
)
The following provides the session information used when compiling this document.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.0 (2024-04-24)
## os macOS 15.4.1
## system x86_64, darwin20
## ui X11
## language (EN)
## collate C
## ctype en_US.UTF-8
## tz Europe/Rome
## date 2025-05-13
## pandoc 3.4 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/x86_64/ (via rmarkdown)
## quarto 1.6.42 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/quarto
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## bslib 0.9.0 2025-01-30 [3] CRAN (R 4.4.1)
## cachem 1.1.0 2024-05-16 [3] CRAN (R 4.4.0)
## cli 3.6.5 2025-04-23 [3] CRAN (R 4.4.1)
## devtools 2.4.5 2022-10-11 [3] CRAN (R 4.4.0)
## digest 0.6.37 2024-08-19 [3] CRAN (R 4.4.1)
## ellipsis 0.3.2 2021-04-29 [3] CRAN (R 4.4.0)
## evaluate 1.0.3 2025-01-10 [3] CRAN (R 4.4.1)
## fastmap 1.2.0 2024-05-15 [3] CRAN (R 4.4.0)
## fs 1.6.6 2025-04-12 [3] CRAN (R 4.4.1)
## glue 1.8.0 2024-09-30 [3] CRAN (R 4.4.1)
## htmltools 0.5.8.1 2024-04-04 [3] CRAN (R 4.4.0)
## htmlwidgets 1.6.4 2023-12-06 [3] CRAN (R 4.4.0)
## httpuv 1.6.16 2025-04-16 [3] CRAN (R 4.4.1)
## jquerylib 0.1.4 2021-04-26 [3] CRAN (R 4.4.0)
## jsonlite 2.0.0 2025-03-27 [3] CRAN (R 4.4.1)
## knitr 1.50 2025-03-16 [3] CRAN (R 4.4.1)
## later 1.4.2 2025-04-08 [3] CRAN (R 4.4.1)
## lifecycle 1.0.4 2023-11-07 [3] CRAN (R 4.4.0)
## magrittr 2.0.3 2022-03-30 [3] CRAN (R 4.4.0)
## memoise 2.0.1 2021-11-26 [3] CRAN (R 4.4.0)
## mime 0.13 2025-03-17 [3] CRAN (R 4.4.1)
## miniUI 0.1.2 2025-04-17 [3] CRAN (R 4.4.1)
## pkgbuild 1.4.7 2025-03-24 [3] CRAN (R 4.4.1)
## pkgload 1.4.0 2024-06-28 [3] CRAN (R 4.4.0)
## profvis 0.4.0 2024-09-20 [3] CRAN (R 4.4.1)
## promises 1.3.2 2024-11-28 [3] CRAN (R 4.4.1)
## purrr 1.0.4 2025-02-05 [3] CRAN (R 4.4.1)
## R6 2.6.1 2025-02-15 [3] CRAN (R 4.4.1)
## Rcpp 1.0.14 2025-01-12 [3] CRAN (R 4.4.1)
## remotes 2.5.0 2024-03-17 [3] CRAN (R 4.4.0)
## rlang 1.1.6 2025-04-11 [3] CRAN (R 4.4.1)
## rmarkdown 2.29 2024-11-04 [3] CRAN (R 4.4.1)
## rstudioapi 0.17.1 2024-10-22 [3] CRAN (R 4.4.1)
## sass 0.4.10 2025-04-11 [3] CRAN (R 4.4.1)
## sessioninfo 1.2.3 2025-02-05 [3] CRAN (R 4.4.1)
## shiny 1.10.0 2024-12-14 [3] CRAN (R 4.4.1)
## urlchecker 1.0.1 2021-11-30 [3] CRAN (R 4.4.0)
## usethis 3.1.0 2024-11-26 [3] CRAN (R 4.4.1)
## vctrs 0.6.5 2023-12-01 [3] CRAN (R 4.4.0)
## xfun 0.52 2025-04-02 [3] CRAN (R 4.4.1)
## xtable 1.8-4 2019-04-21 [3] CRAN (R 4.4.0)
## yaml 2.3.10 2024-07-26 [3] CRAN (R 4.4.0)
##
## [1] /private/var/folders/f_/ww6gm_z57g9f18_j63c_krn5sblhsy/T/RtmprtJJu2/Rinst1fbe19225344
## [2] /private/var/folders/f_/ww6gm_z57g9f18_j63c_krn5sblhsy/T/Rtmp6YCKyP/temp_libpath1bc038657191
## [3] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────