Reference-guided construction of CDS on GTF object

`buildCDS()` is designed to construct CDS information on transcripts from query GTF object.

buildCDS(query, ref, fasta)

Arguments

query: GRanges object containing query GTF data.
ref: GRanges object containing reference GTF data.
fasta: BSgenome or Biostrings object containing genomic sequence

Value

GRanges object containing query exon entries and newly-constructed CDS information

Details

The `buildCDS()`function will first search for known reference mRNAs in `query` and annotate its CDS information. For the remaining transcripts, `buildCDS()` will search for a putative translation start site using a database of annotated ATG codons from `ref`. Transcripts containing an open-reading frame will be assigned the newly-determined CDS information.

Author

Fursham Hamid

Examples

# Load genome and datasets
library(BSgenome.Mmusculus.UCSC.mm10)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: ‘S4Vectors’
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: ‘Biostrings’
#> The following object is masked from ‘package:base’:
#> 
#>     strsplit
#> Loading required package: rtracklayer
data(matched_query_gtf, ref_gtf)

# Build CDS
buildCDS(matched_query_gtf, ref_gtf, Mmusculus)
#>     Searching for reference mRNAs in query
#>     2 reference mRNAs found and its CDS were assigned
#>     Building database of annotated ATG codons
#>     Selecting best ATG start codon for remaining transcripts and determining open-reading frame
#>     2 new CDSs constructed
#> 
#>     Summary: Out of 4 transcripts in `matched_query_gtf`,
#>     4 transcript CDSs were built
#> GRanges object with 105 ranges and 7 metadata columns:
#>         seqnames            ranges strand |       type transcript_id
#>            <Rle>         <IRanges>  <Rle> |   <factor>   <character>
#>     [1]    chr10 79854427-79864432      + | transcript   transcript1
#>     [2]    chr10 79854427-79854721      + | exon         transcript1
#>     [3]    chr10 79856504-79856534      + | exon         transcript1
#>     [4]    chr10 79858752-79858824      + | exon         transcript1
#>     [5]    chr10 79858952-79859271      + | exon         transcript1
#>     ...      ...               ...    ... .        ...           ...
#>   [101]    chr10 79862014-79862047      + |        CDS   transcript4
#>   [102]    chr10 79862449-79862541      + |        CDS   transcript4
#>   [103]    chr10 79862653-79862869      + |        CDS   transcript4
#>   [104]    chr10 79862978-79863055      + |        CDS   transcript4
#>   [105]    chr10 79863145-79863274      + |        CDS   transcript4
#>                       gene_id old_gene_id match_level   gene_name     phase
#>                   <character> <character>   <numeric> <character> <numeric>
#>     [1] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     [2] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     [3] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     [4] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     [5] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     ...                   ...         ...         ...         ...       ...
#>   [101] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         0
#>   [102] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         2
#>   [103] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         2
#>   [104] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         1
#>   [105] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths