`buildCDS()` is designed to construct CDS information on transcripts from query GTF object.

buildCDS(query, ref, fasta)

Arguments

query

GRanges object containing query GTF data.

ref

GRanges object containing reference GTF data.

fasta

BSgenome or Biostrings object containing genomic sequence

Value

GRanges object containing query exon entries and newly-constructed CDS information

Details

The `buildCDS()`function will first search for known reference mRNAs in `query` and annotate its CDS information. For the remaining transcripts, `buildCDS()` will search for a putative translation start site using a database of annotated ATG codons from `ref`. Transcripts containing an open-reading frame will be assigned the newly-determined CDS information.

Author

Fursham Hamid

Examples

# Load genome and datasets
library(BSgenome.Mmusculus.UCSC.mm10)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: ‘S4Vectors’
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: ‘Biostrings’
#> The following object is masked from ‘package:base’:
#> 
#>     strsplit
#> Loading required package: rtracklayer
data(matched_query_gtf, ref_gtf)

# Build CDS
buildCDS(matched_query_gtf, ref_gtf, Mmusculus)
#>     Searching for reference mRNAs in query
#>     2 reference mRNAs found and its CDS were assigned
#>     Building database of annotated ATG codons
#>     Selecting best ATG start codon for remaining transcripts and determining open-reading frame
#>     2 new CDSs constructed
#> 
#>     Summary: Out of 4 transcripts in `matched_query_gtf`,
#>     4 transcript CDSs were built
#> GRanges object with 105 ranges and 7 metadata columns:
#>         seqnames            ranges strand |       type transcript_id
#>            <Rle>         <IRanges>  <Rle> |   <factor>   <character>
#>     [1]    chr10 79854427-79864432      + | transcript   transcript1
#>     [2]    chr10 79854427-79854721      + | exon         transcript1
#>     [3]    chr10 79856504-79856534      + | exon         transcript1
#>     [4]    chr10 79858752-79858824      + | exon         transcript1
#>     [5]    chr10 79858952-79859271      + | exon         transcript1
#>     ...      ...               ...    ... .        ...           ...
#>   [101]    chr10 79862014-79862047      + |        CDS   transcript4
#>   [102]    chr10 79862449-79862541      + |        CDS   transcript4
#>   [103]    chr10 79862653-79862869      + |        CDS   transcript4
#>   [104]    chr10 79862978-79863055      + |        CDS   transcript4
#>   [105]    chr10 79863145-79863274      + |        CDS   transcript4
#>                       gene_id old_gene_id match_level   gene_name     phase
#>                   <character> <character>   <numeric> <character> <numeric>
#>     [1] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     [2] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     [3] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     [4] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     [5] ENSMUSG00000006498.14       GeneA           4       Ptbp1        NA
#>     ...                   ...         ...         ...         ...       ...
#>   [101] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         0
#>   [102] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         2
#>   [103] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         2
#>   [104] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         1
#>   [105] ENSMUSG00000006498.14        <NA>          NA       Ptbp1         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths