buildCDS.Rd
`buildCDS()` is designed to construct CDS information on transcripts from query GTF object.
buildCDS(query, ref, fasta)
GRanges object containing query GTF data.
GRanges object containing reference GTF data.
BSgenome or Biostrings object containing genomic sequence
GRanges object containing query exon entries and newly-constructed CDS information
The `buildCDS()`function will first search for known reference mRNAs in `query` and annotate its CDS information. For the remaining transcripts, `buildCDS()` will search for a putative translation start site using a database of annotated ATG codons from `ref`. Transcripts containing an open-reading frame will be assigned the newly-determined CDS information.
# Load genome and datasets
library(BSgenome.Mmusculus.UCSC.mm10)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#>
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: ‘S4Vectors’
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#>
#> Attaching package: ‘Biostrings’
#> The following object is masked from ‘package:base’:
#>
#> strsplit
#> Loading required package: rtracklayer
data(matched_query_gtf, ref_gtf)
# Build CDS
buildCDS(matched_query_gtf, ref_gtf, Mmusculus)
#> Searching for reference mRNAs in query
#> 2 reference mRNAs found and its CDS were assigned
#> Building database of annotated ATG codons
#> Selecting best ATG start codon for remaining transcripts and determining open-reading frame
#> 2 new CDSs constructed
#>
#> Summary: Out of 4 transcripts in `matched_query_gtf`,
#> 4 transcript CDSs were built
#> GRanges object with 105 ranges and 7 metadata columns:
#> seqnames ranges strand | type transcript_id
#> <Rle> <IRanges> <Rle> | <factor> <character>
#> [1] chr10 79854427-79864432 + | transcript transcript1
#> [2] chr10 79854427-79854721 + | exon transcript1
#> [3] chr10 79856504-79856534 + | exon transcript1
#> [4] chr10 79858752-79858824 + | exon transcript1
#> [5] chr10 79858952-79859271 + | exon transcript1
#> ... ... ... ... . ... ...
#> [101] chr10 79862014-79862047 + | CDS transcript4
#> [102] chr10 79862449-79862541 + | CDS transcript4
#> [103] chr10 79862653-79862869 + | CDS transcript4
#> [104] chr10 79862978-79863055 + | CDS transcript4
#> [105] chr10 79863145-79863274 + | CDS transcript4
#> gene_id old_gene_id match_level gene_name phase
#> <character> <character> <numeric> <character> <numeric>
#> [1] ENSMUSG00000006498.14 GeneA 4 Ptbp1 NA
#> [2] ENSMUSG00000006498.14 GeneA 4 Ptbp1 NA
#> [3] ENSMUSG00000006498.14 GeneA 4 Ptbp1 NA
#> [4] ENSMUSG00000006498.14 GeneA 4 Ptbp1 NA
#> [5] ENSMUSG00000006498.14 GeneA 4 Ptbp1 NA
#> ... ... ... ... ... ...
#> [101] ENSMUSG00000006498.14 <NA> NA Ptbp1 0
#> [102] ENSMUSG00000006498.14 <NA> NA Ptbp1 2
#> [103] ENSMUSG00000006498.14 <NA> NA Ptbp1 2
#> [104] ENSMUSG00000006498.14 <NA> NA Ptbp1 1
#> [105] ENSMUSG00000006498.14 <NA> NA Ptbp1 1
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths