matchGeneInfo.Rd
`matchGeneInfo()` matches and corrects Gene IDs from a query GTF object to a reference GTF
matchGeneInfo(query, ref, primary_gene_id = NULL, secondary_gene_id = NULL)
Query GTF imported as GRanges object
Reference GTF as GRanges object
Character name of the primary gene id metadata in query GTF. Input to this argument is typically 'gene_id'
Character name of the secondary gene id in query file. Example of input to this argument is 'ref_gene_id'
Gene_id-matched query GRanges
The default approach to this correction relies on finding overlaps between transcripts in query with transcripts in reference. Using this method alone could result in false positive matches (19 percent false positives). To improve this, users have the option to invoke two additional layers of matching. (1) Matching by ENSEMBL Gene_IDs. If both query and reference transcript annotations containg Ensembl-style Gene IDs, this program will try to match both IDs in a less stringent manner. This correction can be invoked by providing the 'primary_gene_id' argument
(2) Matching by secondary Gene_IDs. Depending on the transcript assembly program, GTF/GFF3 annotations may contain additional comments on the transcript information. This may include a distinct secondary Gene ID annotation that potentially matches with the reference. To invoke this correction, provide 'primary_gene_id' and 'secondary_gene_id' arguments. To determine if your transcript assembly contain possible secondary Gene IDs, import query GTF file using `importGTF()` and check its metadata columns
## ---------------------------------------------------------------------
## EXAMPLE USING SAMPLE DATASET
## ---------------------------------------------------------------------
# Load datasets
data(chrom_matched_query_gtf, ref_gtf)
# Run matching function
matchGeneInfo(chrom_matched_query_gtf, ref_gtf)
#> Number of mismatched gene_ids found: 1
#> ---> Attempting to match gene_ids by finding overlapping coordinates...
#> ---> 1 gene_id matched
#> Total gene_ids corrected: 1
#> Remaining number of mismatched gene_ids: 0
#> GRanges object with 56 ranges and 6 metadata columns:
#> seqnames ranges strand | type transcript_id
#> <Rle> <IRanges> <Rle> | <factor> <character>
#> [1] chr10 79854427-79864432 + | transcript transcript1
#> [2] chr10 79854427-79854721 + | exon transcript1
#> [3] chr10 79856504-79856534 + | exon transcript1
#> [4] chr10 79858752-79858824 + | exon transcript1
#> [5] chr10 79858952-79859271 + | exon transcript1
#> ... ... ... ... . ... ...
#> [52] chr10 79862014-79862047 + | exon transcript4
#> [53] chr10 79862449-79862541 + | exon transcript4
#> [54] chr10 79862653-79862869 + | exon transcript4
#> [55] chr10 79862978-79863055 + | exon transcript4
#> [56] chr10 79863145-79864359 + | exon transcript4
#> gene_id old_gene_id match_level gene_name
#> <character> <character> <numeric> <character>
#> [1] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> [2] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> [3] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> [4] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> [5] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> ... ... ... ... ...
#> [52] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> [53] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> [54] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> [55] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> [56] ENSMUSG00000006498.14 GeneA 4 Ptbp1
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths