Match gene metadata from query GTF to a reference GTF

`matchGeneInfo()` matches and corrects Gene IDs from a query GTF object to a reference GTF

matchGeneInfo(query, ref, primary_gene_id = NULL, secondary_gene_id = NULL)

Arguments

query: Query GTF imported as GRanges object
ref: Reference GTF as GRanges object
primary_gene_id: Character name of the primary gene id metadata in query GTF. Input to this argument is typically 'gene_id'
secondary_gene_id: Character name of the secondary gene id in query file. Example of input to this argument is 'ref_gene_id'

Value

Gene_id-matched query GRanges

Details

The default approach to this correction relies on finding overlaps between transcripts in query with transcripts in reference. Using this method alone could result in false positive matches (19 percent false positives). To improve this, users have the option to invoke two additional layers of matching. (1) Matching by ENSEMBL Gene_IDs. If both query and reference transcript annotations containg Ensembl-style Gene IDs, this program will try to match both IDs in a less stringent manner. This correction can be invoked by providing the 'primary_gene_id' argument

(2) Matching by secondary Gene_IDs. Depending on the transcript assembly program, GTF/GFF3 annotations may contain additional comments on the transcript information. This may include a distinct secondary Gene ID annotation that potentially matches with the reference. To invoke this correction, provide 'primary_gene_id' and 'secondary_gene_id' arguments. To determine if your transcript assembly contain possible secondary Gene IDs, import query GTF file using `importGTF()` and check its metadata columns

Author

Fursham Hamid

Examples

## ---------------------------------------------------------------------
## EXAMPLE USING SAMPLE DATASET
## ---------------------------------------------------------------------
# Load datasets
data(chrom_matched_query_gtf, ref_gtf)

# Run matching function
matchGeneInfo(chrom_matched_query_gtf, ref_gtf)
#>     Number of mismatched gene_ids found: 1
#>     ---> Attempting to match gene_ids by finding overlapping coordinates...
#>     ---> 1 gene_id matched
#>     Total gene_ids corrected: 1
#>     Remaining number of mismatched gene_ids: 0
#> GRanges object with 56 ranges and 6 metadata columns:
#>        seqnames            ranges strand |       type transcript_id
#>           <Rle>         <IRanges>  <Rle> |   <factor>   <character>
#>    [1]    chr10 79854427-79864432      + | transcript   transcript1
#>    [2]    chr10 79854427-79854721      + | exon         transcript1
#>    [3]    chr10 79856504-79856534      + | exon         transcript1
#>    [4]    chr10 79858752-79858824      + | exon         transcript1
#>    [5]    chr10 79858952-79859271      + | exon         transcript1
#>    ...      ...               ...    ... .        ...           ...
#>   [52]    chr10 79862014-79862047      + |       exon   transcript4
#>   [53]    chr10 79862449-79862541      + |       exon   transcript4
#>   [54]    chr10 79862653-79862869      + |       exon   transcript4
#>   [55]    chr10 79862978-79863055      + |       exon   transcript4
#>   [56]    chr10 79863145-79864359      + |       exon   transcript4
#>                      gene_id old_gene_id match_level   gene_name
#>                  <character> <character>   <numeric> <character>
#>    [1] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>    [2] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>    [3] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>    [4] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>    [5] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>    ...                   ...         ...         ...         ...
#>   [52] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>   [53] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>   [54] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>   [55] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>   [56] ENSMUSG00000006498.14       GeneA           4       Ptbp1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths