This FlyBase analysis synthesizes the results of 22 publications, of both single-gene and genome-wide scope, that have sought to identify protein coding genes derived by retrotransposition. A total of 685 calls for 420 putative retrotransposed protein coding genes were assessed, of which 142 genes were deemed to represent well supported retrotransposed protein coding genes. These 142 well supported retrogenes were annotated with the SO term 'retrotransposed_protein_coding_gene' (SO:0000569). This term is presented in the 'Gene Model and Products > Sequence Ontology: Class of Gene' section of the gene report. All genes with this annotation can be retrieved using the FlyBase Vocabularies tool (http://flybase.org/cgi-bin/cvreport.pl?rel=is_a&id=SO:0000569).
The publications assessed herein are listed in the "Related Publication(s)" section of this FlyBase Reference report. All published calls and subsequent FlyBase analysis are presented in the associated spreadsheet, "FB_retrogenes.2014.8.5.xlsx". The 420 putative retrogenes analyzed, whether or not they were ultimately deemed to represent well supported retrogenes, are listed in the "Data From Reference > Genes" section of this FlyBase Reference report. All 685 putative retrogene assertions from the 22 publications are represented by standardized comments with the key words 'derived' and 'retroposition' in the 'Other Information > Relationship to Other Genes' section of the gene report, attributed to the original publication.
Methods:
Genes copies may arise by retrotransposition, in which the mRNA of a parental gene is reverse transcribed and re-inserted into the genome. As these retrocopies are mRNA-derived, they characteristically lack the introns and flanking sequence of their parental genes. A number of independent studies, at both single gene and genome-wide levels, have sought to identify protein coding genes derived by retrotransposition; these publications were identified by searching abstracts containing the terms 'retrogene', 'retroposition' or 'retrotransposition'. This yielded 685 calls for 420 putative retrotransposed protein coding genes from 22 publications, including seven large lists of putative retrogenes from genome-wide studies, three small lists of putative retrogenes from more focussed genome-wide studies, and 12 single gene papers.
The data were reviewed to identify a set of well supported retrotransposed protein coding genes. In a first pass, genes from small studies, and genes called in at least two different high throughput studies, were selected (n = 202). These first pass genes were then reviewed to confirm a quality alignment between the putative retrogene and a plausible parental gene that confirmed intron loss, substantiating the inference of retrotransposition. Calls from two genome-wide studies (Langille and Clark, 2007 and Zhang et al., 2011) were accepted as their alignment methods were careful enough to ensure that the aligned region spanned a lost intron (confirmed by sampling of these calls). Calls from other genome-wide studies were less stringent - specifically, pairs of similar genes were reported in which only one of the pair lacked introns, without confirmation that the alignment between the two spanned an exon junction in the parental gene - such that partial DNA-based duplication could not be ruled out as an alternative to retroposition. As such, first pass genes not reported in the two aforementioned higher quality genome-wide studies were curator-reviewed using manual BLAST alignments. Putative retrogenes with conflicting parental gene assignments, or different retrogenes sharing the same parental gene assignment, were also curator-reviewed. Retrogenes for which no specific parental gene was reported in the study were rejected. A total of 142 genes were accepted in the second pass, to which the SO term 'retrotransposed_protein_coding_gene' was appended.