Frontiers in Genetics | 2021

Automatic Prediction and Annotation: There Are Strong Biases for Multigenic Families

 
 

Abstract


In the last few decades, the explosion of genomic projects has produced huge sets of predicted genes and annotated sequences. The prediction of a gene structure can be defined as the capacity to determine the start and the stop of the gene as well as the positions of introns, if present. Despite the number of performant gene prediction programs combining ab initio and homology-based approaches (Mathe et al., 2002; Hoff and Stanke, 2015), the rate of mis-predicted genes is not negligible and can be due to several factors (Scalzitti et al., 2020). For example, unusually long introns, short exons or long genes can generate incomplete or partially predicted gene structure; short intergenic regions can lead to gene fusion; DNA sequencing errors (nucleotide deletions or insertions) introducing frameshifts can affect predictions; non-canonical splice sites, overlapping genes and genes located within introns are also a source of erroneous predictions. Due to high sequence identity and duplication rate, the risks of mis-prediction are exacerbated in the case of multigenic families (Figure 1, Fawal et al., 2014). In addition, protein annotation or function assignment, based on the presence of a hypothetical protein domain or on homology with known proteins, can also lead to an inappropriate annotation. The risk of mis-annotations is high for proteins containing multiple domains or small domain(s) common to several classes of proteins. For example, the PFAM domain PF07992 (Pyridine nucleotide-disulphide oxidoreductase) is detected in MonoDehydroAscorbate Reductases (MDARs), Glutathione Reductases (GRs), and in the Thioredoxin family (Trx) but does not discriminate between these three different families (Table 1). Mis-annotations are also observed for proteins belonging to superfamilies with conserved domain and large number of protein families and classes. As an example, 198 genes of the MYB superfamily have been detected in Arabidopsis thaliana (Yanhui et al., 2006), but the PFAM domain PF00249 (Myb_DNA-binding) does not discriminate between the R2R3-MYB, the R1R2R3-MYB, the MYB-related, and the atypical MYB families. In addition, the PF00249 entry also contains the SANT domain, which has a strong structural similarity to the Myb domain but is functionally divergent. Therefore, using this PFAM entry to extract MYB proteins returns many false positives (total of 326 sequences from A. thaliana).

Volume 12
Pages None
DOI 10.3389/fgene.2021.697477
Language English
Journal Frontiers in Genetics

Full Text