[PDF] AFDP: An Automated Function Description Prediction Approach to Improve Accuracy of Protein Function Predictions

Abstract

With the rapid growth in high-throughput biological sequencing technologies and subsequently the amount of produced omics data, it is essential to develop automated methods to annotate the functionality of unknown genes and proteins. There are developed tools such as AHRD applying known proteins characterization to annotate unknown ones. Some other algorithms such as eggNOG apply orthologous groups of proteins to detect the most probable function. However, while the available tools focus on the detection of the most similar characterization, they are not able to generalize and integrate information from multiple homologs while maintaining accuracy. Here, we devise AFDP, an integrated approach for protein function prediction which benefits from the combination of two available tools, AHRD and eggNOG, to predict the functionality of novel proteins and produce more precise human readable descriptions by applying our stCFExt algorithm. StCFExt creates function descriptions applying available manually curated descriptions in swiss-prot. Using a benchmark dataset we show that the annotations predicted by our approach are more accurate than eggNOG and AHRD annotations.

Full PDF

AAFDP: An Automated Function DescriptionPrediction Approach to Improve Accuracy ofProtein Function Predictions

Samaneh Jozashoori , − − − , AmirJozashoori − − − and Heiko Schoof − − − L3S Institute, Leibniz University of Hannover, Germany TIB Leibniz Information Centre for Science and Technology, Germany Azad University of Zanjan, Iran University of Bonn, Germany [email protected] [email protected]

Abstract.

With the rapid growth in high-throughput biological sequenc-ing technologies and subsequently the amount of produced omics data,it is essential to develop automated methods to annotate the function-ality of unknown genes and proteins. There are developed tools such asAHRD applying known proteins characterization to annotate unknownones. Some other algorithms such as eggNOG apply orthologous groupsof proteins to detect the most probable function. However, while theavailable tools focus on the detection of the most similar characteri-zation, they are not able to generalize and integrate information frommultiple homologs while maintaining accuracy.Here, we devise AFDP, an integrated approach for protein function pre-diction which beneﬁts from the combination of two available tools, AHRDand eggNOG, to predict the functionality of novel proteins and producemore precise human readable descriptions by applying our stCFExt al-gorithm. StCFExt creates function descriptions applying available man-ually curated descriptions in swiss-prot. Using a benchmark dataset weshow that the annotations predicted by our approach are more accuratethan eggNOG and AHRD annotations.

Proteins perform a wide variety of important functions within organisms and arenecessary for maintaining metabolism and cellular structure. Hence the majorproblem in understanding the molecular underpinnings of life lies in knowing thefunctionality of proteins. While molecular experiments provide the most reliableannotation of proteins, their relatively low throughput and restricted scope haveled to an increasing role for computational function prediction. Therefore, fromthe early stages of bioinformatics in the late 80s, the development of high per-formance and accurate computational tools for predicting the functions of newlyidentiﬁed proteins, was a major focus of the ﬁeld. a r X i v : . [ q - b i o . GN ] O c t S. Jozashoori et al.

Fig. 1:

Example of Representative Description Extraction.

According tothe sequence similarities, the unknown protein is predicted to belong to a clusterwith three known proteins. The human readable descriptions of these proteinsonly diﬀer in one token which appears in the middle of the descriptions. Obvi-ously, extracting the LCS only from the whole string will lead to keep part S1and exclude the S2 which prompts to lose information. This example motivatesstCFExt algorithm to repeat the previous steps on remaining preﬁx and suﬃxi.e. S2 in this example.The functionality of a protein depends on sequence of amino acids fromwhich it has been created and its structure which describes proteins’ possibleinteractions with other molecules. Sequence-based protein function prediction isbased on the fact that ﬁnding a protein whose function is already characterizedexperimentally and has signiﬁcantly similar sequence as the query protein, mayreveal some functional aspects of the unknown protein [3].Consequently, based on this similarity principle, methods such as BLAST [1]were developed to compare sequence and structure of proteins and also databasessuch as Uniprot were developed, which organize function information of proteinsand serve as reference to be queried against.Protein function prediction through similarity searches is based on the evo-lutionary principles of homology. Generally orthologous genes preserve the samefunction as their ancestral which makes the identiﬁcation of them crucial toreliably annotate the functions of proteins that they encode[9]. EggNOG [7] isone of the best available tools that provide orthologous groups of proteins andcohesive functional annotation for each.After proteins are clustered into the protein families, one representative hu-man readable description needs to be assigned along with the representativesequence to indicate the functionality of the proteins within the group. An auto-mated clustering tool follow one of these approaches to functionally annotate theprotein families: a. applying the function annotation that is already assigned tothe representative sequence, such as CD-HIT [8] algorithm, b. generating a func-tion description according to the available annotations of all protein membersthat are included in the same group, similar to eggNOG approach. Due to thecomplexities of generating human readable annotation automatically, there arestill room to propose an improved algorithm. Therefore, we introduce stCFExtalgorithm along with our integrated approach for protein function prediction. FDP: An Automated Function Description Prediction Approach 3

Algorithm 1 stCFExts algorithm

Require:

A text ﬁle consists of cluster names and information about all proteinsincluded in each cluster i.e. accession keys and human readable descriptions.

Ensure:

A text ﬁle including cluster names and one representative human readabledescription for each cluster.1: for i = 1 to numberOfClusters do

2: Remove uninformative tokens from descriptions applying deﬁned blacklist.3: Filter descriptions whose lengths are less than 80% of maximalLengthOfCluster .4: descriptions ← extract list of descriptions of the cluster5: st ← MakeSuﬃxTree( descriptions )6: lcs ← ﬁnd the longest common substring in st within the length equal to atleast 5% of the length of the longest description of the cluster.7: prefix descs , suffix descs ← removing lcs , extract the former and later re-maining parts8: pre lcs , suf lcs ← repeat 4 and 5 for both prefix descs , suffix descs lists9: pre prefix , pre suffix , suf prefix , suf suffix ← repeat 6 for both pre lcs and suf lcs representativeDescription ← pre prefix + pre lcs + pre suffix + lcs + suf prefix + suf lcs + suf suffix Write reperesentativeDescription to the output return output

The rest of the paper is structured as follows: in Section 2 the pseudocode anddetails of stCFExt algorithm is explained while the AFDP approach is presentedin Section 3. Section 4 reports the experimental evaluation of stCFExt. Finally,Section 5 presents our conclusions. stCFExt or suﬃx tree based Cluster Function Extraction, is an algorithm wepropose for creating representative descriptions for protein families applying Suf-ﬁx Tree[5], a fundamental data structure in string processing, and the availablehuman readable descriptions for protein sequences.As it can also be followed in algorithm1, stCFExt initiates with taking all humanreadable descriptions of the protein sequences included in each protein clusteras input. Considering the general idea of having representative description, theﬁrst step of algorithm is ﬁltering and cleaning. In order to provide a genericinsight on the functionality of each cluster, a blacklist of tokens which are notinformative needs to be deﬁned and be removed from the processing descrip-tions. Experimentally, we generated the initial blacklist including the followingtokens: ”putative”, ”(fragment/ts)”, ”truncated”, ”homolog”, ”probable”, and”(predicted)”. Another case to be considered is a cluster in which one sequencewith totally diﬀerent description from the others exists. Since such sequencesmay refer to outliers, stCFExt diﬀerentiates between mentioned clusters and https://github.com/samiscoding/stCFExt S. Jozashoori et al.

Fig. 2:

Automated Function Description Prediction Approach those in which more than one description are dissimilar. Accordingly, removingthe solitary sequence description is performed in two steps: a. Removal basedon the length; the sequence within the description length less that 80% of thelongest description length in the cluster is to be removed. b. Removal based onthe informativeness which is to be happened in the later step.After the preliminary ﬁltration, a suﬃx tree is generated for each cluster andfollowed by detecting the Longest Common Substring [6] or LCS according towhich secondary ﬁltration is performed in order to assure that it is meaning-ful and informative substring. Afterwards, the LCS detection is performed oneach remaining parts of descriptions i.e. preﬁx and suﬃx substrings. The moti-vation and signiﬁcance of performing later step is shown in Figure1: as it canbe observed in this example, the descriptions of three sequence-based similarproteins to the novel one, only diﬀer in one token which is located in the middleof the whole string and divides it into two common substrings. Subsequently,the remaining preﬁxes and suﬃxes are either chosen based on their frequenciesin diﬀerent descriptions or added to the common strings cumulatively.

AFDP or Automated Function Description Prediction is an integrated approachwe devise for protein function prediction. The ﬁrst 3 steps process the availabledata about known proteins. As it is shown in Figure2, the approach starts withclustering proteins into orthologous groups applying eggNOG. Since eggNOGgenerates new identiﬁcation numbers and representative descriptions for clus-tered proteins and does not include the Uniprot accession keys nor descriptionsof proteins, an intermediate step is required to retrieve curated descriptions ofclustered proteins from Uniprot. Afterwards, an informative representative de-scription is generated for each cluster using stCFExt. This description is anno-tated to all unknown proteins in the cluster, thus providing a function predictionfor all unknown proteins.

To evaluate the performance of stCFExt we use the human readable descrip-tions produced by eggNOG to compare with those generated by stCFExt. Inthis experiment we address this question: Does stCFExt produce more precisehuman readable function descriptions? Accordingly, the experiment is designed

FDP: An Automated Function Description Prediction Approach 5

Fig. 3:

Comparison between performance of stCFExt and eggNOG as follows:

Benchmark:

We use a dataset of protein sequences randomly selected fromclusters generated by eggNOG that have manually curated descriptions in swiss-prot (reference description). Their annotations were removed and they were usedas input for four predictions: stCFExt, eggNOG, AHRD, and best BLAST resultin swiss-prot database . Metrics:

We evaluate performance based on the number of shared wordsbetween prediction and reference description and compute the F Score as har-monic mean of precision and recall [4].

Experimental Setup:

We use a version of AHRD [2] which supports com-petitor parameters to evaluate diﬀerent methods simultaneously. Each competi-tor ﬁle includes accession keys of sequences of interest along with their humanreadable descriptions. Thereby, we expose our algorithm’s outcome to be com-pared with the result of another algorithm in terms of precision and recall.

Discussion:

As it is displayed in the Figure3 and Table1, the descriptionsassigned by stCFExt result in a higher evaluation score than other predictionswhen compared to swiss-prot curated descriptions, and especially appear moreprecise than the eggNOG annotations. S. Jozashoori et al.Description creator evaluation scorestCFExt 0.4323eggNOG 0.2061AHRD 0.2458best BLAST hit against swiss-prot DB 0.4254

Table 1: AHRD evaluation score for a query dataset using reference descriptionsgenerated by diﬀerent algorithms.

We introduced AFDP, an integrated approach for protein function predictionbased on stCFExt, an algorithm to produce representative human readable func-tion descriptions. Our evaluation showed that our AFDP approach generatesmore precise protein annotations compared to eggNOG and other predictions.The key advantage is that high-quality protein orthologous groups from eggNOGcan be utilized for human readable function annotation of unknown proteins bygeneralizing from multiple descriptions.

References

1. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic localalignment search tool.

Journal of molecular biology , 215(3):403–410, 1990.2. T. G. Consortium et al. The tomato genome sequence provides insights into ﬂeshyfruit evolution.

Nature , 485(7400):635, 2012.3. I. Friedberg. Automated protein function predictionthe genomic challenge.

Brieﬁngsin bioinformatics , 7(3):225–242, 2006.4. C. Goutte and E. Gaussier. A probabilistic interpretation of precision, recall andf-score, with implication for evaluation. In

European Conference on InformationRetrieval , pages 345–359. Springer, 2005.5. M. Gower. An introduction to the suﬃx tree and its many applications. 2000.6. D. Gusﬁeld.

Algorithms on strings, trees and sequences: computer science and com-putational biology . Cambridge university press, 1997.7. J. Huerta-Cepas, D. Szklarczyk, K. Forslund, H. Cook, D. Heller, M. C. Walter,T. Rattei, D. R. Mende, S. Sunagawa, M. Kuhn, et al. eggnog 4.5: a hierarchical or-thology framework with improved functional annotations for eukaryotic, prokaryoticand viral sequences.

Nucleic acids research , 44(D1):D286–D293, 2015.8. W. Li and A. Godzik. Cd-hit: a fast program for clustering and comparing largesets of protein or nucleotide sequences.

Bioinformatics , 22(13):1658–1659, 2006.9. E. M. Schwarz. Genomic classiﬁcation of protein-coding gene families. In