Quantitative Biology Quantitative Methods

Prediction of Influenza A virus infections in humans using an Artificial Neural Network learning approach

Charalambos Chrysostomou, Harris Partaourides, Huseyin Seker

Abstract

The Influenza type A virus can be considered as one of the most severe viruses that can infect multiple species with often fatal consequences to the hosts. The Haemagglutinin (HA) gene of the virus has the potential to be a target for antiviral drug development realised through accurate identification of its sub-types and possible the targeted hosts. In this paper, to accurately predict if an Influenza type A virus has the capability to infect human hosts, by using only the HA gene, is therefore developed and tested. The predictive model follows three main steps; (i) decoding the protein sequences into numerical signals using EIIP amino acid scale, (ii) analysing these sequences by using Discrete Fourier Transform (DFT) and extracting DFT-based features, (iii) using a predictive model, based on Artificial Neural Networks and using the features generated by DFT. In this analysis, from the Influenza Research Database, 30724, 18236 and 8157 HA protein sequences were collected for Human, Avian and Swine, respectively. Given this set of the proteins, the proposed method yielded 97.36% (+- 0.04%), 97.26% (+- 0.26%), 0.978 (+- 0.004), 0.963 (+- 0.005) and 0.945 (+- 0.005) for the training accuracy validation accuracy, precision, recall and Mathews Correlation Coefficient (MCC) respectively, based on a 10-fold cross-validation. The classification model generated by using one of the largest dataset, if not the largest, yields promising results that could lead to early detection of such species and help develop precautionary measurements for possible human infections.

Full PDF

PPrediction of Inﬂuenza A Virus Infections in Humans using anArtiﬁcial Neural Network Learning Approach

Charalambos Chrysostomou ∗ , Harris Partaourides and Huseyin Seker Abstract — The Inﬂuenza type A virus can be considered asone of the most severe viruses that can infect multiple specieswith often fatal consequences to the hosts. The Haemagglutinin(HA) gene of the virus has the potential to be a target for antivi-ral drug development realised through accurate identiﬁcationof its sub-types and possible the targeted hosts. In this paper, toaccurately predict if an Inﬂuenza type A virus has the capabilityto infect human hosts, by using only the HA gene, is thereforedeveloped and tested. The predictive model follows three mainsteps; (i) decoding the protein sequences into numerical signalsusing EIIP amino acid scale, (ii) analysing these sequences byusing Discrete Fourier Transform (DFT) and extracting DFT-based features, (iii) using a predictive model, based on ArtiﬁcialNeural Networks and using the features generated by DFT.In this analysis, from the Inﬂuenza Research Database,30724, 18236 and 8157 HA protein sequences were collected forHuman, Avian and Swine respectively. Given this set of the pro-teins, the proposed method yielded 97.36% ( ± ± ± ± ± Index Terms — Artiﬁcial Neural Network, Amino Acid In-dices, Discrete Fourier Transform (DFT), Hemagglutinin (HA)Protein

I. INTRODUCTIONThe Inﬂuenza type A virus can be considered one of themost severe virus that can infect both mammals and birds.The genome of the Inﬂuenza virus is composed of eightsegments that can encode more than 11 proteins [1]. Oneof the most important proteins is the Haemagglutinin (HA),which is an essential glycoprotein and a principal surfaceantigen which is responsible for attaching the virions tohosts, deciding the pathogenicity and virulence [1]. Untilnow, 18 distinct Inﬂuenza A HA subtypes have been identi-ﬁed [2], [3].The Inﬂuenza type A virus continually evolves due to thehigh mutation rate and the constant changes to its genome. Computation-based Science and Technology Research Center, TheCyprus Institute, 20 Konstantinou Kavaﬁ Street, 2121, Aglantzia, Nicosia,Cyprus Department of Electrical Engineering, Computer Engineering and In-formatics, Cyprus University of Technology, 30 Archbishop Kyprianou Str.,3036 Limassol, Cyprus Department of Computer and Information Sciences, Faculty of Engi-neering and Environment, The University of Northumbria at Newcastle,NE1 8ST, Newcastle-upon-Tyne, The United [email protected], [email protected],[email protected]*Corresponding Author

This constant adaptation usually makes any new strain ofvirus more pathogenic than the previous. Furthermore, thesemutations also provide the virus with the ability to crossthe species barrier and may also affect the binding patternof a virus, with catastrophic consequences to the concernedspecies [4].In the literature, previous efforts and analysis have beenperformed to analyse and characterise the phylogenetic di-versity, and discover mechanisms that deﬁne the severity anddistribution of inﬂuenza type A virus [5]–[7]. Additionally,as the authors concluded, classiﬁcation and characterisationof all the sequences with the proposed methods, was difﬁcult[5]–[7], thus a more advanced method is needed. Computa-tional studies exist that tries to characterise and analyse theInﬂuenza type A with promising results [8]. In the proposedmethod a computational, Artiﬁcial Neural Network (ANN)learning based approach is created to predict if a particularvirus has the capability to infect humans, by only analysingthe HA protein sequence.The paper is organised as follows: Section II presents themethods and materials developed and used, while Section IIIpresents the results obtained. Finally, concluding remarks areoutlined in Section IV.II. METHODS AND MATERIAL

A. Inﬂuenza A Hemagglutinin Proteins Data Set

For the proposed analysis 57117 HA Inﬂuenza type Aprotein sequences are collected from the Inﬂuenza ResearchDatabase [9], for three species, Human, Avian and Swine.More speciﬁcally, as Table I shows 30724, 18236 and 8157HA protein sequences were collected for Human, Avian andSwine respectively. Furthermore, Table I shows the speciﬁcnumber of sequences for each class of HA 1-18. Finally,ﬁgure 1 illustrates the percentage of HA proteins per classand per species. For the analysis, classiﬁcation of the HAprotein sequences, based on the ability of the virus to infecthuman hosts, the data were separated into two groups. Theﬁrst group contained all the sequences from HA 1-18 for theviruses that have the ability to infect the Humans hosts, andthe second group with the sequences that have the potentialto infect the Avian and Swine hosts. For the ﬁrst and secondgroups, the total number of 30724 and 26393 HA proteinsequences were used respectively.

B. Data conversion and Normalisation

In this paper digital signal processing techniques are usedto extract information that can be directly used to charac-terise the HA proteins. In the literature, various methods a r X i v : . [ q - b i o . Q M ] F e b ig. 1. HA Protein SequencesTABLE IN UMBER OF

HA P

ROTEIN S EQUENCES

HA Subtype Human Avian SwineH1 16145 650 5714H2 96 462 2H3 14055 1621 2138H4 0 1478 5H5 269 4242 32H6 0 1529 2H7 104 1809 1H8 0 121 0H9 13 3281 20H10 4 834 1H11 0 547 0H12 0 178 0H13 0 200 0H14 0 17 0H15 0 13 0H16 0 150 0H17 0 0 0H18 0 0 0Mixed 38 1104 242Total 30724 18236 8157 used signal processing in bioinformatics for analysing andcharacterising protein sequences [10]–[14] such as Complexresonant recognition model in analysing inﬂuenza a virussubtype protein sequences [10], CISAPS: Complex informa-tional spectrum for the analysis of protein sequences [13] andStructural classiﬁcation of protein sequences based on signalprocessing and support vector machines [14]. Furthermore,previous studies [15] where signal processing was used toanalyse inﬂuenza A HA proteins aimed to identify new ther-apeutic targets for drug development by better understanding the interaction of the inﬂuenza virus and its receptors.For the proposed analysis, signal processing methodsare used, and more speciﬁcally Discrete Fourier Trans-form (DFT), as shown in equations 1-3. The analysis wasperformed directly to absolute spectrum. Before applyingDFT to the HA protein sequences, Electron-ion interactionpotential (EIIP) [16], [17] amino acid index, was used toconvert alphanumerical sequences. The complete list of theEIIP amino acid index can be found in Table II.Discrete Fourier Transform (DFT) X ( n ) = N − (cid:88) m =0 x ( m ) e − j (2 π/N ) nm n = 0 , , ..., N − (1)where X ( n ) are the DFT coefﬁcients, N is the total numberof points in the series and x ( m ) is the m th member of thenumerical series. As the DFT coefﬁcients contain two mirrorparts, only the ( N/ points of the series will be used.The output of DFT is a complex sequence and can beformulated as X ( n ) = ( R ( n ) + jI ( n )) , n = 0 , , ..., ( N − / (2)where R ( n ) and I ( n ) are the Real and Imaginary parts ofthe sequence, respectively. The absolute spectrum ( S ( n ) ) canbe formulated as ( n ) = X ( n ) X ∗ ( n ) = | X ( n ) | , n = 0 , , ..., ( N − / (3)where X ( n ) are the DFT coefﬁcients of the series x ( n ) , X ∗ ( n ) are the complex conjugates.The coefﬁcients from the absolute spectrum will be usedas a feature set to represent the characteristics of differentclasses of proteins’ secondary structure. The HA inﬂuenzaA virus proteins sequences have different lengths, and zero-padding was used to extend all the protein sequences to N = 1024 before applying DFT. After DFT is applied theoutput of the absolute spectrum includes 513 features. Thesefeatures are used as input to the ANN model. TABLE IIEIIP V

ALUES

Amino acid EIIP ValuesLeucine 0.0000Isoleucine 0.0000Asparagine 0.0036Glycine 0.0050Glutamic acid 0.0057Valine 0.0058Proline 0.0198Histidine 0.0242Lysine 0.0371Alanine 0.0373Tyrosine 0.0516Tryptophan 0.0548Glutamine 0.0761Methionine 0.0823Serine 0.0829Cysteine 0.0829Threonine 0.0941Phenylalanine 0.0946Arginine 0.0959Aspartic acid 0.1263

C. Artiﬁcial Neural Network - Experimental Evaluation

Artiﬁcial Neural Networks (ANN) [18] are a computa-tional method, based on an extensive collection of artiﬁcialneurons, which mirrors the process a living brain solvesproblems. Each neuron connects to multiple other neurons,which can enforce or repress the impact on the activationevent of the connected neurons. The ANNs are consideredas self-trained, rather than explicitly programmed, and em-ployed in research ﬁelds where the discovery of featuresand classiﬁcation is challenging in traditional classiﬁcationsystems.For the proposed work, the ANN receives an input of513 features derived from the inﬂuenza type A virus pre-processing and returns the probability of the virus infectinghumans. For the proposed work, the binary classiﬁcationconsist of 57117 samples of which 30724 can infect humans.The network setup consists of a single hidden layer of128 units, Glorot-style uniform for initialization and rectiﬁedlinear units for the activation function. In order to train theANN, the Adam optimizer [19] was used with mini batch size of 128 for 200 epochs. We use 10 fold cross-validationand show the network performance based on accuracy. Themodel was implemented by utilising the Tensorﬂow [20] andKeras [21] libraries. A visual representation of the model canbe seen in Figure 2.

Fig. 2. Artiﬁcial Neural Network Model

The high performance of both training and testing showsthat for this type of problem more advanced Neural Networkmodels (such as Deep Neural Networks) and regularizationtechniques are not needed.III. RESULTS AND DISCUSSIONIn this paper, a classiﬁcation model is presented, based onArtiﬁcial Neural Networks, for the analysis and classiﬁcationInﬂuenza type A based upon the ability to infect a humanhost solely by using the HA protein sequence. To ensurethat the proposed classiﬁcation model is accurate and theresults can be generalised, 10-fold cross-validation was used.The total accuracy of the predictive model with averagetraining accuracy, testing accuracy, precision, recall andMCC of 97.36% ( ± ± ± ± ± ABLE IIIA

CCURACY R ESULTS FOR THE P REDICTION OF I NFLUENZA

A V

IRUS I NFECTIONS

Fold Training Accuracy Validation Accuracy Validation Precision Validation Recall Validation MCC1 0.974 0.970 0.976 0.959 0.9402 0.973 0.971 0.982 0.955 0.9413 0.973 0.970 0.973 0.962 0.9404 0.974 0.971 0.974 0.964 0.9425 0.974 0.970 0.982 0.954 0.9416 0.973 0.975 0.981 0.966 0.9507 0.974 0.972 0.973 0.967 0.9448 0.973 0.978 0.982 0.971 0.9579 0.974 0.975 0.980 0.966 0.95010 0.974 0.972 0.976 0.963 0.944Average 97.36% ( ± ± ± ± ± the 10-fold cross validation yielded the average training accu-racy, testing accuracy, precision, recall and MCC of 97.36%( ± ± ± ± ± EFERENCES[1] R. G. Webster, W. J. Bean, O. T. Gorman, T. M. Chambers, andY. Kawaoka, “Evolution and ecology of inﬂuenza a viruses.”

Micro-biological reviews , vol. 56, no. 1, pp. 152–179, 1992.[2] R. A. Fouchier, V. Munster, A. Wallensten, T. M. Bestebroer, S. Herfst,D. Smith, G. F. Rimmelzwaan, B. Olsen, and A. D. Osterhaus,“Characterization of a novel inﬂuenza a virus hemagglutinin subtype(h16) obtained from black-headed gulls,”

Journal of virology , vol. 79,no. 5, pp. 2814–2822, 2005.[3] Y. Wu, Y. Wu, B. Tefsen, Y. Shi, and G. F. Gao, “Bat-derived inﬂuenza-like viruses h17n10 and h18n11,”

Trends in microbiology , vol. 22,no. 4, pp. 183–191, 2014.[4] N. Nunthaboot, T. Rungrotmongkol, M. Malaisree, N. Kaiyawet,P. Decha, P. Sompornpisut, Y. Poovorawan, and S. Hannongbua,“Evolution of human receptor binding afﬁnity of h1n1 hemagglutininsfrom 1918 to 2009 pandemic inﬂuenza a virus,”

Journal of chemicalinformation and modeling , vol. 50, no. 8, pp. 1410–1417, 2010.[5] J.-M. Chen, Y.-X. Sun, J.-W. Chen, S. Liu, J.-M. Yu, C.-J. Shen, X.-D.Sun, and D. Peng, “Panorama phylogenetic diversity and distributionof type a inﬂuenza viruses based on their six internal gene sequences,”

Virology Journal , vol. 6, no. 1, p. 137, 2009.[6] S. Liu, K. Ji, J. Chen, D. Tai, W. Jiang, G. Hou, J. Chen, J. Li, andB. Huang, “Panorama phylogenetic diversity and distribution of typea inﬂuenza virus,”

PLoS One , vol. 4, no. 3, p. e5022, 2009. [7] W. Shi, F. Lei, C. Zhu, F. Sievers, and D. G. Higgins, “A completeanalysis of ha and na genes of inﬂuenza a viruses,”

PloS one , vol. 5,no. 12, p. e14454, 2010.[8] Z. Rehman, R. Zafar, U. Amir, U. H. Niazi, and A. Fahim, “Charac-terization of evolutionary changes in hemagglutinin of inﬂuenza h1n1virus: a computational analysis,”

VirusDisease , vol. 27, no. 1, pp. 34–40, 2016.[9] R. B. Squires, J. Noronha, V. Hunt, A. Garc´ıa-Sastre, C. Macken,N. Baumgarth, D. Suarez, B. E. Pickett, Y. Zhang, C. N. Larsen et al. ,“Inﬂuenza research database: an integrated bioinformatics resource forinﬂuenza research and surveillance,”

Inﬂuenza and other respiratoryviruses , vol. 6, no. 6, pp. 404–416, 2012.[10] C. Chrysostomou, H. Seker, N. Aydin, and P. I. Haris, “Complexresonant recognition model in analysing inﬂuenza a virus subtypeprotein sequences,” in

Information Technology and Applications inBiomedicine (ITAB), 2010 10th IEEE International Conference on .IEEE, 2010, pp. 1–4.[11] C. J. Carmona, C. Chrysostomou, H. Seker, and M. del Jesus, “Fuzzyrules for describing subgroups from inﬂuenza a virus using a multi-objective evolutionary algorithm,”

Applied Soft Computing , vol. 13,no. 8, pp. 3439–3448, 2013.[12] C. Chrysostomou, H. Seker, and N. Aydin, “Effects of windowingand zero-padding on complex resonant recognition model for proteinsequence analysis,” in

Engineering in Medicine and Biology Society,EMBC, 2011 Annual International Conference of the IEEE . IEEE,2011, pp. 4955–4958.[13] C. Chrysostomou, H. Seker, and N. Aydin, “Cisaps Complex infor-mational spectrum for the analysis of protein sequences,”

Advances inbioinformatics , vol. 2015, 2015.[14] C. Chrysostomou and H. Seker, “Structural classiﬁcation of proteinsequences based on signal processing and support vector machines,”in

Engineering in Medicine and Biology Society (EMBC), 2016 IEEE38th Annual International Conference of the . IEEE, 2016, pp. 3088–3091.[15] V. Veljkovic, N. Veljkovic, C. Muller, S. Muller, S. Glisic, V. Perovic,and H. Kohler, “Characterization of conserved properties of hemag-glutinin of h5n1 and human inﬂuenza viruses: possible consequencesfor therapy and infection control,”

BMC Structural Biology , vol. 9,no. 1, p. 21, 2009.[16] V. Veljkovic, I. Cosic, B. Dimitrijevic, and D. LalovicC, “Is it possibleto analyze DNA and protein sequences by the methods of digital signalprocessing?”

IEEE Transaction on Biomedical Engineering , vol. 32,no. 5, pp. 337–341, 1985.[17] K. Gopalakrishnan, R. Zadeh, K. Najarian, and A. Darvish, “Computa-tional analysis and classiﬁcation of p53 mutants according to primarystructure,” in , 2004, Proceedings Paper, pp. 694–695.[18] N. Gupta, “Artiﬁcial neural network,”

Network and Complex Systems ,vol. 3, no. 1, pp. 24–28, 2013.[19] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,”

CoRR , vol. abs/1412.6980, 2014. [Online]. Available:http://arxiv.org/abs/1412.6980[20] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems,” arXivpreprint arXiv:1603.04467arXivpreprint arXiv:1603.04467