Prediction of Influenza A virus infections in humans using an Artificial Neural Network learning approach
Charalambos Chrysostomou, Harris Partaourides, Huseyin Seker
PPrediction of Influenza A Virus Infections in Humans using anArtificial Neural Network Learning Approach
Charalambos Chrysostomou ∗ , Harris Partaourides and Huseyin Seker Abstract — The Influenza type A virus can be considered asone of the most severe viruses that can infect multiple specieswith often fatal consequences to the hosts. The Haemagglutinin(HA) gene of the virus has the potential to be a target for antivi-ral drug development realised through accurate identificationof its sub-types and possible the targeted hosts. In this paper, toaccurately predict if an Influenza type A virus has the capabilityto infect human hosts, by using only the HA gene, is thereforedeveloped and tested. The predictive model follows three mainsteps; (i) decoding the protein sequences into numerical signalsusing EIIP amino acid scale, (ii) analysing these sequences byusing Discrete Fourier Transform (DFT) and extracting DFT-based features, (iii) using a predictive model, based on ArtificialNeural Networks and using the features generated by DFT.In this analysis, from the Influenza Research Database,30724, 18236 and 8157 HA protein sequences were collected forHuman, Avian and Swine respectively. Given this set of the pro-teins, the proposed method yielded 97.36% ( ± ± ± ± ± Index Terms — Artificial Neural Network, Amino Acid In-dices, Discrete Fourier Transform (DFT), Hemagglutinin (HA)Protein
I. INTRODUCTIONThe Influenza type A virus can be considered one of themost severe virus that can infect both mammals and birds.The genome of the Influenza virus is composed of eightsegments that can encode more than 11 proteins [1]. Oneof the most important proteins is the Haemagglutinin (HA),which is an essential glycoprotein and a principal surfaceantigen which is responsible for attaching the virions tohosts, deciding the pathogenicity and virulence [1]. Untilnow, 18 distinct Influenza A HA subtypes have been identi-fied [2], [3].The Influenza type A virus continually evolves due to thehigh mutation rate and the constant changes to its genome. Computation-based Science and Technology Research Center, TheCyprus Institute, 20 Konstantinou Kavafi Street, 2121, Aglantzia, Nicosia,Cyprus Department of Electrical Engineering, Computer Engineering and In-formatics, Cyprus University of Technology, 30 Archbishop Kyprianou Str.,3036 Limassol, Cyprus Department of Computer and Information Sciences, Faculty of Engi-neering and Environment, The University of Northumbria at Newcastle,NE1 8ST, Newcastle-upon-Tyne, The United [email protected], [email protected],[email protected]*Corresponding Author
This constant adaptation usually makes any new strain ofvirus more pathogenic than the previous. Furthermore, thesemutations also provide the virus with the ability to crossthe species barrier and may also affect the binding patternof a virus, with catastrophic consequences to the concernedspecies [4].In the literature, previous efforts and analysis have beenperformed to analyse and characterise the phylogenetic di-versity, and discover mechanisms that define the severity anddistribution of influenza type A virus [5]–[7]. Additionally,as the authors concluded, classification and characterisationof all the sequences with the proposed methods, was difficult[5]–[7], thus a more advanced method is needed. Computa-tional studies exist that tries to characterise and analyse theInfluenza type A with promising results [8]. In the proposedmethod a computational, Artificial Neural Network (ANN)learning based approach is created to predict if a particularvirus has the capability to infect humans, by only analysingthe HA protein sequence.The paper is organised as follows: Section II presents themethods and materials developed and used, while Section IIIpresents the results obtained. Finally, concluding remarks areoutlined in Section IV.II. METHODS AND MATERIAL
A. Influenza A Hemagglutinin Proteins Data Set
For the proposed analysis 57117 HA Influenza type Aprotein sequences are collected from the Influenza ResearchDatabase [9], for three species, Human, Avian and Swine.More specifically, as Table I shows 30724, 18236 and 8157HA protein sequences were collected for Human, Avian andSwine respectively. Furthermore, Table I shows the specificnumber of sequences for each class of HA 1-18. Finally,figure 1 illustrates the percentage of HA proteins per classand per species. For the analysis, classification of the HAprotein sequences, based on the ability of the virus to infecthuman hosts, the data were separated into two groups. Thefirst group contained all the sequences from HA 1-18 for theviruses that have the ability to infect the Humans hosts, andthe second group with the sequences that have the potentialto infect the Avian and Swine hosts. For the first and secondgroups, the total number of 30724 and 26393 HA proteinsequences were used respectively.
B. Data conversion and Normalisation
In this paper digital signal processing techniques are usedto extract information that can be directly used to charac-terise the HA proteins. In the literature, various methods a r X i v : . [ q - b i o . Q M ] F e b ig. 1. HA Protein SequencesTABLE IN UMBER OF
HA P
ROTEIN S EQUENCES
HA Subtype Human Avian SwineH1 16145 650 5714H2 96 462 2H3 14055 1621 2138H4 0 1478 5H5 269 4242 32H6 0 1529 2H7 104 1809 1H8 0 121 0H9 13 3281 20H10 4 834 1H11 0 547 0H12 0 178 0H13 0 200 0H14 0 17 0H15 0 13 0H16 0 150 0H17 0 0 0H18 0 0 0Mixed 38 1104 242Total 30724 18236 8157 used signal processing in bioinformatics for analysing andcharacterising protein sequences [10]–[14] such as Complexresonant recognition model in analysing influenza a virussubtype protein sequences [10], CISAPS: Complex informa-tional spectrum for the analysis of protein sequences [13] andStructural classification of protein sequences based on signalprocessing and support vector machines [14]. Furthermore,previous studies [15] where signal processing was used toanalyse influenza A HA proteins aimed to identify new ther-apeutic targets for drug development by better understanding the interaction of the influenza virus and its receptors.For the proposed analysis, signal processing methodsare used, and more specifically Discrete Fourier Trans-form (DFT), as shown in equations 1-3. The analysis wasperformed directly to absolute spectrum. Before applyingDFT to the HA protein sequences, Electron-ion interactionpotential (EIIP) [16], [17] amino acid index, was used toconvert alphanumerical sequences. The complete list of theEIIP amino acid index can be found in Table II.Discrete Fourier Transform (DFT) X ( n ) = N − (cid:88) m =0 x ( m ) e − j (2 π/N ) nm n = 0 , , ..., N − (1)where X ( n ) are the DFT coefficients, N is the total numberof points in the series and x ( m ) is the m th member of thenumerical series. As the DFT coefficients contain two mirrorparts, only the ( N/ points of the series will be used.The output of DFT is a complex sequence and can beformulated as X ( n ) = ( R ( n ) + jI ( n )) , n = 0 , , ..., ( N − / (2)where R ( n ) and I ( n ) are the Real and Imaginary parts ofthe sequence, respectively. The absolute spectrum ( S ( n ) ) canbe formulated as ( n ) = X ( n ) X ∗ ( n ) = | X ( n ) | , n = 0 , , ..., ( N − / (3)where X ( n ) are the DFT coefficients of the series x ( n ) , X ∗ ( n ) are the complex conjugates.The coefficients from the absolute spectrum will be usedas a feature set to represent the characteristics of differentclasses of proteins’ secondary structure. The HA influenzaA virus proteins sequences have different lengths, and zero-padding was used to extend all the protein sequences to N = 1024 before applying DFT. After DFT is applied theoutput of the absolute spectrum includes 513 features. Thesefeatures are used as input to the ANN model. TABLE IIEIIP V
ALUES
Amino acid EIIP ValuesLeucine 0.0000Isoleucine 0.0000Asparagine 0.0036Glycine 0.0050Glutamic acid 0.0057Valine 0.0058Proline 0.0198Histidine 0.0242Lysine 0.0371Alanine 0.0373Tyrosine 0.0516Tryptophan 0.0548Glutamine 0.0761Methionine 0.0823Serine 0.0829Cysteine 0.0829Threonine 0.0941Phenylalanine 0.0946Arginine 0.0959Aspartic acid 0.1263
C. Artificial Neural Network - Experimental Evaluation
Artificial Neural Networks (ANN) [18] are a computa-tional method, based on an extensive collection of artificialneurons, which mirrors the process a living brain solvesproblems. Each neuron connects to multiple other neurons,which can enforce or repress the impact on the activationevent of the connected neurons. The ANNs are consideredas self-trained, rather than explicitly programmed, and em-ployed in research fields where the discovery of featuresand classification is challenging in traditional classificationsystems.For the proposed work, the ANN receives an input of513 features derived from the influenza type A virus pre-processing and returns the probability of the virus infectinghumans. For the proposed work, the binary classificationconsist of 57117 samples of which 30724 can infect humans.The network setup consists of a single hidden layer of128 units, Glorot-style uniform for initialization and rectifiedlinear units for the activation function. In order to train theANN, the Adam optimizer [19] was used with mini batch size of 128 for 200 epochs. We use 10 fold cross-validationand show the network performance based on accuracy. Themodel was implemented by utilising the Tensorflow [20] andKeras [21] libraries. A visual representation of the model canbe seen in Figure 2.
Fig. 2. Artificial Neural Network Model
The high performance of both training and testing showsthat for this type of problem more advanced Neural Networkmodels (such as Deep Neural Networks) and regularizationtechniques are not needed.III. RESULTS AND DISCUSSIONIn this paper, a classification model is presented, based onArtificial Neural Networks, for the analysis and classificationInfluenza type A based upon the ability to infect a humanhost solely by using the HA protein sequence. To ensurethat the proposed classification model is accurate and theresults can be generalised, 10-fold cross-validation was used.The total accuracy of the predictive model with averagetraining accuracy, testing accuracy, precision, recall andMCC of 97.36% ( ± ± ± ± ± ABLE IIIA
CCURACY R ESULTS FOR THE P REDICTION OF I NFLUENZA
A V
IRUS I NFECTIONS
Fold Training Accuracy Validation Accuracy Validation Precision Validation Recall Validation MCC1 0.974 0.970 0.976 0.959 0.9402 0.973 0.971 0.982 0.955 0.9413 0.973 0.970 0.973 0.962 0.9404 0.974 0.971 0.974 0.964 0.9425 0.974 0.970 0.982 0.954 0.9416 0.973 0.975 0.981 0.966 0.9507 0.974 0.972 0.973 0.967 0.9448 0.973 0.978 0.982 0.971 0.9579 0.974 0.975 0.980 0.966 0.95010 0.974 0.972 0.976 0.963 0.944Average 97.36% ( ± ± ± ± ± the 10-fold cross validation yielded the average training accu-racy, testing accuracy, precision, recall and MCC of 97.36%( ± ± ± ± ± EFERENCES[1] R. G. Webster, W. J. Bean, O. T. Gorman, T. M. Chambers, andY. Kawaoka, “Evolution and ecology of influenza a viruses.”
Micro-biological reviews , vol. 56, no. 1, pp. 152–179, 1992.[2] R. A. Fouchier, V. Munster, A. Wallensten, T. M. Bestebroer, S. Herfst,D. Smith, G. F. Rimmelzwaan, B. Olsen, and A. D. Osterhaus,“Characterization of a novel influenza a virus hemagglutinin subtype(h16) obtained from black-headed gulls,”
Journal of virology , vol. 79,no. 5, pp. 2814–2822, 2005.[3] Y. Wu, Y. Wu, B. Tefsen, Y. Shi, and G. F. Gao, “Bat-derived influenza-like viruses h17n10 and h18n11,”
Trends in microbiology , vol. 22,no. 4, pp. 183–191, 2014.[4] N. Nunthaboot, T. Rungrotmongkol, M. Malaisree, N. Kaiyawet,P. Decha, P. Sompornpisut, Y. Poovorawan, and S. Hannongbua,“Evolution of human receptor binding affinity of h1n1 hemagglutininsfrom 1918 to 2009 pandemic influenza a virus,”
Journal of chemicalinformation and modeling , vol. 50, no. 8, pp. 1410–1417, 2010.[5] J.-M. Chen, Y.-X. Sun, J.-W. Chen, S. Liu, J.-M. Yu, C.-J. Shen, X.-D.Sun, and D. Peng, “Panorama phylogenetic diversity and distributionof type a influenza viruses based on their six internal gene sequences,”
Virology Journal , vol. 6, no. 1, p. 137, 2009.[6] S. Liu, K. Ji, J. Chen, D. Tai, W. Jiang, G. Hou, J. Chen, J. Li, andB. Huang, “Panorama phylogenetic diversity and distribution of typea influenza virus,”
PLoS One , vol. 4, no. 3, p. e5022, 2009. [7] W. Shi, F. Lei, C. Zhu, F. Sievers, and D. G. Higgins, “A completeanalysis of ha and na genes of influenza a viruses,”
PloS one , vol. 5,no. 12, p. e14454, 2010.[8] Z. Rehman, R. Zafar, U. Amir, U. H. Niazi, and A. Fahim, “Charac-terization of evolutionary changes in hemagglutinin of influenza h1n1virus: a computational analysis,”
VirusDisease , vol. 27, no. 1, pp. 34–40, 2016.[9] R. B. Squires, J. Noronha, V. Hunt, A. Garc´ıa-Sastre, C. Macken,N. Baumgarth, D. Suarez, B. E. Pickett, Y. Zhang, C. N. Larsen et al. ,“Influenza research database: an integrated bioinformatics resource forinfluenza research and surveillance,”
Influenza and other respiratoryviruses , vol. 6, no. 6, pp. 404–416, 2012.[10] C. Chrysostomou, H. Seker, N. Aydin, and P. I. Haris, “Complexresonant recognition model in analysing influenza a virus subtypeprotein sequences,” in
Information Technology and Applications inBiomedicine (ITAB), 2010 10th IEEE International Conference on .IEEE, 2010, pp. 1–4.[11] C. J. Carmona, C. Chrysostomou, H. Seker, and M. del Jesus, “Fuzzyrules for describing subgroups from influenza a virus using a multi-objective evolutionary algorithm,”
Applied Soft Computing , vol. 13,no. 8, pp. 3439–3448, 2013.[12] C. Chrysostomou, H. Seker, and N. Aydin, “Effects of windowingand zero-padding on complex resonant recognition model for proteinsequence analysis,” in
Engineering in Medicine and Biology Society,EMBC, 2011 Annual International Conference of the IEEE . IEEE,2011, pp. 4955–4958.[13] C. Chrysostomou, H. Seker, and N. Aydin, “Cisaps Complex infor-mational spectrum for the analysis of protein sequences,”
Advances inbioinformatics , vol. 2015, 2015.[14] C. Chrysostomou and H. Seker, “Structural classification of proteinsequences based on signal processing and support vector machines,”in
Engineering in Medicine and Biology Society (EMBC), 2016 IEEE38th Annual International Conference of the . IEEE, 2016, pp. 3088–3091.[15] V. Veljkovic, N. Veljkovic, C. Muller, S. Muller, S. Glisic, V. Perovic,and H. Kohler, “Characterization of conserved properties of hemag-glutinin of h5n1 and human influenza viruses: possible consequencesfor therapy and infection control,”
BMC Structural Biology , vol. 9,no. 1, p. 21, 2009.[16] V. Veljkovic, I. Cosic, B. Dimitrijevic, and D. LalovicC, “Is it possibleto analyze DNA and protein sequences by the methods of digital signalprocessing?”
IEEE Transaction on Biomedical Engineering , vol. 32,no. 5, pp. 337–341, 1985.[17] K. Gopalakrishnan, R. Zadeh, K. Najarian, and A. Darvish, “Computa-tional analysis and classification of p53 mutants according to primarystructure,” in , 2004, Proceedings Paper, pp. 694–695.[18] N. Gupta, “Artificial neural network,”
Network and Complex Systems ,vol. 3, no. 1, pp. 24–28, 2013.[19] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,”
CoRR , vol. abs/1412.6980, 2014. [Online]. Available:http://arxiv.org/abs/1412.6980[20] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXivpreprint arXiv:1603.04467arXivpreprint arXiv:1603.04467