SARS-Cov-2 RNA Sequence Classification Based on Territory Information
aa r X i v : . [ q - b i o . Q M ] J a n SARS-Cov-2 RNA Sequence Classification Based onTerritory Information
Jingwei Liu ∗ School of Mathematical Sciences, Beihang University, Beijing, 102206,P.R China
Abstract
CovID-19 genetics analysis is critical to determine virus type,virus variantand evaluate vaccines. In this paper, SARS-Cov-2 RNA sequence analy-sis relative to region or territory is investigated. A uniform framework ofsequence SVM model with various genetics length from short to long andmixed-bases is developed by projecting SARS-Cov-2 RNA sequence to dif-ferent dimensional space, then scoring it according to the output probabilityof pre-trained SVM models to explore the territory or origin information ofSARS-Cov-2. Different sample size ratio of training set and test set is alsodiscussed in the data analysis. Two SARS-Cov-2 RNA classification tasksare constructed based on GISAID database, one is for mainland, Hongkongand Taiwan of China, and the other is a 6-class classification task (Africa,Asia, Europe, North American, South American& Central American, Ocean)of 7 continents. For 3-class classification of China, the Top-1 accuracy ratecan reach 82.45% (train 60%, test=40%); For 2-class classification of China,the Top-1 accuracy rate can reach 97.35% (train 80%, test 20%); For 6-classclassification task of world, when the ratio of training set and test set is 20% :80% , the Top-1 accuracy rate can achieve 30.30%. And, some Top-N resultsare also given.
Keywords:
CovID-19 , SARS-Cov-2 , Sequence SVM , Pattern Recognition, Top-N Accuracy Rate , Genetics Analysis , Mixed-base ∗ Corresponding author.
Email address: [email protected] (Jingwei Liu )
January 12, 2021 . Introduction
The ongoing COVID-19 pandemic has led disaster to human being allover the world. From first identified in Wuhan, China late December 2020 to8 January 2021, more than 88.7 million persons are infected, and more than1.91 million people are dead attributed to severe acute respiratory syndromecoronavirus 2 (SARS-Cov-2). The origin of CovID-19 is still a concernedissue, thought there is a scientific consensus that it has a natural origin.[1,2,3,4,5,6]CovID-19 pandemic poses challenges to the state-of-the-art techniques ofgenome analysis combined with artificial intelligent(AI) and machine learn-ing (ML), which is employed in CovID-19 pandemic for virus discovery,virusvariant, virus evolvement, genetic mutation, symptom diagnosis, conditionsmostly affecting the spread and development of conventional drugs and vaccines.[7-26] It is well-known that SARS-Cov-2 is various among different hosts. Thispaper aims to investigate the diversity of SARS-Cov-2 among different peo-ple related to different region and territory, and reports our tracking researchon the SARS-Cov-2 genetic sequences in CovID-19 pandemic by designingtwo pattern recognition tasks chronologically following submission time inGISAID. The first one is of SARS-Cov-2 from three regions of China, main-land, Hongkong and Taiwan as of 2 June,2020, attempting to explore theorigin of SARS-Cov-2 in China in the view of region information. Our re-sult show that the SARS-Cov-2 RNA sequences of Hongkong are distinctlydiscriminated from those of mainland and Taiwan. The second one is a 6-class pattern recognition task designed from 7 continents worldwide as of 11July,2020, to investigate origin and diversity of SARS-Cov-2, which is helpfulto estimate the vaccine effectiveness for CovID-19 pandemic.The paper is organized as follows: Section 2 describes our sequence SVMmodel with Top-N, Section 3 introduces experimental database , Section 4presents the experimental results, Section 5 summarizes the research.
2. Method
Sequence Multi-class SVM with Top-N
Support Vector Machine (SVM) is an efficient supervised statistical learn-ing and machine learning method for classification analysis. The multi-classSVM is defined on binary SVM with one-versus-one max-wins voting strategy2r one-versus-all winner-takes-all way. The Top-N method is also called rank-ing technique or N-best method, it is widely applied in pattern recognitionand recommendation system. A sequence SVM model with Top-N is devel-oped for microbial marker clades gene sequence classification in [27].Later,the motivation also appears in [28].The same sequence SVM model with Top-N is employed in this paper todig out origin and territory information for SARS-Cov-2. The novel treat-ment here is that the mixed-base is also involved in genetics analysis, it istreated as a noise of genome sequence, however there should have informa-tion in the site of SARS-Cov-2 RNA sequence though the exact base is notmeasured.To strengthen the confidence of experiment results, the standard LibSVM { ∼ cjlin/libsvm/ } is utilized in training stage,and the parameter range of C-SVM with RBF kernel is C × γ ∈ { } ×{ } . Then, theTop-N stage is complemented with C++ platform with LibSVM. The bestclassification accuracy rates of test set (independent from training set ) intotal 64 parameters are reported below.
3. Dataset { A, C, T, G } , but there are mixed-baseor ambiguous base in the genome sequences of Data-II. All of the acid basessequences involved in the experiments are downloaded chronologically withsubmission time of GISAID without abandon. In pattern recognition task.The acid base { A, C, T, G } is mapped into { } respectively and themixed-base is mapped as the average value of pure acid base involved above.The genetics sequences are split into training set and test set accordingto given split percent { } . then all the sequences in train-ing set and testing set are mapped into k–dimension sample space, wherek-dimension is also treated as k-mer size. To investigate the dimension ofspace for discrimination, the k-mer sizes of {
10, 20, 30, 40, 50, 60, 70,80 } arediscussed in the experiments, and the overlaps of {
0, 25%, 50%, 75% } per-cent of adjacent k-mer length vectors in original sequence are also discussed.The pattern recognition rates of test set under above pre-process are brieflydenoted as “S { split percent } ”, “O=overlap”, which means the accuracy rateis obtained using training model with given split percent set to predict therest test set with (1-split percent) of whole database.After pre-process, all the data sets are utilized in training and classifica-tion phases separately according to their respective tasks.
4. Experimental Result
In this experiment, SARS-Cov-2 genome sequences of China are cate-gorized into a 3-class pattern classification problem related to mainland,Hongkong and Taiwan respectively. The sequence SVM accuracy rates ofTop-1 and Top-2 are listed in Figure 1 and Figure 2 respectively. The ex-perimental results show that the best accurate rate can be reached underparameter adjustment, even in case of S0.2 which is training set (20%) andtest set (80%). And, the accuracy rate S0.8 takes no absolute advantage ofS0.2 in Top-1 case. The overall Top-2 classification rate is higher than thatof Top-1, It indicates that there are some samples with high similarity, afterall, the three regions are within China, at least almost same race thoughbelonging to different ethnic groups. 4
Dimension R e c ogn i t i on r a t e % S0.2
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.4
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.6
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.8
O=0O=0.25O=0.5O=0.75
Figure 1: 3-class Top-1 accuracy rate of mainland,Hongkong and Taiwan. Dimension R e c ogn i t i on r a t e % S0.2
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.4
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.6
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.8
O=0O=0.25O=0.5O=0.75
Figure 2: 3-class Top-2 accuracy rate of mainland,Hongkong and Taiwan. .2. 2-class SARS-Cov-2 classification of China To explore the sample structure of SARS-Cov-2 of China, three 2-classpattern classification experiments are designed, that is any two of the aboveregions are combined into one class, and the other are treated as one class.The Top-1 classification rates are shown in Figure 3, Figure 4, Figure 5. Theaccuracy rate of Hongkong and mainland & Taiwan get the best result (Figure4), it can reach the conclusion that the SARS-Cov-2 virus of Hongkong isdifferent from that of mainland & Taiwan. Comparing the results of 3-classand 2-class of SARS-Cov-2 of China, it also indicates that with the increaseof class number, the accuracy rate will decrease. And, in the case of 2-classclassification (Figure 3, Figure 4, Figure 5), the accurate rate will obtainhigh value with the increase of split percent. But, this advantage no longerholds in 3-class recognition (Figure 1,Figure 3,Figure 4,Figure 5).
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.2
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.4
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.6
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.8
O=0O=0.25O=0.5O=0.75
Figure 3: 2-class classification of mainland vs. Hongkong and Taiwan.
As illustrated above, the accuracy rate with more training set to predictthe less test set may not take advantage with the increase of class number,7
Dimension R e c ogn i t i on r a t e % S0.2
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.4
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.6
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.8
O=0O=0.25O=0.5O=0.75
Figure 4: 2-class classification of Hongkong vs. mainland and Taiwan. Dimension R e c ogn i t i on r a t e % S0.2
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.4
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.6
O=0O=0.25O=0.5O=0.75
20 40 60 80
Dimension R e c ogn i t i on r a t e % S0.8
O=0O=0.25O=0.5O=0.75
Figure 5: 2-class classification of Taiwan vs. mainland and Hongkong.
5. Conclusion
SARS-Cov-2 RNA sequences analysis related to region and territory willprovide new deep insights into the behavior of CovID-19. Using less trainset to predict more test set help us to estimate the trend of CovID-19 pan-demic, variant capability of SARS-Cov-2, and application range of vaccines.Our experimental results demonstrate that, with the SARS-Cov-2 before 6June,2020, the SARS-Cov-2 genome sequences are high discriminated theSARS-Cov-2 genome sequences of mainland & Taiwan. And, among theworldwide SARS-Cov-2 virus, there are 30% RNA virus which are appar-ently differentiated in mathematical space.10
Dimension -100102030405060708090100 R e c ogn i t i on r a t e % S0.2, O=0
C=1C=2C=3C=4C=5
Figure 6: 6-class different Top-N classification rates with overlap 0 and split rate 0.2 ,where C denotes number of candidate in Top-N. Dimension -100102030405060708090100 R e c ogn i t i on r a t e % S0.2, O=0.25
C=1C=2C=3C=4C=5
Figure 7: 6-class different Top-N classification rates with overlap 0.25 and split rate 0.2,where C denotes number of candidate in Top-N. eferences [1] Jiumeng Sun, Wan-Ting He, Lifang Wang ,Alexander Lai,Xiang Ji,Xiaofeng Zhai, Gairu Li, Marc A. Suchard, Jin Tian, Jiyong Zhou,Michael Veit, Shuo Su. COVID-19: Epidemiology, Evolution, and Cross-Disciplinary Perspectives. Trends in Molecular Medicine. 2020 May;26(5): 483–495. doi: 10.1016/j.molmed.2020.02.008[2] Peter Forster, Lucy Forster, Colin Renfrew, Michael Forster. Phyloge-netic network analysis of SARS-CoV-2 genomes. Proceedings of the Na-tional Academy of Sciences. 2020-04-08, 117 (17): 9241–9243.[3] Kristian G. Andersen, Andrew Rambaut, W. Ian Lipkin, Edward C.Holmes, Robert F. Garry . The proximal origin of SARS-CoV-2. NatureMedicine. 2020, 26 (4): 450–452.[4] A. Deslandes , V. Berti , Y. Tandjaoui-Lambotte , Chakib Alloui, E.Carbonnelle, J. R. Zahar, S. Brichler, Yves Cohen. SARS-CoV-2 wasalready spreading in France in late December 2019. International Journalof Antimicrobial Agents. 2020 Jun, 55 (6): 106006.[5] Lucy van Dorp, Mislav Acman, Damien Richard, Liam P. Shaw, Char-lotte E. Ford, Louise Ormond, Christopher J. Owen, Juanita Pang,Cedric C. S. Tan, Florencia A. T. Boshier, Arturo Torres Ortiz,Fran¸cois Balloux. Emergence of genomic diversity and recurrent mu-tations in SARS-CoV-2. Infection, Genetics and Evolution. 2020-09-01,83: 104351.[6] Jacqueline Duhon, Nicola Bragazzi, Jude Dzevela Kong.The impact ofnon-pharmaceutical interventions, demographic, social, and climatic fac-tors on the initial growth rate of COVID-19: A cross-country study.Science of the Total Environment. 2020 Dec 10; 760:144325. doi:10.1016/j.scitotenv.2020.144325.[7] Dongwan Kim, Joo-Yeon Lee, Jeong-Sun Yang, Jun Won Kim,V. NarryKim,Hyeshik Chang. The Architecture of SARS-CoV-2 Transcriptome.2020, 181(4): 914-921.e10.[8] Yanni Li, Bing Liu, Zhi Wang, Jiangtao Cui, Kaicheng Yao,Pengfan Lv, Yulong Shen, Yueshen Xu, Yuanfang Guan, Xi-aoke Ma. COVID-19 Evolves in Human Hosts (March 20,13020). Available at SSRN: https://ssrn.com/abstract=3562070 orhttp://dx.doi.org/10.2139/ssrn.3562070.[9] Ben Hu, Hua Guo, Peng Zhou, Zheng-Li Shi. Characteristics ofSARS-CoV-2 and COVID-19. Nature Reviews Microbiology. (2020).https://doi.org/10.1038/s41579-020-00459-7.[10] Tuan M. Nguyen, Yang Zhang, Pier Paolo Pandolfi. Virus against virus:a potential treatment for 2019-nCov (SARS-CoV-2) and other RNAviruses. Cell Research. 2020, 30, 189–190 .[11] Lindsey R. Baden, Hana M. El Sahly, Brandon Essink, Karen Kotloff,Sharon Frey, Rick Novak, David Diemert, Stephen A. Spector, NadineRouphael, C. Buddy Creech, John McGettigan, Shishir Khetan, et al.Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine. The NewEngland Journal of Medicine. DOI: 10.1056/NEJMoa2035389.[12] Chansik An, Hyunsun Lim, Dong-Wook Kim, Jung Hyun Chang, YoonJung Choi, Seong Woo Kim. Machine learning prediction for mortality ofpatients diagnosed with COVID-19: a nationwide Korean cohort study.Scientific Reports, 2020, 10, 18716. https://doi.org/10.1038/s41598-020-75767-2.[13] Tˆ o Tat Dat, Protin Fr´ e d´ e ric, Nguyen T.T. Hang, Martel Jules, NguyenDuc Thang, Charles Piffault, Rodr´ i guez Willy, Figueroa Susely, Hˆ o ngVˆ a n Lˆ e , Wilderich Tuschmann, Nguyen Tien Zung. Epidemic Dynamicsvia Wavelet Theory and Machine Learning with Applications to Covid-19. Biology. 2020, 9, 477.[14] Haochen Yao, Nan Zhang, Ruochi Zhang, Meiyu Duan, Tianqi Xie, Ji-ahui Pan, Ejun Peng, Juanjuan Huang, Yingli Zhang, Xiaoming Xu,Hong Xu, Fengfeng Zhou, Guoqing Wang.Severity Detection for theCoronavirus Disease 2019 (COVID-19) Patients Using a Machine Learn-ing Model Based on the Blood and Urine Tests. Front Cell Dev Biol.2020, 8: 683.[15] Jawad Rasheed, Akhtar Jamil, Alaa Ali Hameed, Usman Aftab, JavariaAftab, Syed Attique Shah, Dirk Draheim. A survey on artificial intel-ligence approaches in supporting frontline workers and decision makers14or the COVID-19 pandemic. Chaos Solitons Fractals. 2020 Dec; 141:110337. Published online 2020 Oct 10. doi: 10.1016/j.chaos.2020.110337.[16] Mohammad-H Tayarani-N. Applications of artificial intelligence in bat-tling against covid-19: A literature review. Chaos Solitons Fractals. 2020Oct 3 : 110338. doi: 10.1016/j.chaos.2020.110338[17] H. Swapnarekha, Himansu Sekhar Behera, Janmenjoy Nayak, BighnarajNaik. Role of intelligent computing in COVID-19 prognosis: A state-of-the-art review. Chaos Solitons Fractals. 2020 Sep; 138: 109947. Pub-lished online 2020 May 29. doi: 10.1016/j.chaos.2020.109947.[18] Manuela Sironi, Seyed E. Hasnain, Benjamin Rosenthal, Tung Phan,Fabio Luciani, Marie-Anne Shaw, M. Anice Sallum, Marzieh Ez-zaty Mirhashemi, Serge Morand, Fernando Gonz´alez-Candelas. SARS-CoV-2 and COVID-19: A genetic, epidemiological, and evolution-ary perspective, Infection, Genetics and Evolution. 2020,84:104384,https://doi.org/10.1016/j.meegid.2020.104384.[19] Ricardo Ram´ırez-Aldana, Juan Carlos Gomez-Verjan, Omar Yaxme-hen Bello-Chavolla. Spatial analysis of COVID-19 spread in Iran: In-sights into geographical and structural transmission determinants ata province level. PLOS Neglected Tropical Diseases. 2020, 14(11):e0008875. https://doi.org/10.1371/journal.pntd.0008875.[20] Dingding Wang, Jiaqing Mo , Gang Zhou, Liang Xu, Yajun Liu. Anefficient mixture of deep and machine learning models for COVID-19diagnosis in chest X-ray images.PLOS ONE. Published: November 17,2020. https://doi.org/10.1371/journal.pone.0242535.[21] Raju Vaishya, Mohd Javaid, Ibrahim Haleem Khan, Abid Haleemb. Ar-tificial Intelligence (AI) applications for COVID-19 pandemic. DiabetesMetab Syndr. 2020 July-August; 14(4): 337–339. Published online 2020Apr 14. doi: 10.1016/j.dsx.2020.04.012.[22] Jordi Laguarta, Ferran Hueto, Brian Subirana. COVID-19 ArtificialIntelligence Diagnosis Using Only Cough Recordings. Engineering inMedicine and Biology. 2020,1,275-281.[23] Mohammad Jamshidi, Ali Lalbakhsh, Jakub Talla, Zdenˇ e k Peroutka,Farimah Hadjilooei, Pedram Lalbakhsh, Morteza Jamshidi, Luigi La15pada,Mirhamed Mirmozafari, Mojgan Dehghani, Asal Sabet, SaeedRoshani, Sobhan Roshani, Nima Bayat-Makou, Bahare Mohamadzade,Zahra Malek, Alireza Jamshidi, Sarah Kiani, Hamed Hashemi-Dezaki,Wahab Mohyuddin. Artificial Intelligence and COVID-19: Deep Learn-ing Approaches for Diagnosis and Treatment. IEEE Access, 2020, 8,109581-109595, doi: 10.1109/ACCESS.2020.3001973.[24] Yun Chen, Gongfa Jiang, Yue Li, Yutao Tang, Yanfang Xu, Siqi Ding,Yanqi Xin, Yao Lu, A Survey on Artificial Intelligence in Chest Imagingof COVID-19. BIO Integration, 2020, https: //doi.org /10.15212 /bioi-2020-0015.[25] Liping Sun, Gang Liu, Fengxiang Song, Nannan Shi, Fengjun Liu,Shenyang Li, Ping Li, Weihan Zhang, Xiao Jiang, Yongbin Zhang, Lin-ing Sun, Xiong Chen, Yuxin Shi. Combination of four clinical indicatorspredicts the severe/critical symptom of patients infected COVID-19.Journal of Clinical Virology. 2020. doi: 10.1016/j.jcv.2020.104431.[26] Nathan Peiffer-Smadja, Redwan Maatoug, Fran¸cois-Xavier Lescure, EricD’Ortenzio, Jo¨ ee