A New Framework For Spatial Modeling And Synthesis of Genome Sequence
AA NEW FRAMEWORK FOR SPATIAL MODELING AND SYNTHESIS OF GENOMESEQUENCE
Salman Mohamadi, Farhang Yeganegi, Hamidreza Amindavar
Department of Electrical Engineering, Amirkabir University of Technology, Tehran, Iran
[email protected], [email protected], [email protected]
ABSTRACT
This paper provides a framework in order to statisticallymodel sequences from human genome, which is allowinga formulation to synthesize gene sequences. We start byconverting the alphabetic sequence of genome to decimalsequence by Huffman coding. Then, this decimal sequenceis decomposed by HP filter into two components, trend andcyclic. Next, a statistical modeling, ARIMA-GARCH, isimplemented on trend component exhibiting heteroskedas-ticity, autoregressive integrated moving average (ARIMA) tocapture the linear characteristics of the sequence and later,generalized autoregressive conditional heteroskedasticity(GARCH) is then appropriated for the statistical nonlinearityof genome sequence. This modeling approach synthesizes agiven genome sequence regarding to its statistical features.Finally, the PDF of a given sequence is estimated usingGaussian mixture model, and based on estimated PDF, wedetermine a new PDF presenting sequences that counteractstatistically the original sequence. Our strategy is performedon several genes as well as HIV nocleotide sequence andcorresponding results is presented.
Index Terms — Statistical modeling, Genome sequencesynthesizing, ARIMA model, GARCH model, Biologicalfunctionality counteraction, Human Immunodeficiency Virus(HIV)
1. INTRODUCTION
Considering the increasingly enormous amount of humanand also animals genomic data and assuming that genomesequences could also be treated as a discrete signal, raisethe idea that signal processing can play an important rolein interpreting the genome information [1, 2]. Here we an-alyze and model genome sequence which is represented bysequence of letters from a four-character alphabet includingof the letters A, T, C, and G; each one representing one offour distinct nucleotides named cytosine (C), thymine (T),adenine (A), guanine (G). Also A, T, G and C are called basepairs of gene sequence. DNA strand is divided into genesand intergenic subsequences noting that genes are responsi-ble for protein synthesis. Therefore in order to process a sub sequence of DNA, here a sequence demonstrating a gene, weneed to convert the sequence of alphabet to a sequence ofbinary or decimal numbers [3]. A variety of methods havebeen used to convert the sequence of alphabet to desirabledecimal sequence which can be found in [2–4]. Sequenceof any particular gene, is a reference to possessing necessaryinformation in order to create a specific protein; so any timethat a new protein is going to come into existence, a trans-lation system start to translate the gene information letter byletter. In this sense, we can suppose that we are facing a mi-croscopic communication system that to transfer information.Therefore in the procedure of translating genome sequence,it can be infered that genome sequence is the source of in-formation transmission during genome sequence biologicaltranslation [5, 6] and based on this fact, we suggest a sourcecoding method for converting the gene sequence to a decimalsequence. After providing a digital signal for any given geneof genome sequence, since frequently have been observedthat genome sequence demonstrate cyclic pattern [7], a pow-erful tools named HP filter will be applied to this decimalsequence to decompose it to trend component and Cycliccomponent. Cyclic present periodic patterns of signal, buttrend provide the main trend of signal. Also it is important tonote that cyclic component shows a time invariant variancebut trend component shows an entirely time variant variance,it can be seen in Fig. 2. Finally based on ARIMA-GARCHmodel, modeling and synthesis of genome sequences, is donefor a number of original genes sequences. This paper is orga-nized as follows, the preprocessing of the genome sequence,ARIMA and ARCH/GARCH modeling of the genome se-quence, PDF estimation and some simulations and at the end,some concluding remarks.
2. PROBLEM STATEMENT AND FORMULATION
In order for statistical modeling of a genome sequence on acomputer, decimal numbers are preferred, on the other hand,Huffman coding is a variable-length coding commonly ac-cepted for lossless data compression providing a short bi-nary sequence; in entropy sense, easily converted to a decimalnumber. By so doing, a discrete signal like a time series is cre-ated of a genome sequence that measures the status of genetic a r X i v : . [ q - b i o . O T ] A ug lphabets over time and space which later would be used forspatial modeling. There is no evidence as to the created se-quence is stationary, in order to convert this non-stationaryseries to stationary, in fact, having a look into the genomestructure, because genome sequence includes too many subsequences such as sequences of genes which are randomly ar-ranged from statistical point of view [8]; ARIMA modeling isutilized that captures the best linear forecast for the genomesequence. In order to model, the recipe of the frequency ofhow the alphabets in a genome sequence converted to eachother a statistical measure of the dispersion can be adopted,volatility is the measure of time-varying variance; which isthe first order non-linearity characteristics of a time series,and it has been modeled through ARCH/GARCH method. Aswe demonstrate that the trend component of genome sequencepossesses volatility clustering reflecting more recent changesand fluctuations in the genome time series. There are two types of coding algorithm in terms of codelength which are fixed length code and variable length code.Along with all the advantageous that variable length codinghas, they may not be uniquely decodable. In fact, in codingtheory, a specific type of code called ”prefix code” can beuniquely decoded. The other aspect of coding algorithm thatis really important is the average code length. In informa-tion theory litretures, it can be seen that the average codelength has a lower bound called ”Entropy”. Also, this boundcan be achieved using a prefix variable length code called”Huffman code” where it has shown a promising results inthe litretures [9, 12]. The output from Huffman’s algorithmcan be viewed as a variable-length code table for encodinga source symbol. The algorithm derives this table from theestimated probability or frequency of occurrence (weight) foreach possible value of the source symbol. In other words,in this method, probability of symbols (here A, T, G and C)are take into account to prepare a code specifically efficientfor the alphabetic sequence we want to encode it [13]. Asin [14] have been demonstrated, Huffman coding would serveas good compression code for genome sequence, virtually be-cause of presenting sequence-specific or here, gene-specificcoding, meaning that with this method, entropy of each ofthe nucleotides, A, T, G and C, individually affect the codingprocedure of that sequence. Therefore by this method, thetask of converting the sequence of A, T, G and C, to a digitalsignal would be performed efficiently [14]. With Huffmanmethod, we map [A T G C] to decimal numbers in vector N= [0 1 2 3], noting it has been seen that different mappings,result in the same ways, so we can map [A T G C] to anyother vector made up from 0, 1, 2 and 3.
Hodrick-Prescott filter (HP filter) captures a smooth trend car-rying the most valuable information of signals rather thanpowerful hidden cyclic pattern [15, 16]. The HP filter is a fineapproximation to an ideal high-pass filter, which outputs twocomponents named cyclic and trend. The genome sequenceconverted to a decimal sequence can be decomposed into twocomponents: a trend and cycle components, and a remaindercomponent that may not be captured denoted as noise. Thisis a problem of interest in the statistical modeling, extractionof trend and hidden periodicity from a time series which mayseem not to carry valuable information. However this ques-tion remains unaddressed that why we should use this decom-position. Cyclic component is about time invarying variancepart of signal, therefore once we aim at modeling the signalwith a model capturing time varying variance, we may de-compose the signal into time variant and time invariant com-ponents and just model the first component. In section 4 weexplain in more detail how we use formulation of our modeland also intact cyclic component of certain gene to synthesizeit. The trend is the component of a genome sequence that pro-vides the low frequency statistical characteristics, whereas thecyclical component that refers to the fluctuations around thetrend. In a genome sequence, the cyclic component does notconvey a significant information, in entropy sense. In Fig. 1Huffman encoded sequences of cyclic and trend are shownfor a limited number of base pairs (letters of alphabetic se-quence) of gene. Furthermore, their time varying variance tosupport the idea that HP filter is effective in this application,additional to above argument, it is helpful to consider thattrend component presents that component of signal with fluc-tuating time varying variance, meaning that trend possessesthe heteroskedasticity while cyclic component presents thoseof almost constant time invarying variance, meaning cyclicis a homoskedastic process; as it is shown in Fig. 2. Het-eroscedasticity characteristics of genome sequence refers tothe circumstance in which the variability of the trend compo-nent cannot be predicted by any predictor if equal ranges ofgenome sequence are observed, on the other hand, a predic-tion of genome sequence is possible of if unequal ranges oftrend values are observed.
A genome sequence possesses some properties that existamong all other genes in a linear sense, from the standpointof forecasting these properties ARIMA modeling allows lin-ear prediction of future values of genome sequence basedon these properties. ARIMA models are applied in casesthat data shows non-stationarity, in order to reduce non-stationarity. Via applying an initial differencing step, corre-sponding to the integrated part of the model, ARIMA reduces ig. 1 . Huffman encoded decimal sequence of 1000 basepairs (letter) of gene CAV1, trend Component and CyclicComponent of this decimal sequence, horizontal axis repre-sents base pair and veritacl axis represents amplitude .the non-stationarity [17]. The ARIMA( p ; d ; q ) is describedby (cid:32) − p (cid:88) k =1 a k L k (cid:33) (1 − L ) d X n = (cid:32) q (cid:88) k =1 b k L k (cid:33) (cid:15) n (1)where X n , here is the decimal sequence corresponding togene sequence, n is the number corresponding to the loca-tion of base pairs in the sequence of gene, for example X refers to the decimal number in the 128th location of gene thatcorresponds to the type of base pair that has occupied this lo-cation of gene sequence. ACCCACAT T CCCCT CT CCA... (cid:48)(cid:48) thbasepair (cid:48)(cid:48) ...
Moreover L is the lag operator, the a k are the parameters ofthe autoregressive part of the model, the b k are the parame-ters of the moving average part and the (cid:15) n are error terms.The error terms (cid:15) n are generally assumed to be independent,identically distributed variables sampled from a normal dis-tribution with zero mean. Here p is the order of AR, and q theorder of MA, and d is the order of differencing [18]. Table 1 . Parameters of the ARIMA models; N is the numberof base pairs (letter) of sequence of gene CAV1 . Model N AIC c ARIMA(1,1,1) 38400 -5.1563e+05ARIMA(2,1,1) 38400 -5.1568e+05ARIMA(2,1,2) 38400 -5.1616e+05ARIMA(1,1,2) 38400 -5.1533e+05ARIMA(2,2,2) 38400 -5.1601e+05ARIMA(2,2,1) 38400 -5.1586e+05ARIMA(1,2,2) 38400 -5.1583e+05ARIMA(1,2,1) 38400 -5.1558e+05
Table 2 . Parameters of the ARCH and GARCH models; Nalso is the length of residual of ARIMA model which is equalto the size of gene CAV1, here we see GARCH(1,1) is thebest of all due to the smallest AIC c . Model N AIC c ARCH(0) 38400 -2.889e+05ARCH(1) 38400 -2.8913e+05ARCH(2) 38400 -2.9221e+05ARCH(3) 38400 -2.9614e+05ARCH(4) 38400 -2.9905e+05ARCH(5) 38400 -3.1421e+05ARCH(6) 38400 -3.2228e+05ARCH(7) 38400 -3.3633e+05ARCH(8) 38400 -3.4689e+05ARCH(9) 38400 -3.5656e+05GARCH(1,1) 38400 -3.9555e+05
Most straight forward of time varying power of random se-quences is ARCH/GARCH; it means that these processesbest model the sequences of time varying power or variance.ARCH/GARCH process mostly and also first time, used ineconomic time series modeling, because its profound mathe-matics allow to cover precisely most of fluctuations in thosesignals. Here because of lack of space, we just present a briefintroduction, for further information and also example of bio-logical application of these two, refer to [17, 18] [15,14]. Fora given digital signal y n , we can model it using GARCH( p , q ), if E (cid:8) y n (cid:9) = 0 and y n = (cid:112) h n (cid:15) n (2) h n = a + p (cid:88) i =1 a i y n − i + q (cid:88) j =1 b j h n − j (3)Where h n is the conditional variance of y n , a > , a i ≥ , b j ≥ and { (cid:15) n } is a sequence of independent identically dis-tributed zero mean with variance one random variables, of-ten assumed to be a standard normal random variable [18].ARCH or GARCH process, with an ability to model volatil-ity, would be used to model the volatility of the digital signalto reflect changes and fluctuations in gene sequence [19].
3. FITTING THE BEST ARIMA-GARCH MODEL
To perform our method, we selected 18 genes that have mostinteraction in human breast cancer; the sequence of each geneis downloaded from United State National Center for Biotech- ig. 2 . Normalized Time Varying Variance of Trend andCyclic components of the two genes, CAV1 (blue) and CD34(orange); horizontal axis represents base pairs (unit*10000)and veritacl axis represents amplitude.nology Information . These genes are as in table 3. Havingthe preprocessed digital signal, corresponding to each genesequence, now different ARIMA(p; d; q) models are fittedand the corresponding AICc are computed as in Table 1. BestARIMA for sample gene, CAV1 is ARIMA(2,1,2). Whenit comes to fit a ARCH/GARCH model, firstly ARCH testshould be used to verify the existence of heteroskedasticityin the data. For all 18 genes this test has been performedon residual of ARIMA model individually and confirmed theissue. After that, best orders are extracted through AIC. Fi-nally a mathematical formulation resulting from modeling isattained.
4. RESULTS4.1. Modeling the sequences of 18 genes of human genome
Order of ARIMA model for each gene is selected basedon best AIC. Best ARCH/GARCH model, is then fitted toresiduals of ARIMA. Results of sample gene, gene CAV1,are presented in tables and figures. For all of genes weachieved mathematical formulation extracted from ARIMA-GARCH modeling; here 15 out of 18 genes are modeled byformulation of ARIMA(2,1,2)-GARCH(1,1) model, only thecoefficients of formula differs from gene to gene. Among 18genes, except for three genes PPARG, NGFR and PLA2G2A,best ARIMA was ARIMA(2,1,2) and for all 18 genes,GARCH(1,1) was selected as the best order of GARCH.For these three, PPARG, NGFR and PLA2G2, best ARI-MAs were respectively ARIMA(2,2,2), ARIMA(2,2,2) andARIMA(1,2,2). Fitting these best models, allow us to synthe-size artificial gene sequence for any given original gene, thatpossess the linear and nonlinear characteristics of the originalgene. According to the formulation provided at 2.2 and 2.3; wecan formulate ARIMA-GARCH model [18]. For all genes,coefficients of the formula of ARIMA-GARCH modelingare extracted; here for gene CAV1 with ARIMA(2,1,2)-GARCH(1,1) we have: h n + X n − X n − = ( − . ) ( X n − − X n − )+ ( . ) ( X n − − X n − )+ ( − . ) (cid:15) n − + ( − . ) (cid:15) n − + ( . ) (cid:15) n − + ( . ) h n − + ( . e − )Where h n is synthesized sequence of Trend component of theartificial sequence, which is going to be synthesized com-pletely, and X n is the sequence of Trend component of theoriginal gene and (cid:15) n the error term. Then we add h n to thecyclic component of the original gene to complete the synthe-sis procedure, which provides us a new decimal sequence forany given original sequence. The latter, using Huffman de-coding, we convert the synthesized sequence, that is decimal,to alphabetic sequence. (cid:8) d , d , d ...d m (cid:9) ⇒ (cid:8) b , b , b , ...b m (cid:9) d , d , ...d m are the decimal numbers resulting from synthesisand b , b , b ...b m are the corresponding base pairs of the syn-thesized gene which are achieved by converting the decimalsequence to alphabetic sequence. Sample of 120 synthesizedbase pairs (from 1332 to 1452), which is randomly selectedfrom whole sequence of synthesized gene base on gene FGF2are as follow :ACGCCCGCGAGGCTGGTTGGCCGGGGCGGGGGCCATGCCGCGGAGCGTGTCGGCGGCCGAGGCCGGCGCCGTAGGACGGCGGCTCAAAGCGCGGCTCCATCGACTCGCAGATCTCGGGGGIt is noteworthy to mention that in the sense of practical ap-plication, the above formulation equips us so that we can syn-thesize new sequences based on a certain genomic sequence,which is useful in some application such as genomic sequencePDF estimation, because for PDF estimation we need to pro-vide enough ensemble of the genomic sequence, which thisformulation can deal with. Suppose we take an original sequence of genome or a virusequivalent to a vector; in order to plausibly estimate its empir-ical PDF, we are required to have access on an infinite ensem-ble of this vector whereas, in our instance, such scenario isimpractical, and even if we provide enough ensemble for a ge-nomic sequence; i.e. random vector, next dilemma is the high-dimensionality issue of its PDF representation. For instanceto estimate PDF of a virus such as HIV with the length 9181 able 3 . List of 18 new sequences, which are synthesizedusing Huffman decoding based on original genes; N is thelength of the new sequence which is the same length as oforiginal gene Original gene N Original gene N IGFBP6 4910 TGFBR3 225993NGFR 19728 EGR1 3826ESR1 472929 FOS 3461DLC1 521250 IGF1 85920CD34 30431 NTRK2 358613CAV1 38401 PLA2G2A 5009PPARG 183646 ERBB4 1163439FGF2 71529 IL6 6561
Fig. 3 . HIV Sequence Autocorrelation, as we see, even atheavy lags, the sequence is highly correlated.can be represented in such dimension. Therefore, theoretical
PDF estimation strategy must be sought. In this regard, wepropose Gaussian Mixture Model (GMM) [21] to model thedecimal sequence provided by Huffman coding for genomeor a virus such as HIV sequence. Since this sequence andgenerally genomic sequences are highly correlated as seen inFig.3, we estimate a two dimensional marginal PDF; as wesee in Fig.4, using GMM of the original sequence of length N . p ( x ) = M (cid:88) i =1 w i N ( x, µ i , Σ i ) , < w i < (4)where p ( x ) = p ( x , x ) is the two dimensional marginalprobability density function of a genomic sequence modeledby a GMM of order M of Gaussian kernels, N ( x , µ i , Σ i ) ,whose mean vectors and covariance matrices are denoted by µ i and Σ i , respectively. We assume the PDF for the vector x (cid:48) = ( x (cid:48) , x (cid:48) ) counteracting the original sequence x is alsomodeled by a GMM as follows p ( x (cid:48) ) = M (cid:88) i =1 w i N ( x (cid:48) , µ i , Σ (cid:48) i ) , < w i < (5) Fig. 4 . Gaussian mixture approximation to PDF of HIV se-quence.
Fig. 5 . Gaussian mixture approximation to PDF of highly un-correlated HIV sequence; this PDF could be considered asthe basis for searching sequences that are statistically nega-tively correlated to the HIV sequence, i.e. the more similar-ity between PDF of given sequence and this PDF, the morecounteraction against HIV statistical treatment and probablyits biological functionality [20]that only differs in covariance matrices Σ (cid:48) which is related to Σ according to Σ ii (cid:48) = Σ ii , Σ ij (cid:48) = − Σ ij , i (cid:54) = j, (6)that is, the diagonal entries of both covariance matrices arethe same but their off-diagonal entries are opposite in signto rationalize the utmost negative statistical correlation tieing x and x (cid:48) . PDF of highly uncorrelated targeted sequence forHIV nocleotide sequence is also shown in Fig.5
5. CONCLUSION
In this paper, a new framework is presented in which, firstlywe propose a preprocessing step to prepare 18 genomic se-quences for further statistical modeling. Then perform sec-ond step, statistical modeling, to capture linear and nonlinearcharacteristics of genomic sequence. Furthermore, we pro-pose an approach to synthesize a new sequences based onour modeling for sequences of several original genes of thegenome sequence and finally using GMM for PDF estimation,at first PDF of a given sequence such as HIV is estimated andbased on that, another PDF is estimated which relates to theequences with highly negative statistical correlation againstthe original sequence PDF, here HIV. This negative statisti-cal correlation would suggest biological functional counter-action.
6. REFERENCES [1] Mai S Mabrouk, Safaa M Naeem, and Mohamed A El-dosoky, “Different genomic signal processing methodsfor eukaryotic gene prediction: A systematic review,”
Biomedical Engineering: Applications, Basis and Com-munications , vol. 29, no. 01, pp. 1730001, 2017.[2] PP Vaidyanathan and Byung-Jun Yoon, “The roleof signal-processing concepts in genomics and pro-teomics,”
Journal of the Franklin Institute , vol. 341,no. 1, pp. 111–135, 2004.[3] Bihter Das and Ibrahim Turkoglu, “A novel numericalmapping method based on entropy for digitizing dna se-quences,”
Neural Computing and Applications , pp. 1–9,2017.[4] Muhammaed Talha Naseem, KR Aravind Britto,Mustafa Musa Jaber, VS Balaji, G Rajkumar,K Narasimhan, V Elamaran, et al., “Preprocessingand signal processing techniques on genomic data se-quences,”
Biomedical Research , pp. 1–1, 2017.[5] Martin Bossert, “Information-and communication the-ory in molecular biology,” 2017.[6] M Thangavel, P Varalakshmi, and R Sindhuja, “A com-parative study on dna-based cryptosystem,” in
Hand-book of Research on Recent Developments in IntelligentCommunication Application , pp. 496–528. IGI Global,2017.[7] Svetlana A Shabalina, Aleksey Y Ogurtsov, and Niko-lay A Spiridonov, “A periodic pattern of mrna secondarystructure created by the genetic code,”
Nucleic Acids Re-search , vol. 34, no. 8, pp. 2428–2437, 2006.[8] Jingli Zhao, Shuling Li, Lijuan Wang, Li Jiang, Run-qing Yang, and Yuehua Cui, “Genome-wide randomregression analysis for parent-of-origin effects of bodycomposition allometries in mouse,”
Scientific Reports ,vol. 7, pp. 45191, 2017.[9] Pothuraju Rajarajeswari and Allam Apparao, “Dnabitcompress–genome compression algorithm,”
Bioinfor-mation , vol. 5, no. 8, pp. 350, 2011.[10] Shanika Kuruppu, Simon J Puglisi, and Justin Zo-bel, “Optimized relative lempel-ziv compression ofgenomes,” in
Proceedings of the Thirty-Fourth Aus-tralasian Computer Science Conference-Volume 113 .Australian Computer Society, Inc., 2011, pp. 91–98. [11] Armando J Pinho, Ant´onio JR Neves, Carlos AC Bas-tos, and Paulo JSG Ferreira, “Dna coding using finite-context models and arithmetic coding,” in . IEEE, 2009, pp. 1693–1696.[12] Farhang Yeganegi, Vahid Hassanzade, and SM Ahadi,“Comparative performance evaluation of svd-based im-age compression,” in
Electrical Engineering (ICEE),Iranian Conference on . IEEE, 2018, pp. 464–469.[13] Jin He, Xuewen Ding, and Li Li, “The practice ofapplication-oriented personnel training in the course ofinformation theory and coding,”
DEStech Transactionson Social Science, Education and Human Science , , no.eemt, 2017.[14] Anas Al-Okaily, Badar Almarri, Sultan Al Yami, andChun-Hsi Huang, “Toward a better compression for dnasequences using huffman encoding,”
Journal of Com-putational Biology , vol. 24, no. 4, pp. 280–288, 2017.[15] Torben Mark Pedersen, “The hodrick–prescott filter,the slutzky effect, and the distortionary effect of filters,”
Journal of Economic Dynamics and Control , vol. 25, no.8, pp. 1081–1101, 2001.[16] Pierre-Alain Bruchez, “A modification of the hp filteraiming at reducing the end-point bias,”
Swiss FinancialAdministration, Working Paper , 2003.[17] Tim Bollerslev, “Generalized autoregressive conditionalheteroskedasticity,”
Journal of econometrics , vol. 31,no. 3, pp. 307–327, 1986.[18] Salman Mohamadi, Hamidreza Amindavar, andSM Ali Tayaranian Hosseini, “Arima-garch model-ing for epileptic seizure prediction,” in
Acoustics,Speech and Signal Processing (ICASSP), 2017 IEEEInternational Conference on . IEEE, 2017, pp. 994–998.[19] Salman Mohamadi, Farhang Yeganegi, and Nasser MNasrabadi, “Detection and statistical modeling of birth-death anomaly,” arXiv preprint arXiv:1906.11788 ,2019.[20] Hulin Wu, “Statistical methods for hiv dynamic studiesin aids clinical trials,”
Statistical methods in medicalresearch , vol. 14, no. 2, pp. 171–192, 2005.[21] Zoran Zivkovic, “Improved adaptive gaussian mixturemodel for background subtraction,” in