Markov chain order estimation with parametric significance tests of conditional mutual information
aa r X i v : . [ s t a t . M E ] N ov Markov chain order estimation with parametric significancetests of conditional mutual information
Maria Papapetrou, Dimitris Kugiumtzis
Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124Thessaloniki, Greece
Abstract
Besides the di ff erent approaches suggested in the literature, accurate estimation of theorder of a Markov chain from a given symbol sequence is an open issue, especiallywhen the order is moderately large. Here, parametric significance tests of conditionalmutual information (CMI) of increasing order m , I c ( m ), on a symbol sequence areconducted for increasing orders m in order to estimate the true order L of the underlyingMarkov chain. CMI of order m is the mutual information of two variables in the Markovchain being m time steps apart, conditioning on the intermediate variables of the chain.The null distribution of CMI is approximated with a normal and gamma distributionderiving analytic expressions of their parameters, and a gamma distribution derivingits parameters from the mean and variance of the normal distribution. The accuracyof order estimation is assessed with the three parametric tests, and the parametric testsare compared to the randomization significance test and other known order estimationcriteria using Monte Carlo simulations of Markov chains with di ff erent order L , lengthof symbol sequence N and number of symbols K . The parametric test using the gammadistribution (with directly defined parameters) is consistently better than the other twoparametric tests and matches well the performance of the randomization test. Thetests are applied to genes and intergenic regions of DNA sequences, and the estimatedorders are interpreted in view of the results from the simulation study. The applicationshows the usefulness of the parametric gamma test for long symbol sequences wherethe randomization test becomes prohibitively slow to compute. Keywords:
Symbol sequence, Markov chain order, conditional mutual information,significance test, DNA
1. Introduction
Symbol sequences are directly observed on real-world processes, such as DNAsequences and on-line transaction logs, but can also be derived from discretization of
Email addresses: [email protected] (Maria Papapetrou), [email protected] (Dimitris Kugiumtzis)
Preprint submitted to Elsevier October 4, 2018 ime series. Sequence analysis, initially developed mostly for biological applications[1], has expanded with regard to both applications and methodologies, and sequencemining techniques are constantly being developed [2]. Here however, we concentrateon a classical and fundamental problem that regards the memory of the underlyingmechanism to a symbol sequence. In the presence of association in symbol sequences,the first step of the analysis is to assume a Markov chain and estimate the order of theMarkov chain.There are many Markov chain order estimators proposed and assessed in the liter-ature. The Bayesian information criterion (BIC) and the Akaike information criterion(AIC) are the two oldest and best known order estimators based on maximum likeli-hood [3, 4, 5]. Another estimator is given by the maximal fluctuation method proposedby Peres-Shields [6] and modified by Dalevi and Dubhashi [7], who found that thePeres-Shields (PS) estimator is simpler, faster and more robust to noise than other cri-teria like AIC and BIC [7]. Other order estimation schemes include the method ofMen´endez et al. [8], which uses the φ -divergence measures [9], the method of globaldependency level (GDL), also called relative entropy [10], and the e ffi cient determina-tion criterion (EDC) [11]. Based on the information-related measures, and specificallythe conditional mutual information (CMI), we recently proposed the order estimationby means of randomization significance tests for CMI at increasing orders [12]. In asomewhat similar way, Pethel et al. [13] propose a randomization test for the examinedMarkov chain order using the Chi-squared statistic.In the approach of [12] we made no assumption on the distribution of CMI. Herewe propose the order estimation with parametric tests, approximating the null distri-bution of CMI by normal and gamma distributions. We follow the bias correction andthe approximation for the variance in [14] and [15] and approximate the distribution ofmutual information with Gaussian distribution as an obvious possible choice [16, 17].We also consider the result in Goebel et al. [18] that the statistic of mutual informa-tion (MI), and subsequently CMI, follows gamma distribution. Finally, we consider asecond gamma approximation with shape and scale parameter derived from the meanand variance approximations of the normal distribution. We implement the three para-metric significance tests for CMI and compare them to the randomization test of [12],as well as other known Markov chain order estimation methods. Further, we attemptto assess the Markov chain order of DNA sequences and infer for short and long rangecorrelation on the basis of the parametric and randomization CMI testing. A systematicinvestigation of long range correlation of DNA sequences using the CMI approach isreported in [19].The structure of the paper is as follows. First, in Section 2, CMI is defined andestimated on symbol sequences. Parametric significance tests for CMI of increasingorders are presented, approximating the null distribution of CMI by the normal andgamma distributions. In Section 3, we assess the e ffi ciency of the parametric tests inestimating the Markov chain orders and compare them to other known methods. InSection 4, we apply the parametric and randomization tests to DNA sequences, and inSection 5, the results are discussed and the main conclusions are drawn.2 . Conditional Mutual Information and Markov Chain Order Estimation We start with the definition of entropy, mutual information (MI) and conditionalmutual information (CMI) for Markov chains. Let { x t } denote a symbol sequence gen-erated by a Markov chain { X t } , t ≥
1, of an unknown order L ≥ K possible states A = { a , . . . , a K } , p ( x t ) the probability of x t ∈ A occurring inthe chain, X t = [ X t , X t − , . . . , X t − m + ] a vector (word) of m successive variables of theMarkov chain and p ( x t ) the probability of a word x t = { x t , x t − , . . . , x t − m + } ∈ A m occur-ring in the chain. The entropy of a random variable of the Markov chain X t is H ( X t ) = − P x t p ( x t ) ln p ( x t ) and the entropy of a word X t is H ( X t ) = − P x t ,..., x t − m + p ( x t ) ln p ( x t ).The MI of two random variables in the Markov chain being m time steps apart is [20] I ( m ) = I ( X t ; X t − m ) = H ( X t ) + H ( X t − m ) − H ( X t , X t − m ) = X x t , x t − m p ( x t , x t − m ) ln p ( x t , x t − m ) p ( x t ) p ( x t − m )and quantifies the amount of information for the one variable given the other variable.The fundamental property of a Markov chain of order L is p ( X t | X t − , X t − , . . . , X t − L , X t − L − , . . . ) = p ( X t | X t − , X t − , . . . , X t − L ) , (1)meaning that the distribution of the variable X t of the Markov chain at time t is de-termined only in terms of the preceding L variables of the chain. It is noted that I ( m )for m > L may not drop to zero due to the existence of MI between the intermediatevariables. Thus for estimating L we consider CMI that accounts for the intermediatevariables. CMI of order m is defined as the mutual information of X t and X t − m condi-tioning on X t − m + , . . . , X t − [20] I c ( m ) = I ( X t ; X t − m | X t − , . . . , X t − m + ) = − H ( X t , . . . , X t − m ) + H ( X t − , . . . , X t − m ) + H ( X t , . . . , X t − m + ) − H ( X t − , . . . , X t − m + ) = X x t ,..., x t − m p ( x t , . . . , x t − m ) ln p ( x t | x t − , . . . , x t − m ) p ( x t | x t − , . . . , x t − m + ) . (2)CMI coincides with MI for successive random variables in the chain, I c (1) = I (1).From the Markov chain property in (1), for m > L the logarithmic term in the sumof (2) is zero and thus I c ( m ) =
0. On the other hand, for m ≤ L , we expect in general thetwo variables m time steps apart be dependent given the m − I c ( m ) >
0. It is possible that I c ( m ) = m < L , but not for m = L , as then theMarkov chain order would not be L . So, increasing the order m , we expect in generalwhen I c ( m ) > I c ( m + = m = L . To account for complicated and ratherunusual cases where I c ( m + = m + < L , we can extend the condition I c ( m ) > I c ( m + = I c ( m + =
0, and even further up to somemaximum order.
The estimate of entropy, MI and CMI from a symbol sequence { x t } Nt = of length N is derived directly by the maximum likelihood estimate of the probabilities given3imply by the relative frequencies. As entropy and MI are fundamental quantities ofinformation theory with many applications, there is rich literature about the statisticalproperties and distribution of their estimates. Some works have focused on correctingthe bias in the estimation of entropy [14, 21, 22, 15, 23, 16, 24, 25], whereas otherworks give approximations with parametric distributions [26, 27, 15, 18, 17]. For ex-ample, Roulston [15] estimates the bias and variance of the observed entropy and givesevidence for normal distribution of the estimates. Pardo [26] shows that, under di ff erentassumptions, the MI estimate is either normal or a linear combination of χ variables,while Goebel et al. [18] using a second-order Taylor approximation of the MI esti-mate derives a gamma distribution, and the same does for CMI. Hutter and Za ff alon[17] use a Bayesian framework with Dirichlet prior distribution to obtain the posteriordistribution of MI estimate and derive expressions for its mean and variance.For simplicity in the expressions below, we assign X for X t , Y for X t − m and Z forthe vector variable of X t − , . . . , X t − m + . The number of observed distinct symbols of X and Y are K but for Z there may be less observed distinct words than K m − denoted K Z .Similarly, K XZ , K YZ and K XYZ denote the number of observed distinct words concate-nating the respective indexed variables. Note that the words XZ and YZ correspondto X t , X t − , . . . , X t − m + and X t − , . . . , X t − m + , X t − m , respectively, and therefore we have K XZ = K YZ (discrepancy by one may occur due to edge e ff ect). Further, we denote N m = N − m . An expression for the mean of the entropy estimate ˆ H ( X ), h ˆ H ( X ) i , is given by thebias correction of Miller [14] h ˆ H ( X ) i = H ( X ) − K − N . (3)The same expression holds for vector variables (words) adjusting accordingly the num-ber of observed distinct words. The mean of the CMI estimate ˆ I c ( m ) can thus be derivedby substituting the mean of entropy estimates of (3) in the expression of CMI of (2) h ˆ I c ( m ) i = I c ( m ) + K XYZ − K ZX − K YZ + K Z N m . (4)For the CMI variance, we follow the variance approximation for MI in [15]. Westart with the error propagation formula for ˆ I c ( m ) V [ ˆ I c ( m )] = K X u = K X v = K Z X w = ∂ ˆ I c ( m ) ∂ n uvw ! V [ n uvw ] , (5)where V [ (cid:5) ] denotes the variance and n uvw is the frequency of the concatenated word of XYZ that corresponds to the indices uvw . We want to express ˆ I c ( m ) in (5) in terms of theobserved probabilities (relative frequencies) of joints words of XYZ , q i jk = n i jk / N m , andthe marginal probabilities, e.g. q · jk = P Ki = q i jk and q ·· k = P Ki = P Kj = q i jk . Substitutingthese probabilities in the four entropy terms in (2) we getˆ I c ( m ) = K X i = K X j = K Z X k = q i jk ln q i jk − K X i = K Z X k = q i · k ln q i · k − K X j = K Z X k = q · jk ln q · jk + K Z X k = q ·· k ln q ·· k . (6)4i ff erentiation of the observed joint and marginal probabilities in (6) with respect to n uvw gives the following expressions ∂ q i jk ∂ n uvw = − n i jk N m + δ iu δ jv δ kw N m , (7) ∂ q · jk ∂ n uvw = − N m K X i = n i jk + δ jv δ kw N m , (8) ∂ q i · k ∂ n uvw = − N m K X j = n i jk + δ iu δ kw N m , (9) ∂ q ·· k ∂ n uvw = − N m K X i = K X j = n i jk + δ kw N m , (10)where δ mn is the Kronecker delta defined as δ mn = m = n and δ mn = m , n . Substitution of (7-10) into (5) gives V [ ˆ I c ( m )] = K X u = K X v = K Z X w = N m (cid:16) − ln q uvw + ln q u · w + ln q · vw − ln q ·· w + ˆ I c (cid:17) V [ n uvw ] . The observed frequency n uvw of the concatenated word of XYZ is itself a binomialrandom variable, n uvw ∼ B ( N m , q uvw ), considering the occurrence of the word as successwith probability q uvw and as number of trials the number N m of possible words of length m in the symbol sequence. Thus the variance of n uvw is V [ n uvw ] = N m q uvw (1 − q uvw ) andsubstituting it in the expression above we have the final expression of the variance of ˆ I c V [ ˆ I c ( m )] = K X u = K X v = K Z X w = N m (cid:16) − ln q uvw + ln q u · w + ln q · vw − ln q ·· w + ˆ I c (cid:17) q uvw (1 − q uvw ) . (11)Thus V [ ˆ I c ( m )] is directly derived when the observed probabilities q uvw are first com-puted on the symbol sequence.In [15], similar expressions for the mean and variance of I ( m ) were derived todefine the normal approximation of the MI distribution. Similarly, we assume that thedistribution of CMI follows approximately the normal distribution (denoted hereafterND) ˆ I c ( m ) ∼ N m ( h ˆ I c ( m ) i , V [ ˆ I c ( m )]) , (12)where h ˆ I c ( m ) i is given by (4) and V [ ˆ I c ( m )] by (11).5 .1.2. Approximation with gamma distribution (GD1) Goebel et al. [18] approximate the expression of distribution for CMI by means ofa second order Taylor series expansion. The second order Taylor approximation of MIabout p ( x , y ) ≡ p ( x ) p ( y ) (assuming independence) is I ( X , Y ) =
12 ln 2 X x ∈ X X y ∈ Y ( p ( x , y ) − p ( x ) p ( y )) p ( x ) p ( y ) , and the estimate ˆ I ( X , Y ) is defined accordingly substituting p ( x , y ) with the observedprobability q i j = n i j / N , where i , j ∈ { , . . . , K } , and the same for the marginal proba-bilities. The expression for MI is related to the χ statistic of the standard chi-squaretest of independence, which is defined as χ = K X i = K X j = ( n i j − ( n · j n i · ) / N ) ( n · j n i · ) / N , and follows a chi-square distribution with ( K − K −
1) degrees of freedom underthe assumption of independence of X and Y . The above equations are related by χ = N ln 2 ˆ I ( X , Y ), from which the approximate gamma distribution Γ (cid:16) ( K − / , / ( N ln 2) (cid:17) of ˆ I ( X , Y ) is established [18]. Further, it follows that ˆ I ( X , Y | Z ) is approximately gammadistributed (denoted hereafter GD1)ˆ I c ( m ) = ˆ I ( X , Y | Z ) ∼ Γ K Z K − K − , N ln 2 ! . (13) It is known that a gamma distribution Γ ( α, β ) with shape parameter α being a pos-itive integer and scale parameter β , can be approximated by a normal distribution N ( αβ, αβ ) if α is su ffi ciently large [28, 29]. Reversing this result, using the meanand variance of ˆ I ( X , Y | Z ) in (4) and (11), respectively, we can estimate the parametersof gamma distribution and obtain approximately the gamma distribution for ˆ I ( X , Y | Z )(denoted hereafter GD2)ˆ I c ( m ) = ˆ I ( X , Y | Z ) ∼ Γ ˆ I c V [ ˆ I c ] , V [ ˆ I c ]ˆ I c ! . (14) Having determined the three parametric approximations for the distribution of ˆ I c ( m ),we use them as null distributions for the null hypothesis H : I c ( m ) =
0. Given that italways holds I c ( m ) ≥
0, all three parametric tests are one-sided. We compute the p -value from the cumulative function of the null distributions ND, GD1 and GD2 of theobserved CMI ˆ I c ( m ), and we reject H if the p -value is less than the nominal signifi-cance level α (we set α = .
05 for all simulations below). We apply the significancetest for increasing orders m until we obtain rejection of H for m and no rejection ofH for m +
1, and then the estimate of L is m . The parametric tests are denoted as ND,GD1 and GD2 corresponding to the respective null distributions.6 .2. Randomization test for the significance of CMI(RD) In a recent work [12], we developed a randomization significance test for I c ( m ) = : I c ( m ) =
0, empirically. For the randomizationtest, we first generate M randomized symbol sequences { x ∗ t } Nt = , . . . , { x ∗ Mt } Nt = by ran-dom permutation of the initial sequence { x t } Nt = . Then we compute ˆ I c ( m ) on the orig-inal symbol sequence, denoted ˆ I c ( m ), and on the M randomized sequences, denotedˆ I ∗ c ( m ) , . . . , ˆ I ∗ Mc ( m ). Finally, we reject H if ˆ I c ( m ) is at the right tail of the empiricalnull distribution formed by ˆ I ∗ c ( m ) , . . . , ˆ I ∗ Mc ( m ). To assess this we use rank ordering,where r is the rank of ˆ I c ( m ) in the ordered list of the M + p -value of the one-sided test is 1 − ( r − . / ( M + + . Here, we show the di ff erences of the distributions ND, GD1 and GD2 in approxi-mating CMI with an example of two Markov chains of order L = L =
6, numberof symbols K = N = N = I c ( m ), for order m = L , is approximated by 1000 Monte Carlo real-izations, as shown in Figure 1 with the broken line displaying the histogram. The threeapproximating distributions are drawn setting their parameters as defined in (12), (13)and (14) to the corresponding average values from the 1000 realizations. As shownin Figure 1, all three approximations match quite well the true distribution of CMI for L = L = N = isexpected to be smaller than the nominal significance level.The three parametric tests are then compared to the randomization test. For onerealization of the same Markov chains with L = L = N = L = of I c ( L ) = I c because all four distributionscover well the observed value of ˆ I c ( L ) (shown by a vertical dashed line in Figure 2).On the contrary, for L =
6, ˆ I c ( L ) lies towards the right tail of ND and GD2 distributiontending to give false rejection, and on the left of the RD and GD1 distributions givingcorrectly no rejection (see Figure 2b). Moreover, the null distribution of GD1 is furtherto the right of the observed value ˆ I c ( L ) than the null distribution of RD, suggesting thatthe test with GD1 may be more conservative than with RD for this setting.7 −3 ˆ I C ( x ) pd f (a) MCNDGD1GD2 −0.02 0 0.02 0.04 0.06020406080100120 ˆ I C ( x ) pd f (b) MCNDGD1GD2 −4 ˆ I C ( x ) pd f (c) MCNDGD1GD2
Figure 1: The true distribution of ˆ I c ( L ) and the three approximations ND, GD1 and GD2, formed from 1000Monte Carlo realizations of a Markov chain of K =
2. (a) L = N = L = N = L = N =
3. Monte Carlo Simulations
We evaluate the three parametric tests (ND, GD1 and GD2) and the randomizationtest (RD) using Monte Carlo simulations for varying Markov chain order L , number ofsymbols K and symbol sequence length N . We also compare the RD and parametrictests with four known criteria for the estimation of L : the Akaike’s information criterion(AIC) [3, 5], the Bayesian information criterion (BIC) [5], the criterion of Dalevi andDubashi which is based on the Peres and Shields estimator (PS) [6, 7] and the criterionof Men´endez et al. (Sf) [31, 8]. For each parameter setting, we use 100 realizations,and M = m = , . . . , L + I c ( m ) for increasing order m , as well as the aforementionedcriteria. In the first simulation setup, Markov chains are derived by randomly selectedtransition probability matrices of given order L , while in the second simulation setup,Markov chains are derived by transition matrices of given order L estimated on twoDNA sequences of genes and intergenic regions. For each selection of L and K , a symbol sequence of length N is generated from atransition probability matrix of size K L × K with randomly selected components from8 −3 ˆ I C ( x ) pd f (a) SurrogateNDGD1GD2 −0.02 0 0.02 0.04 0.06020406080100120 ˆ I C ( x ) pd f (b) SurrogateNDGD1GD2
Figure 2: The three parametric approximations of the null distribution of ˆ I c ( L ) and the distribution formedby M = N = K =
2, and (a) L =
3, (b) L =
6. Theobserved value of ˆ I c ( L ) is shown by a vertical dashed line. the uniform distribution [0 ,
1] under the restriction that the rows of the matrix sum toone. In a pilot study we considered both the setting of selecting a di ff erent transitionprobability matrix for each of the 100 realizations and the setting of using the sametransition probability matrix for all realizations with di ff erent initial conditions. Theresults were qualitatively the same and we chose to proceed with the first setting.As expected, the simulations suggest that for all methods the success rate in iden-tifying the true order L increases with N and decreases with L and K . As shown inFigure 3 for K =
2, all criteria attain about the same success rate in detecting the cor-rect L for L = , N ≥ N being close to100% (see Figure 3e and f). We can also notice that the success rate decreases with L for any N and for all methods, but it decreases di ff erently across the methods. For each N and as L increases, the success rate of GD1, RD and PS decreases slower with L than for the other criteria, with the success rate of PS tending to stay positive even for L =
10, e.g. see Figure 3c for N = N and L and at cases it even scores higher, e.g. for N =
200 (Figure 3b) GD1and RD have a success rate at about 40% for L =
5, while for L = L = L . For larger N the three best criteria tend to align, and thus we can safely conclude thatthese methods perform similarly and distinctly better than the other order estimationcriteria. While all criteria improve with N , Sf tends to score low even for small L .The estimation of L is more data demanding as the number of symbols increases,as shown in Figure 4 for K =
4. The success rate tends to increase with N , but for ND,GD2, AIC and BIC this can be seen only for small L = , L = , N = K = K = L and scoring lower as L increases. Here,GD1 and RD have very similar performance, with GD1 scoring more often higher, andthey both score highest in most cases, especially for large L and N .9 .2. Transition probabilities estimated on DNA DNA consists basically of four nucleotides, the two purines, adenine (A) and gua-nine (G), and the two pyrimidines, cytosine (C) and thymine (T), so a DNA sequencecan be considered as a symbol sequence on the symbols A,C,G,T. In our analysis weuse a large segment of the Chromosome 1 of the plant A rabidopsis thaliana . Weuse two sequences, one sequence derived by joining together the genes, which containnon-coding regions, called introns, in between the coding regions, called exons, andanother sequence joining together the intergenic regions which have solely non-codingcharacter. The sequences used here are segments of the long sequences used in [32].In this simulation setup we form the Markov chains from transition probabilitiesmatrices of given order L estimated on the two DNA sequences of genes and intergenicregions, and we generate 100 symbol sequences from each of these Markov chains fordi ff erent initial conditions. The purpose here is to consider Markov chains of distinctstructure of the probability transition matrix for each order L that relate to a real worldMarkov chain. The results for the success rate of correct estimation of the true order L with all the criteria and for K = K = L = , , , N varying from 100 to 6400, are shownin Figure 5 for the genes and in Figure 6 for the intergenic regions. For both genesand intergenic regions, all the criteria fail when the order gets large ( L >
3) and onlyPS maintains a positive success rate but at the same low level of 10% - 20% and ratherindependently of N . For smaller orders ( L = , N but at low levels of success rate di ff ering across the criteria (for L = K = K =
4, respectively, and the same in Figure 6a and bfor intergenic regions). These results suggest that the task of estimating the true L ofa Markov chain with the structure of transition probabilities as in DNA sequences ismore di ffi cult than when the transition probabilities are selected at random. Concerningthe CMI-based tests, again ND and GD2 fail to estimate the true L for both genesand intergenic regions, while GD1 follows tightly with RD, both being suboptimal butscoring consistently well compared to all other criteria. For example for genes and L =
2, when K = N (and higher than all others), but when K = N and PS at large N . AIC scores highest of all criteria for K = K =
4, and only for L = N (Figure 5b), indicating that the data requirement for AIC with theincrease of K is disproportionately high compared to the other criteria. On the otherhand, PS estimates correctly the order L at the same low rate regardless of N for L > L > L , so thatit hits the true order at a percentage of cases dependent on the range of the tested m values, whereas the other criteria underestimate the order. GD1 and RD have thus themost consistent behavior, increasing the probability (success rate) to identify the trueorder with N at a level depending on L and K .Comparing the results of the criteria for the two types of DNA sequences, they Data were obtained from the database: http: // K , L and N . Though the relative di ff erences ofthe criteria are the same, the level of success rate tends to be higher for the intergenicregions, specifically for K =
2, indicating that the Markov chain of the same order L obtained on the basis of intergenic regions is less complex, i.e. the order is betterdetectable than for the genes. For example, for K = L =
3, it can be seenin Figure 5c that GD1 and RD reach a success rate of 40% at the largest tested N =
4. Application on DNA sequences
Much of the statistical analysis of DNA sequences is focused on the estimationof properties of coding and non-coding regions as well as on the discrimination ofthese regions. There has been evidence that there is a di ff erent structure in codingand non-coding sequences and that the non-coding sequences tend to have long rangecorrelation, whereas the correlation in coding sequences exhibits exponential decay[33, 34, 35]. Here we use intergenic and gene sequences. The latter is a mixture ofcoding regions (exons) and non-coding regions (introns), and therefore we expect tohave also long range correlation due to the non-coding regions in it, but it should beless than the correlation in the intergenic regions consisting only of non-coding parts.Thus both DNA sequences cannot be considered as Markov chains, at least not of amoderate order, and the estimation of the order L should increase with the availabledata size.We estimate the order L of a hypothesized Markov Chain underlying Chromosome1 of plant A rabidopsis thaliana by the three parametric tests ND, GD1 and GD2, theRD, as well as the criteria of AIC, BIC, PS and Sf. The computations are done for bothgenes and intergenic regions of length N = , , , K = N of the DNA sequence, indicating the presence of a Markov chainof a very large order (larger than the maximum order that can be detected for this N )or a chain with long range correlations. The limits of detectable order for N = L = L = L = L increase for N = L = L = N and larger estimate of L for intergenicregions than for genes. On the other hand, the estimated L from the criteria AIC,BIC and PS changes irregularly with N and is not always larger for the intergenicregions, giving inconclusive results. The agreement of L estimation by GD1 and RDis remarkable, both giving exactly the same estimate for any of the two DNA typesand for any but the largest length N = N = ff erence is small for genes with GD1 estimating L =
11 and RD L =
10, and largerfor intergenic regions with GD1 giving L =
16 and RD L =
11. The other two CMI11ased criteria, ND and GD2, give estimates of L close to these of GD1 and RD, and sodoes Sf but tending to give somewhat smaller estimate of L as N increases. The overallresults suggest that the symbol sequence of intergenic regions tend to have larger orderand thus being more consistent to the hypothesis of long range correlation. This isconfirmed by the four CMI based criteria and Sf, but RD and GD1 in addition turn outto be able to estimate large L , as justified also by the simulation results.
5. Conclusions
In this work we propose and assess parametric tests of significance of the condi-tional mutual information (CMI) for the estimation of the order of Markov chain. Thenull distribution of CMI is approximated by the normal distribution and two di ff er-ent approximations of gamma distribution. Simulations showed that among the threeparametric tests the one based on gamma distribution (GD1) performed best for anyMarkov chain order L and number of symbols K and even for short lengths of symbolsequences. The practical aim of the study was to investigate whether a parametric testcan reach the order estimation accuracy of the respective randomization test (RD), re-cently implemented and found to be compatible and often better than the known orderestimation criteria. The simulation study confirmed that GD1 performs similarly to RDand both compare favorably to other known criteria (AIC, BIC, the Peres and Shieldsestimator and the criterion of Men´endez et al. [31, 8]).Having established the equivalence of performance of GD1 and RD, the advantageof GD1 is the computational e ffi ciency, allowing the order estimation based on CMIto be possible for very long symbol sequences, such as the DNA sequences. Obvi-ously, RD applied with a number M of randomized sequences (in this work we used M = M times more computation time than GD1, and thus appli-cation of RD is prohibitive for very long symbol sequences. This was the case of DNAsequences, and for N = , , rabidopsis thaliana , we could establish an increase of the estimated order withthe length of the DNA sequence, indicating the presence of either a very large Markovchain order not reached by the tested sequence lengths or long range correlations (thisis further explored in a focused study in [19]). Further, we could also distinguishgenes from intergenic regions as lower order was estimated in genes, which consistsof coding and non-coding parts, than in intergenic regions which contains non-codingparts exclusively. References [1] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis,Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press,Cambridge, United Kingdom, 1998.[2] G. Dong, J. Pei, Sequence Data Mining, Springer, 2007.123] H. Tong, Determination of the order of a Markov chain by Akaike’s InformationCriterion, Journal of Applied Probability 12 (3) (1975) 488–497.[4] R. Katz, On some criteria for estimating the order of a Markov chain, Technomet-rics 23 (3) (1981) 243–249.[5] P. Guttorp, Stochastic Modeling of Scientific Data, Stochastic Modeling Series,Chapman and Hall, London, 1995.[6] Y. Peres, P. Shields, Two new Markov order estimators, arXiv:math / χ -divergence, Canadian Journal of Statistics 42 (2014) 563–578.[11] L. Zhao, C. Dorea, C. Gonc¸alves, On determination of the order of a Markovchain, Statistical Inference of Stochastic Processes 4 (2001) 273–282.[12] M. Papapetrou, D. Kugiumtzis, Markov chain order estimation with conditionalmutual information, Physica A 392 (2013) 1593–1601.[13] D. Pethel, W. Hahs, Exact significance test for Markov order, Physica D 269(2014) 42–47.[14] G. A. Miller, Note on the bias of information estimates, in: Information Theoryin Psychology: Problems and Methods, The Free Press, Monticello, IL, 1955, pp.95–100.[15] M. Roulston, Estimating the errors on measured entropy and mutual information,Physica D 125 (3–4) (1999) 285–294.[16] L. Paninski, Estimation of entropy and mutual information, Neural Computation15 (2003) 1191–1253.[17] M. Hutter, M. Za ff alon, Distribution of mutual information from complete andincomplete data, Computational Statistics and Data Analysis 48 (2005) 633–657.[18] B. Goebel, Z. Dawy, J. Hagenauer, J. Mueller, An approximation to the distri-bution of finite sample size mutual information estimates, IEEE 2 (2005) 1102–1106. 1319] M. Papapetrou, D. Kugiumtzis, Investigating long range correlation in DNA se-quences using significance tests of conditional mutual information, Computa-tional Biology and Chemistry 53A (2014) 32–42.[20] T. M. Cover, J. A. Thomas, Elements of Information Theory, Wiley, London,1991.[21] P. Grassberger, Finite sample corrections to entropy and dimensions estimates,Physics Letters A 128 (1988) 369–373.[22] A. O. Schmitt, H. Herzel, W. Ebeling, A new method to calculate higher-orderentropies from finite samples, Europhysics Letters 23 (1993) 303–309.[23] T. Sch¨urmann, Scaling behavior of entropy estimates, Journal Of Physics A:Mathematical And General 35 (2002) 1589–1596.[24] J. A. Bonachela, H. Hinrichsen, M. A. Mu˜noz, Entropy estimates of small datasets, Journal Of Physics A: Mathematical And Theoretical 41 (2008) 202001–202009.[25] A. Lesne, J. L. Blanc, L. Pezard, Entropy estimation of very short symbolic se-quences, Physical Review E 79 (2009) 046208.[26] J. A. Pardo, Some aplications of the useful mutual information, Applied Mathe-matics and Computation 72 (1995) 33–50.[27] D. H. Wolpert, D. R. Wolf, Estimating functions of probability distributions froma finite set of samples, Physical Review E 52 (1995) 6841.[28] N. Johnson, S. Kotz, N. Balakrishnan, Continuous Univariate Distributions, 2ndEdition, Vol. 1, John Wiley and Sons, New York, 1994.[29] P. Billingsley, Probability and Measure, John Wiley & Sons, 1995.[30] G.-H. Yu, C.-C. Huang, A distribution free plotting position, Stochastic Environ-mental Research and Risk Assessment 15 (2001) 462–476.[31] M. Men´endez, J. Pardo, L. Pardo, K. Zografos, On tests of dependence basedon minimum φ -divergence estimator with constraints: an application to modelingDNA, Computational Statistics and Data Analysis 51 (2) (2006) 1100–1118.[32] D. Kugiumtzis, A. Provata, Statistical analysis of gene and intergenic DNA se-quences, Physica A 342 (3–4) (2004) 623–638.[33] C.-K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, F. Sciortino, M. Simons,H. E. Stanley, Long-range correlation in nucleotide-sequences, Nature 365 (6365)(1992) 168–170.[34] S. V. Buldyrev, N. V. Dokholyan, A. L. Goldberger, S. Havlin, C.-K. Peng, H. E.Stanley, G. M. Viswanathan, Analysis of DNA sequences using methods of sta-tistical physics, Physica A 249 (1998) 430–438.1435] Y. Almirantis, A. Provata, Long and short range correlations in genome organiza-tion, Journal of Statistical Physics 97 (1999) 233–262.15 S u cc e ss r a t e (a) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (b) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (c) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (d) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (e) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (f) NDGD1GD2RDAICBICPSSf
Figure 3: Number of cases out of 100 realizations the true order L is estimated by the criteria ND, GD1,GD2, RD, AIC, BIC, PS and Sf vs order L , as shown in the legend. The symbol sequences have length (a) N = N = N = N = = = K = S u cc e ss r a t e (a) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (b) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (c) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (d) NDGD1GD2RDAICBICPSSf
Figure 4: Number of cases out of 100 realizations the true order L is estimated by the criteria ND, GD1,GD2, RD, AIC, BIC, PS and Sf vs sequence length N , as shown in the legend. The symbol sequences aregenerated by Markov chains of K = L =
2, (b) L =
3, (c) L = L = S u cc e ss r a t e (a) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (b) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (c) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (d) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (e) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (f) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (g) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (h) NDGD1GD2RDAICBICPSSf
Figure 5: Number of cases out of 100 realizations the true order L is estimated by the criteria as shownin the legend, vs sequence length N . The symbol sequences are generated by Markov chains of transitionprobability matrices estimated on a DNA sequence of genes. The panels are for purines and pyrimidines( K =
2) and L = , , , K =
4) and L = , , , S u cc e ss r a t e (a) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (b) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (c) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (d) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (e) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (f) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (g) NDGD1GD2RDAICBICPSSf S u cc e ss r a t e (h) NDGD1GD2RDAICBICPSSf
Figure 6: (a) Same as for Figure 5, but for the DNA sequence of intergenic regions. D GD1 GD2 RD AIC BIC PS Sf024681012141618 DN A o r de r (a) geneintergenic ND GD1 GD2 RD AIC BIC PS Sf024681012141618 DN A o r de r (b) geneintergenicND GD1 GD2 RD AIC BIC PS Sf024681012141618 DN A o r de r (c) geneintergenic ND GD1 GD2 RD AIC BIC PS Sf024681012141618 DN A o r de r (d) geneintergenicND GD1 GD2 RD AIC BIC PS Sf024681012141618 DN A o r de r (e) geneintergenic Figure 7: The estimated order L by ND, GD1, GD2, RD, AIC, BIC, PS and Sf of sequences of purine andpyrimidine ( K =
2) from genes and intergenic regions of Chromosome 1 of the plant A rabidopsis thaliana ,as indicated in the legend. The sequence lengths are (a) N = N =16000, (c) 32000, (d) 64000 and(e) 128000.