A Markovian genomic concatenation model guided by persymmetric matrices
aa r X i v : . [ q - b i o . GN ] N ov A Markovian genomic concatenation model guided by persymmetricmatrices
Andrew G. Hart † Marcelo Sobottka ‡∗ † Departamento de Ingenier´ıa Matem´atica and Centro de Modelamiento Matem´atico, Universidad de Chile, Chile. ‡ Departamento de Matem´atica, Universidade Federal de Santa Catarina, Brazil
Abstract
The aim of this work is to provide a rigorous mathematical analysis of a stochastic concatenation modelpresented by Sobottka and Hart (2011) which allows approximation of the first-order stochastic structure inbacterial DNA by means of a stationary Markov chain. Two probabilistic constructions that rigorously formalizethe model are presented. Necessary and sufficient conditions for a Markov chain to be generated by the modelare given, as well as the theoretical background needed for designing new algorithms for statistical analyses ofreal bacterial genomes. It is shown that the model encompasses the Markov chains satisfying intra-strand parity,a property observed in most DNA sequences.Keywords: stochastic matrix, Markov chain, DNA sequence.2010 MSC: 60J10, 15B51, 92D20, 92C40.
One of the fundamental issues in the study of genomes is their primary structure, that is, the distribution ofnucleotides along DNA sequences. The identification of statistical patterns in the primary structure of DNAsequences has revealed several underlying patterns in genomes [2, 10, 11, 18] and has enabled scientists to proposemodels for evolutive pressures and mutational mechanisms that might act on organisms [1, 6, 18] as well as toconstruct bioinformatics tools. For example, in [3], a maximum likelihood approach was used to perform analysesof DNA sequences in order to estimate evolutionary trees, while in [20], a measure of the long-range correlationbetween the nucleotide bases of DNA sequences was used to classify bacteria. In addition, strand compositionalasymmetry (SCA) was used to detect replication origins in bacteria [4], while [17] used interpolated Markov modelsto identify genes in bacteria, [8] proposed a maximization model to describe the organization and distribution ofgenes in bacterial DNA and [12] presented a stationary stochastic process for modeling the placement of coding andnon-coding regions within a genome that incorporates the phenomenon of start codons appearing within codingregions.The aim of this work is to provide a rigorous formalization of a stochastic concatenation model for capturing theprimary structure of bacterial DNA sequences which was presented in [18]. The model, henceforth referred to as theS-H model, allowed novel statistical symmetries in the mononucleotide and dinucleotide distributions of a collectionof bacterial chromosomes to be observed. A key feature of the model is a persymmetric matrix of probabilities whichplays a role in determining the nucleic acids seen along a DNA sequence. The persymmetric matrices constitute aspecial class of matrices which has been employed in models from various fields (see for example [13, 14, 15]) andwhich has been widely studied (see for example [5, 9, 16, 19]). ∗ Corresponding author.E-mail addresses: [email protected] (A. Hart), [email protected] (M. Sobottka). A ), cytosine ( C ), guanine ( G ) and thymine ( T ). Of these types, adenine is complementary tothymine while cytosine is complementary to guanine. Each nucleotide on one DNA strand pairs with its complementon the opposite strand. This chemically induced pairing between the two strands causes the strands to assume aladder-like arrangement which is then twisted to attain the famous helix. The chemical composition of DNAmolecules endows a strand with an intrinsic reading direction: each strand can only be read in one direction by thegenetic machinery of the cell. Furthermore, the way strands combine to form a duplex means that the two strandsare read in opposite directions: they are said to be antiparallel.We shall identify each nucleotide type with a number in N := { , , , } ( A ≡ C ≡ G ≡ T ≡ α : N → N be the involution which maps each nucleotide to its complement, that is, α ( i ) = 5 − i .The S-H model is a concatenation model which has at its core a first-order Markov chain whose one-step transitionmatrix P = (cid:0) P ij (cid:1) i,j ∈ N is derived from a positive parameter m and a positive persymmetric matrix L = (cid:0) L ij (cid:1) i,j ∈ N : P ij = L ij M j P k ∈ N L ik M k , (1)where M = M := m/ (2 m + 2) and M = M := 1 / (2 m + 2). The Markov chain governs how the DNA sequencegrows in both directions from an initial nucleotide called the origin by appending nucleotides in three steps. Step1. a nucleotide of type j is randomly selected with probability M j . Step 2. with probability 1 /
2, the nucleotidetries to join the end (consonant with the DNA reading direction ) or beginning (contrary to the reading direction)of the sequence.
Step 3.
In the first case, the nucleotide is appended to the sequence with probability L ij , where i is the type of the last nucleotide in the sequence; in the latter, the nucleotide is prepended to the sequence withprobability L α ( k ) α ( j ) , where k is the type of the initial nucleotide. This scheme is illustrated in Figure 1.Provided nucleotides accumulate evenly at the ends of the DNA strand, after a long time one would obtain (withprobability 1) a sequence with the initial nucleotide at its midpoint. One half would be generated by the stationaryMarkov chain ( P, π ), where the transition matrix P is given by (1) and the chain’s stationary distribution π is the lefteigenvector of P corresponding to the eigenvalue 1 normalised to sum to 1. The other half would have distributiongiven by the stationary Markov chain ( ˜ P , ˜ π ) where ˜ π i = π α ( i ) and ˜ P ij = π α ( j ) π α ( i ) P α ( j ) α ( i ) , for i, j ∈ N . The model isconsistent with the observation reported by geneticists that bacterial DNA sequences are usually composed of twodistinct segments called chirochores (see [4]). Furthermore, if one estimates the transition matrices ˜ P and P for thesegments prior to and following the origin nucleotide respectively, one usually finds that ˜ P ij ≈ π α ( j ) π α ( i ) P α ( j ) α ( i ) (see[18]).Figure 1: A schematic presentation of the S-H model for constructing bacterial DNA sequences. Assuming thereading sense of the sequence is from left to right, a new nucleotide of type C is selected with probability 1 / (2 m + 2)and is appended to the end of the sequence with probability L , while a nucleotide of type T is selected withprobability m/ (2 m + 2) and will be attached to the beginning of the sequence with probability L α (3) α (4) . Thefinal DNA sequence obtained is the concatenation of two Markovian processes: one starting at position zero andextending to the right, whose estimated transition matrix is P ; and the other terminating at zero, whose estimatedtransition matrix is ˜ P .The paper is organized as follows. Section 2 discusses the probabilistic interpretation of the form (1) of thematrix P in greater depth than [18]. Two different probabilistic constructions are presented, the first of whichprovides the justification for the description of DNA sequence growth given above. Section 3 introduces the set2f ℵ -generated matrices as matrices of the form (1), where ℵ is the set of positive persymmetric matrices, andestablishes several algebraic characterizations of such matrices. The non-uniqueness of the persymmetric matrix L and positive parameter m that define an ℵ -generated matrix is then considered in Section 4, where a coupleof equivalence relations on ℵ are considered. This leads to an examination of various properties of ℵ -generatedmatrices as used in the S-H model in Section 5. Finally, we discuss some measures for determining how closely aDNA sequence conforms to the S-H model and make concluding remarks in Section 6. P In [18], a formal description of the way nucleotides are appended to a DNA sequence using the persymmetric matrix L and the parameter m was presented, but the explicit connection with stochastic matrices of the form (1) was leftfor the reader to deduce. Here, we more rigorously discuss how the form (1) of the stochastic matrix P arises fromthe DNA-sequence growth mechanism described above. In addition, we shall present an alternative probabilisticinterpretation of the growth mechanism. To begin, consider the growth of a DNA sequence whose initial nucleotide is taken to be of type i . Let ( β t , t ≥ N with common distribution M = ( M , M , M , M ) = ( m, , , m ) (cid:14) (2 m + 2), that is, anindependent and identically distributed sequence of random variables on N with β s ∼ M . Consider two coupledstochastic processes ( V t , t ≥ N , and ( W t , t ≥ { , } -process where W t is 1 with probability L V t β t (that is, W t ∼ B ( L V t β t )). By setting V := i as the type of theinitial nucleotide from which the DNA sequence grows, the process ( V t , t ≥
0) evolves as a deterministic functionof ( β t , t ≥
0) and ( W t , t ≥
0) as follows: V t +1 := β t W t + V t (1 − W t ) = ( β t , if W t = 1 V t , if W t = 0 , ∀ t ≥ . Note that, while V t denotes the type of the last nucleotide appended to the sequence, β t corresponds to themechanism responsible for proposing the type, say j , of the next nucleotide to concatenate to the sequence, and W t corresponds to the mechanism responsible for accepting or rejecting the new nucleotide in the sequence. If β t = j ,then j is accepted as the type of the next nucleotide provided that W t = 1, in which case V t +1 is set to j . Otherwise,the nucleotide of type j is discarded and no nucleotide is appended. In that case, V t +1 takes the value of V t . In thisway, t counts the number of nucleotides proposed rather than the length of the DNA sequence while the number ofacceptances, given by P tu =1 W u , is one less than the length of the DNA sequence, since it doesn’t count the initialnucleotide. For all i ∈ N and t ≥
0, we define γ i := Pr( W t = 1 | V t = i ) = X j ∈ N Pr( W t = 1 , β t = j | V t = i )= X j ∈ N Pr( W t = 1 | β t = j, V t = i ) Pr( β t = j | V t = i )= X j ∈ N Pr( W t = 1 | β t = j, V t = i ) Pr( β t = j ) = X j ∈ N L ij M j . Next, define a sequence ( τ s , s ≥
0) of stopping times by τ := 0 and τ s +1 := min { t > τ s : W t − = 1 } . The τ s ’s mark the nucleotide type proposals that were accepted. By construction, they constitute a series ofrenewal times. Note that ( V t , t ≥
0) is a discrete step function which transitions to a new nucleotide whenever t ∈ { τ s , s ≥ } . More precisely, for all s ≥ V t = V τ s for t = τ s , τ s + 1 , . . . , τ s +1 −
1. Let i ∈ N and w ∈ { , } .The random variable β t is independent of W u for u < t and the distribution of W t is completely determined by the3alue of β t and V t . Consequently, the event { W t = w is conditionally independent of { W u = 0 } for all u < t given V t = i . For i ∈ N and t > u ≥
0, we havePr( W t = 1 , W t − = 0 , . . . , W u = 0 | V u = i ) = Pr( W t = 1 | W t − = 0 , . . . , W u = 0 , V u = i ) · P r ( W t − = 0 , . . . , W u = 0 | V u = i )= Pr( W t = 1 | V t = i, W t − = 0 , . . . , W u = 0 , V u = 0) · Pr( W t − = 0 , . . . , W u = 0 | V u = i )= Pr( W t = w | V t = i ) Pr( W t − = 0 , . . . , W u = 0 | V u = i )= γ i Pr( W t − = 0 , . . . , W u = 0 | V u = i )and Pr( W t = 0 , . . . , W u = 0 | V u = i ) = (1 − γ i ) Pr( W t − = 0 , . . . , W u = 0 | V u = i ) . Hence, for s ≥ t ≥ i ∈ N , we obtainPr( τ s +1 − τ s = t | V τ s = i ) = Pr( W τ s + t − = 1 , w τ s + t − = 0 , . . . , W τ s = 0 | V τ s = i )= Pr( W τ s + t − = 0 , . . . , W τ s = 0 | V τ s = i ) γ i = Pr( W τ s + t − = 0 , . . . , W τ s = 0 | V τ s = i )(1 − γ i ) γ i = · · · = (1 − γ i ) t − γ i . Conditional on V τ s = i , τ s +1 − τ s is thus a geometric random variable taking values on the positive integers: τ s +1 − τ s | V τ s = i ∼ geom (cid:0) γ i (cid:1) , s ≥ , i ∈ N. Observe that the distribution of τ s +1 − τ s is completely determined by the value of V τ s and is independent of anyevents prior to τ s if V τ s is given. Furthermore, τ s +1 − τ s | V τ s = i is identically distributed as τ | V = i , for all s > U s , s ≥
0) by U s := V τ s . Suppose that V τ s = i for some fixed s ≥
0. Then V τ s +1 is determined by β τ s +1 − and V τ s = β τ s − , which are independent of all β t , V t and W t for all t prior to τ s − U s , s ≥
0) has the Markov property:Pr( U s +1 = j | U s = i, U s − = i , . . . , U = i s ) = Pr( U s +1 = j | U s = i ) , for all i , i , . . . , i s ∈ N and s ≥
0. Finally, since each τ s essentially marks a point at which the process (cid:0) ( β t , V t , W t ) , t ≥ (cid:1) is restarted, we havePr( U s +1 = j | U s = i ) = Pr( V τ s +1 = j | V τ s = i ) = Pr( V τ = j | V τ = i ) = Pr( U = j | U = i ) =: P ij , for all s ≥
0. Therefore, ( U s , s ≥
0) is a time-homogeneous Markov chain on the finite state space N . The followingtheorem gives the form of the one-step transition matrix P = (cid:0) P ij (cid:1) i,j ∈ N in terms of L and M . Theorem 1.
The one-step transition matrix P = (cid:0) P ij (cid:1) i,j ∈ N of the Markov chain ( U s , s ≥ is given by P ij := L ij M j P k ∈ N L ik M k . Proof.
Let τ := τ . Now, P ij = Pr( U = j | U = i )= Pr( V τ = j | V = i )= ∞ X t =1 Pr( V t = j, τ = t | V = i )= ∞ X t =1 Pr( V t = j, τ = t | V = i )Pr( τ = t | V = i ) Pr( τ = t | V = i )= ∞ X t =1 Pr( V t = j, τ = t | V = i ) P k ∈ N Pr( V t = k, τ = t | V = i ) Pr( τ = t | V = i ) . (2)4owever,Pr( V t = j, τ = t | V = i )= Pr( β t − = j, W t − = 1 , τ = t | V = i )= Pr( β t − = j, W t − = 1 , W u = 0 , u = 1 , . . . , t − | V = i )= Pr( β t − = j, W t − = 1 | V = i, W u = 0 , u = 1 , . . . , t −
2) Pr( W u = 0 , u = 1 , . . . , t − | V = i )= Pr( β t − = j, W t − = 1 | V t − = i ) Pr( W u = 0 , u = 1 , . . . , t − | V = i )= Pr( W t − = 1 | β t − = j, V t − = i ) Pr( β t − = j | V t − = i ) Pr( W u = 0 , u = 1 , . . . , t − | V = i )= L ij M j Pr( W u = 0 , u = 1 , . . . , t − | V = i )and substituting this into (2) yields P ij = ∞ X t =1 Pr( V t = j, τ = t | V = i ) P k ∈ N Pr( V t = k, τ = t | V = i ) Pr( τ = t | V = i )= ∞ X t =1 L ij M j Pr( W u = 0 , u = 1 , . . . , t − | V = i ) P k ∈ N L ik M k Pr( W u = 0 , u = 1 , . . . , t − | V = i ) Pr( τ = t | V = i )= ∞ X t =1 L ij M j P k ∈ N L ik M k Pr( τ = t | V = i )= L ij M j P k ∈ N L ik M k . Clearly, the matrix P is invariant to rescaling L . The only effect of rescaling L by some constant, say h , is tomultiply the mean 1 /γ i of the distribution of τ s +1 − τ s | V τ s = i by a factor of 1 /h . Of course, while such scalingpreserves the persymmetry of L , it only makes sense if 0 < h < min { /γ i : i ∈ N } . There is another way to represent how new nucleotides are added to a DNA sequence which provides an alternativederivation of the Markov chain on N with one-step transition matrix P of the form (1). Let ( Y s , s ≥
0) be a Markovchain on the set of nucleotides N with transition matrix K = ( K ij ) i,j ∈ N given by K ij = L ij (cid:14) P k ∈ N L ik . Thus, theone-step transition matrix of ( Y s , s ≥
0) is obtained by converting the positive persymmetric L into a stochasticmatrix by normalizing its rows to sum to unity. Next, let ( B s , s ≥
0) be a Bernoulli scheme on N with commondistribution M . Since ( Y s , s ≥
0) is a positive recurrent Markov chain on the finite state space N and ( B s , s ≥ N that is independent of ( Y s , s ≥ (cid:0)(cid:0) Y s , B s (cid:1) , s ≥ (cid:1) is a positiverecurrent Markov chain on the state space N × N with one-step transition matrix (cid:0) R ( i,k ) , ( j,l ) (cid:1) ( i,k ) , ( j,l ) ∈ N given by R ( i,k ) , ( j,l ) = K ij M l .We shall assume without loss of generality that Y = B . Define a sequence of stopping times ( T s , s ≥
0) by T := 0 and T s +1 := min { t > T s + 1 : Y t − = B t − = Y T s and Y t = B t } , for s ≥
0. By definition, Y T s = B T s for all s ≥ Y T s − = B T s − for all s ≥
1. Observe that if Y T s and B T s aregiven, for example, Y T s = B T s = i , then T s +1 − T s = min { t > T s + 1 : Y t − = B t − = Y T s and Y t = B t } − T s = min { t > Y t − = B t − = i and Y t = B t } . Thus, T s +1 − T s is independent of T s if Y T s is given. Furthermore, T s +1 − T s | Y T s = i has the same distribution as T | Y = i . Thus, each T s is a renewal time at which the Markov chain (cid:0)(cid:0) Y s , B s (cid:1) , s ≥ (cid:1) is restarted.Next, define the stochastic process ( X s , s ≥
0) by X s := Y T s . Since (cid:0)(cid:0) Y s , B s (cid:1) , s ≥ (cid:1) is a Markov chain and( T s , s ≥
0) is a sequence of stopping times at which it renews, one may employ the strong Markov property todeduce that ( X s , s ≥
0) is also a Markov chain. It only remains to compute its one-step transition matrix.5 heorem 2.
The Markov chain ( X s , s ≥ has one-step transition matrix P = ( P ij ) i,j ∈ N , where P ij := L ij M j P k ∈ N L ik M k . Proof.
Fix X = B = i and let T := T . Then, P ij = Pr( X = j | X = i )= ∞ X t =2 Pr( Y T = j, T = t | X = i )= ∞ X t =2 Pr( Y t = j | T = t, X = i ) Pr( T = t | X = i )= ∞ X t =2 Pr( Y t = j, B t = j | Y t = B t , Y t − = i, B t − = i, Y t − = B t − , . . . , Y = B , Y = B , Y = i, B = i ) · Pr( T = t | X = i )= ∞ X t =2 Pr( Y t = j, B t = j | Y t = B t , Y t − = i ) Pr( T = t | X = i )= ∞ X t =2 Pr( Y t = j, B t = j | Y t − = i )Pr( Y t = B t | Y t − = i ) Pr( T = t | X = i )= ∞ X t =2 Pr( Y t = j, B t = j | Y t − = i ) P k ∈ N Pr( Y t = k, B t = k | Y t − = i ) Pr( T = t | X = i )= ∞ X t =2 K ij M j P k ∈ N K ik M k Pr( T = t | X = i )= L ij M j P k ∈ N L ik M k ∞ X t =2 Pr( T = t | X = i )= L ij M j P k ∈ N L ik M k , since ∞ X t =2 Pr( T = t | X = i ) = 1and Pr { Y t = j, B t = j | Y t − = i ) = Pr { Y t = j | Y t − = i ) Pr( B t = j ) = K ij M j . Thus, the mechanism by which nucleotides are appended to a DNA sequence according to a Markov chain withtransition matrix P may also be described as follows. Suppose that the last nucleotide in the sequence is of type i .Then, one simply waits until both the Markov chain ( Y s ) and the i.i.d. sequence ( B s ) simultaneously return tostate i and both immediately jump to the same state, say j . When such a consecutive pair of concordant eventsoccurs, a nucleotide of type j is appended to the sequence. At this point, this scheme is repeated, but using j asthe initial state, so that one waits for A coincident return of the two processes to state j followed by simultaneoustransitions to a new state, say k , and so on. The Markov chain Y s transitions from i to j with probability K ij while B s selects j with probability M j . In contrast to the original description given in [18] and in Section 1, twonucleotides of types j and k are selected with probabilities M j and K ij respectively and a nucleotide of type j isthen appended to the end of the sequence if and only if they are of the same type. In essence, the mechanismby which nucleotides are appended to the DNA sequence can be thought of as carrying out acceptance rejectionsampling, by repeatedly drawing independent sample nucleotides from the distributions ( K ij , j ∈ N ) and M untilthey agree (assuming i is the type of the nucleotide at the end of the sequence). In this case, the number of drawsneeded in order to obtain a suitable nucleotide is a geometric random variable with mean 1 (cid:14) P j ∈ N K ij M j . Thefirst interpretation also amounts to performing acceptance-rejection sampling, but with a two-step procedure in6hich a nucleotide type j is first proposed by sampling it from the distribution M and then is added to the DNAsequence according to an unfair coin toss with probability L ij .Finally, we note that if the matrix L is rescaled so that P i,j ∈ N L ij = 1, it admits the natural interpretation asthe stationary dinucleotide probability distribution, that is, L ij = Pr( Y t = i, Y t +1 = j ) , i, j ∈ N, t ≥ . As noted above, L remains persymmetric under this kind of rescaling. ℵ -generated matrices Let S be the set of all 4 × ℵ be the cone of positive persymmetric matrices (matrices L = ( L ij ) i,j ∈ N with positive entries such that L ij = L α ( j ) α ( i ) for all i, j ∈ N ). Given P ∈ S we will say that ( P, π )is a stationary Markov chain if the vector π = ( π i ) i ∈ N is such that πP = π .Let ̥ : ℵ × (0 , + ∞ ) → S be the map which takes ( L , m ) to the matrix ̥ ( L , m ), which is given for all i, j ∈ N by ( ̥ ( L , m )) ij := L ij M j P k =1 L ik M k , where M ℓ = (cid:26) m/ (2 m + 2) , if ℓ = 1 , , / (2 m + 2) , if ℓ = 2 , . (3)Since L is a positive matrix and m >
0, the matrix ̥ ( L , m ) is primitive, that is, irreducible and aperiodic. Definition 3.
We say that P ∈ S is ℵ -generated if there exist ( L , m ) ∈ ℵ × (0 , + ∞ ) such that P = ̥ ( L , m ).Let Φ : S × (0 , + ∞ ) × (0 , + ∞ ) → ℵ be the map defined for all stochastic matrices P , ˜ m > s > P, ˜ m, ˜ s ) := ˜ s a P κ P / ˜ m a P κ P / ˜ ma P a P ǫ P ˜ m ǫ P ˜ m a P a P a P ǫ P a P ˜ m a P ǫ P ˜ m a P a P a P κ P / ˜ m a P a P a P a P κ P / ˜ m , where a ij P := P ij /P iα ( j ) ; κ P := P (cid:14) P ; ǫ P := P (cid:14) P . (4)From (3) it follows that if P is an ℵ -generated matrix for some L = (cid:0) L ij (cid:1) i,j ∈ N ∈ ℵ and m ∈ (0 , + ∞ ), then thenine ratios that appear in (4) become: a ij P = L ij /L iα ( j ) , κ P = L m/L and ǫ P = L /L m. (5) Theorem 4.
For any ℵ -generated matrix P , ̥ − ( P ) = (cid:8)(cid:0) Φ( P, ˜ m, ˜ s ) , ˜ m (cid:1) : ˜ m > , ˜ s > (cid:9) . Proof.
Let P = ̥ ( L , m ) for some fixed L ∈ ℵ and m > L := Φ( P, ˜ m, ˜ s ) for any choice of ˜ m, ˜ s >
0, it is straightforward to check that ̥ (˜ L , ˜ m ) = ̥ ( L , m ) = P .Therefore, (cid:8)(cid:0) Φ( P, ˜ m, ˜ s ) , ˜ m (cid:1) : ˜ m > , ˜ s > (cid:9) ⊆ ̥ − ( P ).On the other hand, suppose L ′ = ( L ′ ij ) i,j ∈ N ∈ ℵ and m ′ > L ′ , m ′ ) ∈ ̥ − ( P ). Note that, since P = ̥ ( L , m ) = ̥ ( L ′ , m ′ ), it follows from (5) that a ij P = L ij /L iα ( j ) = L ′ ij /L ′ iα ( j ) , κ P = L m (cid:14) L = L ′ m ′ (cid:14) L ′ , ǫ P = L (cid:14) L m = L ′ (cid:14) L ′ m ′ . Hence, L ′ = Φ( P, m ′ , L ′ ) and so ̥ − ( P ) ⊆ (cid:8)(cid:0) Φ( P, ˜ m, ˜ s ) , ˜ m (cid:1) : ˜ m > , ˜ s > (cid:9) , which completes the proof.Since Φ is linear in ˜ s , instead of working with Φ we can work with the map ϕ : S × (0 , + ∞ ) → ℵ defined by ϕ ( P, ˜ m ) := Φ( P, ˜ m, . (6)Then, ̥ − ( P ) = (cid:8)(cid:0) ˜ sϕ ( P, ˜ m ) , ˜ m (cid:1) : ˜ m > , ˜ s > (cid:9) . The next corollary is a simple consequence of (6) and Theo-rem 4. 7 orollary 5. A stochastic matrix P is ℵ -generated if and only if P is obtained by probability-normalizing the rowsof the matrix L := ϕ ( P, . Given a vector a = ( a , a , a , a ) ∈ R , let D ( a ) be the 4 × a on its diagonal. Corollary 6.
A stochastic matrix P is ℵ -generated if and only if there exists a strictly positive vector x = ( x i ) i ∈ N ∈ R such that D ( x ) P ∈ ℵ .Proof. Suppose that P is ℵ -generated and define x to be the vector with elements given by x i := P k =1 (cid:16) ϕ ( P, (cid:17) ik .Then, from Corollary 5 we have that D ( x ) P = ϕ ( P, ∈ ℵ .Conversely, if D ( x ) P = L ∈ ℵ for some x ∈ R , then P = ̥ ( L ,
1) and x contains the row sums of L .Note that given an ℵ -generated matrix P , there exist infinitely many vectors x that satisfy the stated property,all of which are collinear. Because of this, we can decide whether or not a stochastic matrix is ℵ -generated bysetting x P = (cid:18) P jα ( i ) P iα ( j ) (cid:19) i ∈ N = 1 P k =1 (cid:0) ϕ ( P, (cid:1) jk X k =1 (cid:0) ϕ ( P, (cid:1) ik ! i ∈ N , for a fixed j ∈ { , , , } , and checking whether or not D ( x P ) P belongs to ℵ . Observe that x P is expressed interms of elements of P . In particular, we can choose x P = (cid:18) P P , P P , P P , (cid:19) . ℵ -families and generators From the preceding discussion, it is evident that a given ℵ -generated stochastic matrix can be generated using anyone of a multitude of persymmetric matrices. We proceed to examine this non-uniqueness in greater detail. Definition 7.
The ℵ -family of an ℵ -generated matrix P is the set ℵ ( P ) := { ϕ ( P, ˜ m ) : ˜ m > } . The family of generators of an ℵ -generated matrix P is the set ℵ G ( P ) := (cid:8)(cid:0) ϕ ( P, ˜ m ) , ˜ m (cid:1) : ˜ m > (cid:9) . The import of the next theorem is that ℵ can be partitioned into equivalence classes. Firstly, any persymmetricmatrix can be used to generate a whole host of ℵ -generated matrices simply by varying the value of the parameter m .Thus, there are families of persymmetric matrices that give rise to disjoint collections of ℵ -generated matrices andthese families are mutually exclusive, partitioning the space ℵ into equivalence classes. Secondly, for each ℵ -generated matrix P , there is a set of persymmetric matrices, each of which generates P when combined with theappropriate value of m . This leads to an equivalence relation on the set ℵ × (0 , ∞ ). Theorem 8.
Suppose P and Q are two ℵ -generated matrices. Then:(i) Either ℵ ( P ) ∩ ℵ ( Q ) = ∅ or ℵ ( P ) = ℵ ( Q ) .(ii) Either(a) ℵ G ( P ) ∩ ℵ G ( Q ) = ∅ and P = Q ; or(b) ℵ G ( P ) = ℵ G ( Q ) and P = Q .Proof. ℵ ( P ) ∩ ℵ ( Q ) = ∅ and choose an L = ( L ij ) i,j ∈ N ∈ ℵ ( P ) ∩ ℵ ( Q ). Let m (1) , m (2) ∈ (0 , + ∞ ) be suchthat P = ̥ ( L , m (1) ) and Q = ̥ ( L , m (2) ).We begin by proving that ℵ ( P ) ⊆ ℵ ( Q ). Let B = ( B ij ) i,j ∈ N ∈ ℵ ( P ) and let m (3) > P = ̥ ( B , m (3) ). Since P and Q can be generated by the same L , they share the same ratios a ij · listed in (5),that is, a ij P = a ij Q (7)The other two ratios for P will satisfy the following equalities: κ P := P /P = B m (3) /B = L m (1) /L ,ǫ P := P /P = B /B m (3) = L /L m (1) , which means that B m (3) (cid:14) B m (1) = L (cid:14) L , and B m (1) (cid:14) B m (3) = L (cid:14) L . (8)On the other hand, the last two ratios for Q are: κ Q := Q /Q = L m (2) /L = B m (3) m (2) /B m (1) = m (2) m (1) κ P ,ǫ Q := Q /Q = L /L m (2) = B m (1) /B m (3) m (2) = m (1) m (2) ǫ P , (9)where the last equality in each line follows from (8).Setting ˜ m := m (2) m (3) /m (1) , and taking (7) and (9) together with the last line in the proof of Theorem 4yields ϕ ( Q, ˜ m ) = ϕ ( P, m (3) ) = B . Therefore, B ∈ ℵ ( Q ) and so ℵ ( P ) ⊆ ℵ ( Q ).Next, let B ∈ ℵ ( Q ). By symmetry, another application of the above argument allows us to conclude that B ∈ ℵ ( P ) and hence ℵ ( Q ) ⊆ ℵ ( P ). Therefore, ℵ ( P ) = ℵ ( Q ).(ii) By definition, either ̥ − ( P ) = ̥ − ( Q ), in which case P = Q , or ̥ − ( P ) ∩ ̥ − ( Q ) = ∅ and P = Q . Now, ℵ G ( P ) ⊂ ̥ − ( P ) since ̥ − ( P ) = { ˜ s L : ˜ s > , L ∈ ℵ G ( P ) } , and the result follows. Definition 9.
Given an ℵ -generated matrix P and ˜ m ∈ (0 , + ∞ ), we define the ˜ m -canonical representative of ℵ ( P )to be the matrix L P, ˜ m := ϕ ( P, ˜ m/ǫ P ).Note that (cid:0) L P, ˜ m , ˜ m/ǫ P (cid:1) is a generator of P . Furthermore, from (5) if P and Q are two ℵ -generated matriceswith ℵ ( P ) = ℵ ( Q ), then a ij P = a ij Q , for all i, j and κ P ǫ P = κ Q ǫ Q . This gives Corollary 10.
Two ℵ -generated matrices belong to the same ℵ -family if and only if they have identical canonicalrepresentatives, that is, if P and Q are ℵ -generated, then ℵ ( P ) = ℵ ( Q ) ⇐⇒ L P, = L Q, ⇐⇒ L P, ˜ m = L Q, ˜ m , for all ˜ m > . ℵ -generated matrices Given the stationary Markov chain (
P, π ), consider the following related stationary Markov chains: ( P α , π α ) is thecomplement Markov chain of ( P, π ), where P αij := P α ( i ) α ( j ) and π αi := π α ( i ) ; ( P ∗ , π ∗ ) denotes the reverse Markovchain of ( P, π ), where P ∗ ij := π j P ji (cid:14) π i and π ∗ i := π i ; and ( ˜ P , ˜ π ) is the reverse complement Markov chain of ( P, π ),where ˜ P ij = π α ( j ) P α ( j ) α ( i ) (cid:14) π α ( i ) and ˜ π i = π α ( i ) . Note that ˜ P = ( P α ) ∗ = ( P ∗ ) α and ˜ π = ( π α ) ∗ = ( π ∗ ) α . Thenames complement, reverse and reverse complement come from the genetics and Markov chains literature, referringto several kinds of relationship that can exist between nucleotide sequences (genetics), as well as Markov chains. Theorem 11.
The matrices P , P α , P ∗ and ˜ P are either all ℵ -generated or none of them are. roof. Assume P is ℵ -generated and take L := ϕ ( P,
1) = (cid:0) L ij (cid:1) i,j ∈ N .Define L α = ( L αij ) i,j ∈ N ∈ ℵ , where L αij := L α ( i ) α ( j ) . Then, P αij = P α ( i ) α ( j ) = L α ( i ) α ( j ) P k =1 L α ( i ) k = L αij P k =1 L αik , i, j ∈ N and P α is ℵ -generated with P α = ̥ (cid:0) L α , (cid:1) .To check that P is ℵ -generated implies that P ∗ is also ℵ -generated, it suffices by Corollary 6 to set x P ∗ = (cid:18) P ∗ P ∗ , P ∗ P ∗ , P ∗ P ∗ , (cid:19) and prove that D (cid:0) x P ∗ ) P ∗ ∈ ℵ . In fact, (cid:0) D (cid:0) x P ∗ (cid:1) P ∗ (cid:1) ij = (cid:0) D (cid:0) x P ∗ (cid:1) P ∗ (cid:1) α ( j ) α ( i ) because (cid:0) D (cid:0) x P ∗ (cid:1) P ∗ (cid:1) ij = ( x P ∗ ) i P ∗ ij = P ∗ α ( i ) P ∗ i P ∗ ij = π α ( i ) π P α ( i )4 π π i P i π j π i P ji = π α ( i ) π j π π P α ( i )4 P i P ji and similarly (cid:0) D (cid:0) x P ∗ (cid:1) P ∗ (cid:1) α ( j ) α ( i ) = π j π α ( i ) π π P j P α ( j ) P α ( i ) α ( j ) , while P α ( i )4 P i P ji = L α ( i )4 P k =1 L α ( i ) k L ji L i P k =1 L k P k =1 L jk = L α ( i )4 P k =1 L α ( i ) k L α ( i ) α ( j ) L α ( i )4 P k =1 L k P k =1 L jk = L α ( i ) α ( j ) P k =1 L α ( i ) k P k =1 L k P k =1 L jk = L α ( i ) α ( j ) P k =1 L α ( i ) k L j L α ( j ) P k =1 L k P k =1 L jk = P j P α ( j ) P α ( i ) α ( j ) . Next, to check that ˜ P is ℵ -generated given that P is ℵ -generated, we need only note that ˜ P = ( P α ) ∗ and applythe above two results one after the other.The proof is completed by realising that P = ( P α ) α = ( P ∗ ) ∗ = ˜˜ P and hence being ℵ -generated is a solidarityproperty of the four matrices.Most bacterial DNA sequences can be segmented into two halves called chirochores [4] and the two stationaryMarkov chains that empirically approximate their first-order structure are reverse complements of each other [18].If the DNA sequence conforms to the S-H model then the dinucleotide distribution in one of the chirochores isapproximated by ( P, π ) with P being ℵ -generated. However, it was an open question as to whether or not the otherchirochore would also be approximated by an ℵ -generated Markov chain. Theorem 11 above answers this questionin the positive.Furthermore, it is common to find that the stationary Markov chain ( W, ω ) that approximates the first-orderstructure of an entire DNA sequence satisfies intra-strand parity [1, 7], that is, ω i W ij = ω α ( j ) W α ( j ) α ( i ) = ˜ ω i ˜ W ij for all i, j ∈ N . Intra-strand parity has been observed in the DNA sequences of many organisms such as Bacteria, archaea,plants and animals, but not in other sequences such as those from single-stranded viruses and organelles. The nexttheorem relates intra-strand parity of dinucleotides to the ℵ -generated matrices (cf. the direct characterizationin [7, Proposition 1]) and shows that ℵ -generated matrices satisfy a weaker property than intra-strand parity. Theorem 12.
Let ( W, ω ) be a stationary Markov chain. Then ( W, ω ) satisfies ω i W ij = ω α ( j ) W α ( j ) α ( i ) for all i, j ∈ N if and only if it is ℵ -generated and the matrix L := ϕ ( W,
1) = (cid:0) L ij (cid:1) i,j ∈ N that generates it satisfies S i = S α ( i ) for i ∈ N , where S i := P k =1 L ik . Furthermore, if W complies with intra-strand parity, then itsstationary distribution ω can be explicitly expressed as ω = S + S ) ( S , S , S , S ) .Proof. [(= ⇒ )] It can be seen that W is ℵ -generated by observing that (cid:0) D ( ω ) W (cid:1) ij = ω i W ij = ω α ( j ) W α ( j ) α ( i ) = (cid:0) D ( ω ) W (cid:1) α ( j ) α ( i ) . Next, let L = ϕ ( W, ω i W ij = ω α ( j ) W α ( j ) α ( i ) , for i, j ∈ N , im-plies ω i = ω α ( i ) for all i ∈ N . Therefore, ω i L ii P k =1 L ik = ω α ( i ) L α ( i ) α ( i ) P k =1 L α ( i ) k = ω i L ii P k =1 L α ( i ) k , for all i ∈ N , and hence P k =1 L ik = P k =1 L α ( i ) k . 10( ⇐ =)] Suppose W is obtained by normalizing the rows of a matrix L = ( L ij ) i,j ∈ N ∈ ℵ , that is, W = (cid:16) L ij S i (cid:17) i,j ∈ N ,where S i := P k =1 L ik . Suppose that L satisfies S i = S α ( i ) for i = 1 ,
2. It is easy to check that ω := S + S ) ( S , S , S , S ) is the stationary distribution of W . Hence, it follows that for all i, j ∈ N , ω i W ij = S i S + S ) L ij S i = L ij S + S ) = S α ( j ) S + S ) L α ( j ) α ( i ) S α ( j ) = ω α ( j ) W α ( j ) α ( i ) . This article has given a mathematical analysis of the S-H model and elucidated its properties. We conclude withsome remarks about the application of the results that have been presented here. Corollary 10 provides a way ofdeciding whether or not two or more ℵ -generated matrices can be generated from a single persymmetric matrix L in conjunction with different values of the parameter m . Meanwhile, Theorem 12 shows that intra-strand parityin dinucleotides is a special case of ℵ -generated matrices. Possessing a weaker structure than that encapsulated byintra-strand parity, it is possible that ℵ -generated matrices may be useful for capturing the dinucleotide structurein genomic sequences that do not exhibit intra-strand parity.For the purposes of applications, corollaries 5 and 6 are useful for constructing measures of how close theestimated stationary Markov chain of a bacterial DNA sequence is to being ℵ -generated. Given P ∈ S , we candefine the following two examples of such measures: Measure 1:
Let proj( Q ) be the orthogonal projection of a 4 × Q onto ℵ , and define δ ( P ) := min x =( x ,x ,x , (cid:13)(cid:13)(cid:13) D ( x ) P − proj (cid:16) D ( x ) P (cid:17)(cid:13)(cid:13)(cid:13) . The quantity δ ( P ) = 0 if and only if P is ℵ -generated. Otherwise, δ ( P ) gives the minimal distance betweensome matrix D ( x ) P which generates P according to the model (but which does not belong to ℵ ) and the space ℵ .Note that δ ( P ) can be analytically computed. The minimum in the expression for δ ( P ) is attained at the point x = ( x , x , x , x x x = p + p + p − p p − p p − p p p + p + p − p p − p p − p p p + p + p − p p p p p p . Measure 2:
Let ǫ be a 4 × P ( ǫ ) := P + ǫ , and x = ( x , x , x ,
1) be a positive vector. Define δ ( P ) asthe solution of the following optimization problem:min X i,j ∈ N ǫ i,j subject to ( P ( ǫ ) ∈ S ; D ( x ) P ( ǫ ) − proj (cid:16) D ( x ) P ( ǫ ) (cid:17) = . As was the case with δ ( P ), we have δ ( P ) = 0 if and only if P is ℵ -generated, otherwise, δ ( P ) gives the shortestsquared Frobenius distance between P and some ℵ -generated stochastic matrix. There being no closed-form solutionto the optimization problem, the computation of δ ( P ) would need to be implemented using numerical methods.Finally, the development of statistical hypothesis tests based on these measures together with further statisticalanalyses and their application to real bacterial genomes are planned for future publication. Acknowledgments
This work was supported by the Center for Mathematical Modeling CONICYT Project/Grant PIA AFB 170001,Fondecyt Regular Grant 1070344 and CNPq-Brazil grants 308575/2015-6 and 301445/2018-4. M. Sobotka waspartially supported by CNPq-Brazil grant 54091/2017-6. Part of this work was carried out while M. Sobotka wasvisiting the Center for Mathematical Modeling at the University of Chile.11 eferences [1] G. Albrecht-Buehler. Asymptotically increasing compliance of genomes with chargaff’s second parity rules throughinversions and inverted transpositions.
Proceedings of the National Academy of Sciences USA , 103:17828–17833, 2006.[2] C. Cattani and G. Pierro. On the existence of wavelet symmetries in archaea dna.
Computational and MathematicalMethods in Medicine , 2012:ID 673934, 2012.[3] J. Felsenstein. Evolutionary trees from dna sequences: A maximum likelihood approach.
Journal of Molecular Evolution ,17:368–376, 1981.[4] A. C. Frank and J. R. Lobry. Oriloc: Prediction of replication boundaries in unannotated bacterial chromosomes.
Bioinformatics , 16:560–561, 2000.[5] J. Guti´errez-Guti´errez. Eigenvalue decomposition for persymmetric hankel matrices with at most three non-zero anti-diagonals.
Appl. Math. Comput. , 234:333–338, 2014.[6] A. Hart, S. Mart´ınez, and F. Olmos. A Gibbs approach to Chargaff’s second parity rule.
J. Stat. Phys. , 146(2):408–422,2012.[7] A. G. Hart and S. Mart´ınez. Statistical testing of Chargaff’s second parity rule in bacterial genome sequences.
Stoch.Models , 27(2):1–46, 2011.[8] A. G. Hart, S. Mart´ınez, and L. Videla. A simple maximization model inspired by algorithms for the organization ofgenetic candidates in bacterial dna.
Adv. Appl. Probab. , 38(4):1071–1097, 2006.[9] N. M. Huang and R. E. Cline. Inversion of persymmetric matrices having toeplitz inverses.
J. Assoc. Comput. Mach. ,19(3):437–444, 1972.[10] W. Li and K. Kaneko. Long-range correlation and partial 1 /f α spectrum in a noncoding dna sequence. EPL (EurophysicsLetters) , 17(7):655, 1992.[11] J. R. Lobry. Asymmetric substitution patterns in the two dna strands of bacteria.
Molecular Biology and Evolution ,13:660–665, 1996.[12] S. Mart´ınez. A stationary distribution associated to a set of laws whose initial states are grouped into classes. Anapplication in genomics.
J. Appl. Probab. , 53(2):315–326, 2016.[13] L. Nian. A matrix inverse eigenvalue problem and its application.
Linear Algebra Appl. , 266:143–152, 1997.[14] L. Nian and K.-W. E. Chu. Designing the hopfield neural network via pole assignment.
Int. J. Syst. Sci. , 25:669–681,1994.[15] D. A. Nield. Odd-even factorization results for eigenvalue problems.
SIAM Rev. , 36:649–651, 1994.[16] R. M. Reid. Some eigenvalue properties of persymmetric matrices.
SIAM Rev. , 39:313–316, 1997.[17] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White. Microbial gene identification using interpolated markov models.
Nucleic Acids Research , 26:544–548, 1998.[18] M. Sobottka and A. G. Hart. A model capturing novel strand symmetries in bacterial DNA.
Biochem. Biophys. Res.Commun. , 410(4):823–828, 2011.[19] D. Xie and Y. Sheng. Inverse eigenproblem of anti-symmetric and persymmetric matrices and its approximation.
InverseProblems , 19:217–225, 2003.[20] Z.-G. Yu, V. V. Anh, and B. Wang. Correlation property of length sequences based on global structure of the completegenome.
Phys. Rev. E , 63(1):011903, 2000., 63(1):011903, 2000.