Improved Rank-Modulation Codes for DNA Storage with Shotgun Sequencing
aa r X i v : . [ c s . I T ] J a n Improved Rank-Modulation Codes for DNA Storagewith Shotgun Sequencing
Niv Beeri and Moshe Schwartz,
Senior Member, IEEE
Abstract
We study permutations over the set of ℓ -grams, that are feasible in the sense that there is a sequence whose ℓ -gram frequencyhas the same ranking as the permutation. Codes, which are sets of feasible permutations, protect information stored in DNAmolecules using the rank-modulation scheme, and read using the shotgun sequencing technique. We construct systematic codeswith an efficient encoding algorithm, and show that they are optimal in size. The length of the DNA sequences that correspondto the codewords is shown to be polynomial in the code parameters. Non-systematic with larger size are also constructed. Index Terms
DNA storage, permutation codes, De Bruijn graphs
I. I
NTRODUCTION S TORING information in DNA molecules offers unparalleled information density, and has been proven to be feasible [5],[7], [10], [27]. Long DNA sequences may be read relatively accurately using the shotgun sequencing technique (see [18]and the references therein). In this method, several copies of the same DNA sequence are broken down into fragments. Thesefragments are identified, and an algorithm reconstructs the DNA sequence using the knowledge of the multiset of fragmentsobtained. Other similar variants of this reconstruction method have also been studied [1], [8], [9], [20].It has been suggested by [16] that we may skip the final phase of sequence reconstruction, instead opting to have theinformation encoded in the multiset of fragments. More precisely, if the sequence is over an alphabet Σ , the shotgun sequencingprocedure provides us with a histogram, or a profile vector , counting how many times each ℓ -gram from Σ ℓ appears as a substringof the DNA sequence. Thus, the actual sequence is of no consequence, acting merely a vehicle for its profile vector. As a sidebenefit, this allows us to use ambiguous profile vectors that may describe more than one sequence.The profile vector obtained as part of the shotgun-sequencing procedure is unfortunately noisy. Errors in it are mainly due tosubstitution errors in the sequence-synthesis phase, non-uniform fragmentation causing coverage gaps, and ℓ -gram substitutionsdue to sequencing (see [16] and the references therein). One approach, studied in [16] is to protect the profile vector using anerror-correcting code, where an appropriate metric is formulated to capture the error patterns mentioned.Another suggestion put forth by [16], and later studied by [21], was to employ the rank-modulation scheme over the profilevectors. Rank modulation has a long history, starting with [4], [6], [22] for vector digitization and signal detection, throughcommunication over power lines [25], and more recently, for information storage in non-volatile memories [14]. In our context,instead of storing the information in the profile vector, whose integer entries count the number of occurrences of each ℓ -gramfrom Σ ℓ , the information is stored in the permutation over Σ ℓ which is the ranking (by frequency of appearance) of the entriesof the profile vector. By doing so we immediately gain a layer of protection since perturbations of the profile vector that donot result in a change of ranking, do not corrupt the stored information. Additionally, there are known error-correcting codesfor the rank-modulation scheme, which we may use to gain further protection [2], [12], [13], [15], [17], [23], [28]–[32].Not all permutations on Σ ℓ correspond to a ranking of a profile vector of some sequence, as was observed in [21]. Alinear programming algorithm was derived in [21], which can decide whether a given permutation is feasible. However, anexact characterization of all feasible permutations is still unknown. Thus, [21] provided only upper bounds on the number offeasible permutations, and recursive constructions that may also act as encoders. These constructions produce codes whoserate is asymptotically ℓ when ℓ is constant and the alphabet size q = | Σ | goes to infinity, and when q is fixed and ℓ → ∞ .Additionally, the length of the resulting encoded sequence was bounded and shown to be polynomial in q ℓ . We also note thatwhile [16] suggested the rank-modulation scheme, it did so only for a strict subset of the entries of the profile vector.The goal of this paper is to construct rank-modulation codes that improve upon the best known ones, namely those from [21].Our main contributions are the following: We construct systematic codes for all alphabet sizes q > , and all window sizes ℓ > . We give an efficient encoding algorithm for these codes. The asymptotic rate of these codes is when ℓ is fixed and q → ∞ , and is − q when q is fixed and ℓ → ∞ , improving upon [21]. The length of the encoded sequence is analyzed Niv Beeri is with the School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel (e-mail:[email protected]).Moshe Schwartz is with the School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel (e-mail:[email protected]).This work was supported in part by the Israel Science Foundation (ISF) under grant No. 270/18. and upper bounded by O ( q ℓ ) for ℓ > , and O ( q ) when ℓ = . These improve upon the order of the corresponding boundsfrom [21]. Additionally, our upper bound is numerically lower than that of [21] except for the case of q = and ℓ = .We also prove an upper bound on the size of systematic codes, which shows our construction produces optimal systematiccodes. Finally, we show a construction of non-systematic codes that gives codes which are strictly larger than their systematiccounterparts.The paper is organized as follows. In Section II we give the necessary definitions used throughout the paper. Section III weconstruct systematic codes, provide an encoder, analyze the resulting sequence length, and prove an upper bound on the sizeof such codes. In Section IV we build larger codes that are non-systematic. We conclude in Section V with a summary anddiscussion of the results, as well as some open problems.II. P RELIMINARIES
Throughout the paper we use Σ to denote an alphabet of size q . We assume no further structure on the alphabet. We use Σ ℓ to denote the set of all strings over Σ of length ℓ , also called ℓ -grams, and Σ ∗ to denote the set of all finite strings over Σ . If s , s ′ ∈ Σ are strings, we use ss ′ to denote their concatenation, and | s | to denote the length of s . If the need arises to considerspecific letters in a string s ∈ Σ n , we shall usually denote the i th letter as s i , namely, s = s s . . . s n − , where s i ∈ Σ for all i ∈ [ n ] , {
0, 1, . . . , n − } .If G = ( V , E ) is a directed graph, we denote the edge e ∈ E from v ∈ V to v ′ ∈ V by e = v → v ′ . We shall also say itssource is src ( e ) = v and its destination is dest ( e ) = v ′ . Additionally, for any vertex v ∈ V we denote by E in ( v ) the set ofedges entering v , and similarly, we use E out ( v ) to denote the set of edges leaving v , i.e., E in ( v ) , { e ∈ E | dest ( e ) = v } , E out ( v ) , { e ∈ E | src ( e ) = v } . The in-degree and out-degree of v are similarly defined, d in ( v ) , | E in ( v ) | d out ( v ) , | E out ( v ) | . These definitions are extended to sets of vertices in the natural way. Let V ′ ⊆ V be a subset of vertices. Then we define E in ( V ′ ) , { e ∈ E | dest ( e ) ∈ V ′ , src ( e ) V ′ } , E out ( V ′ ) , { e ∈ E | src ( e ) ∈ V ′ , dest ( e ) V ′ } . A. Strings, profiles, and weighted De Bruijn graphs
A useful tool in the context of string analysis is the De Bruijn graph, which is defined as follows.
Definition 1
The De Bruijn graph of order ℓ > over Σ is the directed graph G q , ℓ whose vertex set is V ( G q , ℓ ) = Σ ℓ , andwhose edge set is E ( G q , ℓ ) = { w w . . . w ℓ − → w w . . . w ℓ | for all w i ∈ Σ } . We observe that each edge w w . . . w ℓ − → w w . . . w ℓ in G q , ℓ is uniquely identified by w . . . w ℓ ∈ Σ ℓ + . Let s = s . . . s n − ∈ Σ n be a string. We say that s i s i + . . . s i + ℓ − is a window of length ℓ into s , where indices are taken modulo n (i.e., we consider the string cyclically). Thus, by scanning s with a sliding window of length ℓ , we obtain a cycle in G q , ℓ whosesequence of vertices corresponds to the windows into s . Alternatively, with the same sliding window of length ℓ we obtaina cycle in G q , ℓ − whose sequence of edges corresponds to the windows into s . This latter correspondence between cycles in G q , ℓ − and strings will be used throughout the paper.Motivated by the process of shotgun sequencing, previous papers [16], [21] suggested that information be encoded in theprofile vector of the DNA sequence, whose definition follows. Definition 2
Let s = s . . . s n − ∈ Σ n be a string. The profile vector of s of order ℓ , denoted by p s , ℓ ∈ ( N ∪ { } ) Σ ℓ , is anon-negative integer vector indexed by Σ ℓ such that for each w ∈ Σ ℓ , p s , ℓ ( w ) , |{ i ∈ [ n ] | s i s i + . . . s i + ℓ − = w }| , where indices are taken modulo n . Namely, p s , ℓ ( w ) counts the number of occurrences of w in s (cyclically). Definition 3
Let x ∈ ( N ∪ { } ) Σ ℓ . We say x is feasible if there exists s ∈ Σ ∗ whose profile vector of order ℓ is x , namely, p s , ℓ = x . Example 4
Let Σ = { A , C , G } , and consider the string s = GGGGAGAGAGGGGAAAAAAAACCCCCCCAGGGGCGCGCGCGCGCGCCCCAGCCGCCG . GA C
Fig. 1. The weighted De Bruijn graph of Example 7.
The profile vector of s of order is p s ,2 = (
7, 1, 5, 2, 11, 8, 4, 9, 10 ) , (1) where the indices of the profile vector are in lexicographic order, i.e., AA , AC , AG , CA , CC , CG , GA , GC , GG . Not every vector x ∈ ( N ∪ { } ) Σ ℓ is feasible. Let us build the following directed graph, G , with vertices V = Σ ℓ − ,and for every w = w . . . w ℓ − ∈ Σ ℓ we place x ( w ) parallel copies of the edge w . . . w ℓ − → w . . . w ℓ − . Then by ourprevious discussion of De Bruijn graphs, it is obvious that x is the profile vector of order ℓ of some string s if and only if G contains an Eulerian cycle (i.e., a cycle passing through each edge exactly once). In turn, an Eulerian cycle exists if andonly if G is strongly connected (excluding isolated vertices) and for every vertex v ∈ V , its in-degree equals its out-degree, d in ( v ) = d out ( v ) .For our convenience, we replace the x ( w ) parallel edges discussed above with a single edge of weight x ( w ) . In general,for a directed graph G = ( V , E ) we use wt G ( e ) ∈ R to denote the weight of an edge e ∈ E . We omit the subscript G if it isclear from context. We also extend this definition to subsets of edges E ′ ⊆ E by defining wt ( E ′ ) , ∑ e ∈ E ′ wt ( e ) . Definition 5
Let G = ( V , E ) be a directed weighted graph. We say G is balanced if wt ( E in ( v )) = wt ( E out ( v )) for all v ∈ V . We therefore have the following corollary, translating our previous observation that uses parallel edges, to one using weights.
Lemma 6
A vector x ∈ N Σ ℓ is feasible if and only if the weighted De Bruijn graph, G q , ℓ − , with weights wt ( e ) = x ( e ) forall e ∈ Σ ℓ , is balanced.Proof: Replace each edge e with wt ( e ) parallel edges. Since the weight of every edge is positive, the resulting graph, G ′ ,is strongly connected, and therefore x is feasible if and only if d in ( v ) = d out ( v ) for every v ∈ G ′ . But that happens if andonly if G q , ℓ − is balanced.Following [21], we shall almost always consider strings s whose profile vectors are all positive integers, i.e., for all w ∈ Σ ℓ , p s , ℓ ( w ) > . Example 7
We continue the setting of Example . In Figure we draw the De Bruijn graph with edge weights in accordancewith the profile vector p s ,2 of (1) . We observe that the resulting graph is balanced, not surprising as the profile vector wastaken from a string, i.e, the profile vector is feasible. We would like to make one more simple observation that will be useful later.
Lemma 8
Let G = ( V , E ) be a finite weighted directed graph. Then G is balanced if and only if for every U ⊆ V , wt ( E in ( U )) = wt ( E out ( U )) .Proof: One direction is trivial. If wt ( E in ( U )) = wt ( E out ( U )) for all U ⊆ V , then it is true in particular for subsets U containing exactly one vertex, making G balanced by definition. In the other direction, for any U ⊆ V we have wt ( E in ( U )) − wt ( E out ( U )) = ∑ e ∈ E in ( U ) wt ( e ) − ∑ e ∈ E out ( U ) wt ( e ) ( a ) = ∑ v ∈ U ( wt ( E in ( v )) − wt ( E out ( v ))) = where ( a ) follows from the fact that edges v ′ → v ′′ , where v ′ , v ′′ ∈ U , that are added to the sum ∑ v ∈ U wt ( E in ( v )) , are alsoadded to the sum ∑ v ∈ U wt ( E out ( v )) , thus, canceling out. B. Permutations and rank modulation
Let A be a finite set. We use S A to denote the set of permutations over A . Each permutation π ∈ S A may be consideredas a bijection A → [ | A | ] , sending each element of A to its unique ranking in the permutation. Encoding information inpermutations of the set A , instead of vectors over A , has a long history under the name rank modulation . The identity of theset A depends on the specifics of the applications. As examples we bring [6] dealing with signal detection with impulsivenoise, [25] for powerline communications, and [14] for coding in flash memories.Recently, [16] suggested applying the rank-modulation scheme to DNA storage, with a follow-up work [21]. There, the setof permutations is S q , ℓ , S Σ ℓ , and the ranking is done by the entries of the profile vector of the DNA sequence. Precisedefinitions follow: Definition 9
Let π ∈ S q , ℓ be a permutation, and let x ∈ N Σ ℓ be some vector. We say that x satisfies π , writing x (cid:15) π , ifthe entries of x are distinct and and for w , w ′ ∈ Σ ℓ , π ( w ) < π ( w ′ ) if and only if x ( w ) < x ( w ′ ) . Additionally, we say π isfeasible if x is feasible. We denote the set of all feasible permutations over Σ ℓ by Φ q , ℓ , and their number by F q , ℓ , (cid:12)(cid:12)(cid:12) Φ q , ℓ (cid:12)(cid:12)(cid:12) . Since we will also beinterested in rates, in the coding-theoretic meaning, for any non-empty subset C ⊆ S A , we define its rate as R ( C ) , log |C| log | S A | . We can then define the feasible rate as R q , ℓ , R ( Φ q , ℓ ) = log (cid:12)(cid:12)(cid:12) Φ q , ℓ (cid:12)(cid:12)(cid:12) log (cid:12)(cid:12)(cid:12) S q , ℓ (cid:12)(cid:12)(cid:12) = log F q , ℓ log ( q ℓ ! ) . Example 10
We continue Example . The profile vector p s ,2 of (1) induces a feasible permutation π ∈ S as follows: π = (cid:18) AA AC AG CA CC CG GA GC GG (cid:19) , (2) presented in the standard two-line notation. We shall need the following projection operator for permutations.
Definition 11
Let B ⊆ A be two finite sets, and let π ∈ S A be a permutation over A . We use π | B to denote the uniquepermutation in S B that keeps the relative order of the elements of B in π , namely, for all b , b ′ ∈ B , π | B ( b ) < π | B ( b ′ ) if andonly if π ( b ) < π ( b ′ ) . We can think of π | B in the previous definition as the projection of π onto the elements in the set B . Example 12
We continue Example . Taking the permutation π of (2) , for the set B = { AC , CC , GA , GC } we have π | B = (cid:18) AC CC GA GC (cid:19) , since π ( AC ) < π ( GA ) < π ( GC ) < π ( CC ) . As usual in rank modulation, we define a code C to be a subset of S A . If | A | = n and |C| = M , we say that C is an ( n , M ) -code. Of particular interest to us are systematic codes, which are analogous to systematic linear codes. Definition 13
Let A be some set, | A | = n . We say C ⊆ S A is an [ n , k ] -systematic code , if there exists a set B ⊆ A , | B | = k , |C| = k ! , and { π | B | π ∈ C} = S B . We call B an information set for the code C . Intuitively, in a systematic code the user may set the ranking of the information set, B ⊆ A , arbitrarily (thereby, storing theuser information). The remaining entries of the permutation, A \ B , are then determined by the code, creating a permutationover A . III. O
PTIMAL S YSTEMATIC C ODES FOR F EASIBLE P ERMUTATIONS
In this section we study systematic codes for feasible permutation, namely, systematic subsets
C ⊆ Φ q , ℓ . We provide aconstruction for such codes for all parameters, and show an efficient encoding algorithm. We further prove these are optimal,i.e., having the largest possible size of all systematic codes. Additionally, we analyze the length of the realizing strings, andshow they are at most polynomial in the trivial lower bound. A. Construction
We start by giving some technical lemmas. The first shows two basic operations that take a balanced directed graph, modifythe weights, but keep it balanced.
Lemma 14
Let G = ( V , E ) be a finite balanced directed graph. Construct G ′ = ( V , E ) . Then: If for all e ∈ E , wt G ′ ( e ) = c · wt G ( e ) , where c ∈ R is some constant, then G ′ is also balanced. Let e , e , . . . , e m − be a sequence of edges in E that form a cycle, and let c ∈ R be some constant. If wt G ′ ( e ) = ( wt G ( e ) + c e = e i for some i , wt G ( e ) otherwise,then G ′ is also balanced.Proof: Multiplying the weights by a constant naturally keeps all vertices balanced. For the second case, we note thatvertices that reside on the cycle have the same number of edges from the cycle entering as there are leaving. Thus, adding aconstant weight to the edges of the cycle keeps the graph balanced.Another simple lemma states that if we know that all but one of the vertices are balanced, then that vertex is also balanced.
Lemma 15
Let G = ( V , E ) be a directed weighted graph, and let v ∈ V be some vertex. If wt ( E in ( v ′ )) = wt ( E out ( v ′ )) forall v ′ ∈ V \ { v } , then also wt ( E in ( v )) = wt ( E out ( v )) .Proof: We observe that { E in ( v ′ ) | v ′ ∈ V } is a partition of E , as is { E out ( v ′ ) | v ′ ∈ V } . Thus, wt ( E in ( v )) = wt ( E ) − ∑ v ′ ∈ V \{ v } wt ( E in ( v ′ )) = wt ( E ) − ∑ v ′ ∈ V \{ v } wt ( E out ( v ′ )) = wt ( E out ( v )) , which proves the claim.We recall that a Hamiltonian cycle/path in a graph visits every vertex exactly once, whereas an Eulerian cycle/path visitsevery edge exactly once. It is well known (see [24]) that a Hamiltonian cycle in the De Bruijn graph G q , ℓ (which is equivalentto a De Bruijn sequence) exists for all q , ℓ > . Such a Hamiltonian cycle is also equivalent to an Eulerian cycle in G q , ℓ − .De Bruijn sequences may be nested, with lower-order sequences being prefixes of higher-order sequences. We cite thefollowing result from [3]. Lemma 16 [3, Th. 1] Let G q , ℓ − be a De Bruijn graph with q > and ℓ > , then every Hamiltonian cycle in G q , ℓ − canbe extended to an Eulerian cycle. We shall further need the following technical lemma, which shows that we can complete cycles while avoiding a givenHamiltonian path.
Lemma 17
Let G q , ℓ − be a De Bruijn graph, q > , ℓ > , and let E H = { e , e , . . . , e q ℓ − − } be the edges of a Hamiltoniancycle in G q , ℓ − . Then for any i ∈ [ q ℓ − ] , there is a cycle passing through e i while not passing through any e j , j = i .Proof: As a consequence of Lemma 16, after removing the edges of E H from G q , ℓ − , there exists an Eulerian cycle, β ,in the remaining graph. Let e i = v i → v i + (where indices are taken modulo q ℓ − ) be the edge that we wish to complete to acycle. Since q > , there exists at least one outgoing edge from v i + and one incoming edge to v i that are not in E H . Thus, atsome point β leaves v i + and at some point it enters v i . Denote by ˜ β a part of β that forms a path v i + ❀ v i . It now followsthat e i , ˜ β is a cycle passing through e i but avoiding all e j , j = i .We are now ready for the main theorem of this section, that construct a large systematic code in the space of feasiblepermutations. Theorem 18
Let G q , ℓ − = ( V , E ) be a De Bruijn graph, q > , ℓ > . Let e , e , . . . , e q ℓ − − be a sequence of edges forminga Hamiltonian cycle in G q , ℓ − , and define E ′ H , { e , . . . , e q ℓ − − } . Then there is an injective mapping S E \ E ′ H → Φ q , ℓ , namely,between the set of permutations over E \ E ′ H and the set of feasible permutations over Σ ℓ . Proof:
Let us index the vertices of G q , ℓ − as v , v , . . . , v q ℓ − − , ordered such that e i = v i → v i + for all i ∈ [ q ℓ − ] (andwhere indices are taken modulo q ℓ − ). We need to show that for every permutation on E \ E ′ H we can build a distinct feasiblepermutation on Σ ℓ ∼ = E .Let π ∈ S E \ E ′ H be any permutation on E \ E ′ H . We assign the edges in E \ E ′ H distinct positive weights while keeping theranking as in π . This is easily achieved by setting wt ( e ) = π ( e ) + for all e ∈ E \ E ′ H . Notice that because the edges in E ′ H form a Hamiltonian path, then every vertex in V \ { v q ℓ − − } is left with exactly one outgoing edge whose weight has notbeen set yet.In the next step, we assign weights to the remaining edges, i.e., the edges in E ′ H . We do so in such a way that all verticesbecome balanced. For i =
0, 1, . . . , q ℓ − − , in that order, we assign the weight of e i to be wt ( e i ) = wt ( E in ( v i )) − wt ( E out ( v i ) \ { e i } ) . Thus, all the vertices in V \ { v q ℓ − − } are balanced. By Lemma 15 we must have that v q ℓ − − is also balanced.At this point we have assigned integer weights to all of the edges. In order for the weights to induce a permutation over E ,we need them to be distinct. This is certainly true, by construction, for the edges in E \ E ′ H . However, following the balancingprocess that set the weights for edges in E ′ H , we are not guaranteed distinctness of weights for edges in E , and we thereforeneed to break ties. For the remainder of the proof we proceed with slightly different sets of edges. Define E H , E ′ H ∪ { e q ℓ − − } to be the set of edges in the Hamiltonian cycle required by the theorem. Since E \ E H ⊆ E \ E ′ H , the weights of edges in E \ E H are distinct. It follows that there are only two cases in which we could be seeing equality between weights of twoedges: the two edges are from E H , or one edge is from E H and the other from E \ E H .Let us start by resolving the first case. For each i ∈ [ q ℓ − − ] , let γ i be a cycle in G q , ℓ − that contains e i ∈ E H but doesnot contain any e ∈ E H , e = e i . The existence of such cycles is guaranteed by Lemma 17. We define ∆ , (cid:18) q ℓ − (cid:19) + = q ℓ − ( q ℓ − − ) + and then add ( i + ) · ∆ − to the weight of each of the edges in γ i , for all i ∈ [ q ℓ − − ] . By Lemma 14, we therefore keepthe graph balanced. We further observe that the maximum total weight added to any single edge is upper bounded by q ℓ − − ∑ i = i + ∆ = ∆ · (cid:18) q ℓ − (cid:19) − ∆ . (3)It follows that if wt ( e ) > wt ( e ′ ) before the weight addition, then this relation remains unchanged after the weight addition.In particular, the ranking of edges by weight in E \ E ′ H remains unchanged. Additionally, since distinct weights in the interval [
0, 1 − ∆ − ] were added to the integer weights of edges of E H , all weights of edges in E H are now distinct.For the second case, we increase the weights of edges in the Hamiltonian cycle, E H , by ∆ − . By Lemma 14, the graphremains balanced. However, now edges in E \ E H have weights that are integer multiples of ∆ − , whereas the weights ofedges in E H are not. Additionally, by (3), the addition of ∆ − to the weight does not change the ranking of edges in E \ E ′ H .Thus, the second case is resolved as well.We are now in possession of a weighted graph that is balanced, while keeping the relative ranking of weights of edgesin E \ E ′ H , and having distinct weights. Using Lemma 14, we now multiply all the edge weights by ∆ , to obtain the sameproperties mentioned above, only with integer weights. Finally, we subtract min e ∈ E wt ( e ) − from all the weights. Since G q , ℓ − is Eulerian, by Lemma 14, the resulting weights are positive integers, and the graph has the properties mentionedabove.Let us denote the permutation induced by the weights of the edges by π ′ ∈ S E . Clearly, by the previous discussion, π ′ | E \ E ′ H = π , hence the mapping described here, S E \ E ′ H → S E is injective. Furthermore, since the resulting weighted graphs are all balanced,all resulting permutations are feasible and this mapping is in fact S E \ E ′ H → Φ q , ℓ .The algorithm described in the proof of Theorem 18 is summarized using pseudocode as Algorithm 1. Example 19
We demonstrate Algorithm in action. Assume Σ = { A , C , G , T } , hence, q = . Additionally, fix the window sizeas ℓ = . Algorithm makes use of a predetermined Hamiltonian cycle, which we arbitrarily fix to be the one that is describedby AGTC , namely, α = A → G , G → T , T → C , C → A . Also, the algorithm requires an Eulerian cycle (whose prefix is α ),which we arbitrarily fix as the cycle described by the string AGTCAACCTTATGGCG (namely, α , β = A → G , G → T , . . . ).Assume the user supplies the following input permutation: π = (cid:18) AA AC AT CA CC CG CT GA GC GG TA TG TT (cid:19) . Algorithm 1:
A systematic encoding algorithm for any q > and ℓ > Parameters:
Alphabet Σ , | Σ | = q > , ℓ > , the De Bruijn graph G q , ℓ − = ( V , E ) A sequence of edges α = e , . . . , e q ℓ − − forming a Hamiltonian cycle in G q , ℓ − ,A sequence of edges β = ˜ e , . . . , ˜ e q ℓ − q ℓ − − such that α , β is an Eulerian cycle in G q , ℓ − . Input :
A permutation π ∈ S E \ E ′ H , where E ′ H , { e , . . . , e q ℓ − − } . Output :
A profile vector x ∈ N Σ ℓ such that x (cid:15) π ′ ∈ S E and π ′ | E \ E ′ H = π . // Initial systematic part values for i ← to q ℓ − q ℓ − − do x ( ˜ e i ) ← π ( ˜ e i ) + end x ( e q ℓ − − ) ← π ( e q ℓ − − ) + // Balance the graph for i ← to q ℓ − − do x ( e i ) ← ∑ e ∈ E in ( src ( e i )) x ( e ) − ∑ e ∈ E out ( src ( e i )) x ( e ) end // Break ties in α ∆ ← ( q ℓ − ) + for i ← to q ℓ − − do Find s , s ′ such that src ( e i ) = dest ( ˜ e s ′ ) and dest ( e i ) = src ( ˜ e s ) for j ← s to s ′ (cyclically) do x ( ˜ e j ) ← x ( ˜ e j ) + ( i + ) · ∆ − end x ( e i ) ← x ( e i ) + ( i + ) · ∆ − end // Break ties between α and β for i ← to q ℓ − − do x ( e i ) ← x ( e i ) + ∆ − end // Make weights integers forall e ∈ E do x ( e ) ← ∆ · x ( e ) end // Make weights start at m ← min e ∈ E x ( e ) forall e ∈ E do x ( e ) ← x ( e ) − m + end The main steps of the algorithms are: Weights taken directly from π are assigned to edges. This is shown in Figure . The dashed edges are the Hamiltonianpath, and at this point, their weight has not been determined yet. Next, the algorithm determines the weights on the Hamiltonian path so that the graph becomes balanced. The resultis shown in Figure . We notice that two ties form: the weight of A → G equals that of G → T , and the weight of T → C equals that of C → G . The algorithm then proceeds to break all ties. First, ties within α are broken. We assume here the algorithm does nooptimization when finding s and s ′ in β , and simply takes the first occurrence satisfying the requirements. Thus, for A → G the algorithm uses G → G → C → G → A , for G → T it uses T → T → A → T → G , and for T → C ituses C → C → T . Only then ties between α and β are broken. The result is shown in Figure . The weights are made integers by multiplying by ∆ = . Finally, the weights are shifted so that the minimal weightis , in this, reducing all weights by . The end result, and algorithm output, is shown in Figure .We can see that at the end of the process we are left with weights that induce a permutation on the edges while preservingthe order induced by the input permutation given by the user. The size, and asymptotic rate of the code described in Theorem 18 is presented in the following corollary.
GA TC
10 9 (a) GA TC
10 9 (b) GA TC + ∆ + + ∆ + ∆ + ∆ + + ∆ + ∆ + ∆ + ∆ + ∆ + ∆ + ∆ + + ∆ + ∆ + ∆ (c) GA TC
127 115
103 159143 17535116
59 45 (d)Fig. 2. A depiction of Algorithm 1 in the setting of Example 19: (a) the user information weights, (b) the initial balancing, (c) the tie breaking, and (d) thealgorithm’s output.
Corollary 20
For all q > and ℓ > , there exists a [ q ℓ , q ℓ − q ℓ − + ] -systematic code C q , ℓ ⊆ Φ q , ℓ . Additionally, the numberof feasible permutations is lower bounded by F q , ℓ > |C q , ℓ | = ( q ℓ − q ℓ − + ) !, and asymptotically the rate of feasible permutations satisfies lim ℓ → ∞ R q , ℓ > lim ℓ → ∞ log |C q , ℓ | log ( q ℓ ! ) = − q (4) lim q → ∞ R q , ℓ > lim q → ∞ log |C q , ℓ | log ( q ℓ ! ) = (5) Proof:
The existence of C q , ℓ with these parameters is immediate from Theorem 18, being the image of the injectivemapping described there. For the asymptotic form, we recall Stirling’s approximation, ln ( n ! ) = n ln ( n ) + O ( n ) (e.g., see [11,p. 452]). With that we have R q , ℓ > log |C q , ℓ | log ( q ℓ ! ) = log (( q ℓ − q ℓ − + ) ! ) log ( q ℓ ! ) = ( q ℓ − q ℓ − + ) log ( q ℓ − q ℓ − + ) + O ( q ℓ ) q ℓ log ( q ℓ ) + O ( q ℓ ) , and the claims follow.At this point we pause to compare our results with the best known, described in [21]. For q > and ℓ > , a non-systematiccode C ′ q , ℓ ⊆ Φ q , ℓ was constructed in [21], for which |C ′ q , ℓ | = · q ∏ j = (cid:18) j ! · (cid:18) j − j + j (cid:19)(cid:19) · ℓ ∏ i = ( q ! ) q i − − q i − + q i − . Apart for the case of q = and ℓ = in which, in which our new code is smaller, |C | = < = |C ′ | , for all other cases, our new code is larger, |C q , ℓ | > |C ′ q , ℓ | . It should be emphasized that no explicit construction was presented in [21] for q = and ℓ = . Since this case was thebasis for a recursive construction, the size was obtained via an exhaustive computer search for all feasible permutations.Asymptotically, as noted in [21], lim q → ∞ log |C ′ q , ℓ | log ( q ℓ ! ) = ℓ , which is out-performed by our results in (5). More importantly, in practical settings q is fixed while ℓ → ∞ . In this asymptoticregime, the code of [21] gives lim ℓ → ∞ log |C ′ q , ℓ | log ( q ℓ ! ) = which is inferior to our results in (4) that show a non-vanishing rate. B. Upper Bound
Having found a construction of systematic codes for feasible permutation, it is natural to ask how large such systematiccodes can be. We provide an answer in the following theorem.
Theorem 21
Let q > and k > be integers, and assume there exists a [ q ℓ , k ] -systematic code C ⊆ Φ q , ℓ . Then, k q ℓ − q ℓ − + Proof:
Assume to the contrary k > q ℓ − q ℓ − + , and let G q , ℓ − = ( V , E ) be the De Bruijn graph and E ′ ⊆ E be aninformation set of size | E ′ | = k . If we look at G ′ = ( V , E \ E ′ ) , and forget the edge directions, then we have a graph with q ℓ − vertices, and strictly less than q ℓ − − edges. This implies there exists an isolated vertex in G ′ , say v ∈ V . It thenfollows that E in ( v ) ∪ E out ( v ) ⊆ E ′ , namely, all of the edges entering or leaving v are part of the information set.By the definition of systematic codes, { π | E ′ | π ∈ C} = S E ′ . In particular, there exists π ∈ C such that π ( e ) π ( e ′ ) for all e ∈ E in ( v ) and e ′ ∈ E out ( v ) . However, π is clearly notfeasible since we cannot balance v when the weights of all incoming edges are smaller than the weights of all outgoing edges.Thus, we have reached a contradiction. Corollary 22
The systematic codes from Theorem are optimal.C. String Length An important figure of merit is the length of the string that the encoder generates. We would like this string to be as shortas possible, to facilitate its synthesis. Thus, in this section we would like to derive an upper bound on the maximal length ofthe string that is generated by our algorithm. For any q > and ℓ > we will show that this length is polynomial in the inputlength. A comparison with [21] will show significant improvement.Recall that Algorithm 1 produces a profile vector, being the weights of the De Bruijn graph G q , ℓ − . The length of associatedstring is simply the total weight of all edges. We now prove an upper bound on this total weight. First, the following lemmabounds the weight of an edge on the Hamiltonian path that is used by the algorithm, before tie breaking. Since this weightmight be negative, we upper bound its absolute value. Lemma 23
Let G q , ℓ − = ( V , E ) be the weighted De Bruijn graph after the balancing done in Algorithm , but before breakingties. Let E ′ H be the set of edges in G q , ℓ − defined in Algorithm , and forming a Hamiltonian path. Then for each e ∈ E ′ H , | wt ( e ) | wt ( E \ E ′ H ) = q ℓ − q ℓ − + ∑ i = i = (cid:18) q ℓ − q ℓ − + (cid:19) q ℓ . Proof:
We use the notation of Algorithm 1. Assume E ′ H , { e , . . . , e q ℓ − − } and e , e , . . . , e q ℓ − − is a Hamiltonian pathin G q , ℓ − , where e i is the edge v i → v i + . For all i ∈ [ q ℓ − − ] , we define U i , { v , v , . . . , v i } , and we observe that E ′ H ∩ E in ( U i ) = ∅ , and E ′ H ∩ E out ( U i ) = { e i } . By Lemma 8 we get that wt ( e i ) = wt ( E in ( U i )) − wt ( E out ( U i ) \ { e i } ) . The claim is now immediate, since both E in ( U i ) ⊆ E \ E ′ H and E out ( U i ) \ { e i } ⊆ E \ E ′ H . Theorem 24
Let G q , ℓ − = ( V , E ) be the weighted De Bruijn graph that is the output of Algorithm . Then wt ( E ) q ℓ . Proof:
We again use the notation of Algorithm 1. Recall that all the edges in E \ E ′ H are initially given distinct weightsfrom {
1, . . . , q ℓ − q ℓ − + } , whose sum is upper bounded by q ℓ /2 , as in Lemma 23. Again, by Lemma 23, the weight ofany e ∈ E ′ H satisfies | wt ( e ) | q ℓ /2 , before breaking ties. After the algorithm breaks all ties, the weight of each edge isincreased by no more than . Then all weights are multiplied by ∆ = (( q ℓ − ) + ) q ℓ − /2 . Finally, the normalizationprocess may decrease or increase the weight of all edges. If an increase occurs, that it is only because some edge in E ′ H hasnegative weight. Thus, a weight of no more than q ℓ /2 is added to all edges. It follows that the total weight of the outputweighted graph satisfies, wt ( E ) ∆ q ℓ + ( q ℓ − − ) · q ℓ + q ℓ · ! + q ℓ · q ℓ q ℓ , as claimed.We first comment that the bound of Theorem 24 may be improved by a constant factor by having a more careful analysisin Lemma 23, taking into account the maximal cut size in G q , ℓ − , as well as finer inequalities in Theorem 24. However,recognizing the fact that we are interested in the asymptotic regime where q is constant and ℓ → ∞ , the resulting upper boundis still O ( q ℓ ) .Putting our results in context, if we denote the length of the input to Algorithm 1 by N , q ℓ − q ℓ − + , then the upperbound of Theorem 24 is O ( N ) . Thus, Algorithm 1 guarantees an output string length that is polynomial in the input length.Additionally, the absolute minimum string length is lower bounded by the case of assigning the weights {
1, 2, . . . , q ℓ } to theedges, giving a lower bound of q ℓ ∑ i = i = (cid:18) q ℓ + (cid:19) = Ω ( q ℓ ) = Ω ( N ) . Finally, we would like to compare our upper bound on the length of the output string from Algorithm 1, to the upper boundon the length of the output string from the encoding algorithms in [21]. For general q > and ℓ > , it was shown in [21]that the upper bound is · q − · q ! · ( q + ) !144 · q · ℓ − · q ℓ ( q + ) = O ( ℓ q ℓ ( q + ) ) , in the asymptotic regime of constant q and ℓ → ∞ . This bound is worse than that of Theorem 24. Remark 25
When ℓ = , the phase of tie-breaking in α in Algorithm takes on a very simple form. This is because for everyedge e i , i q ℓ − − , the reverse edge exists in the graph, and is not part of α . Thus, all the cycles used in this phasemay be chosen to be edge disjoint, and then ∆ may be reduced to ∆ = q ℓ − . In that case, the bound on the output-stringlength of Theorem becomes wt ( E ) q + q . For the specific case of ℓ = , [21] showed an upper bound of q · q − · q !6 · ( q + ) !24 · , which for q = is better than thebound in Remark 25, but is otherwise worse. IV. N ON - SYSTEMATIC C ODES
In the previous section we studied systematic rank-modulation codes, and we attained the maximum possible rate (Corol-lary 22). In this section we drop this constraint, and show that there are significantly larger codes that are non-systematic.Since the number of feasible permutation is still unknown, comparing our results with the optimum is impossible, and insteadwe compare against the systematic codes of the previous section.Our first observation is a trivial increase in the code size, by using the self-loop edges in the De Bruijn graph.
Lemma 26
For all q > and ℓ > , there exists a ( q , M ) -code C ⊆ Φ q , ℓ , with M = ( q ℓ − q ℓ − + − q ) ! q ℓ ! ( q ℓ − q ) ! . Proof:
Remove the q self-loop edges from the De Bruijn graph, and run Algorithm 1. We note that removing these edgesdoes not affect the algorithm in any way. Then, set the weight of the self-loop edges arbitrarily. The weighted graph willremain balanced. The number of permutations obtained this way is the claimed value of M .Next, we explore new sufficient conditions and necessary conditions for the existence of feasible permutations. We beginwith a simple extension of a necessary condition presented in [21]. Since [21] used the vertices of the De Bruijn graph, whereashere we balance edge weights, we require the following new definition. Definition 27
Assume q > , ℓ > , and let G q , ℓ − = ( V , E ) be a weighted balanced De Bruijn graph. Let ∅ ⊂ U ⊂ V bea non-empty proper subset of V , and assume E in ( U ) = { e , . . . , e k − } , E out ( U ) = { e ′ , . . . , e ′ k − } , are indexed such that wt ( e ) < · · · < wt ( e k − ) , wt ( e ′ ) < · · · < wt ( e ′ k − ) . We say U exhibits a Dyck configuation if either wt ( e i ) < wt ( e ′ i ) for all i ∈ [ k ] ,or wt ( e ′ i ) < wt ( e i ) for all i ∈ [ k ] .Additionally, we say a permutation π ∈ S E ′ , E in ( U ) ∪ E out ( U ) ⊆ E ′ ⊆ E , exhibits a Dyck configuration at U , if setting wt ( e ) = π ( e ) for all e ∈ E ′ , creates a Dyck configuration at U . Assume the edges in E in ( U ) ∪ E out ( U ) = { e , . . . , e k − } are indexed such that e < e < · · · < e k − . We can constructthe following binary word b , b , . . . , b k − , where b i is if e i ∈ E in ( U ) , and is otherwise. This word is a Dyck word ifand only if U exhibits a Dyck configuration. We now present a necessary condition for a permutation to be feasible. Lemma 28
Let q > and ℓ > . If π ∈ Φ q , ℓ is a feasible permutation, then for all ∅ ⊂ U ⊂ Σ ℓ − , π does not exhibit aDyck configuration at U .Proof: Let G q , ℓ − = ( V , E ) be the De Bruijn graph, and set wt ( e ) = π ( e ) for all e ∈ E . Assume to the contrary that π is feasible but there exists ∅ ⊂ U ⊂ V = Σ ℓ − that exhibits a Dyck configuration. It follows that either wt ( E in ( U )) < wt ( E out ( U )) , or wt ( E in ( U )) > wt ( E out ( U )) . Since π is feasible, there exists a feasible x ∈ N Σ ℓ such that x (cid:15) π . It thenfollows that, either ∑ e ∈ E in ( U ) x ( e ) < ∑ e ∈ E out ( U ) x ( e ) , or ∑ e ∈ E in ( U ) x ( e ) > ∑ e ∈ E out ( U ) x ( e ) . However, the fact that x is feasible implies, by Lemma 8, that ∑ e ∈ E in ( U ) x ( e ) = ∑ e ∈ E out ( U ) x ( e ) , a contradiction.The necessary condition for a permutation to be feasible, which was presented in Lemma 28, is unfortunately not a sufficientcondition, as the following example shows. Example 29
Take q = with Σ = { A , C , G , T } , and ℓ = . Consider the following permutation: π = (cid:18) AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
12 0 1 5 4 13 11 7 3 10 14 6 2 8 9 15 (cid:19) . By inspection, one can verify that no ∅ ⊂ U ⊂ Σ exhibits a Dyck configuration. However, by computer we find that thispermutation is infeasible (see the linear-programming method for deciding feasibility in [21, Section IV]). Not all is lost though. In the next theorem we show that, compared with the systematic code of Theorem 18, the user mayset another edge, provided that a Dyck configuration does not appear. To prove this claim we require a little preparation.
Definition 30
Let G = ( V , E ) be a finite directed weighted graph. A vertex v ∈ V is said to be in an over (respectively, under ) state, if wt ( E in ( v )) < wt ( E out ( v )) (respectively, wt ( E in ( v )) > wt ( E out ( v )) ). Otherwise, v is said to be balanced . A Dyck word is a binary sequence, exactly half of its bits are ’s, such that, in each of its prefixes, the number of ’s is at least the number of ’s. Definition 31
Let G = ( V , E ) be a finite directed weighted graph. For any e ∈ E , define E > e , (cid:8) e ′ ∈ E (cid:12)(cid:12) wt ( e ′ ) > wt ( e ) (cid:9) . We say e ∗ ∈ E is a step-up edge for v ∈ V if | E > e ∗ ∩ E in ( v ) | < | E > e ∗ ∩ E out ( v ) | . We say e ∗ ∈ E is a step-down edge for v ∈ V if | E > e ∗ ∩ E in ( v ) | > | E > e ∗ ∩ E out ( v ) | . Finally, we say e ∗ ∈ E is a stable edge for v ∈ V if | E > e ∗ ∩ E in ( v ) | = | E > e ∗ ∩ E out ( v ) | . Unlike Lemma 14, we introduce an operation that may change the balanced state of vertices.
Lemma 32
Let G = ( V , E ) be a finite directed weighted graph. Assume that v ∈ V is in an over state (resp., in an understate), and that e ∗ ∈ E is a step-down edge (resp., step-up edge) for v . Construct G ′ = ( V , E ) , and set its edge weights asfollows: wt G ′ ( e ) = ( wt G ( e ) e / ∈ E > e ∗ ,wt G ( e ) + c e ∈ E > e ∗ , for all e ∈ E , and where c = | wt G ( E in ( v )) − wt G ( E out ( v )) ||| E > e ∗ ∩ E in ( v ) | − | E > e ∗ ∩ E out ( v ) || . Then the relative order of edges (by weight) in G and G ′ are the same, and v is balanced in G ′ .Proof: The fact that the relative order of edges does not change between G and G ′ , is trivial. Assume v is in an under state,i.e., wt G ( E in ( v )) > wt G ( E out ( v )) . Since e ∗ is a step-up edge for v , by definition we have | E > e ∗ ∩ E in ( v ) | < | E > e ∗ ∩ E out ( v ) | .We now note that increasing the weights of the edges in E > e ∗ by , adds | E > e ∗ ∩ E in ( v ) | weight to the incoming edges of v ,and | E > e ∗ ∩ E out ( v ) | weight to the outgoing edges of v . Thus, wt G ′ ( E in ( v )) − wt G ′ ( E out ( v )) = wt G ( E in ( v )) + c | E > e ∗ ∩ E in ( v ) | − wt G ( E out ( v )) − c | E > e ∗ ∩ E out ( v ) | = A symmetric argument proves the case when v is in an over state.We are now in a position to show how another edge may be set (compared with systematic codes), provided a Dyckconfiguration is avoided. Theorem 33
Assume the same setting as in Theorem . Then every permutation π on E \ E ′ H ∪ { e } , that does not exhibita Dyck configuration at { v } , can be extended to a feasible permutation π ′ ∈ Φ q , ℓ , namely, π ′ | E \ E ′ H ∪{ e } = π .Proof: Our goal is to show that we can find weights x ( e ) for each e ∈ E , such that the graph is balanced, and the relativeorder (by weight) of the edges in E \ E ′ H ∪ { e } is preserved. We start by setting x ( e ) = π ( e ) + for each e ∈ E \ E ′ H ∪ { e } .We then note v is the only vertex all of whose incident edge weight have already been set.If v is not balanced, then it is either in an over state or an under state. Let us assume that v is in an under state. Theproof for the over state is symmetric. Arrange the edges of E in ( v ) ∪ E out ( v ) in ascending weight order, e ′ , . . . , e ′ k − , wherewe note that by definition, self loops are not included in this union. Create the binary word b , . . . , b k − , b i ∈ {
0, 1 } , where b i = if and only if e ′ i ∈ E in ( v ) . In this word exactly half of the bits are ’s. Since there is no Dyck configuration at { v } ,there exists a proper prefix b , . . . , b t − that contains strictly more ’s than ’s, and therefore, b t , . . . , b k − contains strictlymore ’s than ’s. Thus, e ′ t is a step-up edge for v .Using Lemma 32, we can adjust the weights of edges (that have already been assigned) and ensure that v is balanced, whilekeeping the relative order of edges by weight. We also point out that the resulting weights must all be distinct. If needed, wemultiply all edge weights by the same constant to obtain integer weights. We now continue by running Algorithm 1, startingfrom the balancing part. The resulting weights induce a permutation π ′ ∈ Φ q , ℓ , as desired. Corollary 34
For all q > and ℓ > , there exists a code C ′ q , ℓ ⊆ Φ q , ℓ with F q , ℓ > |C ′ q , ℓ | = ( q ℓ − q ℓ − + − q ) ! · q − q + · q ℓ ! ( q ℓ − q ) ! . Proof:
We choose v in the setting of Theorem 33 to be a vertex with no self loops. It follows that | E in ( v ) | = | E out ( v ) | = q . We first look locally at v . We can arrange the q incoming edges among themselves in q ! ways, and similarlyfor the q outgoing edges. Next, we count the number of ways these two orderings may be merged so as not to exhibit a Dyckconfiguration. A Dyck configuration is equivalent to a Dyck word, and the number of those is known to be the Catalan number C q , q + ( qq ) (e.g., see [11, p. 358]). There are also two ways to choose whether the first edge is from E in ( v ) or E out ( v ) .We obtain that the total ways of ordering E in ( v ) ∪ E out ( v ) is given by ( q ! ) (cid:18)(cid:18) qq (cid:19) − C q (cid:19) = ( q ) ! · q − q + We then extend this to a permutation of E \ E ′ H ∪ { e } , for a total number of permutations equalling ( q ℓ − q ℓ − + ) ! · q − q + By Theorem 33 these may be injectively extended to feasible permutations of E . As a final step, we may employ the samestrategy as Lemma 26: exclude the self loops from the entire process, and set their value only at the end. This results in theclaimed number of permutations in C ′ q , ℓ .We can further improve Theorem 33, by considering permutations on all the edges of the De Bruijn graph. A sufficientcondition is described in the following theorem. Theorem 35
Assume q > , ℓ > , and let G q , ℓ − = ( V , E ) be the De Bruijn graph. Let π ∈ S E , and assign wt ( e ) = π ( e ) + , for all e ∈ E . If we can index the vertices V = { v , . . . , v q ℓ − − } such that for each i ∈ [ q ℓ − − ] : there is no Dyck configuration at { v i } , and v i has a step-up edge e ∈ E , and a step-down edge e ′ ∈ E , such that e and e ′ are stable edges for v j , for all j ∈ [ i ] ,then π is feasible, i.e., π ∈ Φ q , ℓ .Proof: When the conditions in the lemma are satisfied, we can balance each v , . . . , v q ℓ − − , one by one, in this order,using Lemma 32. Note that while we balance v i , we do not harm the balance of v , . . . , v i − . Also note that vertex v q ℓ − − is automatically balanced once all the previous ones are. The result is a balanced graph with rational weights. Multiplyingall the weights by an appropriate constant we achieve a balanced graph with distinct integer positive weights, that realize thepermutation π .Alas, the sufficient condition for a permutation to be feasible, which was presented in Theorem 35, is not necessary, as thefollowing example shows. Example 36
Take q = with Σ = { A , C , G , T } , and ℓ = . Consider the following permutation: π = (cid:18) AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
12 0 1 7 2 13 6 8 3 5 14 10 4 11 9 15 (cid:19) , By inspection one can verify that the requirements of Theorem are not satisfied, yet the permutation is feasible, as Figure shows. GA TC
26 8 5
15 11
Fig. 3. The balanced graph for the permutation from Example 36.
Table I shows a comparison between the size of the codes resulting from the different methods in this paper and in [21].We first note that the last row, the total number of feasible permutations, was obtained using an exhaustive computer search,and hence the limitation to ℓ = and q =
3, 4 . We also observe that the entry for q = , ℓ = , from [21] was obtained inthe same way, i.e., an exhaustive computer search, whose results bootstrapped a recursive construction in [21].V. C ONCLUSION
In this paper we studied rank-modulation codes for DNA storage when used in conjunction with shotgun sequencing. Weconstructed systematic codes for all parameters q > and ℓ > , which we proved are optimal. These improve upon theresults of [21] by obtaining an asymptotic rate of − q when q is fixed and ℓ → ∞ , compared with an asymptotic rate of TABLE IT
HE RATE OF RANK - MODULATION CODES FOR
DNA
STORAGE WITH SHOTGUN SEQUENCING ( IN PARENTHESES , THE CODE RATES )Source ℓ = q = ℓ = q = [21] ( ≈ ) ( ≈ ) Theorem 18 (Algorithm 1) ( ≈ ) ( ≈ ) Theorem 33 ( ≈ ) ( ≈ ) Theorem 35 ( ≈ ) ( ≈ ) Total feasible permutations ( ≈ ) ( ≈ ) in [21]. In the asymptotic regime of ℓ fixed and q → ∞ we obtain asymptotic rate of compared with ℓ in [21]. Finally, wealso showed how larger codes may be obtained by avoiding Dyck configurations.We would like to further discuss additional aspects that may be readily combined into the coding schemes we presented inthis paper: a) Weight Balancing: When considering data storage in synthesized DNA molecules, it has been argued that an overallGC-content of roughly contributes to the stability of the molecule [26]. We can adjust Algorithm 1 to accomplish thisby removing the self loops from the set of information edges in the De Bruijn graph G q , ℓ − . After running the algorithm andobtaining an encoded sequence, we may increase the weights of the relevant self loops to reach the desired GC-content of theencoded sequence. b) Forbidden ℓ -grams: Research suggests that some ℓ -grams are likely to cause sequencing errors [19]. We can ensurethese ℓ -grams never appear as a substring of the encoded output sequence by removing their corresponding edges from the DeBruijn graph G q , ℓ − to obtain a graph G ′ . A careful reading of Theorem 18 and Algorithm 1 reveals that the claims hold for G ′ as well, provided the following hold: • G ′ has an Eulerian cycle. • G ′ has a Hamiltonian path. • For each edge e on the Hamiltonian path, there is a directed cycle passing through e and not through any of the otheredges on the Hamiltonian path.When these requirements hold, the edges not on the Hamiltonian path form an information set, and trivial adjustments toAlgorithm 1 make it work for G ′ as well. c) Error Correction: As mentioned in the introduction, the mere use of the rank-modulation scheme already protectsagainst perturbations of the profile vector that do not change the ranking. If we desire more error-protection capabilities, thenwe may use any of the rank-modulation error-correcting codes known in the literature. These may be trivially combined withthe systematic encoding of Section III. In the notation of Theorem 18, if C ⊆ S E \ E ′ H is a rank-modulation code, then each ofits codewords may be mapped to Φ q , ℓ . The reverse process is easily accomplished by projecting the receiving permutation onto E \ E ′ H , and then decoding using C . We can also use the larger non-systematic codes from Section IV. Assume C out ⊆ S q , ℓ isthe (non-systematic) code from Section IV, and let C in ⊆ S q , ℓ be a rank-modulation error-correcting code. If C in is a subgroupcode (e.g., the codes studied in [23]), then its cosets partition S q , ℓ into error-correcting codes (in the case of [23], due to theright-invariance of the ℓ ∞ -metric on permutations). Thus, one of these cosets intersects C out in a code that is both feasible,has the error-correction capabilities of C in , and whose size is at least | C out | · | C in | / | S q , ℓ | .We would like to mention some open questions. Finding the exact number of feasible permutations is first and foremost. Theupper bound on the asymptotic rate of feasible permutations is still (see [21]), whereas the lower bound has been improvedin this paper to − q , assuming q is constant and ℓ → ∞ . This lower bound is obtained by considering systematic codes, andit is the best possible. Unfortunately, even though the non-systematic codes of Corollary 34 have strictly larger size comparedwith systematic codes, they do not offer any improvement asymptotically.Another interesting open question concerns the length of the encoded sequences. The trivial lower bound is Ω ( q ℓ ) , whereasthe upper bound from the systematic codes of Section III is O ( q ℓ ) . What are the worst-case bound and the average-casebound is still unknown.Yet another open problem is determining the minimum distance of feasible permutations. Several metrics have been studiedin connection with rank-modulation codes, e.g., Kendall’s τ -metric, the ℓ ∞ -metric (also known as Chebyshev’s metric), andUlam’s metric, to name a few. Intrinsically, the set of feasible permutations may possess sufficient minimal distance to allowerror correction. What this distance is, or bounds on it, are as of yet, unknown.Finally, finding a concise sufficient and necessary condition for a permutation to be feasible, remains an open problem.Finding such a condition might pave the way to constructing encoders for feasible permutations. We leave all of these openproblems for future work. The GC-content of a DNA molecule is the percentage of bases that are either G or C . R EFERENCES[1] J. Acharya, H. Das, O. Milenkovic, A. Orlitsky, and S. Pan, “String reconstruction from substring compositions,”
SIAM J. Discrete Math. , vol. 29, no. 3,pp. 1340–1371, 2015.[2] A. Barg and A. Mazumdar, “Codes in permutations and error correction for rank modulation,”
IEEE Trans. Inform. Theory , vol. 56, no. 7, pp. 3158–3165,Jul. 2010.[3] V. Becher and P. A. Heiber, “On extending de Bruijn sequences,”
Information Processing Letters , vol. 111, no. 18, pp. 930–932, 2011.[4] T. Berger, F. Jelinek, and J. K. Wolf, “Permutation codes for sources,”
IEEE Trans. Inform. Theory , vol. IT-18, no. 1, pp. 160–169, Jan. 1972.[5] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig, and K. Strauss, “A DNA-based archival storage system,”
ACM SIGOPS Operating SystemsReview , vol. 50, no. 2, pp. 637–649, 2016.[6] H. D. Chadwick and L. Kurz, “Rank permutation group codes based on Kendall’s correlation statistic,”
IEEE Trans. Inform. Theory , vol. IT-15, no. 2,pp. 306–315, Mar. 1969.[7] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,”
Science , vol. 337, p. 1628, 2012.[8] R. Gabrys and O. Milenkovic, “Unique reconstruction of coded strings from multiset substring spectra,”
IEEE Trans. Inform. Theory , vol. 65, no. 12,pp. 7682–7696, Dec. 2019.[9] R. Gabrys, S. Pattabiraman, and O. Milenkovic, “Mass error-correction codes for polymer-based data storage,” in
Proceedings of the 2020 IEEEInternational Symposium on Information Theory (ISIT2020), Los Angeles, CA, USA , Jun. 2020, pp. 25–30.[10] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance informationstorage in synthesized DNA,”
Nature , vol. 494, no. 7435, pp. 77–80, 2013.[11] R. L. Graham, D. E. Knuth, and O. Patashnik,
Concrete Mathematics: A Foundation for Computer Science . Addison-Wesley, 1994.[12] A. E. Holroyd, “Perfect snake-in-the-box codes for rank modulation,”
IEEE Trans. Inform. Theory , vol. 63, no. 1, pp. 104–110, Jan 2017.[13] M. Horovitz and T. Etzion, “Constructions of snake-in-the-box codes for rank modulation,”
IEEE Trans. Inform. Theory , vol. 60, no. 11, pp. 7016–7025,Nov. 2014.[14] A. Jiang, R. Mateescu, M. Schwartz, and J. Bruck, “Rank modulation for flash memories,”
IEEE Trans. Inform. Theory , vol. 55, no. 6, pp. 2659–2673,Jun. 2009.[15] A. Jiang, M. Schwartz, and J. Bruck, “Correcting charge-constrained errors in the rank-modulation scheme,”
IEEE Trans. Inform. Theory , vol. 56, no. 5,pp. 2112–2120, May 2010.[16] H. M. Kiah, G. J. Puleo, and O. Milenkovic, “Codes for DNA sequence profiles,”
IEEE Trans. Inform. Theory , vol. 62, no. 6, pp. 3125–3146, Jun. 2016.[17] A. Mazumdar, A. Barg, and G. Z´emor, “Constructions of rank modulation codes,”
IEEE Trans. Inform. Theory , vol. 59, no. 2, pp. 1018–1029, Feb.2013.[18] S. Motahari, G. Bresler, and D. Tse, “Information theory for DNA sequencing: Part 1: A basic model,” in
Proceedings of the 2012 IEEE InternationalSymposium on Information Theory (ISIT2012), Cambridge, MA, USA , Jul. 2012, pp. 2741–2745.[19] K. Nakamura et al. , “Sequence-specific error profile of Illumina sequencers,”
Nucl. Acids Res. , vol. 39, no. 13, p. e90, 2011.[20] S. Pattabiraman, R. Gabrys, and O. Milenkovic, “Reconstruction and error-correction codes for polymer-based data storage,” in
Proceedings of the 2019Information Theory Workshop (ITW’19), Visby, Sweden , Aug. 2019, pp. 1–5.[21] N. Raviv, M. Schwartz, and E. Yaakobi, “Rank modulation codes for DNA storage with shotgun sequencing,”
IEEE Trans. Inform. Theory , vol. 65,no. 1, pp. 50–64, Jun. 2019.[22] D. Slepian, “Permutation modulation,”
Proc. of the IEEE , vol. 53, no. 3, pp. 228–236, 1965.[23] I. Tamo and M. Schwartz, “Correcting limited-magnitude errors in the rank-modulation scheme,”
IEEE Trans. Inform. Theory , vol. 56, no. 6, pp.2551–2560, Jun. 2010.[24] T. van Aardenne-Ehrenfest and N. G. de Bruijn, “Circuits and trees in oriented linear graphs,”
Simon Stevin: Wis- en Natuurkundig Tijdschrift , vol. 28,pp. 203–217, 1951.[25] H. Vinck, J. Haering, and T. Wadayama, “Coded M-FSK for power line communications,” in
Proceedings of the 2000 IEEE International Symposiumon Information Theory (ISIT2000), Sorrento, Italy , 2000, p. 137.[26] P. Yakovchuk, E. Protozanova, and M. D. Frank-Kamenetskii, “Base-stacking and base-pairing contributions into thermal stability of the DNA doublehelix,”
Nucl. Acids Res. , vol. 34, no. 2, pp. 564–574, 2006.[27] S. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “A rewritable, random-access DNA-based storage system,”
Sci. Rep. , vol. 5, no. 14138, 2015.[28] Y. Yehezkeally and M. Schwartz, “Snake-in-the-box codes for rank modulation,”
IEEE Trans. Inform. Theory , vol. 58, no. 8, pp. 5471–5483, Aug. 2012.[29] ——, “Limited-magnitude error-correcting Gray codes for rank modulation,”
IEEE Trans. Inform. Theory , vol. 63, no. 9, pp. 5774–5792, Sep. 2017.[30] Y. Zhang and G. Ge, “Snake-in-the-box codes for rank modulation under Kendall’s τ -metric,” IEEE Trans. Inform. Theory , vol. 62, no. 1, pp. 151–158,Jan. 2016.[31] ——, “Snake-in-the-box codes for rank modulation under Kendall’s τ -metric in S n + ,” IEEE Trans. Inform. Theory , vol. 62, no. 9, pp. 4814–4818,Sep. 2016.[32] H. Zhou, M. Schwartz, A. Jiang, and J. Bruck, “Systematic error-correcting codes for rank modulation,”