On Coding for an Abstracted Nanopore Channel for DNA Storage
aa r X i v : . [ c s . I T ] F e b On Coding for an Abstracted Nanopore Channel for DNA Storage
Reyna Hulett ∗ , Shubham Chandak † , and Mary Wootters ‡ January 2021
Abstract
In the emerging field of DNA storage, data is encoded as DNA sequences and stored. The data is readout again by sequencing the stored DNA. Nanopore sequencing is a new sequencing technology that hasmany advantages over other methods; in particular, it is cheap, portable, and can support longer reads.While several practical coding schemes have been developed for DNA storage with nanopore sequencing,the theory is not well understood. Towards that end, we study a highly abstracted (deterministic) versionof the nanopore sequencer, which highlights key features that make its analysis difficult. We developmethods and theory to understand the capacity of our abstracted model, and we propose efficient codingschemes and algorithms.
In the emerging field of
DNA storage, data is encoded as DNA sequences and stored; the data can be readback by sequencing the stored DNA. This technology promises high storage density and stability, as well asefficient duplication of data and random access using PCR-based technologies; we refer the reader to [4] andthe references therein for an excellent overview.Both the synthesis and sequencing processes are noisy, and as a result the data must be encoded beforethe synthesis stage to ensure accurate data recovery. Prior work has studied methods for encoding data inorder to protect it against (aspects of) the noise introduced by these processes, for example [10, 3, 7, 22, 17,14, 15, 5].In this work, we focus on one particular stage of this noisy process, the nanopore sequencer.
Nanoporesequencing—and in particular the MinION sequencer developed by Oxford Nanopore Technologies [12]—isan emerging sequencing technology. While initial works in DNA storage used Illumina sequencing, nanoporesequencing has been attracting interest due to its portability, low cost, and ability to support significantlylonger reads than Illumina.At a high level, the nanopore sequencer works as follows. A single strand of DNA is passed througha pore, leading to variations in a current readout. The pore can hold k nucleotides (for our purposes, anucleotide is just a value in { A, C, G, T } ) at a time; in practice k is about six. The value of the currentreadout depends on which nucleotides are in the pore. For example, if the strand of DNA is AGCTAAT,and the pore sees the sub-strand AGCT, it will output one current reading. As the strand is passed throughthe pore, the contents of the pore will shift, say from AGCT to GCTA, then to CTAA, then to TAAT, andso on. This will result in a change in the current reading, according to some function f that maps k -mersto current levels. The process is depicted in Figure 1. Given the current readout, the goal is to recover theoriginal DNA sequence.This channel is difficult to analyze, for several reasons. First, the output at any given time dependson k > ∗ Computer Science Department, Stanford University. [email protected] . RH’s research supported in part by a NSFGraduate Research Fellowship under grant DGE-1656518. † Electrical Engineering Department, Stanford University. [email protected] . ‡ Computer Science and Electrical Engineering Departments, Stanford University. [email protected] . RH and MW’sresearch supported in part by NSF grants CCF-1844628 and SemiSynBio-1807371. bases in thepore at a time TGACCGTA timecurrentFigure 1: High-level view of the nanopore sequencer.Fourth, the amount of time that each k -mer spends in the pore can vary, and sometimes never occur at all,leading to synchronization errors in the output.Due to this complexity, typical basecallers (that is, methods for recovering the original sequence fromthe current readouts) rely on machine learning techniques [21, 9, 20]. This is effective in practice, butdifficult to get a theoretical handle on. While there are several practical approaches to error correction fornanopore sequencing in DNA storage [22, 17, 14, 15, 5], the theoretical limits of this channel are not wellunderstood. The work [16], discussed more below, proposed a probabilistic model of the nanopore sequencerand developed bounds on the capacity of the resulting channel. However, this problem is quite difficult; inparticular, taking into account the possibility of synchronization errors places this problem as more difficultthan determining the capacity of the deletion channel, a notorious open problem.In this work, we take a different approach to developing a theoretical understanding of the nanoporesequencer. We develop a highly abstracted, deterministic model, which is meant to highlight the first twosources of noise mentioned above: the inter-symbol interference, and the prospect of collisions when multiple k -mers lead to similar current outputs. We develop methods and theory to understand the capacity of ourabstracted channel, and we propose efficient coding schemes and algorithms.Our goal in this work is to open up a research direction towards a theoretical understanding of thenanopore sequencer. We fully acknowledge that our work is not yet practical; in particular our abstractedchannel does not include noise in the current readings or synchronization errors that can arise from variablepore dwelling times. It is our hope that a solid understanding of our abstracted channel can then be combinedwith (much more well-studied) theories of coding for substitution and synchronization errors, in order tomake progress on a more realistic channel model. Contributions.
Our contributions are as follows.1. We propose a novel abstraction of the nanopore sequencer, which highlights the inter-symbol inter-ference and the prospect of collisions. This abstraction is simple enough that it is tractable, yetcomplex enough that it (a) captures some fundamental properties of the nanopore sequencer, and (b)already gives rise to extremely interesting problems from a theoretical perspective. We hope that ourabstraction will lead to future work in this area.2. In Section 3, we develop an algorithm to determine the capacity of this channel. The algorithm isinefficient as pore size k grows, but we can use it to determine the capacity for small k .3. In Section 4, we develop simple bounds on the “best” and “worst” capacity for an arbitrary pore size k (where “best” and “worst” refers to the choice of the map from k -mers to current readouts). Weuse our algorithms mentioned above to compare these bounds to the exact values for k = 2. Thesepreliminary results suggest there may be some intriguing “bumpiness” in the worst-case capacity as thenumber of distinct current levels increases, but that the average-case is closer to best- than worst-case.2. In Section 5, we develop efficient coding schemes for our abstracted channel. In particular, we develop ascheme that achieves rate C f − ǫ , where C f is the capacity, that requires preprocessing time O (cid:0) ǫ · /ǫ (cid:1) ,and has encoding/decoding time O ( n ), where the O ( · ) notation hides a constant that depends on f .We also present an efficient coding scheme that improves over the “naive” one with high probability,where the probability is over a random map from k -mers to current readouts. DNA storage has been around as an idea since the 1960’s, and there has been renewed interest in it in the pastdecade, starting with the works [6, 10]. We refer the reader to [4] for an excellent survey on DNA storage.Most works on DNA storage have focused on other sequencing technologies like Illumina, but starting withthe work of [22], there have been several works which developed practical DNA storage systems for nanoporesequencers, including [22, 17, 14, 15, 5]. All of these works developed practical coding schemes for DNAstorage with a nanopore sequencer, but did not explicitly model the sequencer or analyze the theoreticallimits.Perhaps the work most related to ours is that of [16], who also developed a model of the nanoporesequencer and studied the capacity of their model. In particular, they give a multi-letter capacity formulafor their channel, and derive computable bounds for the capacity, in terms of a Markov transition matrix P that captures the probability of transitioning from one k -mer to the next in the pore. Our work complementsthat work by focusing on different aspects of the problem. In more detail, that work differs from ours inseveral ways. First, the model in [16] is stochastic, while ours is deterministic. As a result, they take aninformation-theoretic approach, while our approach is more combinatorial in nature. Second, their modelincludes the possibility that k -mers might get dropped; in particular, the binary deletion channel appearsas one part of their model. Since understanding the capacity of the binary deletion channel is a difficultopen problem, this makes their problem extremely difficult. In contrast, we ignore this aspect in order tomore cleanly focus in on the effects of the inter-symbol interaction and the potential confusability betweenthe k -mers. Third, that work focuses on the nanopore sequencer for general applications (not necessarilyfor DNA storage), and in particular does not consider efficient coding schemes for their model. Finally, thatwork derives bounds on the channel capacity for a particular choice of (a stochastic analog of) the map f from k -mers to current levels, derived from experimental data. In contrast, we are interested in results forany f , and in particular for the best and worst such functions f . While the former direction is obviously ofimmediate interest for existing technology, it is our hope that understanding how the capacity of the channelchanges with f could perhaps guide how nanopore technology is developed in the future. We note that thisis still an emerging area and the technology is evolving; see [11] for an overview of recent advances. In this section we formalize our model. As mentioned above, our goal is to focus on the challenges of (1)inter-symbol interference, and (2) the possibility of different k -mers producing similar current readouts. Withthat in mind, we propose a very simple model for the nanopore sequencer. As input, we take an encodedstring s s · · · s n − ∈ { A, C, G, T } n . This is transformed into a sequence of k -mers according to a slidingwindow, to obtain ( s · · · s k − ) , ( s · · · s k ) , ( s · · · s k +1 ) , . . . , ( s n − k · · · s n − ). Finally, each of these k -mers ismapped to one of b distinct current levels, according to a mapping f : { A, C, G, T } k → { , , . . . , b − } . Thismapping f defines the channel. Definition 1 (Abstract Nanopore Channel) . Given a mapping f : { A, C, G, T } k → { , , . . . , b − } from k -mers to current levels, let f ∗ : { A, C, G, T } ∗ → { , , . . . , b − } ∗ represent the mapping from DNA strandsto their full current readout, i.e., f ∗ ( s s · · · s n − ) = f ( s · · · s k − ) ◦ f ( s · · · s k ) ◦ · · · ◦ f ( s n − k · · · s n − ) . We call f ∗ the abstract nanopore channel given by f .Given a mapping f , we are interested in the capacity C f (in bits-per-base), which we define as follows.3 efinition 2. Let f : { A, C, G, T } k → { , , . . . , b − } . The capacity C f of the abstract nanopore channeldetermined by f is defined as C f = lim n →∞ log |{ c ∈ { , , . . . , b − } n − k +1 | ∃ s ∈ { A, C, G, T } n s.t. f ∗ ( s ) = c }| n . (1)Observe that if S is a collection of strings s ∈ { A, C, G, T } n so that f ∗ ( s ) are all distinct for s ∈ S , thenby assigning a different message to each s ∈ S , we can communicate perfectly (if not necessarily efficiently)across the abstract nanopore channel.We will focus on balanced mappings f . These are mappings so that | f − ( i ) | is the same for all i ∈{ , , . . . , b − } . Our first contribution is an algorithm (Algorithm 1 below) that computes C f , given f . The basic idea is toconsider a finite automaton on the alphabet { , , . . . , b − } that accepts exactly those current readouts thatcan be generated by some DNA strand; then we use the transfer matrix method [2, 8] for counting acceptingpaths in that finite automaton.Formally, an Nondeterministic Finite Automaton (NFA) is a tuple ( Q, Σ , ∆ , Q , F ) where Q is the setof states, Σ is the alphabet, ∆ : Q × Σ → Q is the transition function, Q ⊆ Q is the set of initial states,and F ⊆ Q is the set of accepting states. Likewise, a Deterministic Finite Automaton (DFA) is a tuple( Q, Σ , δ, q , F ), defined analogously except that δ : Q × Σ → Q is the transition function and there is only asingle initial state q .We consider the NFA M and the DFA M ′ described in Algorithm 1. The NFA M has states indexed bystrings in { A, C, G, T } k − and alphabet Σ = { , , . . . , b − } , the current levels. Given a state ( s · · · s k − )and an input current level i ∈ Σ, the NFA M can transition to any other state of the form ( s · · · s k − ) sothat f ( s s · · · s k − ) = i . All states are accepting states. By construction, a sequence of current readings c ∈ Σ n − k +1 can be an output f ∗ ( s s · · · s n − ) of the nanopore sequencer if and only if c is accepted by M .The DFA M ′ accepts exactly the same strings as M , and is obtained using the classic subset construction[18], such that each state of the DFA corresponds to a subset of the states of the NFA.The transfer matrix method is a method for obtaining a generating function g f ( z ) for the number ofstates accepted by a given DFA. In more detail, given the transfer matrix T for the automaton (so that T i,j is the number of transitions from state i to state j ), the transfer matrix method gives an expression for afunction g f ( z ) (in terms of the matrix T ), so that g f ( z ) = ∞ X m =0 N m z m , so that N m is the number of strings of length m accepted by the finite automaton. We will use the notation[ z m ] g f ( z ) to denote the coefficient N m on z m . Lemma 1 (Transfer Matrix Method [2]) . Given a DFA D = ( Q, Σ , δ, q , F ) , let D ′ = ( Q ∪ q F , Σ ∪ λ, δ ′ , q , { q F } ) be obtained from the DFA D by adding a new state q F and new symbol λ , along with λ -transitions from each of the accepting states of D to q F . Let T be the transfer matrix of D ′ , where T ij isthe number of transitions from state i to state j , with q being state and q F being state | Q | . Then thegenerating function for D ′ is g f ( z ) = ( − | Q | × det( I − zT : | Q | , z det( I − zT ) where ( I − zT : | Q | , is the minor of index | Q | , , i.e., the matrix I − zT with the | Q | th row and th columndeleted. In our case, the number of strings of length m accepted by our finite automaton will be the number ofcurrent readouts of length m = n − k + 1 that can be generated by some DNA strand of length n .4rom g f ( z ), we can derive the asymptotic behavior of the number of possible current readouts requiredto determine C f . As shown in the proof below, it is related to the smallest positive singularity of g f ( z ). Algorithm 1:
Calculate C f Q ← { A, C, G, T } k − Σ ← { , , . . . , b − } NFA M ← ( Q, Σ , ∆ , Q, Q ) where ∆( s · · · s k − , i ) = { s · · · s k − | f ( s s · · · s k − ) = i } DFA M ′ ← (2 Q , Σ , δ, q , F ) obtained from the subset construction on MM ′ ← (2 Q ∪ { q F } , Σ ∪ { λ } , δ ′ , q , { q F } ) where δ ′ ( q, σ ) = δ ( q, σ ) q ∈ Q and σ = λq F q ∈ F and σ = λ ∅ otherwise T ← transfer matrix of M ′ , i.e., T i,j = number of transitions from state i to state jg f ( z ) ← ( − | Q | × det( I − zT :2 | Q | , z det( I − zT ) where ( I − zT : 2 | Q | ,
0) is the minor of index 2 | Q | , r ← smallest positive real root of the denominator of g f ( z ) (simplified)return log r Theorem 2.
Given a mapping f : { A, C, G, T } k → { , , . . . , b − } , Algorithm 1 computes C f .Proof. First, observe that the NFA M accepts exactly those current readouts that can be generated bysome DNA strand under the mapping f . For any current readout accepted by M , consider an acceptingpath P = s · · · s k − , s · · · s k − , . . . , s m · · · s m + k − . Then by construction, the DNA strand s · · · s m + k − generates that current readout. In the other direction, for any DNA strand s · · · s m + k − , the path P is anaccepting path for f ∗ ( s · · · s m + k − ).The DFA obtained from the subset construction accepts the same current readouts as M [18]. Thenwe apply the transfer matrix method described in Lemma 1 to obtain the generating function g f ( z ), whichcounts the number of current readouts accepted by M .Finally, we need to extract the asymptotic behavior of [ z m ] g f ( z ). Since g f ( z ) is a generating function withnon-negative coefficients, the Exponential Growth Formula [8] tells us that lim sup m →∞ ([ z m ] g f ( z )) /m = 1 /r where r is the smallest positive singularity of g f ( z ). Note that the number of possible current readouts ismonotonically non-decreasing in m , so the lim sup is equal to the limit. Therefore, based on equation (1),and the fact that the length of the current readouts for DNA strands of length n is m = n − k + 1, C f = lim n →∞ log (cid:0) [ z n − k +1 ] g f ( z ) (cid:1) n = lim n →∞ n − k + 1 n log (cid:0) [ z n − k +1 ] g f ( z ) (cid:1) / ( n − k +1) = log 1 r Unfortunately, the DFA obtained from the subset construction has 2 k − states, so the runtime of Algorithm 1is exponential in the problem size (i.e., the description length of f ).It is possible that calculating C f may be hard, because computing such a statistic for NFAs in general isPSPACE-complete [19]. Specifically, even determining whether C f = log b for b = 2 or 4 ( C f = log b is thehighest possible capacity over all mapping functions f ; see Lemma 5) is equivalent to determining whetherthe corresponding NFA is universal (i.e., accepts every string in the alphabet). We make this precise in thefollowing lemma. 5 emma 3. For b = 2 or and any f , C f is equal to log b if and only if every current readout in { , , . . . , b − } ∗ can be generated by some DNA strand.Proof. First we verify the “if” direction. Indeed, if every current readout can be generated, then C f =lim n →∞ log b n − k +1 n = log b .To show the “only if” direction we show the contrapositive. Suppose there is some current readout c oflength ℓ which cannot be generated by any DNA strand. Observe that any current readout which contains c as a contiguous substring can also not be generated by any DNA strand (otherwise we could generate c bytruncation). Thus, consider any current readout of length n − k + 1 which can be generated by some DNAstrand. If we divide it up into sections of length ℓ , obviously none of those sections can be equal to c . So wesee that C f ≤ lim n →∞ log (cid:16) ( b ℓ − ⌊ n − k +1 ℓ ⌋ b ( n − k +1) mod ℓ (cid:17) n = lim n →∞ ⌊ n − k +1 ℓ ⌋ log( b ℓ − n = 1 ℓ log( b ℓ − < log b. Since universality is PSPACE-complete for general NFAs, an efficient algorithm would have to in someway leverage the highly structured nature of the NFAs corresponding to some mapping f . Remark 4.
Although we currently don’t know how to calculate (or approximate) C f efficiently, it is possibleto approximate the capacity for strands of fixed length ℓ , C f ( ℓ ). Specifically, there are existing randomizedapproximation algorithms for counting the number of strings of a given length accepted by an NFA, in timepolynomial in ℓ and the size of the NFA [13, 1]. The above approach for computing C f exactly given a mapping f is only practical for small window sizes k .However, we can derive some general bounds that apply to any mapping f . In particular, we are interestedin the worst-case mapping f (e.g., the one that attains min f C f ), as well as the best-case mapping f (e.g.,the one that attains max f C f ). We prove the following lemma. Lemma 5.
For a given window size k and with b distinct current levels, we have the following bounds on C f :1. max f C f = min(log( b ) , min f C f ≥ log( b ) k min f C f ≤ when b ≤ k Proof.
We prove each bound independently:1. Observe that the the size of the set in the numerator of equation (1) is bounded above both bythe number of possible current readouts ( b n − k +1 ) and the number of DNA strands (4 n ). Thereforemax f C f ≤ min(log( b ) , b = 2, it is achieved by, for instance, f ( s · · · s k − ) = ( s ∈ { A, C } c ∈ { , , . . . , b − } n − k +1 , simply replaceall the 0’s with A ’s and the 1’s with G ’s, and add any k − k k +1 . . . 4 k b b i t s - p e r - b a s e Figure 2: Bounds on the value of C f . The dashedblue line is equal to max f C f . The value ofmin f C f must lie somewhere in the red shadedarea. 1 2 4 8 16012 b b i t s - p e r - b a s e Figure 3: For k = 2, the true max f C f (dashedblue), min f C f (solid red), and E f C f (dottedgreen), superimposed over our bounds from Fig-ure 2.For b ≥
4, the optimal rate is achieved by any mapping where f ( s · · · s k − ) = f ( t · · · t k − ) implies s = t , since then every DNA strand ending with k − f , we can always generate at least b ⌊ n/k ⌋ distinct current readouts: anychoice of desired 0 th , k th , 2 k th , etc. current readings can be obtained because they correspond tonon-overlapping length- k windows. Thus min f C f ≥ lim n →∞ log b ⌊ n/k ⌋ n = log( b ) k .3. When b ≤ k , we can effectively make the bases A and C indistinguishable from each other, andlikewise the bases G and T indistinguishable from each other. Consider any balanced mapping f ′ : { A, G } k → { , , . . . , b − } — in this case, balanced means that for each i , (cid:12)(cid:12) { s · · · s k − ∈ { A, G } k | f ′ ( s · · · s k − ) = i } (cid:12)(cid:12) = 2 k /b. Then define f ( s ) = f ′ ( s ′ ) where s ′ is obtain from s by replacing all C ’s with A ’s and all T ’s with G ’s.Observe that f is a balanced mapping, and also the size of the set in the numerator of equation (1) isnow at most 2 n . Thus min f C f ≤ lim n →∞ log(2 n ) n = 1.Combining these bounds, we obtain the plot in Figure 2. One might wonder how tight the bounds shownin Figure 2 are, and also about where C f lies for a “typical” mapping function f . Using Algorithm 1, weare able to exactly calculate the maximum, minimum, and average C f for k = 2, restricted to balancedmappings. These empirical results are shown in Figure 3.The lower bound is on min f C f curious. Our theoretical bounds on min f C f are not tight, but the k = 2 results suggest there may be some “bumpiness” in the true bound. Also based on the k = 2 results, itappears that random (balanced) mappings are closer to the best-case scenario than the worst-case. Section 5.2illustrates one coding scheme that takes advantage of a property shared by most random mappings for b = 2,and we may hope there exist more general coding schemes that perform well for random mappings. In the proof of Lemma 5, we observed that given a mapping f with window size k and b distinct currentlevels, we can achieve a rate of log( b ) k by coding only on the 0 th , k th , k th , etc. current readings. Here, we7ropose two generalizations of this trivial scheme to improve the rate with additional preprocessing basedon the mapping f . Instead of coding on non-overlapping windows of length k , each of which maps to one of b current readings,we may instead choose a block length ℓ ≥ k and code on non-overlapping blocks of length ℓ . This requiresprecomputing an alphabet Σ( ℓ ) ⊆ { A, C, G, T } ℓ such that f ∗ ↾ Σ( ℓ ) is an injective function with the samerange as f ∗ ↾ { A,C,G,T } ℓ . That is, each DNA strand in Σ( ℓ ) generates a unique current readout, and togetherthey generate every possible current readout of length ℓ − k + 1 given the mapping f . Provided our desiredstrand length n is divisible by ℓ or tends to infinity, this coding scheme obtains the rate C f ( ℓ ) = log |{ c ∈ { , , . . . , b − } ℓ − k +1 | ∃ s ∈ { A, C, G, T } ℓ s.t. f ∗ ( s ) = c }| ℓ = log | Σ( ℓ ) | ℓ Since lim ℓ →∞ C f ( ℓ ) = C f , it is possible to get arbitrarily close to the optimal rate by picking a sufficientlylarge ℓ . Theorem 6.
Given a mapping f , for all ǫ > , there is a coding scheme achieving rate C f − ǫ with linear timeencoding and decoding that requires preprocessing time O ( ǫ · /ǫ ) , where the O ( · ) notation hides constantsthat may depend on the mapping f .Proof. We will exhibit such a scheme by choosing an appropriate block length ℓ . Let | Σ( ℓ ) | be the numberof distinct current readouts that can be generated from DNA strands of length ℓ . Equivalently, | Σ( ℓ ) | =[ z ℓ − k +1 ] g f ( z ) is the number of current readouts of length ℓ − k + 1 accepted by the NFA M constructedby Algorithm 1. Because g f ( z ) is a counting function for a regular language, and because [ z ℓ − k +1 ] g f ( z ) isnon-decreasing in ℓ , the asymptotic behavior of [ z ℓ − k +1 ] g f ( z ) has a simple form ([8], Theorem V.3): | Σ( ℓ ) | = [ z ℓ − k +1 ] g f ( z ) = Θ(Π( ℓ − k + 1)(2 C f ) ℓ − k +1 )where Π( x ) is a polynomial.Thus, there must exist some ℓ and some constant C depending only on the mapping f , such that for all ℓ ≥ ℓ , | Σ( ℓ ) | ≥ C (2 C f ) ℓ − k +1 . Therefore, if we choose ℓ ≥ max (cid:16) ℓ , C f · ( k − − log Cǫ (cid:17) , we see that C f ( ℓ ) = log | Σ( ℓ ) | ℓ ≥ log (cid:0) C (2 C f ) ℓ − k +1 (cid:1) ℓ = log C + C f · ( ℓ − k + 1) ℓ = C f − C f · ( k − − log Cℓ ≥ C f − ǫ. Therefore, to achieve rate C f − ǫ , we should choose ℓ proportional to 1 /ǫ , with the constants depending onlyon the mapping f .Given the block length ℓ , we now describe how to compute the alphabet Σ( ℓ ). We will construct an array E containing the alphabet Σ( ℓ ) and a hash table D mapping current readouts of length ℓ − k + 1 to theindex in E of the DNA strand that generates that readout.For each DNA strand s of length ℓ , compute f ∗ ( s ). If f ∗ ( s ) has not yet been added to the hash table D , append s to the end of array E and map f ∗ ( s ) to the appropriate index. This preprocessing takes O ( ℓ )time for each of 4 ℓ DNA strands, for a total of O ( ℓ · ℓ ) = O ( ǫ · /ǫ ).Encoding and decoding are then quite straightforward: Convert the message to base | Σ( ℓ ) | and use lookuptable E to map each digit to a block of length ℓ . Similarly, decode each block of ℓ − k + 1 current readingsusing the hash table D (skipping the k − b = 2 or 4 and f is any mapping with the highest-possible capacity C f = log b . Per Lemma 3, this implies that every current readout can be generated by some DNA strand, so | Σ( ℓ ) | = b ℓ − k +1 and C f ( ℓ ) = ( ℓ − k +1) log( b ) ℓ . In this case, we can calculate the dependence of ℓ on ǫ exactly: C f − ǫ = ( ℓ − k + 1) log( b ) ℓ − ǫ = ( − k + 1) log( b ) ℓℓ = ( k −
1) log( b ) ǫ . For instance, when k = 2 , b = 2 , and ǫ = 0 .
1, we would require ℓ = 10. Alternatively, instead of changing the lengths of the blocks used in the trivial scheme, we may relax the“non-overlapping” requirement.This may not always be possible, depending on the mapping f . Indeed, in the worst case, it is possiblethat once you have fixed the first k bases, the next k − k bases are all A , and all windows starting with A map to the same current level. However, thispessimal case shouldn’t happen for most mappings.Consider the case of b = 2 current levels, and suppose that for some length 1 ≤ ℓ < k , for every “prefix” p ∈ { A, C, G, T } ℓ , at least one window in f − (0) and at least one window in f − (1) starts with that prefix.Then it is possible to have every ( k − ℓ ) th current reading code for one binary symbol independently. Thiswould give us a rate of k − ℓ rather than k . Such an event is not too unlikely with a random mapping f . Lemma 7.
Given a random mapping f with b = 2 distinct current levels and a length ≤ ℓ < k , f admitsthe coding scheme described above with rate k − ℓ with probability at least − ℓ · · (cid:18) (cid:19) k − ℓ . Furthermore,1. we can determine whether such a scheme exists for a given f and ℓ in O (4 k ) .2. we can find the maximum ℓ for which such a scheme exists for a given f in O ( k k ) .3. we can implement such a scheme with O ( k k ) preprocessing and linear encoding and decoding.Proof. Let f be a uniformly random mapping from { A, C, G, T } k → { , } (not necessarily balanced). Fora given prefix p of length ℓ and current level β , the probability that all 4 k − ℓ windows beginning with thatprefix belong to f − ( β ) is (cid:0) (cid:1) k − ℓ . Thus by the union bound, the probability of being able to execute thisscheme for a given length ℓ is at least 1 − ℓ · · (cid:18) (cid:19) k − ℓ . This probability would only be higher if we restrict f to be a random balanced mapping.Given a specific mapping f and desired length ℓ , we can trivially check whether this property is satisfiedby checking, for each prefix of length ℓ , the current level corresponding to each of the 4 k − ℓ windows beginningwith that prefix, in time O (4 k ).Naturally, we could determine the maximum ℓ that works for a given mapping f in O ( k k ) by repeatingthe above procedure for each ℓ ∈ { , , . . . , k − } .Furthermore, such a scheme could be straightforwardly implemented for a given f . For instance, wecould construct tries — trie 0 corresponding to f − (0) and trie 1 corresponding to f − (1) — in O ( k k ). Forease of encoding, we will also equip each leaf node s s · · · s k − of the tries with two pointers, pointing to9he internal nodes s k − ℓ s k − ℓ +1 · · · s k − in both trie 0 and trie 1. To encode from binary to a DNA strand,we then start with any length- k window in the trie corresponding to the first bit. For each subsequentbit, follow the pointer from our current leaf to the internal node of the trie corresponding to that bit, thenappend an arbitrary path from that prefix to a leaf in the appropriate trie. Decoding is simply a matter ofextracting every ( k − ℓ ) th current reading. Thus both encoding and decoding are linear in the length of theDNA strand.For example, with k = 6 , ℓ = 4, we see that we can obtain a rate of 1 / / − . We have initiated a theoretical study of coding for a highly abstracted version of the nanopore sequencer forDNA storage. We have provided algorithms and bounds for understanding the capacity, and we have givenefficient coding schemes. However, we view our work as the tip of an iceberg. First, even for this abstractedmodel, much remains open. Can one derive better bounds on the capacity, or compute it efficiently for,say, k = 6? Second, we hope that our insights will generalize to more practical models, including withsubstitution and synchronization errors. References [1] Marcelo Arenas et al. “Efficient Logspace Classes for Enumeration, Counting, and Uniform Gen-eration”. In:
Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles ofDatabase Systems (2019).[2] Abdulbaki Aydin, Lucas Bang, and Tevfik Bultan. “Automata-Based Model Counting for String Con-straints”. In:
Computer Aided Verification . Ed. by Daniel Kroening and Corina S. P˘as˘areanu. Cham:Springer International Publishing, 2015, pp. 255–272. isbn : 978-3-319-21690-4.[3] Meinolf Blawat et al. “Forward error correction for DNA data storage”. In:
Procedia Computer Science
80 (2016), pp. 1011–1022.[4] Luis Ceze, Jeff Nivala, and Karin Strauss. “Molecular digital data storage using DNA”. In:
NatureReviews Genetics
ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE. 2020, pp. 8822–8826.[6] George M Church, Yuan Gao, and Sriram Kosuri. “Next-generation digital information storage inDNA”. In:
Science
Science
Analytic Combinatorics . Cambridge University Press, Jan.2009. isbn : 978-0-521-89806-5.[9]
Flappie: Flip-flop basecaller for Oxford Nanopore reads . https://github.com/nanoporetech/flappie .Last accessed: January 2021.[10] Nick Goldman et al. “Towards practical, high-capacity, low-maintenance information storage in syn-thesized DNA”. In: Nature
Journal ofhuman genetics
Genome biology
Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algo-rithms . SODA ’95. San Francisco, California, USA: Society for Industrial and Applied Mathematics,1995, 551–557. isbn : 0898713498.[14] Henry H Lee et al. “Terminator-free template-independent enzymatic DNA synthesis for digital infor-mation storage”. In:
Nature communications
Nature communications
IEEE Transactions on Information Theory
Nature biotechnology
IBM Journal ofResearch and Development doi : .[19] Narad Rampersad, Jeffrey Shallit, and Zhi Xu. “The computational complexity of universality problemsfor prefixes, suffixes, factors, and subwords of regular languages”. In: CoRR abs/0907.0159 (2009).arXiv: . url : http://arxiv.org/abs/0907.0159 .[20] Scrappie: a technology demonstrator for the Oxford Nanopore Research Algorithms group . https://github.com/nanoporetech/scrappie .Last accessed: January 2021.[21] Ryan R Wick, Louise M Judd, and Kathryn E Holt. “Performance of neural network basecalling toolsfor Oxford Nanopore sequencing”. In: Genome biology