[PDF] On Coding for an Abstracted Nanopore Channel for DNA Storage

Abstract

In the emerging field of DNA storage, data is encoded as DNA sequences and stored. The data is read out again by sequencing the stored DNA. Nanopore sequencing is a new sequencing technology that has many advantages over other methods; in particular, it is cheap, portable, and can support longer reads. While several practical coding schemes have been developed for DNA storage with nanopore sequencing, the theory is not well understood. Towards that end, we study a highly abstracted (deterministic) version of the nanopore sequencer, which highlights key features that make its analysis difficult. We develop methods and theory to understand the capacity of our abstracted model, and we propose efficient coding schemes and algorithms.

Full PDF

aa r X i v : . [ c s . I T ] F e b On Coding for an Abstracted Nanopore Channel for DNA Storage

Reyna Hulett ∗ , Shubham Chandak † , and Mary Wootters ‡ January 2021

Abstract

In the emerging ﬁeld of DNA storage, data is encoded as DNA sequences and stored. The data is readout again by sequencing the stored DNA. Nanopore sequencing is a new sequencing technology that hasmany advantages over other methods; in particular, it is cheap, portable, and can support longer reads.While several practical coding schemes have been developed for DNA storage with nanopore sequencing,the theory is not well understood. Towards that end, we study a highly abstracted (deterministic) versionof the nanopore sequencer, which highlights key features that make its analysis diﬃcult. We developmethods and theory to understand the capacity of our abstracted model, and we propose eﬃcient codingschemes and algorithms.

In the emerging ﬁeld of

DNA storage, data is encoded as DNA sequences and stored; the data can be readback by sequencing the stored DNA. This technology promises high storage density and stability, as well aseﬃcient duplication of data and random access using PCR-based technologies; we refer the reader to [4] andthe references therein for an excellent overview.Both the synthesis and sequencing processes are noisy, and as a result the data must be encoded beforethe synthesis stage to ensure accurate data recovery. Prior work has studied methods for encoding data inorder to protect it against (aspects of) the noise introduced by these processes, for example [10, 3, 7, 22, 17,14, 15, 5].In this work, we focus on one particular stage of this noisy process, the nanopore sequencer.

Nanoporesequencing—and in particular the MinION sequencer developed by Oxford Nanopore Technologies [12]—isan emerging sequencing technology. While initial works in DNA storage used Illumina sequencing, nanoporesequencing has been attracting interest due to its portability, low cost, and ability to support signiﬁcantlylonger reads than Illumina.At a high level, the nanopore sequencer works as follows. A single strand of DNA is passed througha pore, leading to variations in a current readout. The pore can hold k nucleotides (for our purposes, anucleotide is just a value in { A, C, G, T } ) at a time; in practice k is about six. The value of the currentreadout depends on which nucleotides are in the pore. For example, if the strand of DNA is AGCTAAT,and the pore sees the sub-strand AGCT, it will output one current reading. As the strand is passed throughthe pore, the contents of the pore will shift, say from AGCT to GCTA, then to CTAA, then to TAAT, andso on. This will result in a change in the current reading, according to some function f that maps k -mersto current levels. The process is depicted in Figure 1. Given the current readout, the goal is to recover theoriginal DNA sequence.This channel is diﬃcult to analyze, for several reasons. First, the output at any given time dependson k > ∗ Computer Science Department, Stanford University. [email protected] . RH’s research supported in part by a NSFGraduate Research Fellowship under grant DGE-1656518. † Electrical Engineering Department, Stanford University. [email protected] . ‡ Computer Science and Electrical Engineering Departments, Stanford University. [email protected] . RH and MW’sresearch supported in part by NSF grants CCF-1844628 and SemiSynBio-1807371. bases in thepore at a time TGACCGTA timecurrentFigure 1: High-level view of the nanopore sequencer.Fourth, the amount of time that each k -mer spends in the pore can vary, and sometimes never occur at all,leading to synchronization errors in the output.Due to this complexity, typical basecallers (that is, methods for recovering the original sequence fromthe current readouts) rely on machine learning techniques [21, 9, 20]. This is eﬀective in practice, butdiﬃcult to get a theoretical handle on. While there are several practical approaches to error correction fornanopore sequencing in DNA storage [22, 17, 14, 15, 5], the theoretical limits of this channel are not wellunderstood. The work [16], discussed more below, proposed a probabilistic model of the nanopore sequencerand developed bounds on the capacity of the resulting channel. However, this problem is quite diﬃcult; inparticular, taking into account the possibility of synchronization errors places this problem as more diﬃcultthan determining the capacity of the deletion channel, a notorious open problem.In this work, we take a diﬀerent approach to developing a theoretical understanding of the nanoporesequencer. We develop a highly abstracted, deterministic model, which is meant to highlight the ﬁrst twosources of noise mentioned above: the inter-symbol interference, and the prospect of collisions when multiple k -mers lead to similar current outputs. We develop methods and theory to understand the capacity of ourabstracted channel, and we propose eﬃcient coding schemes and algorithms.Our goal in this work is to open up a research direction towards a theoretical understanding of thenanopore sequencer. We fully acknowledge that our work is not yet practical; in particular our abstractedchannel does not include noise in the current readings or synchronization errors that can arise from variablepore dwelling times. It is our hope that a solid understanding of our abstracted channel can then be combinedwith (much more well-studied) theories of coding for substitution and synchronization errors, in order tomake progress on a more realistic channel model. Contributions.

Our contributions are as follows.1. We propose a novel abstraction of the nanopore sequencer, which highlights the inter-symbol inter-ference and the prospect of collisions. This abstraction is simple enough that it is tractable, yetcomplex enough that it (a) captures some fundamental properties of the nanopore sequencer, and (b)already gives rise to extremely interesting problems from a theoretical perspective. We hope that ourabstraction will lead to future work in this area.2. In Section 3, we develop an algorithm to determine the capacity of this channel. The algorithm isineﬃcient as pore size k grows, but we can use it to determine the capacity for small k .3. In Section 4, we develop simple bounds on the “best” and “worst” capacity for an arbitrary pore size k (where “best” and “worst” refers to the choice of the map from k -mers to current readouts). Weuse our algorithms mentioned above to compare these bounds to the exact values for k = 2. Thesepreliminary results suggest there may be some intriguing “bumpiness” in the worst-case capacity as thenumber of distinct current levels increases, but that the average-case is closer to best- than worst-case.2. In Section 5, we develop eﬃcient coding schemes for our abstracted channel. In particular, we develop ascheme that achieves rate C f − ǫ , where C f is the capacity, that requires preprocessing time O (cid:0) ǫ · /ǫ (cid:1) ,and has encoding/decoding time O ( n ), where the O ( · ) notation hides a constant that depends on f .We also present an eﬃcient coding scheme that improves over the “naive” one with high probability,where the probability is over a random map from k -mers to current readouts. DNA storage has been around as an idea since the 1960’s, and there has been renewed interest in it in the pastdecade, starting with the works [6, 10]. We refer the reader to [4] for an excellent survey on DNA storage.Most works on DNA storage have focused on other sequencing technologies like Illumina, but starting withthe work of [22], there have been several works which developed practical DNA storage systems for nanoporesequencers, including [22, 17, 14, 15, 5]. All of these works developed practical coding schemes for DNAstorage with a nanopore sequencer, but did not explicitly model the sequencer or analyze the theoreticallimits.Perhaps the work most related to ours is that of [16], who also developed a model of the nanoporesequencer and studied the capacity of their model. In particular, they give a multi-letter capacity formulafor their channel, and derive computable bounds for the capacity, in terms of a Markov transition matrix P that captures the probability of transitioning from one k -mer to the next in the pore. Our work complementsthat work by focusing on diﬀerent aspects of the problem. In more detail, that work diﬀers from ours inseveral ways. First, the model in [16] is stochastic, while ours is deterministic. As a result, they take aninformation-theoretic approach, while our approach is more combinatorial in nature. Second, their modelincludes the possibility that k -mers might get dropped; in particular, the binary deletion channel appearsas one part of their model. Since understanding the capacity of the binary deletion channel is a diﬃcultopen problem, this makes their problem extremely diﬃcult. In contrast, we ignore this aspect in order tomore cleanly focus in on the eﬀects of the inter-symbol interaction and the potential confusability betweenthe k -mers. Third, that work focuses on the nanopore sequencer for general applications (not necessarilyfor DNA storage), and in particular does not consider eﬃcient coding schemes for their model. Finally, thatwork derives bounds on the channel capacity for a particular choice of (a stochastic analog of) the map f from k -mers to current levels, derived from experimental data. In contrast, we are interested in results forany f , and in particular for the best and worst such functions f . While the former direction is obviously ofimmediate interest for existing technology, it is our hope that understanding how the capacity of the channelchanges with f could perhaps guide how nanopore technology is developed in the future. We note that thisis still an emerging area and the technology is evolving; see [11] for an overview of recent advances. In this section we formalize our model. As mentioned above, our goal is to focus on the challenges of (1)inter-symbol interference, and (2) the possibility of diﬀerent k -mers producing similar current readouts. Withthat in mind, we propose a very simple model for the nanopore sequencer. As input, we take an encodedstring s s · · · s n − ∈ { A, C, G, T } n . This is transformed into a sequence of k -mers according to a slidingwindow, to obtain ( s · · · s k − ) , ( s · · · s k ) , ( s · · · s k +1 ) , . . . , ( s n − k · · · s n − ). Finally, each of these k -mers ismapped to one of b distinct current levels, according to a mapping f : { A, C, G, T } k → { , , . . . , b − } . Thismapping f deﬁnes the channel. Deﬁnition 1 (Abstract Nanopore Channel) . Given a mapping f : { A, C, G, T } k → { , , . . . , b − } from k -mers to current levels, let f ∗ : { A, C, G, T } ∗ → { , , . . . , b − } ∗ represent the mapping from DNA strandsto their full current readout, i.e., f ∗ ( s s · · · s n − ) = f ( s · · · s k − ) ◦ f ( s · · · s k ) ◦ · · · ◦ f ( s n − k · · · s n − ) . We call f ∗ the abstract nanopore channel given by f .Given a mapping f , we are interested in the capacity C f (in bits-per-base), which we deﬁne as follows.3 eﬁnition 2. Let f : { A, C, G, T } k → { , , . . . , b − } . The capacity C f of the abstract nanopore channeldetermined by f is deﬁned as C f = lim n →∞ log |{ c ∈ { , , . . . , b − } n − k +1 | ∃ s ∈ { A, C, G, T } n s.t. f ∗ ( s ) = c }| n . (1)Observe that if S is a collection of strings s ∈ { A, C, G, T } n so that f ∗ ( s ) are all distinct for s ∈ S , thenby assigning a diﬀerent message to each s ∈ S , we can communicate perfectly (if not necessarily eﬃciently)across the abstract nanopore channel.We will focus on balanced mappings f . These are mappings so that | f − ( i ) | is the same for all i ∈{ , , . . . , b − } . Our ﬁrst contribution is an algorithm (Algorithm 1 below) that computes C f , given f . The basic idea is toconsider a ﬁnite automaton on the alphabet { , , . . . , b − } that accepts exactly those current readouts thatcan be generated by some DNA strand; then we use the transfer matrix method [2, 8] for counting acceptingpaths in that ﬁnite automaton.Formally, an Nondeterministic Finite Automaton (NFA) is a tuple ( Q, Σ , ∆ , Q , F ) where Q is the setof states, Σ is the alphabet, ∆ : Q × Σ → Q is the transition function, Q ⊆ Q is the set of initial states,and F ⊆ Q is the set of accepting states. Likewise, a Deterministic Finite Automaton (DFA) is a tuple( Q, Σ , δ, q , F ), deﬁned analogously except that δ : Q × Σ → Q is the transition function and there is only asingle initial state q .We consider the NFA M and the DFA M ′ described in Algorithm 1. The NFA M has states indexed bystrings in { A, C, G, T } k − and alphabet Σ = { , , . . . , b − } , the current levels. Given a state ( s · · · s k − )and an input current level i ∈ Σ, the NFA M can transition to any other state of the form ( s · · · s k − ) sothat f ( s s · · · s k − ) = i . All states are accepting states. By construction, a sequence of current readings c ∈ Σ n − k +1 can be an output f ∗ ( s s · · · s n − ) of the nanopore sequencer if and only if c is accepted by M .The DFA M ′ accepts exactly the same strings as M , and is obtained using the classic subset construction[18], such that each state of the DFA corresponds to a subset of the states of the NFA.The transfer matrix method is a method for obtaining a generating function g f ( z ) for the number ofstates accepted by a given DFA. In more detail, given the transfer matrix T for the automaton (so that T i,j is the number of transitions from state i to state j ), the transfer matrix method gives an expression for afunction g f ( z ) (in terms of the matrix T ), so that g f ( z ) = ∞ X m =0 N m z m , so that N m is the number of strings of length m accepted by the ﬁnite automaton. We will use the notation[ z m ] g f ( z ) to denote the coeﬃcient N m on z m . Lemma 1 (Transfer Matrix Method [2]) . Given a DFA D = ( Q, Σ , δ, q , F ) , let D ′ = ( Q ∪ q F , Σ ∪ λ, δ ′ , q , { q F } ) be obtained from the DFA D by adding a new state q F and new symbol λ , along with λ -transitions from each of the accepting states of D to q F . Let T be the transfer matrix of D ′ , where T ij isthe number of transitions from state i to state j , with q being state and q F being state | Q | . Then thegenerating function for D ′ is g f ( z ) = ( − | Q | × det( I − zT : | Q | , z det( I − zT ) where ( I − zT : | Q | , is the minor of index | Q | , , i.e., the matrix I − zT with the | Q | th row and th columndeleted. In our case, the number of strings of length m accepted by our ﬁnite automaton will be the number ofcurrent readouts of length m = n − k + 1 that can be generated by some DNA strand of length n .4rom g f ( z ), we can derive the asymptotic behavior of the number of possible current readouts requiredto determine C f . As shown in the proof below, it is related to the smallest positive singularity of g f ( z ). Algorithm 1:

Calculate C f Q ← { A, C, G, T } k − Σ ← { , , . . . , b − } NFA M ← ( Q, Σ , ∆ , Q, Q ) where ∆( s · · · s k − , i ) = { s · · · s k − | f ( s s · · · s k − ) = i } DFA M ′ ← (2 Q , Σ , δ, q , F ) obtained from the subset construction on MM ′ ← (2 Q ∪ { q F } , Σ ∪ { λ } , δ ′ , q , { q F } ) where δ ′ ( q, σ ) =  δ ( q, σ ) q ∈ Q and σ = λq F q ∈ F and σ = λ ∅ otherwise T ← transfer matrix of M ′ , i.e., T i,j = number of transitions from state i to state jg f ( z ) ← ( − | Q | × det( I − zT :2 | Q | , z det( I − zT ) where ( I − zT : 2 | Q | ,

0) is the minor of index 2 | Q | , r ← smallest positive real root of the denominator of g f ( z ) (simpliﬁed)return log r Theorem 2.

Given a mapping f : { A, C, G, T } k → { , , . . . , b − } , Algorithm 1 computes C f .Proof. First, observe that the NFA M accepts exactly those current readouts that can be generated bysome DNA strand under the mapping f . For any current readout accepted by M , consider an acceptingpath P = s · · · s k − , s · · · s k − , . . . , s m · · · s m + k − . Then by construction, the DNA strand s · · · s m + k − generates that current readout. In the other direction, for any DNA strand s · · · s m + k − , the path P is anaccepting path for f ∗ ( s · · · s m + k − ).The DFA obtained from the subset construction accepts the same current readouts as M [18]. Thenwe apply the transfer matrix method described in Lemma 1 to obtain the generating function g f ( z ), whichcounts the number of current readouts accepted by M .Finally, we need to extract the asymptotic behavior of [ z m ] g f ( z ). Since g f ( z ) is a generating function withnon-negative coeﬃcients, the Exponential Growth Formula [8] tells us that lim sup m →∞ ([ z m ] g f ( z )) /m = 1 /r where r is the smallest positive singularity of g f ( z ). Note that the number of possible current readouts ismonotonically non-decreasing in m , so the lim sup is equal to the limit. Therefore, based on equation (1),and the fact that the length of the current readouts for DNA strands of length n is m = n − k + 1, C f = lim n →∞ log (cid:0) [ z n − k +1 ] g f ( z ) (cid:1) n = lim n →∞ n − k + 1 n log (cid:0) [ z n − k +1 ] g f ( z ) (cid:1) / ( n − k +1) = log 1 r Unfortunately, the DFA obtained from the subset construction has 2 k − states, so the runtime of Algorithm 1is exponential in the problem size (i.e., the description length of f ).It is possible that calculating C f may be hard, because computing such a statistic for NFAs in general isPSPACE-complete [19]. Speciﬁcally, even determining whether C f = log b for b = 2 or 4 ( C f = log b is thehighest possible capacity over all mapping functions f ; see Lemma 5) is equivalent to determining whetherthe corresponding NFA is universal (i.e., accepts every string in the alphabet). We make this precise in thefollowing lemma. 5 emma 3. For b = 2 or and any f , C f is equal to log b if and only if every current readout in { , , . . . , b − } ∗ can be generated by some DNA strand.Proof. First we verify the “if” direction. Indeed, if every current readout can be generated, then C f =lim n →∞ log b n − k +1 n = log b .To show the “only if” direction we show the contrapositive. Suppose there is some current readout c oflength ℓ which cannot be generated by any DNA strand. Observe that any current readout which contains c as a contiguous substring can also not be generated by any DNA strand (otherwise we could generate c bytruncation). Thus, consider any current readout of length n − k + 1 which can be generated by some DNAstrand. If we divide it up into sections of length ℓ , obviously none of those sections can be equal to c . So wesee that C f ≤ lim n →∞ log (cid:16) ( b ℓ − ⌊ n − k +1 ℓ ⌋ b ( n − k +1) mod ℓ (cid:17) n = lim n →∞ ⌊ n − k +1 ℓ ⌋ log( b ℓ − n = 1 ℓ log( b ℓ − < log b. Since universality is PSPACE-complete for general NFAs, an eﬃcient algorithm would have to in someway leverage the highly structured nature of the NFAs corresponding to some mapping f . Remark 4.

Although we currently don’t know how to calculate (or approximate) C f eﬃciently, it is possibleto approximate the capacity for strands of ﬁxed length ℓ , C f ( ℓ ). Speciﬁcally, there are existing randomizedapproximation algorithms for counting the number of strings of a given length accepted by an NFA, in timepolynomial in ℓ and the size of the NFA [13, 1]. The above approach for computing C f exactly given a mapping f is only practical for small window sizes k .However, we can derive some general bounds that apply to any mapping f . In particular, we are interestedin the worst-case mapping f (e.g., the one that attains min f C f ), as well as the best-case mapping f (e.g.,the one that attains max f C f ). We prove the following lemma. Lemma 5.

For a given window size k and with b distinct current levels, we have the following bounds on C f :1. max f C f = min(log( b ) , min f C f ≥ log( b ) k min f C f ≤ when b ≤ k Proof.

We prove each bound independently:1. Observe that the the size of the set in the numerator of equation (1) is bounded above both bythe number of possible current readouts ( b n − k +1 ) and the number of DNA strands (4 n ). Thereforemax f C f ≤ min(log( b ) , b = 2, it is achieved by, for instance, f ( s · · · s k − ) = ( s ∈ { A, C } c ∈ { , , . . . , b − } n − k +1 , simply replaceall the 0’s with A ’s and the 1’s with G ’s, and add any k − k k +1 . . . 4 k b b i t s - p e r - b a s e Figure 2: Bounds on the value of C f . The dashedblue line is equal to max f C f . The value ofmin f C f must lie somewhere in the red shadedarea. 1 2 4 8 16012 b b i t s - p e r - b a s e Figure 3: For k = 2, the true max f C f (dashedblue), min f C f (solid red), and E f C f (dottedgreen), superimposed over our bounds from Fig-ure 2.For b ≥

4, the optimal rate is achieved by any mapping where f ( s · · · s k − ) = f ( t · · · t k − ) implies s = t , since then every DNA strand ending with k − f , we can always generate at least b ⌊ n/k ⌋ distinct current readouts: anychoice of desired 0 th , k th , 2 k th , etc. current readings can be obtained because they correspond tonon-overlapping length- k windows. Thus min f C f ≥ lim n →∞ log b ⌊ n/k ⌋ n = log( b ) k .3. When b ≤ k , we can eﬀectively make the bases A and C indistinguishable from each other, andlikewise the bases G and T indistinguishable from each other. Consider any balanced mapping f ′ : { A, G } k → { , , . . . , b − } — in this case, balanced means that for each i , (cid:12)(cid:12) { s · · · s k − ∈ { A, G } k | f ′ ( s · · · s k − ) = i } (cid:12)(cid:12) = 2 k /b. Then deﬁne f ( s ) = f ′ ( s ′ ) where s ′ is obtain from s by replacing all C ’s with A ’s and all T ’s with G ’s.Observe that f is a balanced mapping, and also the size of the set in the numerator of equation (1) isnow at most 2 n . Thus min f C f ≤ lim n →∞ log(2 n ) n = 1.Combining these bounds, we obtain the plot in Figure 2. One might wonder how tight the bounds shownin Figure 2 are, and also about where C f lies for a “typical” mapping function f . Using Algorithm 1, weare able to exactly calculate the maximum, minimum, and average C f for k = 2, restricted to balancedmappings. These empirical results are shown in Figure 3.The lower bound is on min f C f curious. Our theoretical bounds on min f C f are not tight, but the k = 2 results suggest there may be some “bumpiness” in the true bound. Also based on the k = 2 results, itappears that random (balanced) mappings are closer to the best-case scenario than the worst-case. Section 5.2illustrates one coding scheme that takes advantage of a property shared by most random mappings for b = 2,and we may hope there exist more general coding schemes that perform well for random mappings. In the proof of Lemma 5, we observed that given a mapping f with window size k and b distinct currentlevels, we can achieve a rate of log( b ) k by coding only on the 0 th , k th , k th , etc. current readings. Here, we7ropose two generalizations of this trivial scheme to improve the rate with additional preprocessing basedon the mapping f . Instead of coding on non-overlapping windows of length k , each of which maps to one of b current readings,we may instead choose a block length ℓ ≥ k and code on non-overlapping blocks of length ℓ . This requiresprecomputing an alphabet Σ( ℓ ) ⊆ { A, C, G, T } ℓ such that f ∗ ↾ Σ( ℓ ) is an injective function with the samerange as f ∗ ↾ { A,C,G,T } ℓ . That is, each DNA strand in Σ( ℓ ) generates a unique current readout, and togetherthey generate every possible current readout of length ℓ − k + 1 given the mapping f . Provided our desiredstrand length n is divisible by ℓ or tends to inﬁnity, this coding scheme obtains the rate C f ( ℓ ) = log |{ c ∈ { , , . . . , b − } ℓ − k +1 | ∃ s ∈ { A, C, G, T } ℓ s.t. f ∗ ( s ) = c }| ℓ = log | Σ( ℓ ) | ℓ Since lim ℓ →∞ C f ( ℓ ) = C f , it is possible to get arbitrarily close to the optimal rate by picking a suﬃcientlylarge ℓ . Theorem 6.

Given a mapping f , for all ǫ > , there is a coding scheme achieving rate C f − ǫ with linear timeencoding and decoding that requires preprocessing time O ( ǫ · /ǫ ) , where the O ( · ) notation hides constantsthat may depend on the mapping f .Proof. We will exhibit such a scheme by choosing an appropriate block length ℓ . Let | Σ( ℓ ) | be the numberof distinct current readouts that can be generated from DNA strands of length ℓ . Equivalently, | Σ( ℓ ) | =[ z ℓ − k +1 ] g f ( z ) is the number of current readouts of length ℓ − k + 1 accepted by the NFA M constructedby Algorithm 1. Because g f ( z ) is a counting function for a regular language, and because [ z ℓ − k +1 ] g f ( z ) isnon-decreasing in ℓ , the asymptotic behavior of [ z ℓ − k +1 ] g f ( z ) has a simple form ([8], Theorem V.3): | Σ( ℓ ) | = [ z ℓ − k +1 ] g f ( z ) = Θ(Π( ℓ − k + 1)(2 C f ) ℓ − k +1 )where Π( x ) is a polynomial.Thus, there must exist some ℓ and some constant C depending only on the mapping f , such that for all ℓ ≥ ℓ , | Σ( ℓ ) | ≥ C (2 C f ) ℓ − k +1 . Therefore, if we choose ℓ ≥ max (cid:16) ℓ , C f · ( k − − log Cǫ (cid:17) , we see that C f ( ℓ ) = log | Σ( ℓ ) | ℓ ≥ log (cid:0) C (2 C f ) ℓ − k +1 (cid:1) ℓ = log C + C f · ( ℓ − k + 1) ℓ = C f − C f · ( k − − log Cℓ ≥ C f − ǫ. Therefore, to achieve rate C f − ǫ , we should choose ℓ proportional to 1 /ǫ , with the constants depending onlyon the mapping f .Given the block length ℓ , we now describe how to compute the alphabet Σ( ℓ ). We will construct an array E containing the alphabet Σ( ℓ ) and a hash table D mapping current readouts of length ℓ − k + 1 to theindex in E of the DNA strand that generates that readout.For each DNA strand s of length ℓ , compute f ∗ ( s ). If f ∗ ( s ) has not yet been added to the hash table D , append s to the end of array E and map f ∗ ( s ) to the appropriate index. This preprocessing takes O ( ℓ )time for each of 4 ℓ DNA strands, for a total of O ( ℓ · ℓ ) = O ( ǫ · /ǫ ).Encoding and decoding are then quite straightforward: Convert the message to base | Σ( ℓ ) | and use lookuptable E to map each digit to a block of length ℓ . Similarly, decode each block of ℓ − k + 1 current readingsusing the hash table D (skipping the k − b = 2 or 4 and f is any mapping with the highest-possible capacity C f = log b . Per Lemma 3, this implies that every current readout can be generated by some DNA strand, so | Σ( ℓ ) | = b ℓ − k +1 and C f ( ℓ ) = ( ℓ − k +1) log( b ) ℓ . In this case, we can calculate the dependence of ℓ on ǫ exactly: C f − ǫ = ( ℓ − k + 1) log( b ) ℓ − ǫ = ( − k + 1) log( b ) ℓℓ = ( k −

1) log( b ) ǫ . For instance, when k = 2 , b = 2 , and ǫ = 0 .

1, we would require ℓ = 10. Alternatively, instead of changing the lengths of the blocks used in the trivial scheme, we may relax the“non-overlapping” requirement.This may not always be possible, depending on the mapping f . Indeed, in the worst case, it is possiblethat once you have ﬁxed the ﬁrst k bases, the next k − k bases are all A , and all windows starting with A map to the same current level. However, thispessimal case shouldn’t happen for most mappings.Consider the case of b = 2 current levels, and suppose that for some length 1 ≤ ℓ < k , for every “preﬁx” p ∈ { A, C, G, T } ℓ , at least one window in f − (0) and at least one window in f − (1) starts with that preﬁx.Then it is possible to have every ( k − ℓ ) th current reading code for one binary symbol independently. Thiswould give us a rate of k − ℓ rather than k . Such an event is not too unlikely with a random mapping f . Lemma 7.

Given a random mapping f with b = 2 distinct current levels and a length ≤ ℓ < k , f admitsthe coding scheme described above with rate k − ℓ with probability at least − ℓ · · (cid:18) (cid:19) k − ℓ . Furthermore,1. we can determine whether such a scheme exists for a given f and ℓ in O (4 k ) .2. we can ﬁnd the maximum ℓ for which such a scheme exists for a given f in O ( k k ) .3. we can implement such a scheme with O ( k k ) preprocessing and linear encoding and decoding.Proof. Let f be a uniformly random mapping from { A, C, G, T } k → { , } (not necessarily balanced). Fora given preﬁx p of length ℓ and current level β , the probability that all 4 k − ℓ windows beginning with thatpreﬁx belong to f − ( β ) is (cid:0) (cid:1) k − ℓ . Thus by the union bound, the probability of being able to execute thisscheme for a given length ℓ is at least 1 − ℓ · · (cid:18) (cid:19) k − ℓ . This probability would only be higher if we restrict f to be a random balanced mapping.Given a speciﬁc mapping f and desired length ℓ , we can trivially check whether this property is satisﬁedby checking, for each preﬁx of length ℓ , the current level corresponding to each of the 4 k − ℓ windows beginningwith that preﬁx, in time O (4 k ).Naturally, we could determine the maximum ℓ that works for a given mapping f in O ( k k ) by repeatingthe above procedure for each ℓ ∈ { , , . . . , k − } .Furthermore, such a scheme could be straightforwardly implemented for a given f . For instance, wecould construct tries — trie 0 corresponding to f − (0) and trie 1 corresponding to f − (1) — in O ( k k ). Forease of encoding, we will also equip each leaf node s s · · · s k − of the tries with two pointers, pointing to9he internal nodes s k − ℓ s k − ℓ +1 · · · s k − in both trie 0 and trie 1. To encode from binary to a DNA strand,we then start with any length- k window in the trie corresponding to the ﬁrst bit. For each subsequentbit, follow the pointer from our current leaf to the internal node of the trie corresponding to that bit, thenappend an arbitrary path from that preﬁx to a leaf in the appropriate trie. Decoding is simply a matter ofextracting every ( k − ℓ ) th current reading. Thus both encoding and decoding are linear in the length of theDNA strand.For example, with k = 6 , ℓ = 4, we see that we can obtain a rate of 1 / / − . We have initiated a theoretical study of coding for a highly abstracted version of the nanopore sequencer forDNA storage. We have provided algorithms and bounds for understanding the capacity, and we have giveneﬃcient coding schemes. However, we view our work as the tip of an iceberg. First, even for this abstractedmodel, much remains open. Can one derive better bounds on the capacity, or compute it eﬃciently for,say, k = 6? Second, we hope that our insights will generalize to more practical models, including withsubstitution and synchronization errors. References [1] Marcelo Arenas et al. “Eﬃcient Logspace Classes for Enumeration, Counting, and Uniform Gen-eration”. In:

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles ofDatabase Systems (2019).[2] Abdulbaki Aydin, Lucas Bang, and Tevﬁk Bultan. “Automata-Based Model Counting for String Con-straints”. In:

Computer Aided Veriﬁcation . Ed. by Daniel Kroening and Corina S. P˘as˘areanu. Cham:Springer International Publishing, 2015, pp. 255–272. isbn : 978-3-319-21690-4.[3] Meinolf Blawat et al. “Forward error correction for DNA data storage”. In:

Procedia Computer Science

80 (2016), pp. 1011–1022.[4] Luis Ceze, Jeﬀ Nivala, and Karin Strauss. “Molecular digital data storage using DNA”. In:

NatureReviews Genetics

ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE. 2020, pp. 8822–8826.[6] George M Church, Yuan Gao, and Sriram Kosuri. “Next-generation digital information storage inDNA”. In:

Science

Analytic Combinatorics . Cambridge University Press, Jan.2009. isbn : 978-0-521-89806-5.[9]

Flappie: Flip-ﬂop basecaller for Oxford Nanopore reads . https://github.com/nanoporetech/flappie .Last accessed: January 2021.[10] Nick Goldman et al. “Towards practical, high-capacity, low-maintenance information storage in syn-thesized DNA”. In: Nature

Journal ofhuman genetics

Genome biology

Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algo-rithms . SODA ’95. San Francisco, California, USA: Society for Industrial and Applied Mathematics,1995, 551–557. isbn : 0898713498.[14] Henry H Lee et al. “Terminator-free template-independent enzymatic DNA synthesis for digital infor-mation storage”. In:

Nature communications

IEEE Transactions on Information Theory

Nature biotechnology

IBM Journal ofResearch and Development doi : .[19] Narad Rampersad, Jeﬀrey Shallit, and Zhi Xu. “The computational complexity of universality problemsfor preﬁxes, suﬃxes, factors, and subwords of regular languages”. In: CoRR abs/0907.0159 (2009).arXiv: . url : http://arxiv.org/abs/0907.0159 .[20] Scrappie: a technology demonstrator for the Oxford Nanopore Research Algorithms group . https://github.com/nanoporetech/scrappie .Last accessed: January 2021.[21] Ryan R Wick, Louise M Judd, and Kathryn E Holt. “Performance of neural network basecalling toolsfor Oxford Nanopore sequencing”. In: Genome biology