An Indel-Resistant Error-Correcting Code for DNA-Based Information Storage
AAn Indel-Resistant Error-Correcting Code forDNA-Based Information Storage
W.H. Press and J.A. HawkinsInstitute for Computational Engineering and SciencesUniversity of Texas at AustinDecember 5, 2018
Engineered DNA is an information channel. One can convert an arbitrarymessage into a string of DNA characters, or bases, { A, C, G, T } , synthesizethe string into a physical DNA sample; store or transport the sample throughspace and time; sequence it back to a string of characters; and then hope torecover exactly the original message. Because errors are introduced during allthe stages of synthesis, storage, and sequencing, it is necessary to utilize anerror-correcting code (ECC) at the stage of converting message bits to DNAcharacters (encoding), and then later, when DNA characters are convertedback to message bits (decoding). The ECC needs to correct three kinds oferrors: substitutions of one base by another, spurious insertions of bases, anddeletions of bases from the message. Insertions and deletions are commonlytermed “indels”.The correction of substitutions is a standard problem in coding theory,where substitutions are termed “errors”. The overarching theoretical frame-work for coding theory starts with Shannon [1], and there exist hundreds, ifnot thousands, of well studied error-correcting codes (ECCs) [2, 3, 4, 5]. How-ever, established methods for error correction in the case of silent deletions—termed deletion channels—are few; and there are virtually no establishedmethods for channels with all three of deletions, insertions, and substitu-tions. (See [6] and [7] for reviews and references.) Indeed, no approaches1 a r X i v : . [ q - b i o . Q M ] D ec uggested in the literature are well suited to DNA applications [8]. For ex-ample, almost all attention has been on binary channels, while the DNAchannel is quaternary.As the main contribution of this paper, we describe in Section 4 a methodfor encoding a stream of arbitrary message bits onto a stream of DNA char-acters with an ECC that simultaneously corrects all three kinds of errors.Our ECC, which we call HEDGES (for “Hash Encoded, Decoded by GreedyExhaustive Search”), is tuned to recover character-by-character message syn-chronization, even at the cost of leaving a small number of uncorrected sub-stitution errors . This tuning makes HEDGES useful as the “inner” code(closest to the channel and applied last in encoding) in a concatenated codedesign, leaving it to a conventional “outer” code (applied first in encoding,last in decoding) to correct any remaining substitution errors. Below, incombination with HEDGES, we will use the standard Reed-Solomon (R-S)code [9] denoted RS(255,223). R-S codes are completely intolerant of indels,i.e., they require perfectly synchronized input.Because of our tuning of HEDGES for use in a concatenated design, it willbe useful to first describe a possible full system design (Section 3), and onlythen describe HEDGES in detail (Section 4). The system design in Section3 is illustrative but not unique. HEDGES itself, as a quaternary-alphabetindel-resistant ECC, is general and can readily be utilized in other overalldesigns. There is a growing body of experimental work on DNA information storage,employing various strategies for dealing with errors. We summarize chrono-logically some of the previous work as it relates to this paper.Church et. al [10] synthesized oligomers of length 159, each of which con-tained both address information (ordering of oligomers in the message) andpayload. There was no explicit ECC. The pool of oligomers was sequencedto a depth 3000x, allowing recovery of a consensus sequence with high prob-ability. High-depth coverage can correct sequencing errors, but not synthesiserrors. Indeed, the final results contained 10-bit errors.Goldman et. al [11] synthesized oligos with 3/4 of each strand overlappingthe previous strand, in effect a 4x repetition code. Each strand had paritycheck bits for error detection (but not correction). A ternary code (ternary2essage alphabet to quaternary DNA alphabet) was utilized to avoid ho-mopolymers. Error correction was done by sequencing to high depth andfiltering for, and aligning, perfectly sequenced fragments. The final resultscontained two gaps where none of the four overlapping sequences were recov-ered.Grass et. al [12] implemented interleaved Reed-Solomon codes for DNAstorage. Message bits were converted to characters in the 47-character alpha-bet GF(47), with interleaved correction in blocks of (713 ×
39) characters.An inner code mapped GF(47) characters to 47 DNA trimers chosen to guar-antee no homopolymers of length > C = A ⊕ B (or other redundant combinations), where ⊕ denotesexclusive-or, and utilizing majority dedoding. Parity check bits allowed thefiltering out of faulty strands. There was no explicit correction of indels.Erlich and Zielinski [14] utilized quite a different overall architecture,based on fountain codes. Fountain codes send linear combinations of portionsof a message in “droplets”, such that one can recover the original message bythe solution of linear equations. The included redundancy allows the loss ofsome droplets. R-S coding was used within each oligomer droplet for errordetection but not correction. Faulty droplets, including any with indels, wererejected.Yazdi et. al [15] leveraged the multiple sequence alignment capabilities ofseveral sophisticated packages in conjunction with a custom homopolymercheck code to correct the large error rates—especially homopolymer errors—associated with MinION nanopore sequencing. Sequencing depths were inthe range of several hundred.In the largest-scale experiment to date, Organick et. al [16] encoded andrecovered, error-free, more than 200 MB of data. R-S coding was used acrossstrands, with no explicit error correction within a strand. Substitutions andindels within a strand were corrected by multiple alignment and consensuscalling. Coverage was 5x for high-quality Illumina sequencing, rising to 36xto 80x required for Nanopore technology.To summarize, while previous work has adopted increasingly sophisti-cated system designs, there has been little progress in the fundamental prob-lem of correcting indels within a single strands. The use of sequencing tolarge depth, followed by multiple alignment, is in effect a use of the oldest,3igure 1: The overall system design is a concatenated code with interleavingacross 255 strands of DNA (horizontal lines in the figure). Each strand isprotected by HEDGES, this paper’s indel-correcting code. HEDGES restoressynchronization, so that the packet can then be protected against residualerrors and missing strands by a Reed-Solomon code (applied diagonally, asshown, for reasons described in the text).simplest, and arguably least efficient ECC, namely a simple repetition code,sending the same message multiple times and taking the consensus. Thismanifests itself as sequencing stored DNA to high coverage, finding sets ofreads which appear to derive from the same intended sequence, and eithermerging the reads into a consensus sequence and/or filtering out any readswhich fail some quality check. Though this is a central part of previouslyimplemented DNA error-correction schemes, it also tends to be left unac-counted for in claims for code efficiency. The central result in this paper is atechnique for correcting substitutions and indels in a single strand, i.e., whensequenced to no more than depth one.4 Example System Design
For cost and efficiency, both DNA synthesis and DNA sequencing employmassive parallelism. That is, many short sequences, each of length hundredsto thousands of bases, are written (synthesized) or read (sequenced) simul-taneously. While the length of a single synthesis or read will increase astechnology improves, it is unlikely that the great advantage of parallelismwill ever be superceded. This being the case, the basic units of our systemdesign are individual strands of length 10 –10 .To connect with our use of RS(255,223), we define a “DNA packet” as anordered set of 255 DNA strands. When any one strand in the set is decodedwith HEDGES, it produces a message fragment of length L bytes (say), nowhaving high probability of being perfectly synchronized. Each 255 correctlyordered message fragments form a “message packet”, as illustrated in Figure1. There can be any number of message packets in a total communication.The Reed-Solomon code is applied across the strands (interleaved). Thisenables it to protect against missing strands—“erasures” to coding theorists—as well as correcting any residual substitution errors that were not correctedby HEDGES. Different from previous investigations, we apply the R-S codediagonally across the strands (see Figure). This increases the resistance toany failure of synthesis or sequencing to produce full-length strands. It alsoameliorates the effect of the observed tendency for error rates to be higherat the ends of strands.It is an important point that the Reed-Solomon code can only be ap-plied after the strands in a packet are identified as being from one particularpacket (out of an assumed pool of many packets, perhaps millions) and arecorrectly ordered. This implies that a packet’s identification number and astrand’s serial number within the packet (both shown as shaded green in theFigure) cannot themselves be R-S protected. We will instead protect themby a different technique (“salt protection”) that is described in Section 4.4.Salt protection has the effect of turning uncorrectable errors in the identifi-cation/serial bytes into erasures in the message bytes—which are correctableby R-S.Summarizing, here are the main points that affect the design of HEDGESas an inner code: (1) We don’t need to decode strands of arbitrary length, butonly of some known uncorrupted length L . (2) Recovering synchronizationhas the highest priority. (3) Known erasures are less harmful than unknownsubstitutions, because R-S can correct twice as many erasures as substitution5rrors. (4) Burst errors within a single byte are less harmful than distributedbit errors, because R-S corrects a byte at a time. (5) Within the R-S code’scapacity for byte errors and erasures, residual errors will be fully correctedby the outer code, yielding an error-free message. Given a message stream of bits b i , i = 0 , , , . . . , M, b i ∈ { , } (1)(“the message” or “bits”), we want to emit a stream of DNA characters C i , i = 0 , , , . . . , N, C i ∈ { A, C, G, T } ≡ { , , , } (“the codestream” or “characters”). We first describe the case of a half-ratecode, where we emit exactly one C i (2 bits of output) for each b i (1 bit ofinput). In section 4.5 we generalize to codes at other rates r (message bitsper codestream bit), 0 < r <
1, so that the streams b i and C i are not thenin lockstep, and M (cid:54) = N . One should think of N as being on the order of10 to 10 , the maximum length of a single DNA strand that can be cheaplysynthesized today or in the foreseeable future.We want to be able to decode without residual errors a received code-stream C (cid:48) that differs from C by substitutions (errors), insertions, and dele-tions (collectively “indels”). Indels are silent: their positions in the code-stream C (cid:48) are not known to the receiver.The basic plan is a variant of a centuries-old cryptographic technique,“text auto-key encoding” [17]. We generate a keystream of characters K i ∈{ , , , } , where each K i depends pseudorandomly (but deterministically bya hash function) on some number of previous message bits b j (with j < i ),and also directly on the bit position index i . (We can initialize the previousbits by defining b j ≡ j < C i = K i + b i , the addition performed modulo 4. In the terminology ofmodern code theory, this scheme would be called a type of “tree code” or,more specifically, an “infinite constraint-length convolutional code”.The redundancy necessary for error correction comes from the fact that b i takes on only two values, while K i and C i can have four values. This generates6only) one bit of redundancy per character, i.e., can be acausally valid bychance half the time. However, the dependence of K i on many previousmessage bits ties any given message bit to many future bits of redundancy.Similarly, the dependence of K i on i ties every bit to its position index, sothat (as we will see) insertions can be identified and removed, and deletedvalues can be restored.What is not obvious is that a codestream thus generated can actually bepractically decoded, especially in the presence of errors and indels at signifi-cant rates. We will show by numerical simulation that it can be, remarkablyeasily, essentially by guessing successive message bits and scoring against thelikelihood of the codestream under the guessed hypothesis. Wrong guesseswill be rejected by implying exponentially small downstream likelihoods. Incoding theory, this general technique is known as “sequential decoding”. Elaborating slightly on the above description, let S i denote an arbitrary s -bitvalue (“salt”) that can depend on i but is known to both sender and receiver, S i = known ∈ Z ⊗ s Denote the low-order q bits of the bit position index i by I i ≡ i (mod 2 q )Let B i denote the r previous concatenated bits B i ≡ [ b i − r b i − r +1 · · · b i − ] ∈ Z ⊗ r Finally, let F ( S, I, B ) be a deterministic hash function from r + q + s bits to2 bits F ( S, I, B ) : Z ⊗ ( r + q + s )2 → Z Then the formula for encoding is C i = K i + b i = F ( S i , I i , B i ) + b i (mod 4) (2)Figure 2 shows the algorithm graphically.Typical values that we use are r = 8, q = 10, s = 46, so that r + q + s = 64 bits, a convenient value for input to the hash. For the hashfunction we use the low order 2 bits from the Numerical Recipes [18] function
Ranhash.int64() , because it is very fast and will occur in the inner loop ofthe decode algorithm. 7igure 2: The basic HEDGES encoding algorithm is a variant of plaintextauto-key, but with redundancy introduced because (in the case of a half-ratecode, e.g.) one bit of input produces two bits of output.
For simplicity, assume that error rates are “small”, so that “most” DNAbases are received as they were intended. (We will see in Section 5 thatDNA character error rates up to ∼ b i − and nowwant to know bit b i . Guessing the two possibilities, { , } , we use equation(2) to predict two possibilities for the character C i . In the absence of an error,only one of these is guaranteed to agree with the observed character C (cid:48) i . Weassign to a guess that generates disagreement with C (cid:48) i a penalty score equal(conceptually) to the negative log probability of observing a substitutionerror. In other words, a wrong guess might actually be right, but only ifa substitution has occurred. If neither guess produces the correct C i , thenboth are assigned the substitution penalty.We have not yet accounted for the possibility of insertions and deletions,however. In fact, there are more than the above two possible guesses. Wemust guess not just b i ∈ { , } , but also a “skew” ∆ ∈ { . . . , − , , , . . . } thattells us whether in comparing C to C (cid:48) we should skip characters (∆ >
0) be-cause of insertions, or posit missing characters (∆ <
0) because of deletions(in which case there is no comparison to be done). As a practical simplifi-cation we consider only ∆ ∈ {− , , } . (We comment on this simplificationin Section 4.6.) Then there are six guesses for ( b i , ∆) ∈ { , } ⊗ {− , , } .8ach can be scored by an appropriate log probability penalty for any impliedsubstitution, insertion, or deletion.Log probability penalties accumulate additively along any chain of guesses.In the causal case of a chain of all-correct guesses, we accumulate penaltiesonly in the (relatively rare) case of actual errors. However, because of theway that the key K i (equation (2)) is constructed, single wrong guess foreither b i , i , or ∆ throws us into the acausal case where 3/4 of subsequentcomparisons of computed C (at some bit position index i ) to observed C (cid:48) (atsome index k ) will not agree—thus penalties will accumulate rapidly. Thedecoding problem, conceptually a maximum likelihood search, thus reducesto a shortest-path search in a tree with branching factor 6, but with thesaving grace that the correct path will be much shorter than any deviationfrom it.Figure 3: The HEDGES decoding algorithm is a greedy search on an expand-ing tree of hypotheses. Each hypothesis simultaneously guesses a message bit b i , its bit position index i , and its corresponding character position index k .While the tree can in principle grow exponentially, a “greediness parameter” P ok (see text) limits its growth: Most spawned nodes are never revisited.We can formalize the above discussion as follows. Let H := [ i, b i , B i , k ]9enote the joint hypothesis that the values i, b i , B i are all correct and syn-chronize to the observed codestream character C (cid:48) k through equation (2). Asa node in the search tree, the hypothesis H := [ i, b i , B i , k ] spawns six childhypotheses, each of which can be scored with additional penalty ∆ P (to beadded to their common parent’s accumulated penalty) as follows: H := [ i + 1 , { , } , B i +1 , k ] : ∆ P = P del H := [ i + 1 , { , } , B i +1 , k + 1] : ∆ P = ( P ok if C = C (cid:48) else P sub ) H := [ i + 1 , { , } , B i +1 , k + 2] : ∆ P = ( P ins + P ok if C = C (cid:48) else P ins + P sub )(3)Here P sub , P ins , P del can be thought of as respectively the log probabilitypenalties for substitution, insertion, or deletion errors (but see Section 4.6). P ok is the penalty or, if negative, reward, for an agreement between thecomputed and received codestream characters C and C (cid:48) . In the comparisonnotated above as C = C (cid:48) , the index of C is the first parameter in the hy-pothesis H , while the index of C (cid:48) is the last parameter in H . Note that achild node’s B i +1 is always computable from its parent’s B i and b i .How can we practically search this huge tree? A conceptual starting pointis the famous A* search algorithm [19], a best-first (that is, “greedy”) searchutilizing a heap data structure. A* assigns a heuristic cost to every node thatis the sum of its actual cost plus a quantity less than or equal to the smallestpossible additional cost that it can incur in reaching the goal. (For a treeof constant depth, this is equivalent to adding a reward for every step takencloser to the leaf nodes, i.e., a negative constant P ok above.) Figure 3 showsthe logical flow of an A* search, and also the HEDGES decode algorithm. Asalready remarked, in coding theory, this kind of decoding strategy is called“sequential decoding”.Provably, A* always finds the best path. For our application, unfortu-nately, it is exponentially slow, because actual errors along the true pathcause too many spawned hypotheses to be revisited; and because its termi-nation criterion is too restrictive, again leading to too many spawned hy-potheses.To ameliorate these problems we make two heuristic modifications of A*:First, we allow P ok to be more negative than that sanctioned by A* and tuneits value heuristically. While we thus lose the guarantee of finding exactlythe shortest path, we heuristically encourage the search not to revisit earlierhypotheses after a sufficiently lengthy run of successes along one particularchain. Second, we adopt a “first past the post” termination criterion. That10s, the first chain of hypotheses to decode the required L bytes of messagewins. It is not obvious (or, by us, provable) that these heuristics should resultin a workable or efficient algorithm, but we will demonstrate by numericalexperiment that it does. Above, we noted the importance of protecting message bits that determinethe ordering or “serial number” of strands for the outer, concatenated Reed-Solomon code. In equation (2) (and Figure 2) we allowed for some numberof bits of known salt S i when message bit b i is encoded. Here is how thissalt is enabling of extra protection: Suppose we want to protect an initial n message bits. Then define recursively the salt by S = 0 S i = S i − b i , i = 1 , . . . , n − S i = S i − , i ≥ n (4)Most errors in the first n bits will be corrected as usual by the shortest-pathheap search. But any residual error that gets through will “poison” the saltfor the entire rest of the strand, rendering it undecodable. In effect we convertan error in the protected bits into an erasure of the whole strand. This mayseem drastic, but it is just what we want: An strand with incorrect serialnumber (and hence incorrect ordering among other strands) would look likea strand of errors (with probablility 255/256 per byte) to the outer R-S; anerased strand is equivalent to only half as many errors. A simple modification of the encode and decode algorithms described inSections 4.2 and 4.3 allows for code rates other than one-half. Take theinput bitstream of expression (1) and partition it into a stream of values v k with variable numbers of bits in the range 0 to 2, according to a repetitivepattern like the ones shown in Table 1.Here are two examples showing how to interpret the entries in Table 1(with adjacency denoting two-bit values in Z ):Rate 0.750: v = b b , v = b , v = b b , v = b , . . . Rate 0.250: v = b , v = 0 , v = b , v = 0 , . . . P ok (see text)0.750 2 , , , , . . . − . , , , , , , , , , , . . . − . , , . . . − . , , , , , , . . . − . , , , , . . . − . , , , , , , . . . − . b i to variable-bits v i for various code ratesEquation (2) for encoding now becomes C i = K i + v i = F ( S i , I i , V i ) + v i (mod 4) (5)where V i is composed of concatenated previous variable bits. Pattern valuesof 0 provide one bit of additional redundancy check relative to the base caseof code rate one-half, while pattern values of 2, encoding 2 bits per DNAcharacter, provide one less bit. By construction the code rate is one-half theaverage of the integers in the pattern. The column in the table labeled P ok will be explained in Section 4.6.Decoding follows exactly the same pattern. Guessing a two-bit v i spawns12 child hypotheses, while guessing a zero-bit v i spawns only 3. For encoding, the parameter choices are (i) the choice of code rate and vari-able bit pattern (as in Table 1), the default case being code rate 0 .
5; (ii) thenumber q > r > s ≥ n ≥ q and r , but this isnot the case. Restricting r to a smaller value better allows the heap search torecover from previous errors, basically by finding an acasual (i.e., “wrong”)path that coincidentally puts it back on track. As for q , restricting it to asmaller value could be useful in case one desires the capability of jumpinginto the middle of an undecoded message: The heap can then be initializedwith all possible values of I and B (cf. Figure 2). For our system design,Section 3, this is not a necessary, or useful, capability, however. For the12aseline validation experiments in Section 5, we take q = 10, r = 8, n = 16or 24.For decoding, we need to know the encoding parameters, and must nowalso choose values for P sub , P del , P ins , and P ok . While, conceptually, these arenegative log probabilities of the occurrence of the different kinds of errors(which can be known only after the fact), we adopt a more empirical ap-proach. First, we take P sub = P del = P ins to give the HEDGES decodingalgorithm equal robustness against all three kinds of errors. Second, we notethat the search for shortest path is invariant under applying the same linear(or affine) transformation to all four P ’s. So, without loss of generality, wemay take P sub = P del = P ins = 1, leaving P ok as the only free parameter. Wedetermine optimal (or at least good) values for P ok by numerical experiment.We find that the optimal P ok depends only negligibly on the encoding pa-rameters q and r , and only slightly on the length L of the strand, but it doesdepend on the code rate. Good values for various code rates are given in thethird column of Table 1.Implicitly, the choice of P ok reflects a tradeoff between computationalworkload and decode failure probability. P ok that is too negative results in toogreedy a search, which is fast but can get stuck in a blind alley that requiresus to declare the rest of the strand as an erasure (hence its dependence onstrand length). On the other hand, P ok that is insufficiently negative resultsin a too large, potentially exponential, expansion of the size of the heap.Happily, there is an accessible range of workable values. Changes of ∼ P ok matter little, and our values are implicitly tuned for best performanceon strand lengths in the range ∼
100 to ∼ {− , , } so asto limit the expansion of the heap. This results in more than one consecutiveinsertion or deletion being improperly scored. For example, without thepossibility of skew ∆ = −
2, the shortest available path through two deletions . . . DD . . . declares a spurious substitution . . . DSD . . . . In practice, thismakes little difference, because double deletions are significantly less commonthan single deletions, and because other, completely incorrect, paths scoremuch worse.It is an important point that choosing any set of decode parameters is notan irrevocable choice. Given a DNA message, one can make multiple tries,varying the decode parameters adaptively until acceptable performance isachieved. One can evaluate success by running time and by the count oferrors needing correction by the outer R-S code. The parameter values that13e suggest may be viewed as starting points.
We have implemented HEDGES in C++ code, with also a Python interfacefor convenience. (We similarly implemented a compatible Python interfaceto the published “Schifra” implementation of Reed-Solomon.[20]) For testson individual strands of length L , we encode a random stream of messagebits and degrade the resulting codestream by errors with a specified Poisson-random total rate, divided equally among the three error types, substitution,insertion, and deletion. Unless otherwise stated the HEDGES code rate isone-half.Figure 4: Hypotheses expended (heap size) in decoding a long strand with5% DNA character errors. Ten examples are shown each of successful decodes(blue) and unsuccessful decodes (red). Unsuccessful decodes are declared aserasures. Although here shown equally, red cases are actually much rarerthan the blue (see also Figure 5).We allow each decode a “hypothesis budget”, that is, a maximum size towhich the heap is allowed to expand. If, along a strand, a decode exceedsits budget, we declare subsequent message bits in that strand to be erasures.14igure 4 shows examples of how decodes expend their budget along a longstrand. There is a sharp bifurcation between decodable strands, which typ-ically expend (cid:46)
100 hypotheses per decoded bit, and undecodable strands,which go into blind alleys and readily expend (cid:38) . × , 2 . × , 5 × , 1 × ). Failure rates (cid:46) − are fullycorrectable by the concatenated Reed-Solomon code.Figure 5 shows decode failure rates actually achieved by a half-rate code,as a function of length of strand, input total error rate, and hypothesis bud-get. One sees that, for strand lengths in the useful range 100–1000, failurerates (cid:46) − are readily achievable for total input error rates up to ∼ are feasible. Failure15ates (cid:46) − are easily absorbed as erasures in the error budget of an outer,interleaved Reed-Solomon code (see below, this section).In the case of a successful decode, there may remain uncorrected substi-tution errors. Figure 6 shows the uncorrected (output) bit, and byte, errorrates along strands of length 240 for the three input codestream charactererror rates 3%, 5%, and 10%. The uncorrected error rates vary along thestrand for two reasons: First, for this experiment, we applied salt protectionto the first 24 message bits (that is, 24 characters for the half-rate code). Onesees that this worked as advertised: There were no uncorrected errors in thefirst 24 bits. Second, uncorrected error rates are seen to rise as the length ofthe strand is approached. Although undesirable, this is an inevitable featureof HEDGES. As the strand end is approached, there are fewer redundancychecks available downstream, making the greedy search algorithm less selec-tive. Our system design (Section 3) allows for specifying some number of“runout” bits at the ends of the strands, encoding zeros and not part of themessage packet. How much runout to allow depends on how much one wantsto burden the R-S error budget. For this numerical experiment, we assumed24 runout bits.It is notable that the byte error rates in Figure 6 are only ≈ (cid:46) × − , 3 . × − , 2 × − forinput error rates 3%, 5%, and 10%, respectively. The corresponding byteerror rates are 3 × − , 1 × − , and 6 × − .What level of uncorrected errors may we allow to get through to theR-S outer code, with a high probability that they will there be corrected? RS (255 , N equiv = (byte errors) + 0 . × (byte erasures) ≤ P failure = PoissonCDF( k > | λ = 255 P equiv ) (6)16here P equiv is the byte error rate (B.e.r.) plus half the erasure rate. As men-tioned above, we interleave and apply the R-S code diagonally (see Figure 1)because (i) strands may be missing, (ii) byte errors along a single strand maybe bursty, and (iii) it is found experimentally that sequencing and synthesiserror rates can be different (often larger) near to the end of strands. Theinterleaved diagonal pattern ensures that no single 255 length R-S packetgets handicapped by an error rate much different than the average across thewhole strand, and that its number of errors will be distributed with (closeto) Poisson statistics. Table 2 evaluates equation (6) for relevant values of P equiv . One sees that a value P equiv (cid:46)
1% are adequate to guarantee error-freedecoding of messages of gigabyte length or longer. P equiv P failure . × − . × − . × − . × − . × − Table 2: Probability that the outer code RS (255 , P equiv , the inner-code HEDGES outputbyte error rate (plus half the erasure rate). One sees that P equiv < .
01 issufficient for error-free decoding of gigabyte-length messages.Figure 7 now shows the results of a numerical experiment evaluating P equiv as a function of input DNA error rates for six different code rates. Theevaluation was done with strand length L = 300 (as a variant of the value L = 240 in Figure 6), no salt protection, and 2 bytes of runout on each strand(errors in which are not counted). Using Table 2, one sees that a DNA errorrate of about 1% is correctable at code rate 3/4 with probability effectively1, increasing to a correctable error rate of about 15% at code rate 1/6.17igure 6: Remaining uncorrected substitution error rates after HEDGESdecoding, excluding the erasures implied by Figure 5. For this numericalexperiment, strands of length 240 bases were encoded with a half-rate code,with 24 initial bits protected by salt (see Section 4.4). Strand averages (ex-clusive of protected and runout regions) are shown at the left edge of thefigure. Error rates rise in the runout region, because fewer downstream bitsof redundancy are available. Strand average Byte error rates (cid:46) − arereadily corrected by the outer Reed-Solomon code.18igure 7: Effective rate of uncorrected byte errors after HEDGES decod-ing for 6 code rates. The curves give values P equiv that, after correction bythe outer Reed-Solomon code, give the error rates labeled on the horizontaldotted lines (values from Table 2). Intersections with the dotted lines cor-respond to final error rates (post Reed-Solomon) shown on the right. Thisnumerical experiment is done with strand length L = 300, no salt protection,and 2 bytes of runout (see text). 19 Channel Capacity re Shannon Limit
We might wonder how close the results of Figure 7 come to the absolutebound of the Shannon limiting channel capacity [1]. Unfortunately, comput-ing the Shannon limit for even the simplest case of a binary deletion channel,let alone channels with also insertions and substitutions, remains a difficultunsolved problem [6, 22]. Still, it is possible to get some idea by making aninformed estimate as follows.Figure 8: Comparison of the channel capacity achieved with HEDGES to anestimate of the Shannon limiting capacity. The dots show the greatest of thecapacities achieved by the various code rates in Table 1. As expected, thesmaller code rates become optimal for the larger input error rates. Codeswith r > .
75 (not tried) would give even better performance at the smallesterror rates (left side of of the graph).A remarkable theorem of Shannon [21] proves that the channel capacity ofa forward error-corrected channel is identical to that of a “feedback channel,”where the sender gets to see (error-free) what was actually received, and then20end correcting information post hoc . We can thus estimate channel capacityby reducing the maximum capacity (for DNA, 2 bits per character) by theentropy of the necessary correction messages. The reason that this is anestimate only (strictly, a lower bound), is that we may not be sending theoptimally short correction messages, especially as the error rate becomeslarge.In our case, suppose p is the character error rate. As before, assumeequal probabilities p/ C Shannon = 2 − [ H ( p ) + p log p log p log
3] (7)where H is the entropy of a binary choice, H ( p ) ≡ − p log p − (1 − p ) log (1 − p ) (8)In equation (7), the first term in brackets is the cost of communicatingwhether a particular code character marks the position of an error. Thesecond term is the cost of telling which kind of error (substitution, deletion,insertion). The third term is the cost of communicating the missing characterin a deletion. The fourth is the similar cost for a substitution. While strictlyonly a lower bound, analogous results for binary deletion channels [22] sug-gest that equation (7) is actually a good approximation for small values of p . We now calculate the channel capacity actually achievable with HEDGESby the relation C HEDGES = 2 × (code rate) × [1 − H ( p equiv )] (9)Here the factor 2 is the number of bits per DNA character, while the factor insquare brackets reflects the loss of channel capacity to an (assumed perfect)concatenated outer code that corrects all of HEDGES’ uncorrected bit errors.Figure 8 shows the results of the comparison. One sees that HEDGESachieves a respectable fraction, (cid:38) .
5, of the estimated Shannon limit forDNA character error rates up to 20%.
Previous work on DNA information storage, despite the increasing sophisti-cation of methods, has largely ignored the possibility of directly correcting21nsertion and deletion errors by an appropriate error-correcting code. In-stead, most previous work has relied on multiple sequence alignment aftersequencing DNA messages to significant depths. In effect, though not alwaysacknowledged, this method is an inefficient multiple-repetition code.This paper developed a coding technique, termed HEDGES, for the directcorrection of insertions and deletions, along with substitutions, workable with(combined) DNA character error rates up to 20%, and at a respectable frac-tion of the Shannon information limit. The code, HEDGES, was optimizedfor use as the inner code in an overall design with an outer concatenated codethat will generally be interleaved across DNA strands. Used with HEDGES,the outer code need not be indel-aware and can be a conventional ECC likeReed-Solomon.
Acknowledgments
We have benefitted from communication with Dave Forney, Dan Costello,and Vince Poor, and from our continuing collaboration in related matterswith Ilya Finkelstein and Stephen Jones.
References [1] Shannon CE, Weaver W,
The Mathematical Theory of Communication (University of Illinois Press, 1949)[2] MacWilliams FJ, Sloane NJA,
The Theory of Error-Correcting Codes (North Holland, 1983)[3] Roth RM,
Introduction to Coding Theory (Cambridge University Press,2006)[4] Moon TK,
Error Correction Coding: Mathematical Methods and Algo-rithms (Wiley, 2005)[5] Lin S, Costello DJ
Error Control Coding , Second Edition (Pearson,2004)[6] Mitzenmacher, M, “A survey of results for deletion channels and relatedsynchronization channels”,
Probability Surveys , vol. 6, pp. 1-33 (2009)227] Li R,
New developments in coding against insertions and dele-tions
An Introduction to Reed-Solomon Codes (Wiley-IEEE, 1994)[10] Church GM, Gao Y, Kosuri S, “Next-Generation Digital InformationStorage in DNA”,
Science , vol. 337, pp. 1628 (2012)[11] Goldman N, Bertone P, Chen S, Dessimoz C, LeProust EM, Sipos B,Birney E, “Towards Practical, High-Capacity, Low-Maintenance Infor-mation Storage in Synthesized DNA”,
Nature , vol. 494, pp. 77–80 (2013)[12] Grass RN, Heckel R, Pudda M, Paunescu D, Stark WJ, “Robust Chem-ical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes”,
Angew. Chem. Int. Ed. , vol. 54, pp. 2552–2555(2015)[13] Bornholt J, Lopez R, Carmean DM, Ceze L, Seelig G, Strauss K, “ADNA-Based Archival Storage System”,
ACM SIGARCH Computer Ar-chitecture News , vol. 44, pp. 637-649 (2016).[14] Erlich Y, Zielinski D, “DNA Fountain enables a robust and efficientstorage architecture”,
Science , vol. 255, pp. 950–954 (2017)[15] Yazdi SMHT, Gabrys R, Milenkovic O, “Portable and Error-Free DNA-Based Data Storage”,
Nature Scientific Reports , 7:5011 (2017)[16] Organick L, Ang SD, Chen Y-J, et al., “Random access in large-scaleDNA data storage”,
Nature Biotechnology , vol. 36, no. 3, pp. 242–248(2018)[17] Vigen`ere, Blaise de,
Traict´e des chiffres ou secr`etes mani`eres d’escrire ,Abel l’Angelier, Paris 1586. ff. 46r-49v. See also
Wikipedia , “Autokeycipher”. 2318] Press WH, Teukolsky SA, Vetterling WT, Flannery BP,
NumericalRecipes: The Art of Scientific Computing , Third Edition (CambridgeUniversity Press, 2007), p. 352.[19] Hart PE, Nilsson NJ, Raphael B, ”A Formal Basis for the HeuristicDetermination of Minimum Cost Paths”,
IEEE Transactions on SystemsScience and Cybernetics , SSC4, vol. 4 no. 2, pp. 100-107 (1968)[20] Partow A,
Schifra Reed-Solomon Code Library , at .[21] Shannon CE, “The Zero Error Capacity of a Noisy Channel”,
IRETransactions on Information Theory vol. 2, no. 3, pp. 8–19 (1956).[22] Rahmati M, Duman TM, “Upper Bounds on the Capacity of DeletionChannels Using Channel Fragmentation”,