[PDF] A Rewritable, Random-Access DNA-Based Storage System

Abstract

We describe the first DNA-based storage architecture that enables random access to data blocks and rewriting of information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks of existing read-only methods that require decoding the whole file in order to read one data fragment. Our system is based on new constrained coding techniques and accompanying DNA editing methods that ensure data reliability, specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media suitable for both ultrahigh density archival and rewritable storage applications.

Full PDF

AA Rewritable, Random-Access DNA-Based Storage System

S. M. Hossein Tabatabaei Yazdi † , Yongbo Yuan † , Jian Ma , , Huimin Zhao , ,Olgica Milenkovic ∗ Aﬃliations: Department of Electrical and Computer Engineering, University of Illinois, Urbana, IL 61801 Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, IL 61801 Department of Bioengineering, University of Illinois, Urbana, IL 61801 Institute for Genomic Biology, University of Illinois, Urbana, IL 61801 † These authors contributed equally to the work. ∗ To whom correspondences should be addressed: Olgica Milenkovic, e-mail: [email protected]

Abstract

We describe the ﬁrst DNA-based storage architecture that enables random access to data blocks and rewriting ofinformation stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacksof existing read-only methods that require decoding the whole ﬁle in order to read one data fragment. Our systemis based on new constrained coding techniques and accompanying DNA editing methods that ensure data reliability,speciﬁcity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proofof concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts ofthe text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile mediasuitable for both ultrahigh density archival and rewritable storage applications.

Addressing the emerging demands for massive data repositories, and building upon the rapid development of tech-nologies for DNA synthesis and sequencing, a number of laboratories have recently outlined architectures for archivalDNA-based storage [1, 2, 3, 4, 5]. The architecture in [3] achieved a storage density of

TB/gram, while the systemdescribed in [4] raised the density to . PB/gram. The success of the latter method may be largely attributed to three classical coding schemes:

Huﬀman coding, diﬀerential coding, and single parity-check coding [4]. Huﬀman coding wasused for data compression, while diﬀerential coding was used for eliminating homopolymers (i.e., repeated consecutivebases) in the DNA strings. Parity-checks were used to add controlled redundancy, which in conjunction with four-foldcoverage allows for mitigating assembly errors .Due to dynamic changes in biotechnological systems, none of the three coding schemes represents a suitable solutionfrom the perspective of current DNA sequencer designs: Huﬀman codes are ﬁxed-to-variable length compressors thatcan lead to catastrophic error propagation in the presence of sequencing noise; the same is true of diﬀerential codes.Homopolymers do not represent a signiﬁcant source of errors in Illumina sequencing platforms [6], while single parityredundancy or RS codes and diﬀerential encoding are inadequate for combating error-inducing sequence patterns such aslong substrings with high GC content [6]. As a result, assembly errors are likely, and were observed during the readoutprocess described in [4].An even more important issue that prohibits the practical wide-spread use of the schemes described in [3, 4] is thataccurate partial and random access to data is impossible, as one has to reconstruct the whole text in order to read orretrieve the information encoded even in a few bases. Furthermore, all current designs support read-only storage. Theﬁrst limitation represents a signiﬁcant drawback, as one usually needs to accommodate access to speciﬁc data sections; thesecond limitation prevents the use of current DNA storage methods in architectures that call for moderate data editing,for storing frequently updated information and memorizing the history of edits. Moving from a read-only to a rewritableDNA storage system requires a major implementation paradigm shift, as: Another class of DNA error-correcting schemes based on Reed-Solomon (RS) codes was recently reported in [5]. a r X i v : . [ c s . I T ] M a y . Editing in the compressive domain may require rewriting almost the whole information content; Rewriting is complicated by the current data DNA storage format that involves reads of length bps shifted by bps so as to ensure four-fold coverage of the sequence (See Figure 1.1 (a) for an illustration and description of the dataformat used in [4]). In order to rewrite one base, one needs to selectively access and modify four “consecutive” reads; Addressing methods used in [3, 4] only allow for determining the position of a read in a ﬁle, but cannot ensureprecise selection of reads of interest, as undesired cross-hybridization between the primers and parts of the informationblocks may occur.To overcome the aforementioned issues, we developed a new, random-access and rewritable DNA-based storage ar-chitecture based on DNA sequences endowed with specialized address strings that may be used for selective informationaccess and encoding with inherent error-correction capabilities. The addresses are designed to be mutually uncorrelated and to satisfy the error-control running digital sum constraint [7, 8]. Given the address sequences, encoding is performedby stringing together properly terminated preﬁxes of the addresses as dictated by the information sequence. This encodingmethod represents a special form of preﬁx-synchronized coding [9]. Given that the addresses are chosen to be uncorrelatedand at large Hamming distance from each other, it is highly unlikely for one address to be confused with another addressor with another section of the encoded blocks. Furthermore, selection of the blocks to be rewritten is made possibleby the preﬁx encoding format, while rewriting is performed via two DNA editing techniques, the gBlock and OE-PCR(overlap-extension polymerase chain reaction) methods [10, 11]. With the latter method, rewriting is done in several stepsby using short and cheap primers. The ﬁrst method is more eﬃcient, but requires synthesizing longer and hence moreexpensive primers. Both methods were tested on DNA encoded Wikipedia entries of size KB, corresponding to sixuniversities, where information in one, two and three blocks was rewritten in the DNA encoded domain. The rewrittenblocks were selected, ampliﬁed and Sanger sequenced [12] to verify that selection and rewriting are performed with accuracy.

The main feature of our storage architecture that enables highly sensitive random access and accurate rewriting is ad-dressing . The rational behind the proposed approach is that each block in a random access system must be equipped withan address that will allow for unique selection and ampliﬁcation via DNA sequence primers.Instead of storing blocks mimicking the structure and length of reads generated during high-throughput sequencing,we synthesized blocks of length bps tagged at both ends by specially designed address sequences. Adding addressesto short blocks of length bps would incur a large storage overhead, while synthesizing blocks longer than bpsusing current technologies is prohibitively costly.More precisely, each data block of length bps was ﬂanked at both ends by two unique, yet diﬀerent, address blocksof length bps. These addresses are used to provide speciﬁcity of access (see Figure 1.1 (b) and the SupplementaryInformation for details). The remaining bases in a block are divided into sub-blocks of length bps, with eachblock encoding six words of the text. The “word-encoding” process may be seen as a specialized compaction scheme suitablefor rewriting, and it operates as follows. First, diﬀerent words in the text are counted and tabulated in a dictionary. Eachword in the dictionary is converted into a binary sequence of length suﬃciently long to allow for encoding of the dictionary.For our current implementation and texts of choice, described in the Supplementary Information section, this length wasset to . Encodings of six consecutive words are subsequently grouped into binary sequences of length . The two-bit is appended as a word marker to the left hand side of each binary sequence of length , resulting in sequences oflength bits. The binary sequences are subsequently translated into DNA blocks of length bps using a new familyof DNA preﬁx-synchronized codes described in the Methods section. Our choice for the number of jointly encoded wordsis governed by the goal to make rewrites as straightforward as possible and to avoid error propagation due to variablecodelengths. Furthermore, as most rewrites include words, rather than individual symbols, the word encoding methodrepresents an eﬃcient means for content update. Details regarding the counting and grouping procedure may be found inthe Supplementary Information.For three selected access queries, the bps blocks containing the desired information were identiﬁed via primerscorresponding to their unique addresses, PCR ampliﬁed, Sanger sequenced, and subsequently decoded.Two methods were used for content rewriting. If the region to be rewritten had length exceeding several hundreds, newsequences with unique primers were synthesized as this solution represents a less costly alternative to rewriting. For thecase that a relatively short substring of the encoded string had to be modiﬁed, the corresponding bps block hostingthe string was identiﬁed and the changes were generated via DNA editing.2igure 1.1. (a) The scheme of [4] uses a storage format consisting of DNA strings that cover the encoded compressedtext in fragments of length of bps. The fragments overlap in bps, thereby providing -fold coverage for all exceptthe ﬂanking end bases. This particular fragmenting procedure prevents eﬃcient ﬁle editing: If one were to rewrite the“shaded” block, all four fragments containing this block would need to be selected and rewritten at diﬀerent positions torecord the new “shaded” block. (b) The address sequence construction process using the notions of autocorrelation andcross-correlation of sequences [13]. A sequence is uncorrelated with itself if no proper preﬁx of the sequence is also a suﬃxof the same sequence. Alternatively, no shift of the sequence overlaps with the sequence itself. Similarly, two diﬀerentsequences are uncorrelated if no preﬁx of one sequence matches a suﬃx of the other. Addresses are chosen to be mutuallyuncorrelated, and each bps block is ﬂanked by an address of length on the left and by another address of length on the right (colored ends). (c) Content rewriting via DNA editing: the gBlock method [10] for short rewrites, and thecost eﬃcient OE-PCR (Overlap Extension-PCR) method [11] for sequential rewriting of longer blocks.Both the random access and rewriting protocols were tested experimentally on two jointly stored text ﬁles. One textﬁle, of size KB, contained the history of University of Illinois, Urbana-Champaign (UIUC) based on its Wikipedia entryretrieved on / / . The other text ﬁle, of size KB, contained the introductory Wikipedia entries of Berkeley,Harvard, MIT, Princeton, and Stanford, retrieved on / / .Encoded information was converted into DNA blocks of length bps synthesized by IDT (Integrated DNA Tech-nologies), at a cost of $149 per PCR selection and ampliﬁcation of one bps sequence and simultaneous selection and ampliﬁcation of three bps sequences in the pool.

All linear bps fragments were mixed, and the mixture was used as a templatefor PCR ampliﬁcation and selection. The results of ampliﬁcation were veriﬁed by conﬁrming sequence lengths of bps banks via gel electrophoresis (Figure 1.2 (a)) and by randomly sampling - sequences from the pools and Sangersequencing them (Figure 1.2 (b)). Experimental content rewriting via synthesis of edits located at various positions in the bps blocks.

Forsimplicity of notation, we refer to the blocks in the pool on which we performed selection and editing as B1, B2, andB3. Two primers were synthesized for each rewrite in the blocks, for the forward and reverse direction. In addition, twodiﬀerent editing/mutation techniques were used, gBlock and Overlap-Extension (OE) PCR. gBlocks are double-strandedgenomic fragments used as primers or for the purpose of genome editing, while OE-PCR is a variant of PCR used forspeciﬁc DNA sequence editing via point editing/mutations or splicing. To demonstrate the plausibility of a cost eﬃcientmethod for editing, OE-PCR was implemented with general primers ( ≤ bps) only. Note that for edits shorter than bps, the mutation sequences were designed as overhangs in primers. Then, the three PCR products were used as templatesfor the ﬁnal PCR reaction involving the entire bps rewrite. Figure 1.1 (c) illustrates the described rewriting process.In addition, a summary of the experiments performed is provided in Table S3.3equence identiﬁer - Editing Method of sequence samples Length of edits (bps) Selection accuracy/error percentageB1-M-gBlock / / B1-M-PCR / / B2-M-gBlock / / B2-M-PCR / / B3-M-gBlock / / B3-M-PCR / / Table 1. Selection, rewriting and sequencing results. Each rewritten bps sequence was ligated to a linearizedpCRTM-Blunt vector using the Zero Blunt PCR Cloning Kit and was transformed into

E. coli.

The

E. coli strains withcorrect plasmids were sequenced at ACGT, Inc. Sequencing was performed using two universal primers: M13F_20 (inthe reverse direction) and M13R (in the forward direction) to ensure that the entire block of bps is covered.Church et.al. [3] Goldman et.al. [4] Our schemeDensity . × B/g . × B/g . × B/gFile size . MB KB File size: KBCost Not available $12 ,

600 $4 , Features Archival, no random-access Archival, no random-access Rewritable, random-accessTable 2. Comparison of storage densities for the DNA encoded information expressed in B/g (bytes per gram), ﬁle size,synthesis cost, and random access features of three known DNA storage technologies. Note that the density does not reﬂectthe entropy of the information source, as the text ﬁles are encoded in ASCII format, which is a redundant representationsystem.Given that each nucleotide has weight roughly equal to daltons ( × . × − grams), and given that ,

000 + 5000 = 32 , bps were needed to encode a ﬁle of size

13 + 4 = 17

KB in ASCII format, we estimate a potentialstorage density of . × B/g. This density signiﬁcantly surpasses the current state-of-the-art storage density of . × bytes/g, as we avoid costly multiple coverage, use larger blocklengths and specialized word encoding schemes.A performance comparison of the three currently known DNA-based storage media is given in Table S2. We observe thatthe cost of sequence synthesis in our storage model is signiﬁcantly higher than the corresponding cost of the prototypein [4], as blocks of length bps are still diﬃcult to synthesize. This trend it likely to change dramatically in thenear future, as within the last seven months, the cost of synthesizing bps blocks reduced almost -fold. Despite itshigh cost, our system oﬀers exceptionally large storage density, and for the ﬁrst time, enables random access and contentrewriting features. Furthermore, although we used Sanger sequencing methods for our small scale experiment, for largescale storage projects Next Generation Sequencing (NGS) technologies will enable signiﬁcant reductions in readout costs. To encode information on DNA media, we employed a two-step procedure. First, we designed address sequences of shortlength which satisfy a number of constraints that makes them suitable for highly selective random access [13].

Constrainedcoding ensures that DNA patterns prone to sequencing errors are avoided and that DNA blocks are accurately accessed,ampliﬁed and selected without perturbing or accidentally selecting other blocks in the DNA pool. The coding constraintsapply to address primer design, but also indirectly govern the properties of the fully encoded DNA information blocks.The design procedure used is semi-analytical, in so far that it combines combinatorial methods with computer searchtechniques.We required the address sequences to satisfy the following constraints: • (C1) Constant GC content (close to ) of all their preﬁxes of suﬃciently long length. DNA strands with GC content are more stable than DNA strands with lower or higher GC content and have better coverage duringsequencing. Since encoding user information is accomplished via preﬁx-synchronization, it is important to impose4igure 1.2. (a) Gel electrophoresis results for three blocks, indicating that the length of the three selected and ampliﬁedsequences is tightly concentrated around bps. (b) Output of the Sanger sequencer, where all bases shaded in yellowcorrespond to correct readouts. The sequencing results conﬁrmed that the desired sequences were selected, ampliﬁed, andrewritten with 100 % accuracy.the GC content constraint on the addresses as well as their preﬁxes, as the latter requirement also ensures that allfragments of encoded data blocks have balanced GC content. • (C2) Large mutual Hamming distance, as it reduces the probability of erroneous address selection. Recall that theHamming distance between two strings of equal length equals the number of positions at which the correspondingsymbols disagree. An appropriate choice for the minimum Hamming distance is equal to half of the address sequencelength ( bps in our current implementation which uses length address primers). • (C3) Uncorrelatedness of the addresses, which imposes the restriction that preﬁxes of one address do not appear assuﬃxes of the same or another address and vice versa. The motivation for this new constraint comes from the fact thataddresses are used to provide unique identities for the blocks, and that their substrings should therefore not appearin “similar form” within other addresses. Here, “similarity” is assessed in terms of hybridization aﬃnity. Furthermore,long undesired preﬁx-suﬃx matches may lead to read assembly errors in blocks during joint informational retrievaland sequencing. • (C4) Absence of secondary (folding) structures, as such structures may cause errors in the process of PCR ampliﬁ-cation and fragment rewriting.Addresses satisfying constraints C1-C2 may be constructed via error-correcting codes with small running digital sum [7]adapted for the new storage system. Properties of these codes are discussed in Section 2.2. The novel notion of mutuallyuncorrelated sequences is introduced in 2.3. Constructing addresses that simultaneously satisfy the constraints C1-C4and determining bounds on the largest number of such sequences is prohibitively complex [14, 15]. To mitigate thisproblem, we resort to a semi-constructive address design approach, in which balanced error-correcting codes are designedindependently, and subsequently expurgated so as to identify a large set of mutually uncorrelated sequences. The resultingsequences are subsequently tested for secondary structure using mfold and Vienna [16]. We conjecture that the numberof sequences satisfying C1-C4 grows exponentially with their length: proofs towards establishing this claim include resultson the exponential size of codes under each constraint individually.Given two uncorrelated sequences as ﬂanking addresses of one block, one of the sequences is selected to encode userinformation via a new implementation of preﬁx-synchronized encoding [17, 16], described in 2.4. The asymptotic rate ofan optimal single sequence preﬁx-free codes is one. Hence, there is no asymptotic coding loss for avoiding preﬁxes of onesequence; we only observe a minor coding loss for each ﬁnite-length block. For multiple sequences of arbitrary structure,the problem of determining the optimal code rate is signiﬁcantly more complicated and the rates have to be evaluatednumerically, by solving systems of linear equations [17] as described in 2.4 and the Supplementary Information. Thissystem of equations leads to a particularly simple form for the generating function of mutually uncorrelated sequences, asexplained in the Supplementary Information. 5 .2 Balanced Codes and Running Digital Sums

An important criteria for selecting block addresses is to ensure that the corresponding DNA primer sequences have preﬁxeswith a GC content approximately equal to , and that the sequences are at large pairwise Hamming distance. Dueto their applications in optical storage, codes that address related issues have been studied in a diﬀerent form under thename of bounded running digital sum (BRDS) codes [7, 8]. A detailed overview of this coding technique may be foundin [7].Consider a sequence a = a , a , a , . . . , a l , . . . , a n over the alphabet {− , } . We refer to S l ( a ) = (cid:80) l − i =0 a i as therunning digital sum (RDS) of the sequence a up to length l , l ≥ . Let D a = max {| S l ( a ) | : l ≥ } denote the largestvalue of the running digital sum of the sequence a . For some predetermined value D > , a set of sequences { a ( i ) } Mi =1 istermed a BRDS code with parameter D if D a ( i ) ≤ D for all i = 1 , . . . , M . Note that one can deﬁne non-binary BRDScodes in an equivalent manner, with the alphabet usually assumed to be symmetric, {− q, − q + 1 , . . . , − , , . . . , q − , q } ,and where q ≥ . A set of DNA sequences over { A , T , G , C } may be constructed in a straightforward manner by mappingeach +1 symbol into one of the bases { A , T } , and − into one of the bases { G , C } , or vice versa. Alternatively, one can useBRDS over an alphabet of size four directly.To address the constraints C1-C2, one needs to construct a large set of BRDS codewords at suﬃciently large Hammingdistance from each other. Via the mapping described above, these codewords may be subsequently translated to DNAsequences with a GC content approximately equal to for all sequence preﬁxes, and at the same Hamming distance asthe original sequences.Let ( n, C, d ; D ) be the parameters of a BRDS error-correcting code, where C denotes the number of codewords oflength n , d denotes the minimum distance of the code, while log Cn equals the code rate. For D = 1 and d = 2 , the bestknown BRDS-code has parameters (cid:0) n, n ,

2; 1 (cid:1) , while for D = 2 and d = 1 , codes with parameters (cid:0) n, n ,

1; 2 (cid:1) exist. For D = 2 and d = 2 , the best known BRDS code has parameters (cid:16) n, · n ) − ,

2; 2 (cid:17) [8]. Note that each of these codes hasan exponentially large number of codewords, among which a (suﬃciently) large number of sequences satisfy the requiredcorrelation property C3, discussed next, and the folding property C4. Codewords satisfying constraints C3-C4 were foundby expurgating the BRDS codes via computer search.

We describe next the notion of autocorrelation of a sequence and introduce the related notion of mutual correlation ofsequences.It was shown in [17] that the autocorrelation function is the crucial mathematical concept for studying sequencesavoiding forbidden strings and substrings. In the storage context, forbidden strings correspond to the addresses of theblocks in the pool. In order to accommodate the need for selective retrieval of a DNA block without accidentally selectingany undesirable blocks, we ﬁnd it necessary to also introduce the notion of mutually uncorrelated sequences.Let X and Y be two words, possibly of diﬀerent lengths, over some alphabet of size q > . The correlation of X and Y , denoted by X ◦ Y , is a binary string of the same length as X . The i -th bit (from the left) of X ◦ Y is determinedby placing Y under X so that the leftmost character of Y is under the i -th character (from the left) of X , and checkingwhether the characters in the overlapping segments of X and Y are identical. If they are identical, the i -th bit of X ◦ Y isset to , otherwise, it is set to . For example, for X = CATCATC and Y = ATCATCGG , X ◦ Y = 0100100 , as depicted below.Note that in general, X ◦ Y (cid:54) = Y ◦ X , and that the two correlation vectors may be of diﬀerent lengths. In the exampleabove, we have Y ◦ X = 00000000 . The autocorrelation of a word X equals X ◦ X .In the example below, X ◦ X = 1001001 . X = C A T C A T C Y = A T C A T C G G A T C A T C G G A T C A T C G G A T C A T C G G A T C A T C G G A T C A T C G G A T C A T C G G eﬁnition 1. A sequence X is self-uncorrelated if X ◦ X = 10 . . . . A set of sequences { X , X , . . . , X m } is termedmutually uncorrelated if each sequence is self-uncorrelated and if all pairs of distinct sequences satisfy X i ◦ X j = 0 . . . and X j ◦ X i = 0 . . . .Intuitively, correlation captures the extent to which preﬁxes of sequences overlap with suﬃxes of the same or othersequences. Furthermore, the notion of mutual uncorrelatedness may be relaxed by requiring that only suﬃciently longpreﬁxes do not match suﬃciently long suﬃxes of other sequences. Sequences with this property, and at suﬃciently largeHamming distance, eliminate undesired address cross-hybridization during selection and cross-sequence assembly errors.We proved the following bound on the size of the largest mutually uncorrelated set of sequences of length n overan alphabet of size q = 4 . The bounds show that there exist exponentially many mutually uncorrelated sequences forany choice of n , and the lower bound is constructive. Furthermore, the construction used in the bound “preserves” theHamming distance (see the Supplementary Information). Theorem 2.

Suppose that { X , . . . , X m } is a set of m pairwise mutually uncorrelated sequences of length n . Let u ( n ) denote the largest possible value of m for a given n . Then · n ≤ u ( n ) ≤ · n − . As an illustration, for n = 20 , the lower bound equals . The proof of the theorem is give in the SupplementaryInformation.It remains an open problem to determine the largest number of address sequences that jointly satisfy the constraintsC1-C4. We conjecture that the number of such sequences is exponential in n , as the numbers of words that satisfy C1-C2, C3 and C4 [15] are exponential. Exponentially large families of address sequences are important indicators of thescalability of the system and they also inﬂuence the rate of information encoding in DNA.Using a casting of the address sequence design problem in terms of a simple and eﬃcient greedy search procedure, wewere able to identify sequences for length n = 20 that satisfy constraints C1-C4, out of which pairs were used forblock addressing. Another means to generate large sets of sequences satisfying the constraints is via approximate solversfor the largest independent set problem [18]. Examples of sequences constructed in the aforementioned manner and usedin our experiments are listed in the Supplementary Information. In the previous sections, we described how to construct address sequences that can serve as unique identiﬁers of the blocksthey are associated with. We also pointed out that once such address sequences are identiﬁed, user information has to beencoded in order to avoid the appearance of any of the addresses, suﬃciently long substrings of the addresses, or substringssimilar to the addresses in the resulting DNA codeword blocks. For this purpose, we developed new preﬁx-synchronizedencoding schemes based on [14].To address the problem at hand, we start by introducing comma free and preﬁx-synchronized codes which allow forconstructing codewords that avoid address patterns. A block code C comprising a set of codewords of length N over analphabet of size q is called comma free if and only if for any pair of not necessarily distinct codewords a a . . . a N and b b . . . b N in C , the N concatenations a a . . . a N b , a a . . . b b , . . . , a N a . . . b N − b N − are not in C [17]. Comma freecodes enable eﬃcient synchronization protocols, as one is able to determine the starting positions of codewords withoutambiguity. A major drawback of comma free codes is the need to implement an exhaustive search procedure over sequencesets to decide whether or not a given string of length n should be used as a codeword or not. This diﬃculty can beovercome by using a special family of comma free codes, introduced by Gilbert [9] under the name preﬁx-synchronizedcodes . Preﬁx-synchronized codes have the property that every codeword starts with a preﬁx P = p p . . . p n , whichis followed by a constrained sequence c c . . . c s . Moreover, for any codeword p p . . . p n c c . . . c s of length n + s , thepreﬁx P does not appear as a substring of p . . . p n c c . . . c s p p . . . p n − . More precisely, the constrained sequences ofpreﬁx-synchronized codes avoid the pattern P which is used as the address.Due to the choice of mutually uncorrelated addresses at large Hamming distance , we encode each information blockby avoiding only one of the address sequences , used for that particular block.To explain how to perform encoding, assume that P = p p . . . p n ∈ { A , T , G , C } n is a self-uncorrelated sequence. Thisguarantees that p (cid:54) = p n . Without loss of generality, let p = A and p n = G , and deﬁne ¯ P i = { A , C , T } \ { p i } P i = p . . . p i , ≤ i ≤ n . In addition, assume that the elements of ¯ P i are arranged in increasing order, say using the lexicographicalordering A ≺ C ≺ T . We subsequently use ¯ p i,j to denote the j -th smallest element in ¯ P i , for ≤ j ≤ (cid:12)(cid:12) ¯ P i (cid:12)(cid:12) . For example, if ¯ P i = { C , T } , then ¯ p i, = C and ¯ p i, = T . Next, we deﬁne a sequence of integers G n, , G n, , . . . that satisﬁes the following recursive formula G n,(cid:96) = (cid:40) (cid:96) , ≤ (cid:96) < n, (cid:80) n − i =1 (cid:12)(cid:12) ¯ P i (cid:12)(cid:12) G n,(cid:96) − i , (cid:96) ≥ n. For an integer (cid:96) ≥ and y < (cid:96) , let θ (cid:96) ( y ) = { A , T , C } (cid:96) be a length- (cid:96) ternary representation of y . Conversely, for each W ∈ { A , T , C } (cid:96) , let θ − ( W ) be the integer y such that θ (cid:96) ( y ) = W. Every integer ≤ x < G n,(cid:96) can be mapped into a sequenceof n + (cid:96) symbols { A , T , C , G } via an encoding algorithm that consists of two parts: EncodePSC ( P, (cid:96), x ) and CodePSC ( P, (cid:96), x ) .Algorithm EncodePSC ( P, (cid:96), x ) calls CodePSC ( P, (cid:96), x ) and returns the concatenation of P and CodePSC ( P, (cid:96), x ) .The steps of the encoding procedure are listed in Algorithm 1, where C P(cid:96) = { EncodePSC ( P, (cid:96), x ) | ≤ x < G n,(cid:96) } , andwhere n denotes the length of the sequence P . The decoding steps are described in the same chart. Algorithm 1

Preﬁx-synchronized encoding and decoding X = EncodePSC ( P, (cid:96), x ) return P CodePSC ( P, (cid:96), x ) ; X = CodePSC ( P, (cid:96), x ) x = DecodePSC ( P, X ) begin begin1 n = length ( P ) ; n = length ( P ) ; ( (cid:96) ≥ n ) (cid:96) = length ( X ) ; t := 1; X = X X . . . X (cid:96) ; y := x ; ( (cid:96) < n ) (cid:0) y ≥ (cid:12)(cid:12) ¯ P t (cid:12)(cid:12) G n,(cid:96) − t (cid:1) θ − ( X ) ; y := y − (cid:12)(cid:12) ¯ P t (cid:12)(cid:12) G n,(cid:96) − t ; t + +; (cid:0) s, t such that P t − ¯ p t,s = X . . . X t (cid:1) ; (cid:0)(cid:80) t − i =1 (cid:12)(cid:12) ¯ P i (cid:12)(cid:12) G n,(cid:96) − i (cid:1) + ( s − G n,(cid:96) − t + DecodePSC ( P, X t +1 . . . X (cid:96) ) ; a := (cid:106) yG n,(cid:96) − t (cid:107) ; 9 end;10 b := mod ( y, G n,(cid:96) − t ) ; end;11 return P t − ¯ p t,a +1 CodePSC ( P, (cid:96) − t, b ) ;

12 else13 return θ (cid:96) ( y ) ;

14 end;end;

The following theorems are proved in the Supplementary Information.

Theorem 3. C P(cid:96) is a preﬁx-synchronized codeword.

Theorem 4.

The algorithm

EncodePSC ( P, (cid:96), x ) outputs a uniquely decodable string, for any ≤ x < G n,(cid:96) . A simple example describing the encoding and decoding procedure for the short address string P = AGCTG , which caneasily be veriﬁed to be self-uncorrelated, is provided in the Supplementary Information.The previously described

EncodePSC ( P, (cid:96), x ) algorithm imposes no limitations on the length of a preﬁx used for encod-ing. This feature may lead to unwanted cross hybridization between address primers used for selection and the preﬁxes ofaddresses encoding the information. One approach to mitigate this problem is to “perturb” long preﬁxes in the encodedinformation in a controlled manner. For small-scale random access/rewriting experiments, the recommended approach isto ﬁrst select all preﬁxes of length greater than some predeﬁned threshold. Afterwards, the ﬁrst and last quarter of thebases of these long preﬁxes are used unchanged while the central portion of the preﬁx string is cyclically shifted by halfof its length. For example, for the address primer ACTAACTGTGCGACTGATGC , the suﬃx

ACTAACTGTGCGACTG produced by

EncodePSC ( P, (cid:96), x ) maps to ACTAATGCCTGGACTG . The process of shifting applied to this string is illustrated below:

ACTAA CTGTGC (cid:124) (cid:123)(cid:122) (cid:125)

GACTG cyclically shift by 3 ⇓ ACTAA (cid:122) (cid:125)(cid:124) (cid:123)

TGCCTG GACTG

EncodePSC ( P, (cid:96), x ) . However,there exist simple conditions that can be checked to eliminate primers that do not allow this transform to be “unique”.Given the address primers created for our random access/rewriting experiments, we were able to uniquely map eachmodiﬁed preﬁx to its original preﬁx and therefore uniquely decode the readouts.As a ﬁnal remark, we would like to point out that preﬁx-synchronized coding also supports error-detection and limitederror-correction. Error-correction is achieved by checking if each substring of the sequence represents a preﬁx or “shifted”preﬁx of the given address sequence and making proper changes when needed. We described a new DNA based storage architecture that enables accurate random access and cost-eﬃcient rewriting. Thekey component of our implementation is a new collection of coding schemes and the adaptation of random-access enablingcodes from classical storage systems. In particular, we encoded information within blocks with unique addresses thatare prohibited to appear anywhere else in the encoded information, thereby removing any undesirable cross-hybridizationproblems during the process of selection and ampliﬁcation. We also performed four access and rewriting experimentswithout readout errors, as conﬁrmed by post-selection and rewriting Sanger sequencing. The current drawback of ourscheme is high cost, as synthesizing long DNA blocks is expensive. Cost considerations also limited the scope of ourexperiments and the size of the prototype, as we aimed to stay within a budget comparable to that used for other existingarchitectures. Nevertheless, the beneﬁts of random access and other unique features of the proposed system compensatefor this high cost, which we predict will decrease rapidly in the very near future.

This work was partially supported by the Strategic Research Initiative of University of Illinois, Urbana-Champaign, and theNSF STC on Science of Information, Purdue University. A provisional patent for rewritable, random-access DNA-basedstorage was ﬁled with the University of Illinois in November 2014.

References [1] C. Bancroft, T. Bowler, B. Bloom, and C. T. Clelland, “Long-term storage of information in dna.”

Science (NewYork, NY) , vol. 293, no. 5536, pp. 1763–1765, 2001.[2] J. Davis, “Microvenus,”

Art Journal , vol. 55, no. 1, pp. 70–74, 1996.[3] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in dna,”

Science , vol. 337, no.6102, pp. 1628–1628, 2012.[4] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical,high-capacity, low-maintenance information storage in synthesized dna,”

Nature , 2013.[5] R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J. Stark, “Robust chemical preservation of digital informationon dna in silica with error-correcting codes,”

Angewandte Chemie International Edition , vol. 54, no. 8, pp. 2552–2555,2015.[6] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R. Hegarty, C. Nusbaum, and D. B. Jaﬀe, “Character-izing and measuring bias in sequence data,”

Genome Biol , vol. 14, no. 5, p. R51, 2013.[7] G. D. Cohen and S. Litsyn, “Dc-constrained error-correcting codes with small running digital sum,”

InformationTheory, IEEE Transactions on , vol. 37, no. 3, pp. 949–955, 1991.[8] M. Blaum, S. Litsyn, V. Buskens, and H. C. van Tilborg, “Error-correcting codes with bounded running digital sum,”

IEEE transactions on information theory , vol. 39, no. 1, pp. 216–227, 1993.[9] E. Gilbert, “Synchronization of binary messages,”

Information Theory, IRE Transactions on , vol. 6, no. 4, pp. 470–477,1960.[10] H. Packer, “gblocks R (cid:13) gene fragments, related decoded articles,” 2014.911] A. V. Bryksin and I. Matsumura, “Overlap extension pcr cloning: a simple and reliable way to create recombinantplasmids,” Biotechniques , vol. 48, no. 6, p. 463, 2010.[12] S. C. Schuster, “Next-generation sequencing transforms today’s biology,”

Nature methods , vol. 5, no. 1, pp. 16–18,2008.[13] K. A. S. Immink,

Codes for mass data storage systems . Shannon Foundation Publisher, 2004.[14] H. Morita, A. J. van Wijngaarden, and A. Han Vinck, “On the construction of maximal preﬁx-synchronized codes,”

Information Theory, IEEE Transactions on , vol. 42, no. 6, pp. 2158–2166, 1996.[15] O. Milenkovic and N. Kashyap, “On the design of codes for dna computing,” in

Coding and Cryptography . Springer,2006, pp. 100–119.[16] J.-M. Rouillard, M. Zuker, and E. Gulari, “Oligoarray 2.0: design of oligonucleotide probes for dna microarrays usinga thermodynamic approach,”

Nucleic acids research , vol. 31, no. 12, pp. 3057–3062, 2003.[17] L. J. Guibas and A. M. Odlyzko, “Maximal preﬁx-synchronized codes,”

SIAM Journal on Applied Mathematics ,vol. 35, no. 2, pp. 401–418, 1978.[18] P. Berman and M. Fürer, “Approximating maximum independent set in bounded degree graphs.” in

SODA , vol. 94,1994, pp. 365–371.[19] R. G. Gallager, “Low-density parity-check codes,”

Information Theory, IRE Transactions on , vol. 8, no. 1, pp. 21–28,1962.

Supplementary Information

List of sections

1. Encoding Wikipedia Entries – A Working Example (Section 1).2. Proofs of Theorems (Section 2).3. Address Sequences (Section 3).4. Example of Encoding and Decoding Procedure (Section 4).5. Experimental Synthesis, Access and Rewrite of DNA Storage Sequences (Section 5).6. Hybrid DNA-Based and Classical Storage (Section 6).

In this section we describe the data format used for encoding two ﬁles of size KB containing the introductory sectionsof Wikipedia pages of six universities: Berkeley, Harvard, MIT, Princeton, Stanford, and University of Illinois Urbana-Champaign. There were , words in the text, out of which were distinct. Note that in our context, words areelements of the text separated by a space. For example, “university” and “university.” are counted as two diﬀerent words,while “Urbana-Champaign” is counted as a single word. These , words were mapped to (cid:6) (cid:7) = 27 DNA blocks oflength bps, as we grouped six words into fragments, and combined fragments for preﬁx-synchronized encoding.Table S1 provides the word counts in the ﬁles and encoding lengths (in bits) of the of the outlined procedure.Assume that instead of using a preﬁx-synchronized code, we used classical ASCII encoding without compression toencode the same Wikipedia pages. The total number of characters in the text equals , , and each character is mappedto a binary string of length . Hence, one would need × bits to represent the data, which is equivalentto (cid:108) × (cid:109) = 47 DNA blocks of length bps if we set aside two unique address ﬂags for the blocks. As one can see,preﬁx-synchronized codes oﬀer an almost . -fold improvement in description length compared to ASCII encoding. Thiscomes at the cost of storing a larger dictionary, as one encodes words rather than symbols of the alphabet. For the workingexample, one would require roughly -times larger dictionaries, as there are words with an average of . symbolsper word. This increased in the dictionary is not a signiﬁcant problem, as only one copy of the dictionary is ever needed.10 symbols Words

Table S1. Comparison between character and word based encoding. Note the the number of bits per distinct symbol forthe word encoding case is computed as the ceiling of the logarithm of the number of distinct symbols plus one, where theextra bit is used to prevent very small integers from being used in preﬁx-synchronized coding. Such integers may producelong runs of the ﬁrst symbol in the address, which should be avoided. Furthermore, to ensure ﬁxed length encoding, andhence avoid catastrophic error propagation, we doubled the number of bits used for encoding to . Proof of Theorem 2 . The proof consists of two parts. First, we prove the upper bound on u ( n ) in Lemma 1, and thenproceed to prove a lower bound in Lemma 2. Recall that u ( n ) denotes the largest possible size for a set of mutuallyuncorrelated words of length n . Lemma 1.

Let u ( n ) the largest set of distinct mutually uncorrelated sequences of length n . Then u ( n ) ≤ · n − . Proof:

To prove the lemma, let us introduce some terminology. Let d H ( · , · ) stand for the Hamming distance betweentwo words, and deﬁne the Hamming ball of radius d around a point W in { A , T , G , C } n as B ( W, d ) = { W (cid:48) ∈ { A , T , G , C } n : d H ( W, W (cid:48) ) ≤ d } . Furthermore, let C ( W, d ) = { W (cid:48) ∈ { A , T , G , C } n : W (cid:48) ∈ B ( W, d ) , W (cid:48) , W are correlated } denote the set of sequences correlated with W that are also at most at Hamming distance d from W .We claim that for n ≥ d + 2 ≥ , one has | C ( W, d ) | ≥ d − (cid:88) i =0 (cid:18) n − i (cid:19) i − d − (cid:88) i =0 (cid:18) n − i (cid:19) i . (2.1)To prove the result, assume without loss of generality that W starts with the symbol A , i.e., W = A W . . . W n . Next,consider two scenarios regarding the structure of W = A W . . . W n : • W n (cid:54) = A : In this case, any word W (cid:48) in B ( W, d ) that starts with W n or ends with A is an element of C ( W, d ) . Let S = { W (cid:48) : W (cid:48) ∈ B ( W, d ) , W (cid:48) starts with W n } and E = { W (cid:48) : W (cid:48) ∈ B ( W, d ) , W (cid:48) ends with A } . Clearly, | S | = | E | = (cid:80) d − i =0 (cid:18) n − i (cid:19) i and | S ∩ E | = (cid:80) d − i =0 (cid:18) n − i (cid:19) i . Therefore, | C ( W, d ) | ≥ | S ∪ E | =2 (cid:80) d − i =0 (cid:18) n − i (cid:19) i − (cid:80) d − i =0 (cid:18) n − i (cid:19) i . • W n = A : In this case, any word W (cid:48) in B ( W, d ) which starts or ends with A is also an element of C ( W, d ) . Using anargument similar to the one described for the previous scenario, one can show that | C ( W, d ) | ≥ (cid:80) di =0 (cid:18) n − i (cid:19) i − (cid:80) di =0 (cid:18) n − i (cid:19) i . Moreover, it is straightforward to see that d (cid:88) i =0 (cid:18) n − i (cid:19) i − d (cid:88) i =0 (cid:18) n − i (cid:19) i > d − (cid:88) i =0 (cid:18) n − i (cid:19) i − d − (cid:88) i =0 (cid:18) n − i (cid:19) i . { X , . . . , X m } of size m , we have X i / ∈ C ( X , n ) , for ≤ i ≤ m. This implies that { X , . . . , X m } ⊆ { A , T , C , G } n \ C ( X , n ) . At the same time, the previous claim suggests that | C ( X , n ) | ≥ n − (cid:88) i =0 (cid:18) n − i (cid:19) i − n − (cid:88) i =0 (cid:18) n − i (cid:19) i = 2 · n − − n − . Therefore, m ≤ n − (cid:0) · n − − n − (cid:1) = 9 · n − , which completes the proof. Lemma 2.

Let u ( n ) the largest set of distinct mutually uncorrelated sequences of length n . Then u ( n ) ≥ · n . Proof:

For simplicity, assume that m is even. Given a mutually uncorrelated set { X , . . . , X m } , with words oflength n and over the alphabet { A , T , G , C } , partition { X , . . . , X m } into two arbitrary sets A and B of equal size, say A = (cid:8) X , . . . , X m (cid:9) and B = (cid:8) X m +1 , . . . , X m (cid:9) . We argue that C = { XY | X ∈ A, Y ∈ B } is a mutually uncorrelatedset with words of length n . • First, we show that the elements in C are self-uncorrelated: For an arbitrary element Z ∈ C , we have Z = XY.

Since the two sequences { X, Y } are mutually uncorrelated, one can easily verify that Z i (cid:54) = Z n n − i +1 , for i ∈ { , . . . , n − } \ { n } . Moreover, since X (cid:54) = Y , it holds that Z n (cid:54) = Z nn +1 . This establishes the claim. • Next, we argue that any two distinct elements in C are uncorrelated: For any two distinct elements Z = XY and Z (cid:48) = X (cid:48) Y (cid:48) in C , one can show that Z i (cid:54) = ( Z (cid:48) ) n n − i +1 , for i ∈ { , . . . , n − } \ { n } . In addition, X (cid:54) = Y (cid:48) implies that Z n (cid:54) = ( Z (cid:48) ) nn +1 . This completes the proof.As a result, given a mutually uncorrelated set { X , . . . , X m } , where X i ∈ { A , T , C , G } n , one can construct anothermutually uncorrelated set (cid:110) Z , . . . , Z m (cid:111) , where Z i ∈ { A , T , C , G } n . Therefore, u (2 n ) ≥ u ( n )4 . Observing that for n = 4 it is possible to construct the following set of mutually uncorrelated sequences { ATGC , ATAC , GTAC , GTGCATTC , GTTC , AGGC , AAACGAAC , GGGC , ATTT , GTTT } establishes the base of a recursive procedure which gives u ( n ) > · (1 . n . Note that this bound is constructive, and theconcatenation procedure preserves normalized minimum Hamming distances.We now turn our attention to preﬁx-synchronized coding, and describe a number of results relevant for our subsequentdiscussion.

Theorem 5 ([17]) . Given a positive integer N , chose the unique integer n = n ( N ) so that β = N − n satisﬁes log 2 ≤ β < . Then, the maximal preﬁx-synchronized code of length N has cardinality N − N − βe − β (1 + o (1)) , as N → ∞ , for a preﬁx of the form . . . . Note that the above results indicate that codes avoiding one address sequence represent an exponentially large familyof binary sequences. We prove a similar result for the case of -ary sequences that avoid a set of m mutually uncorrelatedsequences. To establish the claim, we need the following deﬁnitions. Let g (0) , g (1) , . . . , be an integer sequence over aﬁnite alphabet. Deﬁne the generating function of the sequence G ( z ) = ∞ (cid:88) N =0 g ( N ) z − N . heorem 6. Suppose that { X , . . . , X m } is a set of mutually uncorrelated sequences of length n over the alphabet { A , T , C , G } . Let f ( N ) , with f (0) = 1 , be the number of strings of length N over { A , T , C , G } that do not contain sub-strings in { X , . . . , X m } . Then F ( z ) = z N m + ( z − z N − , where F ( z ) is the generating function of the sequence { f ( N ) } .Proof of Theorem 6 . The result is a direct consequence of Theorem 4.1 of [17]. For ≤ i ≤ m, let f i ( n ) denote thenumber of strings of length n over { A , T , C , G } that contain no element of { X , . . . , X m } , except for a single copy of X i atthe right-hand side of the string. Let F i ( z ) be the generating function of f i ( n ) . Then, we have the following system ofequations that holds for the two sets of aforementioned functions: ( z − F ( z ) + zF ( z ) + . . . + zF m ( z ) = zF ( z ) − z ( X ◦ X ) z F ( z ) − z ( X ◦ X ) z F ( z ) − . . . − z ( X m ◦ X ) z F m ( z ) = 0 ... F ( z ) − z ( X ◦ X m ) z F ( z ) − z ( X ◦ X m ) z F ( z ) − . . . − z ( X m ◦ X m ) z F m ( z ) = 0 (2.2)By using the fact that ( X i ◦ X i ) z = z n − , for ≤ i ≤ m , and ( X i ◦ X j ) z = 0 , for ≤ i (cid:54) = j ≤ m , one can show that F ( z ) = z n F ( z ) = . . . = z n F m ( z ) . (2.3)The result follows by replacing (2.2) into the ﬁrst line of (2.3).As the dominant pole of the generating function is close to , the number of sequences avoiding a set of mutuallyuncorrelated sequences grows roughly as n . Proof of Theorem 3 . Since P is self-uncorrelated, we need to show that this string is not contained in the output of CodePSC ( P, (cid:96), x ) , where the output of CodePSC ( P, (cid:96), x ) equals CodePSC ( P, (cid:96), x ) = P t − ¯ p t ,s . . . P t r − ¯ p t r ,s r θ t ( · ) , for some input θ t ( · ) , and ≤ t , t , . . . , t r < n . Consequently, if P is a substring of the output of CodePSC ( P, (cid:96), x ) , thenthe last symbol of P (recall that we assumed this symbol to be G ) has to appear in one of the following three positions: • The symbol appears in P t i − , for a unique ≤ i ≤ r : In this case, there exists a suﬃx of P appearing as a preﬁxof P t i − . This contradicts our assumption that P is self uncorrelated. • The symbol appears in ¯ p t i ,s i , for a unique ≤ i ≤ r : This contradicts our assumption that ¯ p t i ,s i (cid:54) = G . • The symbol appears in θ t ( · ) : This contradicts our assumption that G does not appear in θ t ( · ) ∈ { A , T , C } t .Therefore, the string P does not appear as a substring in the output of CodePSC ( P, m, x ) , which completes the proof. Proof of Theorem 4 . It suﬃces to show that the output of

CodePSC ( P, (cid:96), x ) is uniquely decodable. We use inductionarguments to establish this result. For the basis step, by the deﬁnition of the output of CodePSC , it is straightforwardto show that

CodePSC ( P, (cid:96), x ) returns the encoding θ (cid:96) ( x ) , which represents a one-to-one mapping from ≤ x < (cid:96) to { A , T , C } (cid:96) whenever (cid:96) < n . For the inductive step, we assume that the result is true for all (cid:96) < r , as well as for all r ≥ n ,and show that it is consequently true for (cid:96) = r .For (cid:96) = r , CodePSC ( P, (cid:96), x ) returns P t − ¯ p t,s CodePSC ( P, (cid:96) − t, b ) , for some integer values s, b and for some ≤ t < n, where x = (cid:16)(cid:80) t − i =1 (cid:12)(cid:12) ¯ P i (cid:12)(cid:12) G n,(cid:96) − i (cid:17) + ( s − G n,(cid:96) − t + b. Therefore x is uniquely decodable if and only if s, t and b are unique. Since sequences of the form P t − ¯ p t,s are preﬁx-free one canuniquely identify both t and s. Moreover (cid:96) − t < r , hence by the induction hypothesis it follows that b is also uniquelydecodable from CodePSC ( P, (cid:96) − t, b ) . Hence, x can be uniquely decoded.13esgination ofprimer SequenceB1-forward (cid:48) AATTACTAAGCGACCTTCTC (cid:48) B1-reverse (cid:48) ACTTATTGCGACTTCTAAGG (cid:48) gBlock-B1-reverse (cid:48) CTTCATAACAACTAACTGTGAC (cid:48) B1-SU1-reverse (cid:48) CGTGCACTCATAACCCATATTTCAAGAGCTAGCTATTCCTCTCCCTTAAAAGTAAATGAC (cid:48) B1-SD1-forward (cid:48) GGGAGAGGAATAGCTAGCTCTTGAAATATGGGTTATGAGTGCACGATCATCACATAAC (cid:48) B2-forward (cid:48) AACCTAACCATCTTCCTCTC (cid:48) B2-reverse (cid:48) AAACGATCCCCTGACAGAGC (cid:48) gBlock-B2-forward (cid:48) GAAGCACAGTGTTGCTGCGTG (cid:48) B2-SU1-reverse (cid:48) CAGCTTGTATCCCATCTCAACCCTAATTCCATAACCGTCAGCGCAGTTGACTAGTCTC (cid:48) B2-SD1-forward (cid:48) CTGCGCTGACGGTTATGGAATTAGGGTTGAGATGGGATACAAGCTGATATGGGAAC (cid:48) B3-forward (cid:48) ATAATAGGCCTGATGATCTC (cid:48) B3-reverse (cid:48) AAGAAGAACCAGTAAGCAGC (cid:48) B3-SU1-reverse (cid:48) AACATCTACTCACTCTCAATCTAAGCTTGAACTGTGTACACACCATCGCTCTTGTACGCC (cid:48) B3-SU2-forward (cid:48) GTGTACACAGTTCAAGCTTAGATTGAGAGTGAGTAGATGTTGATGCGAGGCGAAAGATGT (cid:48) B3-SD2-reverse (cid:48) GACTTCCCCCCTATAATCCATTAATGCTAGATCAAGCCGCATATACTATGTTGCAAATAC (cid:48) B3-SD2-forward (cid:48) GCGGCTTGATCTAGCATTAATGGATTATAGGGGGGAAGTCGCTGCTGGTACTCTG (cid:48) Table S2. List of primers for rewriting (editing) the blocks B1, B2 and B3. The primers for the gBlock method are listedseparately for those used with the OE-PCR method. In the latter case, the labels of DNA fragments SU and SD stand forsample upstream and sample downstream. In OE-PCR, we linked two DNA fragments or three DNA fragments into theﬁnal PCR products; when two fragments were linked, the ﬁrst fragment was labeled UP (U), while the second fragmentwas labeled DOWN (D); when three fragments were combined, the second fragment was labeled MIDDLE (M).14

Address Sequences

Consider the following set of strings of length , ACTAACTGTGCGACTGATGCACACTATCGAGCTGACACGTAGTCAGCAGTAGTCAGTCAGACTGAGCTGAGCGTATATCGACTCAGCTACGACTCACATG with GC content equal to , i.e., GC bases. The sequences are mutually uncorrelated and at Hamming distanceexactly from each other. The sequences do not exhibit secondary structures at room temperature, as veriﬁed by themfold and Vienna packages. We used these addresses for a very small-scale, proof-of-concept random access/rewritingexperiment of a KB ﬁle.In the large scale random access/rewriting experiment described in Section 5, we used diﬀerent address sequences forthe two ﬂanking ends of the bps blocks. The sequences we synthesized include:block 1: ( CTCTTCCAGCGAATCATTAA , ACTTATTGCGACTTCTAAGG ) block 2: ( CTCTCCTTCTACCAATCCAA , AAACGATCCCCTGACAGAGC ) block 3: ( CTCTAGTAGTCCGGATAATA , AAGAAGAACCAGTAAGCAGC ) block 4: ( CTCTTTCGCTGTGCACAAAA , AAATCGGAAATTCGTGTCGC ) block 5: ( CTCTGCTGGAAATGTGTGAA , AATTCACGGTCCGAAACACC ) block 6: ( CTCTGTTCCTCCTTTCTCGT , TGTAGACGATTTGATTGGCG ) block 7: ( CTCTAGCAACTTCCGCAAAT , ACGAGATTCATACCGGACCC ) block 8: ( CTCTAGCTTCCCTATCCATA , TGCAGAAGAGGAGTGTCAGC ) block 9: ( CTCTATAGGCTCTGGTATGT , TTTAACCCGCCCGTACAGCC ) block 10: ( CTCTCGCTCATCTCATGTTT , ACAGTACTTGCCCAATTCGC ) block 11: ( CTCTGTACTCCGCTGAATCA , TAAACATTACAAGCCCCTCG ) block 12: ( CTCTTCTTCCCTGACGATGT , AATACAACTTCTAACCACCC ) block 13: ( CTCTTGATCCTACTGAGAAA , TTAATAGTTCCCGGCAGCCC ) block 14: ( CTCTAGTGACGTGACAGGTA , TTAGAACGAACCAGTATAGC ) block 15: ( CTCTACCTAAGGCCTTTGAA , TTGACCCATGAGCCAGCACC ) block 16: ( CTCTACAGTAGTAAACTCGT , TGCTGAACTCTAATCTGTCC ) block 17: ( CTCTGGGCGGCTGTACACAA , ATACACTCATAACACCTCGG ) block 18: ( CTCTGCGATCACAAAAAGTT , ACAACTATACGTGTCGGACC ) block 19: ( CTCTTTAGCACGAGTCCTAT , TGAACCCGTCGTGCTAATCG ) block 20: ( CTCTAATACGCACGCCCATT , ATACGGGATACAATTAGGGC ) block 21: ( CTCTGAGGCGTGGATATTTT , AATACATCCCTAAAAGCCGG ) block 22: ( CTCTGCGTGTTCATTCCATT , TGAGGATAGGATTAGTAAGG ) block 23: ( CTCTAAGAATCTGACTGCAT , ATGTTAACACTGAGTAAGGG ) block 24: ( CTCTGATCGAACCCATGTCA , ACATGACCTACATAACGTCC ) block 25: ( CTCTCTGGTGGCCTAAAAAT , AACAGAGATCAGAGCAGTGG ) block 26: ( CTCTAGAGAAACGTTGAAGT , AACCCGTACTCACTATGCCG ) block 27: ( CTCTGACGTCTACACAACAT , TTTGTAGATCCCAAGCATCG ) The pairs of sequences were used to ﬂank the two ends of the data blocks. Only the addresses on the left were usedfor subsequent preﬁx-synchronized coding. 15he sequences on the left-hand side of the pairing have “interleaved” { G , C } and { A , T } bases – for example, they allstart with CTCT . . . . This ensures a “GC balancing” property for the preﬁxes of the addresses.

In this section, we illustrate the encoding and decoding procedure for the short address string P = AGCTG , which caneasily be veriﬁed to be self-uncorrelated.More precisely, we explain how to compute a sequence of integers G n, , G n, , . . . , G n, , described in the main body ofthe paper. As before, n denotes the length of the address string, which in this case equals ﬁve.One has ( G n, , G n, , . . . , G n, ) = (3 , , , , , , . The algorithm

CodePSC ( P, , produces:

550 = 0 × G , + 550 ⇒ CodePSC ( P, , CCodePSC ( P, , × G , + 550 ⇒ CodePSC ( P, , CCodePSC ( P, , × G , + 0 × G , + 16 ⇒ CodePSC ( P, , AACodePSC ( P, , ,

16 = 0 × + 1 × + 2 × + 1 × ⇒ CodePSC ( P, ,

16) =

ATCT , ⇒ CodePSC ( P, , CCAAATCT

When running

DecodePSC ( P, X ) on the encoded output X = CCAAATCT , the following steps are executed: ⇒ DecodePSC ( P, CCAAATCT ) = 0 × G , + DecodePSC ( P, CAAATCT ) ⇒ DecodePSC ( P, CAAATCT ) = 0 × G , + DecodePSC ( P, AAATCT ) , ⇒ DecodePSC ( P, AAATCT ) = 2 × G , + 0 × G , + DecodePSC ( P, ATCT ) ⇒ DecodePSC ( P, ATCT ) = 16 ⇒ DecodePSC ( P, CCAAATCT ) = 2 × G , + 16 = 550 A total of sequences of length bps each were designed to encode information retrieved from the Berkeley, Harvard,MIT, Princeton, Stanford, and UIUC Wikipedia page in 2014. Except for sequence corresponding address primers were synthesized by the same company. The address sequences of the blocks are listedin Section 3.As a proof of concept, we performed a number of selection and editing experiments. These include selecting individualblocks and rewriting one of its sections, selecting three blocks and rewriting three sections in each, two close to theﬂanking ends, and one in the middle. The edits involved information about the budget of the institutions at a givenyear of operation. Detailed information about the original sequences and their rewritten forms is given in the followingsections. 16equence identiﬁer Number ofsequencesamples Length of theedited region(in bps) Selection accuracy/ readout errorpercentage Description ofediting methodB1-M-gBlock / / gBlock methodB1-M-PCR / / OE-PCR methodB2-M-gBlock / / gBlock methodB2-M-PCR / / OE-PCR methodB3-M-gBlock / / gBlock methodB3-M-PCR / / OE-PCR methodTable S3. Selection, rewriting and sequencing results. Each rewritten bps sequence was ligated to a linearizedpCRTM-Blunt vector using the Zero Blunt PCR Cloning Kit and was transformed into

E. coli.

The

E. coli strains withcorrect plasmids were sequenced at ACGT, Inc. Sequencing was performed using two universal primers: M13F_20 (inthe reverse direction) and M13R (in the forward direction) to ensure that the entire blocks of bps are covered.Fig. S1. A) Schematic depiction of the editing method using gBlocks. B) Detailed description of the generation of themutation. Four sequences (ranging in length from to bps) containing the entire edit region were gBlock synthesizedfrom IDT. The remaining parts of the bps sequences were PCR ampliﬁed. A homology in at least 30 bps betweenthe ﬂanking end sequence of the blocks and the corresponding end of the gBlock fragment was created. By one OE-PCR,the desired edits were generated in a one-pot matter.We denoted the blocks on which we performed selection and editing by B1, B2, and B3. The primers used for performingthe edits in the blocks are listed in Table S2. Note that two primers were synthesized for each rewrite, for the forwardand reverse direction. In addition, two diﬀerent editing (mutation) techniques were used, gBlock and Overlap-Extension(OE) PCR; gBlocks are double-stranded genomic fragments that are frequently used as primers, for gene construction orfor mediated genome editing. An illustration of editing via gBlocks is shown in Fig. S1. On the other hand, OE-PCR is avariant of PCR used for speciﬁc DNA sequence editing via point mutations or splicing. An illustration of the procedureis given in Fig. S1. To demonstrate the plausibility of a cost eﬃcient method for editing, OE-PCR was used with generalprimers ( ≤ bps) only. For edits shorter than bps, the mutation sequences were designed as overhangs in primers.Then, the three PCR products were used as templates for the ﬁnal PCR reaction involving the entire bps rewrite.All linear bps fragments were mixed, and the mixture was used as a template for PCR ampliﬁcation andselection of the B1, B2 and B3 sequences. The results of selection are shown in Fig S2, where three banks of size bpsare depicted. These banks indicate that sequences of the correct length were isolated. Subsequent sequencing conﬁrmedthat the sequences were indeed the user requested B1, B2 and B3 strands. A summary of the experiments performed isprovided in Table S3. 17ig. S2. PCR of 1000 bps sequences-B1, B2, B3 from a mixture of 26 sequences. The unedited B1_original (B1) sequence is of the form:

AATTACTAAGCGACCTTCTCGGATAGAACGCTTAGTTGGTGCGTTGACATGCTCGAACTGATCATCGGTCACTTGCATTCATTATTGATTGTTGAGTTGAGAAGCGCATTGGTGTCACTCGTTGCTGGGTCATTTTCGGCGAGAGAAACA

GTTCACTGTGGCGTGATGTTTTGAAATGAGGGAGAGTTCTCTTAACTGCA

GTTGGAGTTCAGTATACTCGGGATAGTGTAACAGAGGGAGGCGGATGTGTGTATTGATGTGAAGTCTTTCACGTGCGGGCTAGGTCGTAATGACGGGTCGGGAACTATTCATTGGCGCAATAGTGATTTTGATGAATGATGGATAGAACGCTTAAAGGGAAACTATATAGTTCAAAGCTCGTCGGCGGTGTCGAGGATGTATAGGGGTTAATGAATGGTGGAACTTACTTATACTATAGATTGGACTGGTGGTATGAGAACTTCACTAATTATTGACGTCACAGTTAGTTGTTATGAAGTGATAATATGAATCGAGCGCAACAGGACTAGTCATTTACTTTTAAGGGAGAGGAATAGCTAATCTCAAATTTTTTTTATGTGAGTGCACGATCATCACATAACATAGGAGGCGATGAGACAGCGACTCAATCTGACTAATTCATTATAGGAGTTATATGAAGAGTTCGGAACGAAGCTAGCGCTTTCGCACAATGCGAGGGATAAGAGCGGGTGCAGAGCGAAGGGTGTGAAATTGATGGTGGATAAGAACTTCGCACAGTACTAGCTAGTGGGGAGAGACTTCTATGAATTCGGAGGGATACTTGATATTGATATGGGGGGATGGCGCTATTAAGCGCAGAGCGTAAGTGCGCTTCAAATCGAACATTGTGTAGCTAAGCAATAGAGAAATGTGGGGATTGAGCAGTTCGTATCGGTTCGCATGACATACTTGGGAAAATGGCAGCTTGTTTAAGCTAAACTGGATGAAAGGGAGGAAAAACTTATTGCGACTTCTAAGG where the bases written in red represent the regions we edited.18ig. S3. Illustration of the process of generating the B1 edit/mutation using general primers.The edited B1_mutation (B1_M) sequence reads as:

AATTACTAAGCGACCTTCTCGGATAGAACGCTTAGTTGGTGCGTTGACATGCTCGAACTGATCATCGGTCACTTGCATTCATTATTGATTGTTGAGTTGAGAAGCGCATTGGTGTCACTCGTTGCTGGGTCATTTTCGGCGAGAGAAACAGTTCACTGTGGCGTGATGTTTTGAAATGAGGGAGAGTTCTCTTAACTGCAGTTGGAGTTCAGTATACTCGGGATAGTGTAACAGAGGGAGGCGGATGTGTGTATTGATGTGAAGTCTTTCACGTGCGGGCTAGGTCGTAATGACGGGTCGGGAACTATTCATTGGCGCAATAGTGATTTTGATGAATGATGGATAGAACGCTTAAAGGGAAACTATATAGTTCAAAGCTCGTCGGCGGTGTCGAGGATGTATAGGGGTTAATGAATGGTGGAACTTACTTATACTATAGATTGGACTGGTGGTATGAGAACTTCACTAATTATTGACGTCACAGTTAGTTGTTATGAAGTGATAATATGAATCGAGCGCAACAGGACTAGTCATTTACTTTTAAGGGAGAGGAATAGCTAGCTCTTGAAATATGGGTTATGAGTGCACGATCATCACATAACATAGGAGGCGATGAGACAGCGACTCAATCTGACTAATTCATTATAGGAGTTATATGAAGAGTTCGGAACGAAGCTAGCGCTTTCGCACAATGCGAGGGATAAGAGCGGGTGCAGAGCGAAGGGTGTGAAATTGATGGTGGATAAGAACTTCGCACAGTACTAGCTAGTGGGGAGAGACTTCTATGAATTCGGAGGGATACTTGATATTGATATGGGGGGATGGCGCTATTAAGCGCAGAGCGTAAGTGCGCTTCAAATCGAACATTGTGTAGCTAAGCAATAGAGAAATGTGGGGATTGAGCAGTTCGTATCGGTTCGCATGACATACTTGGGAAAATGGCAGCTTGTTTAAGCTAAACTGGATGAAAGGGAGGAAAAACTTATTGCGACTTCTAAGG with rewrites listed in red.

Since a gBlock of length longer than bps was needed, it was more costly to synthesize the gBlock and perform rewritingthan to directly re-synthesizing the whole block. Hence, the gBlock method was not used in this case. ’ AATTACTAAGCGACCTTCTC ’while for the reverse direction, the primer was ’ CGTGCACTCATAACCCATATTTCAAGAGCTAGCTATTCCTCTCCCTTAAAAGTAAATGAC ’ . The second part of the sequence was PCR ampliﬁed by using the forward direction primer ’ GGGAGAGGAATAGCTAGCTCTTGAAATATGGGTTATGAGTGCACGATCATCACATAAC ’and reverse direction primer ’ ACTTATTGCGACTTCTAAGG ’ . Both PCR reactions used the sequence B1 as template. Two such PCR products are shown in Fig. S4, indicating thatthe correct length products were isolated in each reaction.OE-PCR was performed in a ul reaction volume containing the two aforementioned PCR products without primersfor the ﬁrst cycles and the products with primers (B1 primers in Table S2) for the later cycles. A single bank withcorrect size of bps was obtained (see Fig. S4). The unedited B2_original (B2) sequence is of the form:

AACCTAACCATCTTCCTCTCGATTTGGAGCAGATTGGTATTATTCTAGTCGTCGAGACTAGTCAACTGCGCTAGTTTGTGTTCATAAAATAAGAGTATGAGATACAAGCTGATATGGGAACTTAATTACGAAGCACAGTGTTGCTGCGTGGACTTGTGAAGTAGGGTGTGAGATAAGAATGATAGCGAACGCAGCGTATGGCTGAAGTGCTGGGCATATTGTGGTGTGGACATCTCAAAGTCTATGAAGATTGGTAATAGGATGGTCTCTCGGGTCTCAAACTTCGTCAGGCAGCATTGTGCATGCGAGTGATTGAAAGGGAGGGTAAGGGTTATTAATAGAAAAGACTTACAGGCGTTGGTATGATTCAAGATCGCAAGAATCGTGTGAGCTTGAGGACTAAATAGTTTAAAGAAATAGGAATAGTTGTAATTTAAGGAGCGTGGCACGGATGGATCAGCGTGTCAACGGAACGCGCATTTGGGAGTTTTATGTTAAGTGAGCAGACTAAGGTGAAATTCAATAGTCTCTATCGTTCGAGGGTTATTGCTAGGGGAGACTTTGAGTGAGTGGTAATTTTGAAGCAGTATACGTAACTTTTTCGATTCTTAGTGGCAGTTACTCTGAATTTTAGTGTGAGCAGAGTGTGATAAATAGAGAGATACGAGGTCGACACGGCTGTTGGGGGCACTTAACAGTAGGGGGTTGATGCTGGCGGACACTAAAGGATTTTTGAAGGGGATTGTTGGCGACTCACATCTAAGTGGTATTGCGGGCTCTATGAGAATCTGCTCGAGTCATCTAGGTTGAGGAAGAGGGGGAGATTCTCGTTAAAGACAGTACATATTTCGCATACTTCTTAACGTGGAGTATGAATGTCAATGGTGGGAGATATGGGTGGAGGGATTTCATTCACTGCATATGTACGCTCAGGAGCGCGAACGAATCATAAAACTATTGTAATATATTGATAGATAAAGAAACGATCCCCTGACAGAGC

AACCTAACCATCTTCCTCTCGATTTGGAGCAGATTGGTATTATTCTAGTCGTCGAGACTAGTCAACTGCGCTGACGGTTATGGAATTAGGGTTGAGATGGGATACAAGCTGATATGGGAACTTAATTACGAAGCACAGTGTTGCTGCGTGGACTTGTGAAGTAGGGTGTGAGATAAGAATGATAGCGAACGCAGCGTATGGCTGAAGTGCTGGGCATATTGTGGTGTGGACATCTCAAAGTCTATGAAGATTGGTAATAGGATGGTCTCTCGGGTCTCAAACTTCGTCAGGCAGCATTGTGCATGCGAGTGATTGAAAGGGAGGGTAAGGGTTATTAATAGAAAAGACTTACAGGCGTTGGTATGATTCAAGATCGCAAGAATCGTGTGAGCTTGAGGACTAAATAGTTTAAAGAAATAGGAATAGTTGTAATTTAAGGAGCGTGGCACGGATGGATCAGCGTGTCAACGGAACGCGCATTTGGGAGTTTTATGTTAAGTGAGCAGACTAAGGTGAAATTCAATAGTCTCTATCGTTCGAGGGTTATTGCTAGGGGAGACTTTGAGTGAGTGGTAATTTTGAAGCAGTATACGTAACTTTTTCGATTCTTAGTGGCAGTTACTCTGAATTTTAGTGTGAGCAGAGTGTGATAAATAGAGAGATACGAGGTCGACACGGCTGTTGGGGGCACTTAACAGTAGGGGGTTGATGCTGGCGGACACTAAAGGATTTTTGAAGGGGATTGTTGGCGACTCACATCTAAGTGGTATTGCGGGCTCTATGAGAATCTGCTCGAGTCATCTAGGTTGAGGAAGAGGGGGAGATTCTCGTTAAAGACAGTACATATTTCGCATACTTCTTAACGTGGAGTATGAATGTCAATGGTGGGAGATATGGGTGGAGGGATTTCATTCACTGCATATGTACGCTCAGGAGCGCGAACGAATCATAAAACTATTGTAATATATTGATAGATAAAGAAACGATCCCCTGACAGAGC where, as before, red letters were used to indicate the rewritten region. A bps sequence, containing the entire edited region and the B2 string, was gBlock synthesized by IDT. Another partof B2 was PCR ampliﬁed using the forward primer ’ GAAGCACAGTGTTGCTGCGTG ’and reverse primer ’ AAACGATCCCCTGACAGAGC ’The B2 sequence served as a template. See Fig. S4 for an illustration.21ig. S5. PCR products of B1 and B2.Fig. S6. PCR products of B3. Over extension PCR (OE-PCR) was performed in a ul reaction volume containing the above bps gBlock productand PCR products without primers for the ﬁrst cycles and with B2 forward and reverse primers listed in Table S2 forthe subsequent cycles.The PCR product was deposited on a gel substrate and the correct bps band was obtained as shown in Fig. S5.One pair of primers was designed to PCR amplify the ﬁrst part of the sequence B2-M, with forward primer ’ AACCTAACCATCTTCCTCTC ’and reverse primer ’ CAGCTTGTATCCCATCTCAACCCTAATTCCATAACCGTCAGCGCAGTTGACTAGTCTC ’ . The second part was PCR ampliﬁed by the forward primer ’ CTGCGCTGACGGTTATGGAATTAGGGTTGAGATGGGATACAAGCTGATATGGGAAC ’and reverse primer ’ AAACGATCCCCTGACAGAGC ’ . Both PCRs used B2 as a template. Two PCR products are shown in Fig. S5.22ig. S7. Scheme for generating the B3 edits using standard 60 bps primers.

The unedited original B3 sequence equals:

ATAATAGGCCTGATGATCTCGATGGATGCGCGTCACTCGAGTGCGGTAGGCACGTCTCAGGTGATAAGTGATTGTGATTGTAGGTGAAGGGGGTAGAAATGATTGAGGAAACTTGTGTACTCGTTACACGTGATAGGGTTTGATCGGCGGTGGAAAAATTAGGGATGGGGATAAGATTATGGGATCGTTCTCAATAATTGTTACGATATCGTTGTTACACAGTTGTTACGCTACGACGTCATCGATAAAGGTGGGTATGTGGGGGTACTATACTCTTGGGGGCGTACAAGAGCGATGGTTGGTCGGATTGAAATTAAAAGCATTAAGAGGTTAATTTATAGATGCGAGGCGAAAGATGTGAGCGCAAGTAAAGGAAACGCGAGCAAGTGATTGTTACTAATTATATTAGGAGGTGATGAGGAGCGTGGTTATCTTATTGGGCGAGCTGCAGCGAATTCTAGATTTCTTCGAGTTACAGTCGTAGTGATGTATATAGAGTGGATGCGCACATTATTACATATATCGTCGAATTGGATTAGACGCAAAGAAAATGCGGCATTGTAATGGGTTGTGTAAAATTGAGCGTGGTTATCTTGTCATGACATAGTAAAAGTTGCTCAATTGATTGAAGCTCGATTAGGAGAAGTAATTTGAAAAAAGGATAGACTAGGACTCAACGAGGAACGGGTATTTGCAACATAGTATATGCGGTCTTAATCGGAGGGTAATGTTATTTGTGTGGAAGTCGCTGCTGGTACTCTGGGCGTTTAGGATGAATCTTCGAAACTAGGCTTTGTCAGAGATAGTTTGTTGGTAAGAAGAATCAGGAAACGGTAACAGAGAATAAATGAATTAACGTAGCAAGATTTCGTCTTTCTGGAGATGAGAAGGTGTAGTTGAGGAGTCGACGTTCTTTACGGAGGTGGGAGATTGGTTTTGGCAGTACTTCGTTAAATACACTAAAAAATTTGATAATGTAGAAGAAGAACCAGTAAGCAGC

ATAATAGGCCTGATGATCTCGATGGATGCGCGTCACTCGAGTGCGGTAGGCACGTCTCAGGTGATAAGTGATTGTGATTGTAGGTGAAGGGGGTAGAAATGATTGAGGAAACTTGTGTACTCGTTACACGTGATAGGGTTTGATCGGCGGTGGAAAAATTAGGGATGGGGATAAGATTATGGGATCGTTCTCAATAATTGTTACGATATCGTTGTTACACAGTTGTTACGCTACGACGTCATCGATAAAGGTGGGTATGTGGGGGTACTATACTCTTGGGGGCGTACAAGAGCGATGGTGTGTACACAGTTCAAGCTTAGATTGAGAGTGAGTAGATGTTGATGCGAGGCGAAAGATGTGAGCGCAAGTAAAGGAAACGCGAGCAAGTGATTGTTACTAATTATATTAGGAGGTGATGAGGAGCGTGGTTATCTTATTGGGCGAGCTGCAGCGAATTCTAGATTTCTTCGAGTTACAGTCGTAGTGATGTATATAGAGTGGATGCGCACATTATTACATATATCGTCGAATTGGATTAGACGCAAAGAAAATGCGGCATTGTAATGGGTTGTGTAAAATTGAGCGTGGTTATCTTGTCATGACATAGTAAAAGTTGCTCAATTGATTGAAGCTCGATTAGGAGAAGTAATTTGAAAAAAGGATAGACTAGGACTCAACGAGGAACGGGTATTTGCAACATAGTATATGCGGCTTGATCTAGCATTAATGGATTATAGGGGGGAAGTCGCTGCTGGTACTCTGGGCGTTTAGGATGAATCTTCGAAACTAGGCTTTGTCAGAGATAGTTTGTTGGTAAGAAGAATCAGGAAACGGTAACAGAGAATAAATGAATTAACGTAGCAAGATTTCGTCTTTCTGGAGATGAGAAGGTGTAGTTGAGGAGTCGACGTTCTTTACGGAGGTGGGAGATTGGTTTTGGCAGTACTTCGTTAAATACACTAAAAAATTTGATAATGTAGAAGAAGAACCAGTAAGCAGC

Two sequences, the bps sequence containing the ﬁrst mutation region and the second bps sequence containing thesecond mutation region, were gBlock synthesized by IDT. There was a bps overlap between the two gBlocks. OE-PCR was performed in a ul reaction volume containing the above two bps gBlock products without primersfor the ﬁrst cycles and additional B3 forward and reverse primers listed in Table S2 for the subsequent cycles. ThePCR product was deposited on a gel substrate and the correct bps band was obtained.One pair of primers was designed to PCR amplify the ﬁrst part of the sequence B2-M, using ’ ATAATAGGCCTGATGATCTC3 ’in the forward direction and ’ AACATCTACTCACTCTCAATCTAAGCTTGAACTGTGTACACACCATCGCTCTTGTACGCC ’in the reverse direction.The second part was PCR ampliﬁed in the forward direction by using the primer ’ GTGTACACAGTTCAAGCTTAGATTGAGAGTGAGTAGATGTTGATGCGAGGCGAAAGATGT ’and in the reverse direction by using the primer ’ GACTTCCCCCCTATAATCCATTAATGCTAGATCAAGCCGCATATACTATGTTGCAAATAC ’ . The third part was PCR ampliﬁed by the forward direction primer ’ GCGGCTTGATCTAGCATTAATGGATTATAGGGGGGAAGTCGCTGCTGGTACTCTG ’24ig. S8. The generated PCR products of bps edits from the gBlock method, involving B1-gBlock, B2-gBlock andB3-gBlock.Fig. S9. The generated PCR products of 1000bps sequence editing for the OE-PCR based method, and sequences B1-PCR,B2-PCR and B3-PCR.and reverse direction primer ’ AAGAAGAACCAGTAAGCAGC ’ . All three PCRs used the sequence B3 as the template. All three PCR products are shown in Fig. S8.OE-PCR was performed in a ul reaction volume containing the above three PCR products without primers for theﬁrst cycles and with B3 primers listed in Table S2 for the subsequent cycles. A single bank of correct size bpswas obtained (See Fig. S9).Correctness of the synthesized edited regions was conﬁrmed via DNA Sanger sequencing as follows. The PCR productsof the gBlock method and the OE-PCR method were named B1-M-gBlock, B2-M-gBlock, B3-M-gBlock and B1-M-PCR,B2-M-PCR, B3-M-PCR, respectively. All ﬁnal mutations/edits of PCR products were puriﬁed using the QiaGen GelPuriﬁcation Kit. The puriﬁed bps edited sequences were blunt-ligated to the vector named pCR TM -Blunt (Fig. S10)using the Zero Blunt PCR Cloning Kit and following the manufacturers’ protocol. Five colonies of each PCR-Blunt-mutation were sent to ACTG, Int. Sequencing was performed using two universal primers: M13F_20 (for the reversedirection) and M13R (for the forward direction). Bi-directional sequencing was performed in order to ensure that theentire bps block was completely covered. 25ig. S10. Map and features of PCR-Blunt vector (Life technologies). In our small-scale experiments, Sanger sequencing produced two erroneous symbols in one strand which we were ableto correct using preﬁx matching. One possible problem that may arise in large scale DNA-storage systems involvingmillions of blocks is erroneous sequencing which may not be corrected via preﬁx matching. In current High ThroughputSequencing technologies, such as Illumina HiSeq or MiSeq, the dominant sources of errors are substitutions. Due toour word grouping scheme, such substitution errors cannot cause catastrophic error propagation, but may neverthelessaccumulate as the number of rewrite cycles increases. In this case, preﬁx matching may not suﬃce to correct the errorsand more sophisticated coding schemes need to be used. Unfortunately, adding additional parity-check symbols into thepreﬁx-encoded data stream may cause problems as the parities may violate the preﬁx properties and dis-balance theGC content. Furthermore, every time rewriting is performed, the parity-checks will need to be updated, which incursadditional cost for maintaining the system. A simple solution to this problem is a hybrid scheme, in which the bulk ofthe information is stored in DNA media, while only parity-checks are stored on a classical device, such as ﬂash memory.Given that the current error-rate of short-read sequencing technologies roughly equals1%