Circular Trace Reconstruction
aa r X i v : . [ c s . D S ] S e p Circular Trace Reconstruction
Shyam Narayanan, Michael Ren
Abstract
Trace Reconstruction is the problem of learning an unknown string x from independenttraces of x , where traces are generated by independently deleting each bit of x with somedeletion probability q . In this paper, we initiate the study of Circular Trace Reconstruction ,where the unknown string x is circular and traces are now rotated by a random cyclic shift.Trace reconstruction is related to many computational biology problems studying DNA, whichis a primary motivation for this problem as well, as many types of DNA are known to be circular.Our main results are as follows. First, we prove that we can reconstruct arbitrary circularstrings of length n using exp (cid:0) ˜ O ( n / ) (cid:1) traces for any constant deletion probability q , as long as n is prime or the product of two primes. For n of this form, this nearly matches the best knownbound of exp (cid:0) O ( n / ) (cid:1) for standard trace reconstruction. Next, we prove that we can recon-struct random circular strings with high probability using n O (1) traces for any constant deletionprobability q . Finally, we prove a lower bound of ˜Ω( n ) traces for arbitrary circular strings,which is greater than the best known lower bound of ˜Ω( n / ) in standard trace reconstruction. The trace reconstruction problem asks one to recover an unknown string x of length n from inde-pendent noisy samples of the string. In the original setting, x is a binary string in { , } n , and arandom subsequence ˜ x of x , called a trace , is generated by sending x through a deletion channelwith deletion probability q , which removes each bit of x independently with some fixed probability q . The main question is to determine how many independent traces are needed to recover theoriginal string with high probability. This question has become very well studied over the pasttwo decades [Lev01a, Lev01b, BKKM04, KM05, HMPW08, VS08, MPV14, DOS19, NP17, PZ17,HHP18, HL20, HPP18, Cha19, CDL + x for any arbitrarily chosen x ∈ { , } n (worst-case) or the case where we just wish to reconstruct a randomly chosen string x (average-case). People have also studied the trace reconstruction problem for various values of the deletionprobability q , such as if q is a fixed constant between 0 and 1 or decays as some function of n .People have also studied variants where the traces allow for insertions of random bits, rather thanjust deletion of bits, and variants where the string is no longer binary but from a larger alphabet.Finally, various generalizations or variants of the trace reconstruction problem have also beendeveloped. These include error-correcting codes over the deletion channel (i.e., “coded” tracereconstruction) [CGMR19, BLS20], reconstructing matrices [KMMP19] and trees [DRR19] fromtraces, and reconstructing mixtures of strings from traces [BCF +
19, BCSS19, Nar20].In this paper, we develop and study a new variant of trace reconstruction that we call
CircularTrace Reconstruction . In this variant, there is again an unknown string x ∈ { , } n that we cansample traces from, but this time, the string x is a cyclic string, meaning that there is no beginning1 0 0 0 10110110 0 0 1010 010001Figure 1: An example of a circular trace. We start with an unknown circular string (top left). Eachbit of the string is randomly deleted (red bits are deleted, black bits are retained) and the order ofthe retained bits is preserved, so we are left with the smaller circular string. However, since there isno beginning or end of the circular string, we assume the string is seen in clockwise order startingfrom a randomly chosen bit.or end to the string. Equivalently, one can imagine a linear string that undergoes a random cyclicshift before a trace is returned. See Figure 1 for an example. Our goal, like in the normal tracereconstruction, is to reconstruct the original circular string using as few random traces as possible. Perhaps the first natural question about circular trace reconstruction is the following: how does thesample complexity of circular trace reconstruction compare to the sample complexity of standard(linear) trace reconstruction? Intuitively, one should expect circular trace reconstruction to be atleast as difficult as standard trace reconstruction, since given any trace of a linear string, we canrandomly rotate it to get a trace of the corresponding circular string. This reasoning, however,is slightly flawed. For instance, if we wish to distinguish between two strings x and y which aredifferent as linear strings but equivalent up to a cyclic shift, then one cannot distinguish betweentraces of random rotations of x and traces of random rotations of y . However, by padding the tracewith extra bits before randomly rotating, one can show that circular trace reconstruction is at leastas hard as linear trace reconstruction in both the worst-case and average-case. Indeed, we have thefollowing proposition – as its proof is quite simple, we defer it to Appendix A. Proposition 1.1.
Suppose that we can solve worst-case circular trace reconstruction over length m strings with deletion probability q using T ( m, q ) traces. Then, we can solve worst-case lineartrace reconstruction over length n strings with deletion probability q using min m ≥ n T ( m, q ) traces.Likewise, suppose that we can solve average-case circular trace reconstruction over length m strings with deletion probability q using T ( m, q ) traces. Then, we can solve average-case lineartrace reconstruction over length n strings with deletion probability q using min m ≥ n T ( m, q ) traces. Given Proposition 1.1, any upper bounds for circular trace reconstruction imply nearly equiva-lent upper bounds for the linear trace reconstruction, and any lower bounds for linear trace recon-struction imply nearly equivalent lower bounds for circular trace reconstruction. This raises twonatural questions. First, can we match or nearly match the best linear trace reconstruction upperbounds for circular trace reconstruction? Second, can we beat the best linear trace reconstructionlower bounds for circular trace reconstruction?The first main result we prove is for worst-case circular strings. The best known upper boundfor worst-case linear trace reconstruction with deletion probability q , where q is a fixed constant2etween 0 and 1, is exp (cid:0) O ( n / ) (cid:1) , where the unknown string has length n [DOS19, NP17]. Ourfirst main result, proven in Section 3, provides a nearly matching upper bound for the circular tracereconstruction problem, but only if the length n has at most 2 prime factors. Theorem 1.2.
Let x be an unknown, arbitrary circular string of length n , let q be the deletionprobability of each element in the string, and let p = 1 − q be the retention probability. Then, if n is either a prime or a product of two (possibly equal) primes, using exp (cid:0) O (cid:0) n / (log n ) / p − / (cid:1)(cid:1) random traces, we can determine x with failure probability at most − n . The primary reason why our theorem fails for n having 3 or more prime factors is that we provethe following number theoretic result which is crucial in our algorithm. Theorem 1.3.
For any fixed integer n ≥ , the following statement is true if and only if n hasat most prime factors, counting multiplicity.Define ω := e πi/n , and suppose that a , . . . , a n − , b , . . . , b n − are all integers in { , } . Also,suppose that for all ≤ k ≤ n − , there is some integer c k such that P a i ω i · k = ω c k · P b i ω i · k . Then, the sequences { a i } and { b i } are cyclic shifts of each other. The next main result we prove is for average-case circular strings: we show that a randomcircular string can be recovered using a polynomial number of traces. Formally, we prove thefollowing theorem, done in section 4.
Theorem 1.4.
Let x be an unknown but randomly chosen circular string of length n and let < q < be the deletion probability of each element. Then, there exists a constant C q dependingonly on q such that we can determine x with failure probability at most n − using O ( n C q ) traces. The main lemma we need to prove Theorem 1.4 is actually a result that is true for worst-casestrings. Specifically, we show how to recover the multiset of all consecutive substrings of length O (log n ) using a polynomial number of strings. While this does not guarantee that we can recoveran arbitrary circular string, it does allow us to recover what we will call regular strings , which weshow comprise the majority of circular strings. The following lemma may be of independent interestfor studying worst-case strings as well, as it allows one to gain information about all “consecutivechunks” of the unknown string using only a polynomial number of queries. Lemma 1.5.
Let x = x · · · x n be an arbitrary circular string of length n and let < q < bethe deletion probability of each element. Then, for k = 100 log n, we can recover the multiset ofall substrings { x i x i +1 · · · x i + k − } ni =1 , where indices are modulo n , using O ( n C q ) traces with failureprobability n − , where C q is a constant that only depends on q . The best known upper bound for average-case linear trace reconstruction is only exp (cid:0) O ((log n ) / ) (cid:1) [HPP18]. Unfortunately, we were not able to adapt their argument to circular strings. One majorreason why we are unable to do so is that in the argument of [HPP18] (as well as [PZ17], whichprovides an exp (cid:0) O ((log n ) / ) (cid:1) sample algorithm), the authors recover the ( k + 1) st bit of the stringassuming the first k bits are known using a small number of traces, and by reusing traces, theyinductively recover the full string. However, since we are dealing with circular strings, even recov-ering the “first” bit does not make much sense. However, we note that even a polynomial-samplealgorithm is quite nontrivial. In the linear case, a polynomial-sample algorithm for average-casestrings was first proven by [HMPW08], and their algorithm only worked as long as the deletionprobability q was at most some small constant, which when optimized is only about 0 .
07 [PZ17].3ur final main result regards lower bounds for worst-case strings. For linear worst-case strings,the best known lower bound for trace reconstruction is ˜Ω( n / ) [Cha19]. For circular trace re-construction, we show an improved lower bound of Ω( n ), although the proof of our lower boundis actually much simpler and cleaner than those of the known lower bounds for standard tracereconstruction [Cha19, HL20]. Specifically, we prove the following theorem, done in Section 5: Theorem 1.6.
Let n ≥ , ≤ k ≤ , and let x be the string n n +1 n + k = 1 0 . . . | {z } n times . . . | {z } n +1 times . . . | {z } n + k times . Likewise, let y be the string y = 10 n n + k n +1 . Then, the strings x, y are not equivalent up tocyclic rotations, but for any constant deletion probability q , one requires Ω( n / log n ) random tracesto distinguish between the original string being x or y . Thus, for all integers n , worst-case circulartrace reconstruction requires at least ˜Ω( n ) random traces. We note that a very similar statement to Lemma 1.5, but for linear strings, was proven in indepen-dent concurrent work by Chen et. al. [CDL +
20, Theorem 2], which provides a polynomial-samplealgorithm for a “smoothed” variant of worst-case linear trace reconstruction. Many ideas in ourproof of Lemma 1.5 and their proof appear to overlap, though our proof is substantially shorter.
From a theoretical perspective, circular trace reconstruction can bring many novel insights tothe theory of reconstruction algorithms, some of which may be useful even in the standard tracereconstruction problem. For instance, the proof of Theorem 1.2 combines analytic, statistical, andcombinatorial approaches as in previous trace reconstruction papers, but now also uses ideas fromnumber theory and results about cyclotomic integers. To the best of our knowledge, this paper isthe first paper on trace reconstruction that utilizes number theoretic ideas, though there is work onother problems about cyclic strings that uses ideas from number theory. Also, Lemma 1.5 showsa way to recover all contiguous sequences in the original string of length O (log n ) for arbitrarycircular strings, which is a new result even in the linear case (concurrent with [CDL + + +
11, pp. 313, 397, 516-517], or [Wik]). Therefore, understanding circular tracereconstruction could prove useful in reconstructing ancestral sequences for mitochondrial or bacte-rial DNA. Another problem in computational biology that trace reconstruction may be applicableto is the DNA Data Storage problem, where data is stored in DNA and can be recovered throughsequencing, though the stored DNA may mutate over time [CGK12, OAC + + +
19, BCSS19, Nar20],where the goal is to recover an unknown mixture of ℓ strings from random traces. Indeed, receivingtraces from a circular string is equivalent to receiving traces from a uniform mixture of a linearstring along with all of its cyclic shifts, so circular trace reconstruction can be thought of as aninstance of population recovery from the deletion channel with mixture size ℓ = n .Unfortunately, the best known algorithm for population recovery over worst-case strings requiresexp (cid:16) ˜ O ( n / ) · ℓ (cid:17) traces [Nar20], which is not useful if ℓ = n . However, to prove our worst-caseupper bound, we will use ideas based on [DOS19, NP17, Nar20] to estimate certain polynomialsthat depend on the unknown circular string x . For the average case problem, i.e. if given amixture over ℓ random strings, population recovery can be done with poly (cid:0) ℓ, exp (cid:0) (log n ) / ) (cid:1)(cid:1) random traces. While this seemingly implies a poly( n )-sample algorithm for average-case circulartrace reconstruction, the n cyclic shifts of the circular string are quite similar to each other andthus do not behave like a collection of n independent random strings. Indeed, our techniques foraverage-case circular trace reconstruction are very different from those developed in [BCSS19].While circular strings have not been studied before in the context of trace reconstruction, peoplehave studied circular strings and cyclic shifts in the context of edit distance [Mae90, AGMP13],multi-reference alignment [BCSZ19, BNWR19, PWB + + In this subsection, we highlight some of the ideas used in Theorems 1.2, 1.4, and 1.6.The proof of Theorem 1.2 is partially based on ideas from [DOS19, NP17, Nar20]. In [DOS19,NP17], the authors consider two strings x, y ∈ { , } n and show how to distinguish between randomtraces of x and random traces of y . To do so, they construct an unbiased estimator for P ( z ; x ) := P x i z i (or P ( z ; y ) = P y i z i ) solely based on the random trace of either x or y , for some z ∈ C .By showing that the unbiased estimator is never “too” large and that P ( z ; x ) and P ( z ; y ) differenough for an appropriate choice of z , they can estimate this quantity using many random tracesto distinguish between x and y . Unfortunately, in our case, applying the same estimator will giveus an unbiased estimator for P ′ ( z ; x ) := E i [ P ( z ; x ( i ) )] , where x ( i ) is the i th cyclic shift of x : itturns out that P ( z ; x ) = P ( z ; y ) as polynomials in z even if x, y have the same number of 1’s. Ourgoal will then be to establish some other polynomial Q ( z ; x ) such that we can construct a goodunbiased estimator, but at the same time Q ′ ( z ; x ) := E i [ Q ( z ; x ( i ) )] and Q ′ ( z ; y ) := E i [ Q ( z ; y ( i ) )] aredistinct polynomials for any distinct cyclic strings x, y . We show that the polynomial Q ( z ; x ) := z kn P ( z ; x ) k P ( z − k ; x ) will do the job, for some some small integer k . We provide a (significantly morecomplicated) unbiased estimator of Q ( z ; x ) using a random trace: the construction is similar to thatof [Nar20], which shows how to estimate P ( z ; x ) k for some integer k . To show that Q ( z ; x ) = Q ( z ; y )as polynomials, we first show that P ( z ; x ) k P ( z − k ; x ) has the special property that if z is a cyclotomic n th root of unity, this polynomial is in fact invariant under cyclic shifts! Thus, it just suffices toshow that if x, y ∈ { , } n are not cyclic shifts of each other, there is some n th root of unity ω such that P ( ω ; x ) k P ( ω − k ; x ) = P ( ω ; y ) k P ( ω − k ; y ) . This will require significant number theoretic5omputation, and will be true as long as n is a prime or a product of two primes.The bulk of the proof of Theorem 1.4 will be proving Lemma 1.5, which reconstructs all consec-utive substrings of length 100 log n in the unknown circular string x . For a random string x , thesesubstrings will all be sufficiently different, so once we know the substrings, we can reconstruct thefull string because there is only one way to “glue” together the substrings. Therefore, we focus onexplaining the ideas for Lemma 1.5. Our goal will be to determine how many times a string s ap-pears consecutively in x for each string s of length 100 log n . For an unknown string x and i between0 and n −
100 log n, we let c i be the number of times s appears in some contiguous block of length i + 100 log n in x . Then, a basic enumerative argument shows that for a random (cyclically shifted)trace ˜ x = ˜ x ˜ x · · · ˜ x m , the probability that ˜ x · · · ˜ x
100 log n can be written as P i ≥ c i (1 − q )
100 log n q i , and we wish to recover c . The (1 − q )
100 log n term is a constant that equals 1 / poly( n ) , so it is easyto recover an approximation to P i ≥ c i q i . We truncating this polynomial at an appropriate degree(approximately C log n for some large C ) and show that the truncated polynomial P C log ni =0 c i x i isvery close to the original polynomial, but differs from P C log ni =0 c ′ i x i for some x ∈ [ q, (1 − q ) /
2] by asignificant amount, if c ′ = c , using ideas based on [BEK99]. We can also simulate a trace withdeletion probability x > q by taking a “trace of the trace.” This will be sufficient in determining c , and therefore, the (multi)-set of all consecutive substrings of length 100 log n .The proof of Theorem 1.6 proceeds by showing that the laws of the traces of x = 10 n n +1 n + k and y = 10 n n + k n +1 are close to each other in the sense of Hellinger distance and concludingby a lemma in [HL20] that was used in a similar fashion to show a lower bound for linear tracereconstruction. It is first shown that conditioned on a 1 being deleted, a trace from x is equidis-tributed as a trace from y . Then explicit expressions for the probabilities that the trace is 10 a b c are computed and compared, yielding an upper bound on the Hellinger distance. The differencebetween the probabilities for x and y is proportional to the product of ( a − b )( b − c )( a − c ) anda symmetric polynomial in a, b, c . Both x and y consist of three 1’s separated by runs of 0’s ofapproximate length n , so with high probability we have that a, b, c are approximately np , withsquare root fluctuations. The contribution of the ( a − b )( b − c )( a − c ) term allows us to recover a˜Ω( n ) bound. First, we explain a basic definition we will use involving complex numbers.
Definition 2.1.
For z ∈ C , let | z | be the magnitude of z , and if z = 0, let arg z be the argument of z , which is the value of θ ∈ ( − π, π ] such that z | z | = e iθ .Next, we state a Littlewood-type result about bounding polynomials on arcs of the unit circle. Theorem 2.2. [BE97]
Let f ( z ) = P nj =0 a j z j be a nonzero polynomial of degree n with complexcoefficients. Suppose there is some positive integer M such that | a | ≥ and | a j | ≤ M for all ≤ j ≤ n . Then, if A is an arc of the unit circle { z ∈ C : | z | = 1 } with length < a < π , thereexists some absolute constant c > such that sup z ∈ A | f ( z ) | ≥ exp (cid:18) − c (1 + log M ) a (cid:19) . Next, we state two well known results about roots of unity in cyclotomic fields.6 emma 2.3. [Mar77] Let ω = e πi/n . Then, the set of { ω k } for k ∈ Z , gcd( k, n ) = 1 are all Galois conjugates . This means that if P ( x ) is an integer polynomial, then P ( ω k ) = 0 if and only if P ( ω ) = 0 for any k ∈ Z with gcd( k, n ) = 1 . Moreover, P ( ω ) = 0 if and only if P is a multiple ofthe n th Cyclotomic polynomial . Lemma 2.4. [Mar77] Let ω = e πi/n be an n th root of unity, and let Q [ ω ] be the n th degreecyclotomic field. Then, if z ∈ Q [ ω ] such that z r = 1 for some integer r ≥ , z must equal ω k or − ω k for some integer k . Finally, we define the Hellinger distance between two probability measures and state a folklorebound on distinguishing between distributions based on samples in terms of the Hellinger distance.
Definition 2.5.
Let µ and ν be discrete probability measures over some set Ω . In other words, for x ∈ Ω, µ ( x ) is the probability of selecting x when drawing from the measure µ. Then, the Hellingerdistance is defined as d H ( µ, ν ) = X x ∈ Ω (cid:16)p µ ( x ) − p ν ( x ) (cid:17) ! / . The following proposition is quite well-known (see for instance, [HL20, Lemma A.5]).
Proposition 2.6. If µ, ν are discrete probability measures, then if given i.i.d. samples from either µ or ν , one must see at least Ω( d H ( µ, ν ) − ) i.i.d. samples to determine whether the distribution is µ or ν with at least / success probability. In this section, we prove Theorem 1.2, i.e., we provide an exp (cid:16) ˜ O ( n / ) (cid:17) -sample algorithm forcircular trace reconstruction when the length n is a prime or product of two primes.For a (linear) string x ∈ { , } n and z ∈ C , we define P ( z ; x ) := P ni =1 x i z i . The first lemmawe require creates an unbiased estimator for Q mi =1 P ( z i ; x ) for some complex numbers z , . . . , z m , using only random traces of x . The proof of the following lemma greatly resembles the proof of[Nar20, Lemma 4.1], so we defer the proof to Appendix A. Lemma 3.1.
Let x be a linear string of length n . Fix q as the deletion probability and p = 1 − q asthe retention probability. Then, for any integer m ≥ and any Z = ( z , . . . , z m ) for z , . . . , z m ∈ C , there exists some function g m (˜ x, Z ) such that E ˜ x [ g m (˜ x, Z )] = m Y k =1 n X i =1 x i z ik ! , where the expectation is over traces drawn from x . Moreover, for any L ≥ , and for all ˜ x ∈ { , } n and all Z such that | z | , . . . , | z m | = 1 and | arg z i | ≤ L for all ≤ i ≤ m, | g m (˜ x, Z ) | ≤ ( p − mn ) O ( m ) · e O ( m n/ ( p L )) . x ∈ { , } n and z ∈ C , let P ( z ; x ) := P ni =1 x i z i . Our main goal will be to determine the valueof f t ( z ; x ) := P ( z ; x ) t · P ( z − t ; x ) for some integer t , where z is an n th root of unity. Importantly,we note that f t ( z ; x ) is invariant under rotations of x , since for z = e πik/n , n X i =1 x ( i +1) (mod n ) z i = X x i z i − = P ( z ; x ) · z − whereas n X i =1 x ( i +1) (mod n ) z − t · i = X x i z − t ( i − = P ( z − t ; x ) · z t Therefore, if we define x ( j ) as the string x rotated by j places (so x ( j ) i = x ( i + j ) (mod n ) ), then f ( z ; x ) = f ( z ; x ( j ) ) for all z = e πik/n and 0 ≤ j ≤ n − . Now, choose some z with | z | = 1 and | arg z | ≤ L . Also, fix some integer t , let m = t + 1 , andlet Z = ( z, . . . , z | {z } t times , z − t ) . Then, if j is randomly chosen in { , , . . . , n − } and ˜ x is a random trace, E ˜ x [ nz tn · g m (˜ x, Z )] = ( n · z tn ) · n · n − X j =0 P ( z ; x ( j ) ) t · P ( z − t ; x ( j ) ) = n − X j =0 z tn · P ( z ; x ( j ) ) t · P ( z − t ; x ( j ) ) . Note that P n − j =0 z tn · P ( z ; x ( j ) ) t · P ( z − t ; x ( j ) ) is a polynomial of z of degree at most ( t + 1) n and allcoefficients bounded by n t +1 . We write this polynomial as Q t ( z ; x ) . Thus, if we define h t (˜ x, z ) := nz tn g m (˜ x, Z ) , we have that E ˜ x [ h t (˜ x, z )] = Q ( z ; x ) for ˜ x a trace of a randomly shifted x , and that | h t (˜ x ; z ) | ≤ ( p − tn ) O ( t ) · e O ( t n/ ( p L )) whenever | z | = 1 and | arg z | ≤ L for L ≥ , since m = t + 1 . Now, we will state two important results that will lead to the proof of the main result.
Lemma 3.2.
Let n ≥ , and suppose that x, x ′ are strings in { , } n such that Q t ( z ; x ) = Q t ( z ; x ′ ) as polynomials in z. Then, there is a uniform constant c such that for any L ≥ , there exists z such that | z | = 1 , | arg z | ≤ L , and | Q t ( z ; x ) − Q t ( z ; x ′ ) | ≥ n − c tL . Proof.
Note that Q t ( z ; x ) − Q t ( z ; x ′ ) is a nonzero polynomial in z of degree at most ( t + 1) n andwith all coefficients bounded by 2 n t +1 . Therefore, by Theorem 2.2,sup | z | =1 , | arg z |≤ /L | Q t ( z ; x ) − Q t ( z ; x ′ ) | ≥ exp (cid:18) − c (1 + log(2 n t +1 ))2 /L (cid:19) ≥ exp ( − c · L · t · log n ) = n − c tL , where we note that the arc { z : | z | = 1 , | arg z | ≤ L } has length L . The next important result we need will be Theorem 1.3. We defer the full proof of Theorem 1.3to Subsection A.3, but as the proof of the case where n is prime is simpler, we prove this specialcase here. Using this, we can get an exp (cid:16) ˜ O ( n / ) (cid:17) sample upper bound at least for n prime. Proposition 3.3.
Suppose that n = p is prime, and a , . . . , a n − , b , . . . , b n − ∈ { , } such thatfor all ≤ k < p, there is some integer c k such that P p − i =0 a i = ω c k · P p − i =0 b i . Then, the sequences { a , . . . , a n } and { b , . . . , b n } are equivalent up to a cyclic permutation. roof. First, P p − i =0 a i = ω c · P pi =0 b i . Since P p − i =0 a i and P p − i =0 b i ≥ ω c is a root of unity, we must have that P p − i =0 a i = P p − i =0 b i . In the case p = 2 , this alone proves the proposition, so we now assume p is odd.Now, we have that P p − i =0 a i ω i = ω c · P p − i =0 b i ω i . Letting b ′ i = b ( i − c ) (mod p ) , we have that b ′ is a cyclic shift of b , and P p − i =0 a i = P p − i =0 b ′ i and P p − i =0 a i ω i = P p − i =0 b ′ i ω i . Letting Q ( x ) = P p − i =0 ( a i − b ′ i ) x i , we have that ω and 1 are both roots of Q ( x ). Since Q ( x ) is an integer-valuedpolynomial, this implies that all Galois conjugates of ω are roots, so 1 , ω, ω , . . . , ω p − are roots of Q ( x ) . Thus, x p − Q ( x ) . But since Q ( x ) has degree at most p − , Q ( x ) must equal 0, so a i = b ′ i for all i . Since the sequence b ′ is just a shift of b , we are done.Finally, we are ready to Prove Theorem 1.2. Proof of Theorem 1.2.
Let L = Θ( n / (log n ) − / p − / ) , and suppose that we are trying to distin-guish between the original circular string being a = a a · · · a n or b = b b · · · b n , where a, b aredistinct, even up to cyclic shifts. First, we claim that for some 0 ≤ ℓ ≤ n −
1, some 2 ≤ t ≤ , and z = ω ℓ , we have that P ( z ; a ) t P ( z − t ; a ) = P ( z ; b ) t P ( z − t ; b ) , where we recall that ω := e πi/n . To prove this, first choose k such that P ni =1 a i ω i · k = ω c k · P ni =1 b i ω i · k for all integers c k , whichexists by Theorem 1.3. If k = 0 , then P ( ω k ; a ) = P (1; a ) and P ( ω k ; b ) = P (1; b ) are distinctnonnegative integers, so we trivially have P (1; a ) t P (1; a ) = P (1; b ) t P (1; b ) . Otherwise, let t bethe smallest prime that doesn’t divide n gcd( n,k ) (so t ≤ n has at most 2 prime factors). If P ni =1 a i ω i · k = 0 , then P ni =1 b i ω i · k = 0 . Now, since ω − tk is a Galois conjugate of ω k (since t ∤ n ), wealso have that P ni =1 b i ω − ti · k = 0 . This means that P ( ω k ; a ) = 0 so P ( ω k ; a ) t P (( ω k ) − t ; a ) = 0 , but P ( ω k ; b ) t P (( ω k ) − t ; b ) = 0. Likewise, if P ni =1 b i ω i · k = 0 , we’ll have P ( ω k ; a ) t P (( ω k ) − t ; a ) = 0 , but P ( ω k ; b ) t P (( ω k ) − t ; b ) = 0.Otherwise, P ( ω k ; a ) = P ni =1 a i ω i · k and P ( ω k ; b ) = P ni =1 b i ω i · k are both nonzero. This meansthat for all r ≥ , P ( ω ( − t ) r · k ; a ) and P ( ω ( − t ) r · k ; b ) are both nonzero, since ω ( − t ) r · k and ω k are Galoisconjugates. This means that if P ( z ; a ) t P ( z − t ; a ) = P ( t ; b ) P ( z − t ; b ) for all z = ω ( − t ) r · k , then P ( ω ( − t ) r +1 · k ; a ) P ( ω ( − t ) r · k ; a ) − t = P ( z − t ; a ) P ( z ; a ) − t = P ( z − t ; b ) P ( z ; b ) − t = P ( ω ( − t ) r +1 · k ; b ) P ( ω ( − t ) r · k ; b ) − t for all r ≥ , so we inductively have that P ( ω ( − t ) r · k ; a ) P ( ω k ; a ) ( − t ) r = P ( ω ( − t ) r · k ; b ) P ( ω k ; b ) ( − t ) r . Now, letting r = ϕ (cid:16) n gcd( n,k ) (cid:17) , we know that k · ( − t ) r ≡ k (mod n ) by Euler’s theorem, which meansthat ω ( − t ) r · k = ω k . Thus, P ( ω k ; a ) − ( − t ) r = P ( ω k ; b ) − ( − t ) r . Since k = 0 , we have that n gcd( n,k ) > r ≥ . Thus, since t ≥ , − ( − t ) r = 0 . Now,since P ( ω k ; a ) , P ( ω k ; b ) are nonzero, we have that P ( ω k ; a ) P ( ω k ; b ) is a | − ( − t ) r | th root of unity. Also, P ( ω k ; a ) , P ( ω k ; b ) ∈ Q [ ω ] , which means P ( ω k ; a ) P ( ω k ; b ) ∈ Q [ ω ]. However, all roots of unity in Q [ ω ] areof the form ± ω i for some i , and since ( − t ) r − n is odd (since t = 2), we musthave that P ( ω k ; a ) P ( ω k ; b ) = ω c k for some integer c k . This is a contradiction, so we must have that P ( z ; a ) t P ( z − t ; a ) = P ( z ; b ) t P ( z − t ; b ) , for some z = ω ( − t ) r · k , r ≥ P ( z ; a ) t P ( z − t ; a ) is invariant under rotation of a , and P ( z ; b ) t P ( z − t ; b )is invariant under rotation of b. Thus, by our definition of Q t ( z ; x ) , we have that Q t ( z ; a ) = Q t ( z ; b ) . Thus, by Lemma 3.2, there is some z such that | z | = 1 , | arg z | ≤ L , and | Q t ( z ; a ) − Q t ( z ; b ) | ≥ n − c tL ≥ n − c L . Therefore, for L = Θ( n / (log n ) − / p − / ) , there exists some z with | z | = 1 and | arg z | ≤ L andsome 2 ≤ t ≤ | Q t ( z ; a ) − Q t ( z ; b ) | ≥ n − c L ≥ exp (cid:16) − c · n / (log n ) / p − / (cid:17) but | h t (˜ x, z ) | ≤ ( p − n ) O (1) · exp (cid:18) O (cid:18) np L (cid:19)(cid:19) ≤ exp (cid:16) c · n / (log n ) / p − / (cid:17) . Therefore, by choosing z and t appropriately, taking R = exp (cid:0) O (cid:0) n / (log n ) / p − / (cid:1)(cid:1) traces˜ x (1) , . . . , ˜ x ( R ) , and letting h t ( z ) denote the average of h t (˜ x ( i ) , z ) for all i , the Chernoff bound tellsus that with probability at least 1 − n , | h t ( z ) − Q t ( z ; a ) | ≤ · exp (cid:0) c · n / (log n ) / p − / (cid:1) ifthe original string were a, and | h ( z ) − Q t ( z ; b ) | ≤ · exp (cid:0) c · n / (log n ) / p − / (cid:1) if the originalstring were b. Thus, by returning a if h ( z ) is closer to Q t ( z ; a ) and returning b otherwise, we candistinguish between the original string being a or b using exp (cid:0) O (cid:0) n / (log n ) / p − / (cid:1)(cid:1) traces, with1 − n failure probability.Thus, to reconstruct the original string x , we simply run the distinguishing algorithm for allpairs a, b ∈ { , } n such that a = b, using the same R traces ˜ x , . . . , ˜ x R . With probability at least1 − (4 / n ≥ − − n , the true string x will be the only string such that the distinguishing algorithmwill successfully choose x over all other strings. Thus, for n a prime or a product of two primes, thecircular trace reconstruction problem can be solved using exp (cid:0) O (cid:0) n / (log n ) / p − / (cid:1)(cid:1) traces. We now consider the situation in which the unknown circular string x is random. We will supposethat x is equidistributed as a random circular string in which each bit is 0 or 1 with probability.Note that this distribution is not uniform over all possible circular strings. However, our argumentscan easily be modified to handle such a situation. We use the randomness to rule out certainproblematic strings with high probability, and this can be done for uniform random circular stringsas well as other distributions, for example if independently each bit is biased towards 0 or 1. Theorem 4.1.
Let x be a random (in the sense described above) unknown circular string of length n and let q be the deletion probability of each element. Then there exists a constant C q dependingonly on q such that we can determine x with failure probability at most n − using O ( n C q ) traces. In what follows, we will let x = x · · · x n and take indices of bits in x modulo n . Let k = 100 log n .We first note that with high probability, all of the consecutive substrings of x of length k and k − regular strings. Indeed, the probability that x i · · · x i + k − = x j · · · x j + k − for i = j is 2 − k (where indices are taken modulo n ), and union boundingover all i, j as well as both k and k − O ( n − k ) ≪ n − .If we assume that x is regular, the length k consecutive substrings of x uniquely determine x .Indeed, given x i · · · x i + k − , we can uniquely determine x i + k as there is a unique length k consecutive10ubstring of x that begins with x i +1 · · · x i + k − . Iteratively applying this allows us to recover theentire string x . Thus, to prove Theorem 4.1, it suffices to prove Lemma 1.5, i.e., to determine howmany times each length k substring appears consecutively in x using O ( n C q ) traces, which willallow us to recover x if x is regular.We will show the existence of C q so that for any string s of length k , we can distinguish betweenstrings x and y correctly using O ( n C q ) samples with failure probability 10 − n , if the number ofconsecutive occurrences of s in x and in y differ, from which a union bound over all strings s of length k and all pairs of strings x, y of length n shows the result. Let α denote a sufficientlylarge constant only depending on q that we will determine later. For 0 ≤ i ≤ n − k , let c i denote the number of (not necessarily consecutive) occurrences of s in x contained in a consecutivesubstring of x of length at most i + k . Similarly, let d i denote the number of (not necessarilyconsecutive) occurrences of s in y contained in a consecutive substring of y of length at most i + k .By assumption, we have that c = d . By casework on the last bit of the occurrence of s , we havethat c i , d i ≤ n (cid:0) i + kk (cid:1) . Let P ( t ) = P αki =0 c i t i and Q ( t ) = P αki =0 d i t i . Moreover, the following is true: Lemma 4.2.
The probability that a trace of x starts with s (where a random bit in the string ischosen as the beginning before bits are deleted) is n (1 − q ) k P ( q ) + O ( q αk ( α + 1) k e k ) . Similarly, theprobability that a trace of y starts with s is n (1 − q ) k Q ( q ) + O ( q αk ( α + 1) k e k ) .Proof. To compute the probability that a trace of x starts with s , we do casework on how manybits are deleted before the last bit in the occurrence of s . If i bits are deleted, then note thatthere are c i ways for it to be done by definition. Each such way has a probability of n (1 − q ) k q i tooccur. Indeed, for each way there is a n probability that the correct starting bit is chosen, and theprobability that only the bits corresponding to the specific instance of s are kept is (1 − q ) k q i . Itfollows that the probability is exactly n (1 − q ) k P n − ki =0 c i q i .It remains to show that n (1 − q ) k P n − kαk +1 c i q i = O ( q αk ( α + 1) k e k ). As mentioned before, we havethat c i ≤ n (cid:0) i + kk (cid:1) . Thus, this term is at most P i>αk (cid:0) i + kk (cid:1) q i ≤ (cid:0) αk + kk (cid:1) q αk P i ≥ (cid:16) q ( α +1) α (cid:17) i . Indeed,the ratio of consecutive terms in the sequence (cid:0) i + kk (cid:1) q i is equal to q i + ki ≤ q ( α +1) α . For a sufficientlylarge choice of α , q ( α +1) α <
1, so P i>αk (cid:0) i + kk (cid:1) q i = O ( (cid:0) αk + kk (cid:1) q αk ) = O ( q αk ( α + 1) k e k ) by Stirling’sapproximation.The argument for y is analogous.Lemma 4.2 allows us to estimate P ( q ) and Q ( q ) up to an O ( n (1 − q ) − k q αk ( α + 1) k e k ) error bylooking at how often traces of x or y begin with s , and then dividing by n (1 − q ) k . So long as P ( q )and Q ( q ) are sufficiently far apart, a Chernoff bound allows us to determine with high probabilityif the traces came from x or y . However, it may be the case that P ( q ) and Q ( q ) are quite close. Toremedy this, we observe that it is possible to simulate higher deletion probabilities q ′ > q . Indeed,this can be achieved by deleting each bit in traces received independently with probability q ′ − q − q .Thus, it suffices to find q ′ ∈ [ q, r ] with P ( q ′ ) and Q ( q ′ ) far apart for some q < r <
1. The existenceof such a q ′ is proven by the following Littlewood-type result of Borwein, Erd´elyi, and K´os. Theorem 4.3 ([BEK99], Theorem 5.1) . There exist absolute constants c > and c > suchthat if f is a polynomial with coefficients in [ − , and a ∈ (0 , , then | f (0) | c /a ≤ exp (cid:16) c a (cid:17) sup z ∈ [1 − a, | f ( z ) | . roof of Theorem 4.1. Let r = q +12 . We first apply Theorem 4.3 to (cid:0) αk + kk (cid:1) − ( P ( rx ) − Q ( rx )) and a = 1 − q/r . Here, we are using the fact that the coefficients of P and Q are bounded in magnitudeby (cid:0) αk + kk (cid:1) by previous observations, and that | P (0) − Q (0) | ≥
1. Theorem 4.3 tells us that (cid:18) αk + kk (cid:19) − c /a ≤ exp (cid:16) c a (cid:17) (cid:18) αk + kk (cid:19) − sup z ∈ [1 − a, | P ( rz ) − Q ( rz ) | = exp (cid:16) c a (cid:17) (cid:18) αk + kk (cid:19) − sup q ′ ∈ [ q,r ] | P ( q ′ ) − Q ( q ′ ) | , or sup q ′ ∈ [ q,r ] | P ( q ′ ) − Q ( q ′ ) | ≥ c (cid:18) αk + kk (cid:19) − c for some constants c and c that only depend on q .In particular, this is much larger than 10 k n (1 − r ) − k r αk ( α + 1) k e k for sufficiently large valuesof α ( α may depend on q ). Indeed, after taking k th roots and using Stirling’s approximation thisreduces to showing that ( e ( α + 1)) − c > n /k (1 − r ) − r α ( α + 1) e for sufficiently large α where c is some constant that only depends on q , which is clear (since 0 < r < n /k < q ′ ∈ [ q, r ] , the error term n (1 − q ′ ) k P n − kαk +1 c i ( q ′ ) i = O (( q ′ ) αk ( α + 1) k e k ) is at most10 − k times n (1 − q ′ ) k · sup q ′ ∈ [ q,r ] | P ( q ′ ) − Q ( q ′ ) | . Hence, for some q ′ ∈ [ q, r ], the probability that a trace begins with s under bit deletion withprobability q ′ differs between x and y by Ω(10 k n (1 − r ) − k r αk ( α +1) k e k ) = Ω( n − c ) for some constant c that only depends on q . By a standard Chernoff bound, for some constant C q only dependingon q , we can distinguish between x and y using O ( n C q ) traces with failure probability at mostexp( − Ω( n )), so the theorem follows. In this section, we prove Theorem 1.6 and demonstrate that worst-case circular trace reconstructionrequires ˜Ω( n ) traces. We first record the following lemma from [HL20] expressing the number ofindependent samples required to distinguish between two probability measures µ and ν in termsof their Hellinger distance d H ( µ, ν ), defined to be (cid:0)P x ∈ X ( µ ( { x } ) − ν ( { x } )) (cid:1) / where the sum isover all events in some discrete sample space X . Let d T V ( µ, ν ) denote the total variation distancebetween µ and ν and µ n denote the law of n independent samples from µ . Lemma 5.1 ([HL20], Lemma A.5) . If µ and ν are probability measures satisfying d H ( µ, ν ) ≤ / ,then for m ≥ / (4 d H ( µ, ν )) , we have that − d T V ( µ m , ν m ) ≥ ǫ if m ≤ log(1 /ǫ )9 d H ( µ,ν ) . Note that the number of samples m required to distinguish between µ and ν is given by thetotal variation distance between µ m and ν m . Thus, it requires Ω( d − H ( µ, ν )) samples to distinguishbetween two probability measures µ and ν . Proof of Theorem 1.6.
We now specialize to the case of distinguishing between x = 10 n n +1 n + k and y = 10 n n + k n +1 from independent traces. Let µ and ν respectively denote the laws oftraces from x and y . We will show that d H ( µ, ν ) = O (( n log n ) / ), which establishes the result byLemma 5.1. 12irst, we note that conditional on the first 1 in x being deleted, the resulting trace is equidis-tributed as a trace from y conditioned on the second 1 being deleted, as in both cases we obtain atrace from the circular string 10 n +1 n + k . Similar arguments for other cases show that conditionedon any 1 being deleted, traces from x and y are equal in law. Thus, the resulting string must havethree 1’s to contribute to the Hellinger distance. We will henceforth assume that the resulting traceis of the form 10 a b c for some nonnegative integers a, b, c .We now compute the ratio µ ( { a b c } ) ν ( { a b c } ) and show that it is typically 1 + O (( n/ log n ) / ). Wehave that µ ( { a b c } ) q n + k +1 − a − b − c (1 − q ) a + b + c = (cid:18) na (cid:19)(cid:18) n + 1 b (cid:19)(cid:18) n + kc (cid:19) + (cid:18) nb (cid:19)(cid:18) n + 1 c (cid:19)(cid:18) n + ka (cid:19) + (cid:18) nc (cid:19)(cid:18) n + 1 a (cid:19)(cid:18) n + kb (cid:19) ,ν ( { a b c } ) q n + k +1 − a − b − c (1 − q ) a + b + c = (cid:18) na (cid:19)(cid:18) n + kb (cid:19)(cid:18) n + 1 c (cid:19) + (cid:18) nb (cid:19)(cid:18) n + kc (cid:19)(cid:18) n + 1 a (cid:19) + (cid:18) nc (cid:19)(cid:18) n + ka (cid:19)(cid:18) n + 1 b (cid:19) . It follows that µ ( { a b c } ) ν ( { a b c } ) = n +1 − b )( n +1 − c ) ··· ( n + k − c ) + n +1 − c )( n +1 − a ) ··· ( n + k − a ) + n +1 − a )( n +1 − b ) ··· ( n + k − b )1( n +1 − c )( n +1 − b ) ··· ( n + k − b ) + n +1 − a )( n +1 − c ) ··· ( n + k − c ) + n +1 − b )( n +1 − a ) ··· ( n + k − a ) . Multiplying the numerator and denominator by Q ki =1 ( n + i − a )( n + i − b )( n + i − c ) results in S = k Y i =1 ( n + i − a ) k Y i =2 ( n + i − b ) + k Y i =1 ( n + i − b ) k Y i =2 ( n + i − c ) + k Y i =1 ( n + i − c ) k Y i =2 ( n + i − a )and S = k Y i =1 ( n + i − b ) k Y i =2 ( n + i − a ) + k Y i =1 ( n + i − c ) k Y i =2 ( n + i − b ) + k Y i =1 ( n + i − a ) k Y i =2 ( n + i − c ) , respectively. We have that S − S = ( a − b ) Q ki =2 ( n + i − a )( n + i − b ) + ( b − c ) Q ki =2 ( n + i − b )( n + i − c ) + ( c − a ) Q ki =2 ( n + i − c )( n + i − a ). This is an alternating polynomial in a, b, c , i.e. applyinga permutation σ to a, b, c changes the sign of the polynomial by the sign of σ . Hence, it can bewritten in the form ( a − b )( b − c )( a − c ) P k ( n, a, b, c ), where P k is a polynomial in n, a, b, c of degree2 k − S and S have degree 2 k − C such that with probability at least1 − n − , a, b, c ∈ [ np − C √ n log n, np + C √ n log n ]. When this occurs, we have that S = Ω( n k − )and | S − S | = O (( n log n ) / n k − ), so µ ( { a b c } ) ν ( { a b c } ) ∈ [1 − ( c log n/n ) / , c log n/n ) / ] forsome constant c . We thus have that d H ( µ, ν ) = X a,b,c ≥ ( µ ( { a b c } ) − ν ( { a b c } )) ≤ n − + X a,b,c ∈ [ np − C √ n log n,np + C √ n log n ] ν ( { a b c } ) (cid:18) − µ ( { a b c } ) ν ( { a b c } ) (cid:19) = O ((log n/n ) ) . It follows by Lemma 5.1 that it requires Ω( n / log n ) samples to distinguish between tracesfrom x and y , as desired. 13 cknowledgments This research was partially supported by the MIT Akamai Fellowship, the NSF Graduate Fellow-ship, and by NSF-DMS grant 1949884 and NSA grant H98230-20-1-0009. The first author thanksPiotr Indyk for many helpful discussions and feedback, as well as Mehtaab Sawhney for pointersto some references. The second author thanks Professor Joe Gallian for running the Duluth REUat which part of this research was conducted, as well as program advisors Amanda Burcroff, ColinDefant, and Yelena Mandelshtam for providing a supportive environment.
References [AGMP13] Alexandr Andoni, Assaf Goldberger, Andrew McGregor, and Ely Porat. Homomorphicfingerprints under misalignments: Sketching edit and shift distances. In
Proceedings ofthe 45th Annual ACM SIGACT Symposium on Theory of Computing , pages 931–940,2013.[BCF +
19] Frank Ban, Xi Chen, Adam Freilich, Rocco A. Servedio, and Sandip Sinha. Beyondtrace reconstruction: Population recovery from the deletion channel. In , pages 745–768, 2019.[BCSS19] Frank Ban, Xi Chen, Rocco A. Servedio, and Sandip Sinha. Efficient average-casepopulation recovery in the presence of insertions and deletions. In
Approximation,Randomization, and Combinatorial Optimization: Algorithms and Techniques , pages44:1–44:18, 2019.[BCSZ19] Afonso S. Bandeira, Moses Charikar, Amit Singer, and Andy Zhu. Multireferencealignment using semidefinite programming. In
Fifth Annual ACM Conference on In-novations in Theoretical Computer Science , pages 745–768, 2019.[BE97] Peter Borwein and Tam´as Erd´elyi. Littlewood-type polynomials on subarcs of the unitcircle.
Indiana University Mathematics Journal , 46(4):1323–1346, 1997.[BEK99] Peter Borwein, Tam´as Erd´elyi, and G´eza K´os. Littlewood-type problems on [0 , Proceedings of the London Mathematical Society , 3(79):22–46, 1999.[BKKM04] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Recon-structing strings from random traces. In
Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 910–918, 2004.[BLS20] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constantnumber of traces. In , 2020.[BNWR19] Afonso S. Bandeira, Jonathan Niles-Weed, and Philippe Rigollet. Optimal rates of es-timation for multi-reference alignment.
Mathematical Statistics and Learning , 2(1):25–75, 2019. 14CDL +
20] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model.
CoRR , abs/2008.12386,2020.[CGK12] George M. Church, Yuan Gao, and Sriram Kosuri. Next-generation digital informationstorage in dna.
Science , 337(6102):1628, 2012.[CGMR19] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and Jo˜ao Ribeiro. Coded tracereconstruction. In
Information Theory Workshop , 2019.[Cha19] Zachary Chase. New lower bounds for trace reconstruction.
CoRR , abs/1905.03031,2019.[CKP +
21] Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Ra-doszewski, Wojciech Rytter, Juliusz Straszy´nski, Tomasz Wale´n, and Wiktor Zuba.Circular pattern matching with k mismatches.
Journal of Computer and System Sci-ences , 115:73–85, 2021.[DOS19] Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithmsfor trace reconstruction.
Annals of Applied Probability , 29(2):851–874, 2019.[DRR19] Sami Davies, Miklos Racz, and Cyrus Rashtchian. Reconstructing trees from traces.In
Conference On Learning Theory , pages 961–978, 2019.[HHP18] Lisa Hartung, Nina Holden, and Yuval Peres. Trace reconstruction with varying dele-tion probabilities. In
Proceedings of the Fifteenth Workshop on Analytic Algorithmicsand Combinatorics , pages 54–61, 2018.[HL20] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction.
Annals ofApplied Probability , 30(2):503–525, 2020.[HMPW08] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Tracereconstruction with constant deletion probability and related results. In
Proceedings ofthe Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 389–398,2008.[HPP18] Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstructionfor random strings and arbitrary deletion probability. In
Conference On LearningTheory , pages 1799–1840, 2018.[KM05] Sampath Kannan and Andrew McGregor. More on reconstructing strings from ran-dom traces: insertions and deletions. In
Proceedings of the 2005 IEEE InternationalSymposium on Information Theory , pages 297–301, 2005.[KMMP19] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal.Trace reconstruction: Generalized and parameterized. In
Proceedings of the 27th An-nual European Symposium on Algorithms , pages 68:1–68:25, 2019.[Lev01a] Vladimir I. Levenshtein. Efficient reconstruction of sequences.
IEEE Trans. Informa-tion Theory , 47(1):2–22, 2001. 15Lev01b] Vladimir I. Levenshtein. Efficient reconstruction of sequences from their subsequencesor supersequences.
J. Comb. Theory, Ser. A , 93(2):310–332, 2001.[Mae90] Maurice Maes. On a cyclic string-to-string correction problem.
Information ProcessingLetters , 35(2):73–78, 1990.[Mar77] Daniel A. Marcus.
Number Fields . Springer International Publishing, 1977.[MPV14] Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace reconstruction revisited.In
Proceedings of the 22nd Annual European Symposium on Algorithms , pages 689–700,2014.[Nar20] Shyam Narayanan. Population recovery from the deletion channel: Nearly matchingtrace reconstruction bounds.
CoRR , abs/2004.06828, 2020.[NP17] Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(o(n1/3)) samples. In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing ,pages 1042–1046, 2017.[NPP +
18] Hoang Hiep Nguyen, Jeho Park, Seon Joo Park, Chang-Soo Lee, Seungwoo Hwang,Yong-Beom Shin, Tai Hwan Ha, and Moonil Kim. Long-term stability and integrity ofplasmid-based dna data storage.
Polymers , 10(1):28, 2018.[OAC +
18] Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin,Konstantin Makarychev, Miklos Z Racz, Govinda Kamath, Parikshit Gopalan, Bich-lien Nguyen, Christopher N Takahashi, Sharon Newman, Hsing-Yeh Parker, CyrusRashtchian, Kendall Stewart, Gagan Gupta, Robert Carlson, John Mulligan, DouglasCarmean, Georg Seelig, Luis Ceze, , and Karin Strauss. Random access in large-scaledna data storage.
Nature Biotechnology , 36:242–248, 2018.[PWB +
19] Amelia Perry, Jonathan Weed, Afonso S. Bandeira, Philippe Rigollet, and Amit Singer˙The sample complexity of multireference alignment.
SIAM Journal on Mathematics ofData Science , 1(3):497–517, 2019.[PZ17] Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel:Subpolynomially many traces suffice. In , pages 228–239, 2017.[RUC +
11] Jane B. Reece, Lisa A. Urry, Michael L. Cain, Steven A. Wasserman, Peter V. Mi-norsky, and Robert B. Jackson.
Campbell Biology . Pearson, 9th edition, 2011.[VS08] Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstructionover insertion-deletion channels. In
Proceedings of the Nineteenth Annual ACM-SIAMSymposium on Discrete Algorithms , pages 399–408, 2008.[Wik] Circular DNA. Available at https://en.wikipedia.org/wiki/Circular_DNA .16
Omitted Proofs
A.1 Proof of Proposition 1.1
Here, we prove Proposition 1.1, which shows that circular trace reconstruction is at least as hardas linear trace reconstruction in both the worst-case and average case models for any choice of q . Proof of Proposition 1.1.
Let m ≥ n, and suppose that using T = T ( m, q ) traces, we can solveworst-case circular trace reconstruction over length m strings with failure probability δ. Then,suppose we are given T traces of some unknown linear string x of length n . We will reconstruct x as follows. First, the algorithm creates a random binary string y of length m − n . Then, thealgorithm lets x ′ be the circular string x ◦ y, i.e. x concatenated with y , which has length m . Whilewe do not know x ′ , given a random trace ˜ x i of x , we can create a random trace ˜ x ′ i of x ′ by creating arandom trace of y (with deletion probability q ) and appending it to ˜ x i , and then randomly rotatingit. Doing this for each trace gives us T random traces of the circular string x ′ , which allows usto reconstruct x ′ with probability 1 − δ. Now, the string y appears exactly once (consecutively) inthe circular string x ′ with failure probability exponentially small in n since m ≥ n , and since weknow y , we would be able to find the unique copy of y in x ′ and thus recover the linear string x with failure probability δ + e − Ω( n ) .The same argument works in the average case. Suppose using T = T ( m, q ) traces, we can solveaverage-case circular trace reconstruction with probability δ, where the average string is generatedby creating a uniformly random binary (linear) string and making it circular. Then, if given T random traces of a random linear string x of length n , our algorithm works the same way: creatinga random string y of length m − n, appending it to x , reconstructing the circular string x ′ = x ◦ y, and then recovering x since with 1 − e − Ω( n ) probability, there is a unique copy of y in x ′ . A.2 Proof of Lemma 3.1
Here, we prove Lemma 3.1, which gives us the unbiased estimator of Q mi =1 P ( z i ; x ). To do so, wefirst note a simple proposition about complex numbers. Proposition A.1. [Nar20] Let z be a complex number with | z | = 1 and | arg z | ≤ θ. Then, for any < p < , (cid:12)(cid:12)(cid:12) z − (1 − p ) p (cid:12)(cid:12)(cid:12) ≤ θ p . Proof of Lemma 3.1.
For some 1 ≤ k ≤ m, fix some complex numbers w , . . . , w k and consider therandom variable f (˜ x, w ) := X ≤ i
We let ω k denote e πi/k for k ≥ . When dealing with a string of length n , we write ω := ω n . In the case where n is prime, we already proved it in Proposition 3.3.Next, we prove it in the case that n = p · q for p, q odd primes. We will first need a simplelemma, which is likely folklore, though we give a proof regardless. Lemma A.2.
Let p, q be distinct primes, and suppose that (1 − ω p ) | ( P q − i =0 b i ω iq ) in Q [ ω pq ] . Then,we in fact have that p | ( P q − i =0 b i ω iq ) in Q [ ω pq ] .Proof. Note that (1 − ω p ) | ( P q − i =0 b i ω iq ) implies that (1 − ω p ) p | ( P q − i =0 b i ω iq ) p . But p | (1 − ω p ) p , so p | ( P q − i =0 b i ω iq ) p . Now, using Frobenius Endomorphism, we have that ( P q − i =0 b i ω iq ) p ≡ P q − i =0 b i ω i · pq (mod p ) , so p | ( P q − i =0 b i ω i · pq ) . But since p = q, we have that ω pq and ω q are Galois conjugates, so we thereforehave that p | ( P q − i =0 b i ω iq ) , as desired. Lemma A.3.
Theorem 1.3 is true when n = p · q, where p, q are distinct odd primes.Proof. First, we have that P n − i =0 a i = ω c P n − i =0 b i and since a , . . . , a n , b , . . . , b n ∈ { , } , thismeans that ω c is real and thus equals 1. So, P n − i =0 a i = P n − i =0 b i . Next, we have that P n − i =0 a i ω ip = ω c q · P n − i =0 b i ω ip , since ω p = ω q . Therefore, since the a i ’s are all integers, this implies that either19 n − i =0 a i ω ip = P n − i =0 b i ω ip = 0 or ω c q = P a i ω ip P b i ω ip ∈ Q [ ω p ]. Thus, by Theorem 2.4, ω c q actually equals ω kp for some k . Likewise, ω c p actually equals ω ℓq for some ℓ , so we have that n − X i =0 a i ω ip = ω kp n − X i =0 b i ω ip , n − X i =0 a i ω iq = ω ℓq n − X i =0 b i ω iq . Therefore, by the Chinese Remainder theorem, we can cyclically shift { b i } by something thatis k modulo p and ℓ modulo q to get some sequence { b ′ i } so that n − X i =0 a i = n − X i =0 b ′ i , n − X i =0 a i ω ip = n − X i =0 b ′ i ω ip , and n − X i =0 a i ω iq = n − X i =0 b ′ i ω iq . Without loss of generality, we can therefore pretend that k = ℓ = 0 , so in fact we have b ′ i = b i for all i . Now, suppose that P a i ω i = ω m · P b i ω i . Our goal is to show that p | m and q | m , so that ω m = 1 . Assume the contrary, WLOG that q ∤ m. Then, we can write n − X i =0 ( a i − b i ) ω i = ( ω m − · n − X i =0 b i ω i . (1)Now, choose integers r, s so that r · q + 1 = s · p. Then, we have that ω i · sq − ω i = ω i · s · p − ω i = ω i (cid:0) ω i · r · q − (cid:1) = ω i (cid:0) ω i · rp − (cid:1) , which is a multiple of ω p −
1. Therefore, we have that 1 − ω p divides n − X i =0 ( a i − b i ) · (cid:0) ω i − ω i · sq (cid:1) = n − X i =0 ( a i − b i ) ω i − n − X i =0 ( a i − b i ) ω i · sq = n − X i =0 ( a i − b i ) ω i . The last equality in the above line follows since P n − i =0 a i ω iq = P n − i =0 b i ω iq , and since s is relativelyprime to q , this means ω sq is a Galois conjugate of ω q , so P n − i =0 a i ω i · sq = P n − i =0 b i ω i · sq . Now, since q ∤ m, we have that either ω m − Z [ ω ] (if p ∤ m ) or ω m − | q (if p | m ).Therefore, by Equation (1), we have that(1 − ω p ) (cid:12)(cid:12)(cid:12)(cid:12) q · n − X i =0 b i ω i ⇒ (1 − ω p ) (cid:12)(cid:12)(cid:12)(cid:12) n − X i =0 b i ω i , since (1 − ω p ) , ( q ) are relatively prime as ideals. Now, recalling that ω i ≡ ω i · sq (mod 1 − ω p ) , wehave that (1 − ω p ) | P n − i =0 b i ω i · sq . By Lemma A.2, we have that p | P n − i =0 b i ω i · sq . Since ω sq and ω q are Galois conjugates, thisalso means that p | P n − i =0 b i ω iq . Now, for 0 ≤ j ≤ q − , let d j = b j + b j + q + · · · + b j +( p − q .We have that p | P q − j =0 d j ω jq , so P q − j =0 d j p ω jq is an algebraic integer in Q [ ω q ] . Therefore, d ≡ d ≡· · · ≡ d q − (mod p ). Since 0 ≤ d i ≤ p for all i , we either have that d = d = · · · = d q − , or d , d , . . . , d q − ∈ { , p } . Likewise, we also have that P b i ω i = ω − m · P a i ω i , where q ∤ ( − m ) . Therefore, if c j = a j + a j + q + · · · + a j +( p − q for each 0 ≤ j ≤ q − , we either have that c = c = · · · = c q − , or c , c , . . . , c q − ∈ { , p } . d , d , . . . , d q − ∈ { , p } . This means that for all 0 ≤ j ≤ q − , b j = b j + q = · · · = b j +( p − q , so b j ω j + b j + q ω j + q + · · · + b j +( p − q ω j +( p − q = 0 for all 0 ≤ j ≤ p − . Importantly,this means P b j ω j = 0. But since P a i ω i = ω m · P b i ω i for some m ∈ Z , this also means that P a i ω i = 0 , so in fact we do have that P b i ω i = P a i ω i . Likewise, if a , a , . . . , a q − ∈ { , p } wealso have that P b i ω i = P a i ω i = 0 by a symmetric argument.Otherwise, we are dealing with the case where d = d = · · · = d q − and c = c = · · · = c q − . But then, 0 = P q − j =0 d j ω jq = P n − i =0 b i ω iq and 0 = P q − j =0 c j ω jq = P n − i =0 a i ω iq . Recall that P a i ω i = ω m · P b i ω i and that we assumed q ∤ m. If p | m, then if m = p · t , we have P a i ω i = P b i − p · t ω i , where i − p · t is done modulo n . Moreover, we have that P b i − p · t ω ip = P b i ω i + p · tp = P b i ω ip , P b i − o · t ω iq = P b i ω i + p · tq = ω p · tq · P b i ω iq = 0 , and P b i − p · t = P b i . Therefore, by shifting b by p · t, we have that P a i ω k · i = P b i ω k · i for k = 0 , , p, and q , and therefore for all 0 ≤ k ≤ n − . The other case is that p ∤ m. In this case, we can define e j = a j + a j + p + · · · + a j +( q − p and f j = b j + b j + p + · · · + b j +( q − p for 0 ≤ j ≤ p − . By the same argument as before, either e = e = · · · = e p − or e , e , . . . , e p − ∈ { , q } , and either f = f = · · · = f p − or f , f , · · · , f p − ∈ { , p } . Again, either e , e , . . . , e p − ∈ { , q } or f , f , . . . , f q − ∈ { , q } implies that P a i ω i = P b i ω i = 0 . Therefore, the final case to deal with is if c = c = · · · = c q − , d , d = · · · = d q − , e = e = · · · = e q − , and f = f = · · · = f q − . As we have seen, the first two equations implythat 0 = P n − i =0 a i ω iq = P n − i =0 b i ω iq . Thus, the same argument applied to the last two equationsimplies that 0 = P n − i =0 a i ω ip = P n − i =0 b i ω ip . As a result, we can shift the sequence b by m , since P a i ω i = P b i − m ω i , but we will still have that P a i = P b i − m , P a i ω ip = P b i ω ip = P b i − m ω ip = 0 , and P a i ω iq = P b i ω iq = P b i − m ω iq = 0 . We now show that Theorem 1.3 is true when n is the square of an odd prime. Proposition A.4.
Theorem 1.3 is true if n = p is the square of an odd prime.Proof. By shifting, we may without loss of generality assume that P a i ω i = P b i ω i , so P ( x ) = P ( a i − b i ) x i has ω as a root. Thus, 1 + x p + · · · + x n − p | P ( x ), which means that a i − b i = a i + p − b i + p = · · · = a i + n − p − b i + n − p , where indices are taken mod n . Thus, if it is not the casethat a i = a i + p = · · · = a i + n − p , equivalently that b i = b i + p = · · · = b i + n − p , then we must have that a i = b i , a i + p = b i + p , . . . , a i + n − p = b i + n − p since a j , b j ∈ { , } .Let z = ω p . We have that X a i z i = ( a + a p + · · · + a n − p ) + ( a + a p +1 + · · · + a n − p +1 ) z + · · · + ( a p − + a p − + · · · + a n − ) z p − , X b i z i = ( b + b p + · · · + b n − p ) + ( b + b p +1 + · · · + b n − p +1 ) z + · · · + ( b p − + b p − + · · · + b n − ) z p − . Thus, P a i z i P b i z i ∈ Q [ z ], so p | c p and we have that P a i z i = z m P b i z i for some m ∈ Z . Itfollows that { a i + a i + p + · · · + a i + n − p } and { b i + b i + p + · · · + b i + n − p } are cyclic shifts of eachother. Since these sequences are of length p , this means that they are equal. We already knowthat a i + a i + p + · · · + a i + n − p / ∈ { , p } = ⇒ a i + pℓ = b i + pℓ for all ℓ . But it is also the case that a i + a i + p + · · · + a i + n − p = b i + b i + p + · · · + b i + n − p ∈ { , p } = ⇒ a i + pℓ = b i + pℓ for all ℓ since a j , b j ∈ { , } . Thus, we have shown that a i = b i for all i , so we are done. Proposition A.5.
Theorem 1.3 is true if n = 2 p , i.e., n is twice a prime. roof. If p = 2 , i.e. n = 4 , then if P a i = P b i but the sequences { a i } and { b i } are not equal up toa cyclic rotation, then up to cyclic rotations, we either have { a i } = { , , , } and { b i } = { , , , } or vice versa. But then, P a i ( − i = ± P b i ( − i = 0 . If p is an odd prime, then note that the minimal polynomial of ω = ω n is 1 + x + · · · + x p − . Now, suppose that a, b are rotated so that if P ( x ) := P n − i =0 a i x i and Q ( x ) := P n − i =0 b i x i , then P ( ω ) = Q ( ω ) . Therefore, (1 + x + · · · + x p − ) | P p − i =0 ( a i − b i ) x i . Since a i − b i ∈ {− , , } for all i , we must have that P p − i =0 ( a i − b i ) x i = (1 + x + · · · + x p − ) · R ( x ) , where R ( x ) must be either0 , ±
1, and ± x ± . However, since P a i = P b i , we have that P (1) − Q (1) = 0 = ( p − · R (1) , so R (1) = 0 . Thus, R ( x ) must equal either 0, x −
1, or 1 − x. If R ( x ) = x − , then a i − b i = 1 for all odd i and − i , which means that a i = 1 ifand only if i is odd, but b i = 0 if and only if i is even. Since n is even, this means that { a i } and { b i } are the same sequence, up to a rotation by 1. The same is true if R ( x ) = 1 − x by symmetrybetween a and b . Finally, if R ( x ) = 0 , then a i − b i = 0 for all i , so a i = b i for all i , and thus thesequences { a i } and { b i } are the same.Finally, we remark that the statement is false for numbers with 3 or more prime factors, whichconcludes the proof of Theorem 1.3. Suppose that n = abc with a, b, c >
1. Let A = { , a +1 , . . . , ab − a + 1 , a, ab + a, . . . , abc − ab + a } and B = { , a + 1 , . . . , ab − a + 1 , , ab, . . . , abc − ab } .Consider circular strings a and b of length n with 1s in positions given by A and B , respectively.Let P ( x ) = P i ∈ A x i and Q ( x ) = P i ∈ B x i . We have that P ( x ) − Q ( x ) = ( x a − · x abc − x ab − and P ( x ) − x a Q ( x ) = x (1 − x ab ). Thus, for all k , P ( ω k ) Q ( ω k ) is a power of ω , so the conditions of Theorem1.3 hold. However, a and bb