Mean-Based Trace Reconstruction over Practically any Replication-Insertion Channel
Mahdi Cheraghchi, Joseph Downs, João Ribeiro, Alexandra Veliche
aa r X i v : . [ c s . I T ] F e b Mean-Based Trace Reconstruction over Practically anyReplication-Insertion Channel
Mahdi Cheraghchi ∗ Joseph Downs † Jo˜ao Ribeiro ‡ Alexandra Veliche §¶ Abstract
Mean-based reconstruction is a fundamental, natural approach to worst-case trace recon-struction over channels with synchronization errors. It is known that exp( O ( n / )) traces arenecessary and sufficient for mean-based worst-case trace reconstruction over the deletion chan-nel, and this result was also extended to certain channels combining deletions and geometricinsertions of uniformly random bits. In this work, we use a simple extension of the originalcomplex-analytic approach to show that these results are examples of a much more generalphenomenon: exp( O ( n / )) traces suffice for mean-based worst-case trace reconstruction overany memoryless channel that maps each input bit to an arbitrarily distributed sequence ofreplications and insertions of random bits, provided the length of this sequence follows a sub-exponential distribution. When any length- n message x ∈ {− , } n is sent through a noisy channel Ch , the channelmodifies the input x in some way to produce a distorted copy of x , called a trace . The goalof worst-case trace reconstruction over Ch is to design an algorithm which recovers any inputstring x ∈ {− , } n with high probability from as few independent and identically distributed(i.i.d.) traces as possible. This problem was first introduced by Levenshtein [1, 2], who studiedit over combinatorial channels causing synchronization errors, such as deletions and insertions ofsymbols and certain discrete memoryless channels. Trace reconstruction over the deletion channel ,which independently deletes each input symbol with some probability, was first considered byBatu, Kannan, Khanna, and McGregor [3]. Some of their results were quickly generalized to whatwe call the geometric insertion-deletion channel [4, 5], which prepends a geometric number ofindependent, uniformly random symbols to each input symbol and then deletes it with a givenprobability. Both the deletion and geometric insertion-deletion channels are examples of discretememoryless synchronization channels [6, 7].Holenstein, Mitzenmacher, Panigrahy, and Wieder [8] were the first to obtain non-trivial worst-case trace reconstruction algorithms for the deletion channel with constant deletion probability.They showed that exp( e O ( √ n )) traces suffice for mean-based reconstruction of any input stringwith high probability. By mean-based reconstruction, we mean that the reconstruction algorithmonly requires knowledge of the expected value of each trace coordinate. In general, this procedureworks as follows: Let Y x = ( Y x, , Y x, , . . . ) denote the trace distribution on input x ∈ {− , } n and ∗ University of Michigan – Ann Arbor. Email: [email protected] † University of Michigan – Ann Arbor. Email: [email protected] ‡ Imperial College London. Email: [email protected] § University of Michigan – Ann Arbor. Email: [email protected] ¶ This material is based upon work supported by the National Science Foundation under Grant No. CCF-2006455. ′ x denote the infinite string obtained by padding Y x with zeros on the right. The mean trace µ x isgiven by µ x = ( E [ Y ′ x, ] , E [ Y ′ x, ] , . . . ) . As the first step, the algorithm estimates µ x from t traces T (1) , T (2) , . . . , T ( t ) sampled i.i.d. accordingto Y ′ x via the empirical means b µ i = 1 t t X j =1 T ( j ) i , i = 1 , , . . . . (1)Subsequently, it outputs the string b x ∈ {− , } n that minimizes k µ b x − b µ k . If t = t ( n ) is largeenough, we have b x = x with high probability over the randomness of the traces. Because of theirstructure, pinpointing the number of traces required for mean-based reconstruction over any chan-nel reduces to bounding k µ x − µ x ′ k for any pair of distinct strings x and x ′ . Overall, mean-basedreconstruction is a natural paradigm, and it is not only useful over channels with synchroniza-tion errors. For example, O (log n ) traces suffice for mean-based reconstruction over the binarysymmetric channel, which is optimal.More recently, an elegant complex-analytic approach was employed concurrently by De, O’Donnell,and Servedio [9] and by Nazarov and Peres [10] to show that exp( O ( n / )) traces suffice for mean-based worst-case trace reconstruction not only over the deletion channel with constant deletionprobability, but also over the more general geometric insertion-deletion channel we described pre-viously. Remarkably, exp(Ω( n / )) traces were shown to also be necessary for mean-based recon-struction over the deletion channel.Given the fundamental nature of mean-based reconstruction and this state of affairs, the fol-lowing question arises naturally: Are these results examples of a much more general phenomenon?
In particular, is it true that exp( O ( n / )) traces suffice for mean-based trace reconstruction over any discrete memoryless synchronization channel? We make significant progress in this directionby showing that a simple extension of the analysis from [9, 10] yields that result for a much broaderclass of such channels which map each input symbol to an arbitrarily distributed sequence of noisysymbol replications and insertions of random symbols, under a mild assumption.Research in this direction has other practical and theoretical implications. First, studying tracereconstruction over channels introducing more complex synchronization errors than simple i.i.d.deletions is fundamental for the design of reliable DNA-based data storage systems with nanopore-based sequencing [11, 12, 13]. Second, understanding the structure of the mean trace of a stringis by itself a natural information-theoretic problem which may lead to improved capacity boundsand coding techniques for channels with synchronization errors, both notoriously difficult problems(see the extensive surveys [14, 15, 7, 16]). Besides the works mentioned above, there has been significant recent interest in various notionsof trace reconstruction.The mean-based approach of [9, 10] has proven useful to some problems incomparable to ourgeneral setting: the deletion channel with position- and symbol-dependent deletion probabilitiessatisfying strong monotonicity and periodicity assumptions [17]; a combination of the geometric Nazarov and Peres [10] consider a slightly modified geometric insertion-deletion channel: First, a geometricnumber of independent, uniformly random symbols is added independently before each input symbol. Then, theresulting string is sent through a deletion channel. The analysis is similar to that of the geometric-insertion channel. average-case trace reconstruction algorithms (which are only required to have low aver-age reconstruction error probability) [18, 19]; trace reconstruction of trees with i.i.d. deletions ofvertices [20]; trace reconstruction of matrices with i.i.d. deletions of rows and columns [21]; tracereconstruction of circular strings over the deletion channel [22]. In another direction, Grigorescu,Sudan, and Zhu [23] studied the performance of mean-based reconstruction for distinguishing be-tween strings at low Hamming or edit distance from each other over the deletion channel.Different complex-analytic methods have been used to, for example, obtain the current bestupper bound of exp( e O ( n / )) traces on the trace complexity of the deletion channel [24], as wellas upper bounds for trace reconstruction of “smoothed” worst-case strings over the deletion chan-nel [25]. We note, however, that mean-based reconstruction remains the state-of-the-art approachfor the geometric insertion-deletion channel.Other related problems considered include the already-mentioned average-case trace reconstruc-tion problem over the deletion and geometric insertion-deletion channels [3, 8, 26, 18, 19], tracereconstruction over the deletion and geometric insertion-deletion channels with vanishing deletionprobabilities [3, 4, 5, 27], trace complexity lower bounds for the deletion channel [3, 26, 28, 29], tracereconstruction of coded strings over the deletion channel [30, 31], approximate trace reconstruc-tion [32], alternative trace reconstruction models motivated by immunology [33], and populationrecovery over the deletion and geometric insertion-deletion channels [34, 35, 36]. For convenience, we denote discrete random variables and their corresponding distributions byuppercase letters, such as X , Y , and Z . The expected value of X is denoted by E [ X ]. Sets aredenoted by calligraphic uppercase letters such as S and T , and we write [ n ] := { , , . . . , n } . Theopen disk of radius r centered at z ∈ C is D r ( z ) := { z ′ ∈ C : (cid:12)(cid:12) z − z ′ (cid:12)(cid:12) < r } . The 1-norm of vector x is denoted by k x k . The concatenation of strings x and y is denoted by x k y . We consider a general replication-insertion channel model that, in particular, captures themodels studied in [4, 5, 9, 10]. A replication-insertion channel Ch ( M, R ,p flip ) is characterized by aconstant p flip ∈ [0 , /
2) and a joint probability distribution ( M, R ) with M ∈ { , , , . . . } and R ⊆ [ M ]. To avoid trivial settings where trace reconstruction is impossible, we require thatPr[ M > > R 6 = ∅ ] >
0. Given an input string x ∈ {− , } n , the channel Ch ( M, R ,p flip ) behaves independently on each input x i as follows:1. Sample a pair ( m i , R i ) with m i ∈ { , , , . . . } and R i ⊆ [ m i ] according to the distribution of( M, R );2. Construct an output string Y x i ∈ {− , } m i . If j ∈ R i , then let Y x i ,j = − x with probability p flip and Y x i ,j = x with probability 1 − p flip . If j
6∈ R i , let Y x i ,j be a uniformly random bit.The overall output of Ch ( M, R ,p flip ) on input x is Y x = Y x k Y x k · · · k Y x n . For example, the geometric insertion-deletion channel from [4, 5, 9] can be easily instantiatedin this general framework by sampling ( M, R ) as follows: First, sample G following a geometricdistribution with support in { , , , . . . } and success probability σ and B following a Bernoulli3istribution with success probability δ . Then, set M = B + G and let R = ∅ if B = 1 and R = { } otherwise. Likewise, the alternative geometric insertion-deletion channel from [10, 18, 19] can alsobe easily instantiated in our framework. Our main theorem shows that previous results on mean-based trace reconstruction over the dele-tion and geometric insertion-deletion channels are examples of a much more general phenomenon.
Theorem 1.
Worst-case mean-based trace reconstruction with success probability at least − e − Ω( n ) over any channel Ch ( M, R ,p flip ) with sub-exponential random variable M is achievable with exp( O ( n / )) traces. Note that many common distributions satisfy the requirement that M is sub-exponential, in-cluding geometric, Poisson, and all finitely-supported distributions. Fix a replication-insertion channel Ch ( M, R ,p flip ) , where M is a sub-exponential random variable.To every string x ∈ {− , } n , we can associate a polynomial P x over C defined as P x ( z ) := n X i =1 x i z i − . Then, using the definition of mean trace above, we define the mean trace power series P x as P x ( z ) := ∞ X i =1 µ x,i z i − . Let
N > N by µ Nx := ( µ x, , . . . , µ x,N ) . To prove Theorem 1, we will show that there exists a constant
C > n , appropriate N , and any distinct input strings x, x ′ ∈ {− , } n , their truncated meantraces satisfy (cid:13)(cid:13)(cid:13) µ Nx − µ Nx ′ (cid:13)(cid:13)(cid:13) = N X i =1 (cid:12)(cid:12) µ x,i − µ x ′ ,i (cid:12)(cid:12) ≥ δ ( n ) := e − Cn / . (2)This implies that exp( O ( n / )) traces suffice for mean-based worst-case trace reconstruction asfollows: Let x be the true input and suppose that we have access to t := n/δ ( n ) = exp( O ( n / ))traces. Then a direct application of the Chernoff bound and a union bound over all coordinates i = 1 , . . . , N shows that the empirical mean trace b µ N = ( b µ , . . . , b µ N ) defined in (1) satisfies (cid:13)(cid:13)(cid:13)b µ N − µ Nx (cid:13)(cid:13)(cid:13) ≤ δ ( n )4 (3) A random variable M is sub-exponential if there exists a constant α > | M | ≥ τ ] ≤ e − ατ for all τ ≥ − e − Ω( n ) over the randomness of the traces. On the other hand, if (3)holds, we also have that (cid:13)(cid:13)(cid:13)b µ N − µ Nx ′ (cid:13)(cid:13)(cid:13) ≥ δ ( n )4for all x ′ = x as a result of (2). This allows us to recover x naively from b µ by computing µ N b x forevery b x ∈ {− , } n and outputting the b x that minimizes (cid:13)(cid:13)(cid:13)b µ N − µ N b x (cid:13)(cid:13)(cid:13) .We prove (2) by relating (cid:13)(cid:13)(cid:13)b µ N − µ Nx (cid:13)(cid:13)(cid:13) to | P x ( z ) − P x ′ ( z ) | for an appropriate choice of z ∈ C .By the triangle inequality, we have (cid:12)(cid:12)(cid:12) P x ( z ) − P x ′ ( z ) (cid:12)(cid:12)(cid:12) ≤ ∞ X i =1 (cid:12)(cid:12) µ x,i − µ x ′ ,i (cid:12)(cid:12) | z | i − = N X i =1 (cid:12)(cid:12) µ x,i − µ x ′ ,i (cid:12)(cid:12) | z | i − + ∞ X i = N +1 (cid:12)(cid:12) µ x,i − µ x ′ ,i (cid:12)(cid:12) | z | i − ≤ | z | N (cid:13)(cid:13)(cid:13) µ Nx − µ Nx ′ (cid:13)(cid:13)(cid:13) + ∞ X i = N +1 (cid:12)(cid:12) µ x,i − µ x ′ ,i (cid:12)(cid:12) | z | i − for every z ∈ C such that | z | ≥
1. Rearranging, it follows that (cid:13)(cid:13)(cid:13) µ Nx − µ Nx ′ (cid:13)(cid:13)(cid:13) is lower-bounded by | z | − N (cid:12)(cid:12)(cid:12) P x ( z ) − P x ′ ( z ) (cid:12)(cid:12)(cid:12) − ∞ X i = N +1 (cid:12)(cid:12) µ x,i − µ x ′ ,i (cid:12)(cid:12) | z | i − (4)for any such z . The lower bound in (2), and thus Theorem 1, follows by combining (4) with thenext two lemmas, each bounding a different term in the right-hand side of (4). Lemma 2.
There exist constants c , c > such that for n large enough and any distinct strings x, x ′ ∈ {− , } n , it holds that (cid:12)(cid:12)(cid:12) P x ( z ) − P x ′ ( z ) (cid:12)(cid:12)(cid:12) ≥ e − c n / for some z satisfying ≤ | z | ≤ e c n − / . Lemma 3. If ≤ | z | ≤ e c n − / for some constant c > , there exist constants c , c > suchthat if N = c n then ∞ X i = N +1 (cid:12)(cid:12) µ x,i − µ x ′ ,i (cid:12)(cid:12) | z | i − ≤ e − c n for all distinct x, x ′ ∈ {− , } n when n is large enough. Invoking Lemmas 2 and 3, we have that for n large enough and any distinct x, x ′ ∈ {− , } n ,there exists an appropriate choice z ⋆ ∈ C possibly depending on x and x ′ which satisfies 1 ≤ | z ⋆ | ≤ e c n − / and by setting z = z ⋆ and N = c n in (4) yields (cid:13)(cid:13)(cid:13) µ Nx − µ Nx ′ (cid:13)(cid:13)(cid:13) ≥ e − c · c n / (cid:16) e − c n / − e − c n (cid:17) ≥ e − Cn / for some constant C >
0, implying (2).We prove Lemmas 2 and 3 in Sections 3 and 4, respectively, which completes the argument.5
Proof of Lemma 2
Our proof of Lemma 2 follows the blueprint of [9, Sections 4 and 5] and [10, Sections 2 and 3].The key differences lie in Lemmas 4 and 6 below. Lemma 4 generalizes [9, Section 4 and AppendixA.3] and [10, Lemmas 2.1 and 5.2] to arbitrary replication-insertion channels well beyond thedeletion and geometric insertion-deletion channels. Lemma 6 requires analyzing the local behaviorof the inverse of an arbitrary probability generating function (PGF) in the complex plane around z = 1. Remarkably, the desired behavior follows by combining the standard inverse functiontheorem for analytic functions with basic properties of PGFs. In contrast, the PGFs associatedto the deletion and geometric insertion-deletion channels treated in [9, 10, 18, 19] are all M¨obiustransformations, meaning that their inverses have simple explicit expressions and could be easilyanalyzed directly.As a first step, we show that the mean trace power series P x is related to the input polynomial P x through a change of variable. This allows us to bound (cid:12)(cid:12)(cid:12) P x ( z ) − P x ′ ( z ) (cid:12)(cid:12)(cid:12) in terms of (cid:12)(cid:12) P x ( w ) − P x ′ ( w ) (cid:12)(cid:12) for some w related to z . Lemma 4.
Let Ch ( M, R ,p flip ) be a replication-insertion channel. Suppose E (cid:2) |R| (cid:3) > is finite. Let W be the distribution given by W ( j ) = Pr[ j + 1 ∈ R ] E (cid:2) |R| (cid:3) , j = 0 , , . . . , and g M and g W be the probability generating functions of M and W , respectively. Then for every x ∈ {− , } n and z ∈ C such that z is in the disk of convergence of both g M and g W , we have P x ( z ) = (1 − p flip ) · E (cid:2) |R| (cid:3) · g W ( z ) · P x ( g M ( z )) . Let C := (1 − p flip ) · E (cid:2) |R| (cid:3) . Then, Lemma 4 yields (cid:12)(cid:12)(cid:12) P x ( z ) − P x ′ ( z ) (cid:12)(cid:12)(cid:12) = | C | · (cid:12)(cid:12) g W ( z ) (cid:12)(cid:12) · (cid:12)(cid:12) P x ( g M ( z )) − P x ′ ( g M ( z )) (cid:12)(cid:12) . (5)Analogously to [9, 10], we use the following lemma, due to Borwein and Erd´elyi [37], to lowerbound (cid:12)(cid:12) P x ( g M ( z )) − P x ′ ( g M ( z )) (cid:12)(cid:12) . Lemma 5 ([37]) . There is a universal constant c > for which the following holds: Let a = ( a , ..., a ℓ − ) ∈ {− , , } ℓ be non-zero and define A ( w ) := P ℓ − j =0 a j w j . Let γ L denote the arc (cid:8) e iϕ : ϕ ∈ [ − πL , πL ] (cid:9) . Then, we have max w ∈ γ L | A ( w ) | ≥ e − cL for every L > . This lemma implies that there is a constant C > L > w L = e iϕ L with | ϕ L | ≤ πL satisfying (cid:12)(cid:12) P x ( w L ) − P x ′ ( w L ) (cid:12)(cid:12) ≥ e − C L . (6)We can use (6) to lower bound (5), provided there exists z L such that g M ( z L ) = w L with goodproperties. The following lemma ensures this. Lemma 6.
For L large enough there is a constant c ′ > such that for any ϕ ∈ (cid:2) − πL , πL (cid:3) thereexists z ϕ satisfying g M ( z ϕ ) = e iϕ , (cid:12)(cid:12) g W ( z ϕ ) (cid:12)(cid:12) ≥ / , and ≤ (cid:12)(cid:12) z ϕ (cid:12)(cid:12) ≤ c ′ ϕ .
6s a result of Lemma 6, we can choose a z L that satisfies g M ( z L ) = w L , | g W ( z L ) | ≥ /
2, and1 ≤ | z L | ≤ C L ≤ e C /L for large enough L . Using this together with (6), we obtain (cid:12)(cid:12) P x ( g M ( z L )) − P x ′ ( g M ( z L )) (cid:12)(cid:12) = (cid:12)(cid:12) P x ( w L ) − P x ′ ( w L ) (cid:12)(cid:12) ≥ e − C L . (7)Set L = n / . Combining (5), (7), and the fact that | g W ( z L ) | ≥ / L , we obtain (cid:12)(cid:12)(cid:12) P x ( z L ) − P x ′ ( z L ) (cid:12)(cid:12)(cid:12) ≥ | C | · (cid:12)(cid:12) g W ( z L ) (cid:12)(cid:12) · e − C L ≥ e − C n / for some constant C > n is large enough. This concludes the proof of Lemma 2 assumingLemmas 4 and 6. In this section, we prove the remaining lemmas.
Proof of Lemma 4.
Fix an input string x ∈ {− , } n . For each i ∈ [ n ], let R i denote the indicesof Y x that correspond to replications of x i and let I i denote the indices of Y x that correspond toinsertions of random bits resulting from the channel’s action on x i . Then we may write( Y ′ x ) j = n X i =1 [ B i,j x i · { j ∈R i } + { j ∈I i } · U i,j ] , where the U i,j are uniformly distributed over {− , } , the B i,j are random variables over {− , } that are − p flip , and all these are independent of each other, R i , and I i . Notethat if an output bit is in R i , then it has expected value (1 − p flip ) x i . Therefore, we have that E [( Y ′ x ) j ] = n X i =1 (1 − p flip ) x i · Pr[ j ∈ R i ] . We can use this to show that P x ( z ) = ∞ X j =1 E [( Y ′ x ) j ] · z j − = ∞ X j =1 n X i =1 (1 − p flip ) x i · Pr[ j ∈ R i ] · z j − = (1 − p flip ) n X i =1 x i · ∞ X j =1 Pr[ j ∈ R i ] · z j − . (8)We proceed to simplify P ∞ j =1 Pr[ j ∈ R i ] · z j − . Let M ( ℓ ) := P ℓk =1 M k , where the M k := | Y x k | denote the lengths of the channel outputs associated to each input bit x k and are i.i.d. according7o M . We have ∞ X j =1 Pr[ j ∈ R i ] · z j − = ∞ X j =1 Pr[ M ( i − < j, j ∈ R i ] · z j − = ∞ X j =1 j − X j ′ =0 Pr[ M ( i − = j ′ ] · Pr[ j ∈ R i | M ( i − = j ′ ] · z j − = ∞ X j =1 j − X j ′ =0 Pr[ M ( i − = j ′ ] · Pr[ j − j ′ ∈ R ] · z j − = ∞ X j ′ =0 Pr[ M ( i − = j ′ ] · ∞ X j = j ′ +1 P r [ j − j ′ ∈ R ] · z j − = ∞ X j ′ =0 Pr[ M ( i − = j ′ ] z j ′ · ∞ X j =1 Pr[ j ∈ R ] · z j − = g M ( z ) i − · ∞ X j =1 Pr[ j ∈ R ] · z j − . (9)We can interchange the sums above because z is in the disk of convergence of g M and g W . Fromthe definition of W , g W ( z ) = ∞ X j =0 W ( j ) · z j = 1 E (cid:2) |R| (cid:3) · ∞ X j =1 Pr[ j ∈ R ] · z j − . (10)Combining (9) with (10) yields ∞ X j =1 Pr[ j ∈ R i ] · z j − = E (cid:2) |R| (cid:3) · g W ( z ) · g M ( z ) i − , Recalling (8) concludes the proof.We prove Lemma 6 using the standard inverse function theorem stated below.
Lemma 7 ([38, Section VIII.4], adapted) . Let g : Ω → C be a non-constant function analytic ona connected open set Ω ⊆ C such that g ′ ( z ) = 0 for a given z ∈ Ω . Then, there exist radii ρ, δ > such that for every w ∈ D δ ( g ( z )) there exists a unique z w ∈ D ρ ( z ) satisfying g ( z w ) = w . Moreover,the inverse function f : D δ ( g ( z )) → D ρ ( z ) defined as f ( w ) = z w is analytic on D δ ( g ( z )) .Proof of Lemma 6. Because M is sub-exponential and non-trivial, g M is a non-constant analyticfunction on some open ball D r (0) of radius r > g ′ M (1) = E [ M ] = 0. Hence, Lemma 7applies with g = g M , so there exist ρ, δ > f : D δ (1) → D ρ (1) such that g M ( f ( w )) = w . In particular, there exists γ ∈ (0 , δ ) such that for every w ∈ D γ (1) we can write f ( w ) = 1 + f ′ (1)( w −
1) + ∞ X i =2 f ( i ) (1) i ! ( w − i . (11)8his is because f (1) = 1, since g (1) = 1, and furthermore ∞ X i =2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( i ) (1) i ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ·| w − | i ≤ c ′′ | w − | for some constant c ′′ > w .Assume that L is large enough so that e iϕ ∈ D γ (1) for all ϕ ∈ (cid:2) − πL , πL (cid:3) . Then, we set z ϕ = f ( e iϕ ).Note that g M ( z ϕ ) = e iϕ by the definition of f , as required.Combining (11) with w = e iϕ and the triangle inequality, we have (cid:12)(cid:12) z ϕ − (cid:12)(cid:12) = O (cid:16)(cid:12)(cid:12)(cid:12) e iϕ − (cid:12)(cid:12)(cid:12)(cid:17) → L → ∞ . Since g W is a continuous function on a neighborhood of 1, and g W (1) = 1, it followsthat (cid:12)(cid:12) g W ( z ϕ ) (cid:12)(cid:12) ≥ / L is large enough. On the other hand, combining (11) with the fact that f ′ (1) = 1 g ′ ( f (1)) = 1 g ′ (1) = 1 E [ M ] ∈ R , by the chain rule, we obtain (cid:12)(cid:12) z ϕ (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e iϕ − E [ M ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + c ′′ (cid:12)(cid:12)(cid:12) e iϕ − (cid:12)(cid:12)(cid:12) ≤ s(cid:18) − − cos( ϕ ) E [ M ] (cid:19) + sin( ϕ ) E [ M ] + c ′′ ϕ ≤ s − cos( ϕ )) E [ M ] + c ′′ ϕ ≤ (cid:18) E [ M ] + c ′′ (cid:19) ϕ . The second inequality holds because (cid:12)(cid:12) e iϕ − (cid:12)(cid:12) ≤ ϕ . The last inequality follows by noting that1 − cos( ϕ ) ≤ ϕ / √ x ≤ x for x ≥ To conclude the argument, we prove Lemma 3 using an argument analogous to [9, AppendixA.2] and the fact that M is sub-exponential.Let M , M , . . . , M n be i.i.d. according to M , and set M ( n ) = P ni =1 M i . Then, we have (cid:12)(cid:12) µ x,i − µ x ′ ,i (cid:12)(cid:12) ≤ Pr[ M ( n ) ≥ i ]for every i . Since M is sub-exponential, a direct application of Bernstein’s inequality [39, Theorem2.8.1] guarantees the existence of constants c , c > N = c n and any j ≥ M ( n ) ≥ N + j ] ≤ e − c ( N + j ) . Combining these observations with the assumption that | z | ≤ e c n − / yields ∞ X i = N +1 (cid:12)(cid:12) µ x,i − µ x ′ ,i (cid:12)(cid:12) | z | i − ≤ ∞ X j =1 e − c ( N + j ) · e c n − / ≤ e − c n for some constant c > n large enough. 9 Future Work
We have shown that exp( O ( n / )) traces suffice for mean-based worst-case trace reconstructionover a broad class of replication-insertion channels. However, our channel model does not coverall discrete memoryless synchronization channels as defined by Dobrushin [6, 7]. It would beinteresting to extend the result in some form to all such non-trivial channels. On the other hand,to complement the above, it would be interesting to prove trace complexity lower bounds for mean-based reconstruction over all these channels.Furthermore, it is unclear whether the assumption that M is sub-exponential is necessary forour result. A clear extension of this work would be to either remove this condition or prove that itis necessary for mean-based trace reconstruction from exp( O ( n / )) traces. References [1] V. I. Levenshtein, “Efficient reconstruction of sequences,”
IEEE Transactions on InformationTheory , vol. 47, no. 1, pp. 2–22, Jan 2001.[2] ——, “Efficient reconstruction of sequences from their subsequences or supersequences,”
Jour-nal of Combinatorial Theory, Series A , vol. 93, no. 2, pp. 310–332, 2001.[3] T. Batu, S. Kannan, S. Khanna, and A. McGregor, “Reconstructing strings from randomtraces,” in
Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms(SODA) , 2004, pp. 910–918.[4] S. Kannan and A. McGregor, “More on reconstructing strings from random traces: insertionsand deletions,” in , 2005,pp. 297–301.[5] K. Viswanathan and R. Swaminathan, “Improved string reconstruction over insertion-deletionchannels,” in
Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms(SODA) , 2008, pp. 399–408.[6] R. L. Dobrushin, “Shannon’s theorems for channels with synchronization errors,”
ProblemyPeredachi Informatsii , vol. 3, no. 4, pp. 18–36, 1967.[7] M. Cheraghchi and J. Ribeiro, “An overview of capacity results for synchronization channels,”
IEEE Transactions on Information Theory , 2020, to appear. DOI: 10.1109/TIT.2020.2997329.[8] T. Holenstein, M. Mitzenmacher, R. Panigrahy, and U. Wieder, “Trace reconstruction withconstant deletion probability and related results,” in
Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , 2008, pp. 389–398.[9] A. De, R. O’Donnell, and R. A. Servedio, “Optimal mean-based algorithms for trace recon-struction,”
Annals of Applied Probability , vol. 29, no. 2, pp. 851–874, Apr 2019.[10] F. Nazarov and Y. Peres, “Trace reconstruction with exp( O ( n / )) samples,” in Proceedingsof the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC) , 2017, pp.1042–1046.[11] S. M. H. T. Yazdi, H. M. Kiah, E. Garcia-Ruiz, J. Ma, H. Zhao, and O. Milenkovic, “DNA-based storage: Trends and methods,”
IEEE Transactions on Molecular, Biological and Multi-Scale Communications , vol. 1, no. 3, pp. 230–248, 2015.1012] S. M. H. T. Yazdi, R. Gabrys, and O. Milenkovic, “Portable and error-free DNA-based datastorage,”
Scientific reports , vol. 7, no. 1, p. 5011, 2017.[13] L. Organick, S. D. Ang, Y.-J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, M. Z. Racz,G. Kamath, P. Gopalan, B. Nguyen et al. , “Random access in large-scale DNA data storage,”
Nature biotechnology , vol. 36, no. 3, p. 242, 2018.[14] M. Mitzenmacher, “A survey of results for deletion channels and related synchronization chan-nels,”
Probability Surveys , vol. 6, pp. 1–33, 2009.[15] H. Mercier, V. K. Bhargava, and V. Tarokh, “A survey of error-correcting codes for channelswith symbol synchronization errors,”
IEEE Communications Surveys Tutorials , vol. 12, no. 1,pp. 87–96, First Quarter 2010.[16] B. Haeupler and A. Shahrasbi, “Synchronization strings and codes for insertions and dele-tions – a survey,”
IEEE Transactions on Information Theory , 2021, to appear. Available athttps://arxiv.org/abs/2101.00711.[17] L. Hartung, N. Holden, and Y. Peres, “Trace reconstruction with varying deletion proba-bilities,” in
Proceedings of the 15th Workshop on Analytic Algorithmics and Combinatorics(ANALCO) , 2018, pp. 54–61.[18] Y. Peres and A. Zhai, “Average-case reconstruction for the deletion channel: Subpolynomiallymany traces suffice,” in , Oct 2017, pp. 228–239.[19] N. Holden, R. Pemantle, and Y. Peres, “Subpolynomial trace reconstruction for random stringsand arbitrary deletion probability,” in
Proceedings of the 31st Conference On Learning Theory(COLT) , 2018, pp. 1799–1840.[20] S. Davies, M. Z. Racz, and C. Rashtchian, “Reconstructing trees from traces,” in
Proceedingsof the 32nd Conference on Learning Theory (COLT) , 2019, pp. 961–978.[21] A. Krishnamurthy, A. Mazumdar, A. McGregor, and S. Pal, “Trace reconstruction: General-ized and parameterized,” in , 2019,pp. 68:1–68:25.[22] S. Narayanan and M. Ren, “Circular trace reconstruction,” arXiv e-prints , p.arXiv:2009.01346, Sep. 2020, to appear in ITCS 2021.[23] E. Grigorescu, M. Sudan, and M. Zhu, “Limitations of mean-based algorithms for trace recon-struction at small distance,” arXiv e-prints , p. arXiv:2011.13737, Nov. 2020.[24] Z. Chase, “New upper bounds for trace reconstruction,” arXiv e-prints , p. arXiv:2009.03296,Sep. 2020.[25] X. Chen, A. De, C. H. Lee, R. A. Servedio, and S. Sinha, “Polynomial-time trace reconstructionin the smoothed complexity model,” in
Proceedings of the 2021 ACM-SIAM Symposium onDiscrete Algorithms (SODA) , 2021, pp. 54–73.[26] A. McGregor, E. Price, and S. Vorotnikova, “Trace reconstruction revisited,” in , 2014, pp. 689–700.1127] X. Chen, A. De, C. H. Lee, R. A. Servedio, and S. Sinha, “Polynomial-time trace reconstructionin the low deletion rate regime,” arXiv e-prints , p. arXiv:2012.02844, Dec. 2020, to appear inITCS 2021.[28] N. Holden and R. Lyons, “Lower bounds for trace reconstruction,”
Ann. Appl. Probab. , vol. 30,no. 2, pp. 503–525, Apr. 2020.[29] Z. Chase, “New lower bounds for trace reconstruction,” arXiv e-prints , p. arXiv:1905.03031,May 2019, to appear in Ann. Inst. Henri Poincar´e Probab. Stat.[30] M. Cheraghchi, R. Gabrys, O. Milenkovic, and J. Ribeiro, “Coded trace reconstruction,”
IEEETransactions on Information Theory , vol. 66, no. 10, pp. 6084–6103, 2020.[31] J. Brakensiek, R. Li, and B. Spang, “Coded trace reconstruction in a constant number oftraces,” in ,2020, pp. 482–493.[32] S. Davies, M. Z. Racz, C. Rashtchian, and B. G. Schiffer, “Approximate trace reconstruction,” arXiv e-prints , p. arXiv:2012.06713, Dec. 2020.[33] V. Bhardwaj, P. A. Pevzner, C. Rashtchian, and Y. Safonova, “Trace reconstruction problemsin computational biology,”
IEEE Transactions on Information Theory , 2020, to appear. DOI:10.1109/TIT.2020.3030569.[34] F. Ban, X. Chen, A. Freilich, R. A. Servedio, and S. Sinha, “Beyond trace reconstruction:Population recovery from the deletion channel,” in , Nov 2019.[35] F. Ban, X. Chen, R. A. Servedio, and S. Sinha, “Efficient average-case population recoveryin the presence of insertions and deletions,” in
Approximation, Randomization, and Com-binatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019) , 2019, pp.44:1–44:18.[36] S. Narayanan, “Improved algorithms for population recovery from the deletion channel,” in
Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA) , 2021, pp.1259–1278.[37] P. Borwein and T. Erd´elyi, “Littlewood-type problems on subarcs of the unit circle,”
IndianaUniversity mathematics journal , pp. 1323–1346, 1997.[38] T. Gamelin,
Complex Analysis , ser. Undergraduate Texts in Mathematics. Springer NewYork, 2001.[39] R. Vershynin,