Efficient Document Exchange and Error Correcting Codes with Asymmetric Information
aa r X i v : . [ c s . CC ] J u l Efficient Document Exchange and Error Correcting Codes withAsymmetric Information
Kuan Cheng ∗ Xin Li † July 20, 2020
Abstract
We study two fundamental problems in communication, Document Exchange (DE) and ErrorCorrecting Code (ECC). In the first problem, two parties hold two strings, and one party triesto learn the other party’s string through communication. In the second problem, one partytries to send a message to another party through a noisy channel, by adding some redundantinformation to protect the message. Two important goals in both problems are to minimize thecommunication complexity or redundancy, and to design efficient protocols or codes.Both problems have been studied extensively. In this paper we study whether asymmetricpartial information can help in these two problems. We focus on the case of Hamming dis-tance/errors, and the asymmetric partial information is modeled by one party having a vector ofdisjoint subsets S = ( S , · · · , S t ) of indices and a vector of integers k = ( k , · · · , k t ), such that ineach S i the Hamming distance/errors is at most k i . To our knowledge, no previous work has stud-ied this problem systematically. We establish both lower bounds and upper bounds in this model,and provide efficient randomized constructions that achieve a min { O ( t ) , O (cid:0) (log log n ) (cid:1) } factorwithin the optimum, with almost linear running time.We further show a connection between the above document exchange problem and the prob-lem of document exchange under edit distance , and use our techniques to give an efficient ran-domized protocol with optimal communication complexity and exponentially small error for thelatter. This improves the previous result by Haeupler [18] (FOCS’19), which has polynomiallylarge error; and that by Belazzougui and Zhang [8] (FOCS’16), which is only optimal for alimited range of parameters. Our techniques are based on a generalization of the celebratedexpander codes by Sipser and Spielman [33], which may be of independent interests. ∗ [email protected]. Department of Computer Science, University of Texas at Austin. Supported by a SimonsInvestigator Award ( † [email protected]. Department of Computer Science, Johns Hopkins University. Supported by NSF Award CCF-1617713 and NSF CAREER Award CCF-1845349. Introduction
Document exchange, first introduced and studied by Orlitsky [29] and subsequently named byCormode et. al. [13], is a fundamental problem in communication. Here, two parties Alice and Bobeach holds a string (document) x and y , and the goal is for one party to learn the other party’sstring with the least amount of communication possible. For simplicity, let us assume that both x and y have n bits. If x and y can be arbitrary strings, then it is clear that in the worst case thecommunication needs at least n bits, i.e., sending one party’s string to the other party. However,in practice this is often not the case, and x and y can actually be close in some sense. For example,Alice and Bob may be two uses holding different versions of some original document, where x and y are obtained after some edits of a string z . If the number of edits is limited, then it is possiblefor one party to learn the other party’s string with significantly less amount of communication. Inthis paper, we focus on the case where the strings have binary alphabet.More generally and formally, the document exchange problem can be described as follows. Aliceand Bob each has an n -bit string x and y , and the distance between x and y , D ( x, y ) is upperbounded by some number k . Here the distance D can be any measure of interests. Now, the firstgoal here is to minimize the communication complexity as a function of n and k . In addition, it isalso an important goal to keep the protocol efficient , i.e., we would like the communication protocolto run in polynomial time of n .There has been a lot of work on the document exchange problem [29, 6, 7, 1, 13, 27, 35, 22, 23,8, 11, 18, 12]. While Orlitsky [29] established some upper and lower bounds on the communicationcomplexity of general “balanced” measures D ( x, y ), as well as exponential time protocols that canachieve the optimal communication, efficient protocols in subsequent works have been mostly fo-cusing on the two natural cases where D ( x, y ) is either the Hamming distance or the edit distance.In the former, the distance is measured by how many bits in x and y are different at the corre-sponding locations, while in the latter the distance ED ( x, y ) is measured by the minimum numberof insertions, deletions, and substitutions to transform one string into another. Both distances aremetrics, and edit distance strictly generalizes Hamming distance.For both Hamming distance and edit distance, it is known that if D ( x, y ) ≤ k , then the optimalcommunication complexity in the document exchange problem is Θ( k log( n/k )), and this can beachieved by a deterministic one-round protocol running in exponential time. The situation of effi-cient protocols however is different for these two measures. For Hamming distance, we have efficient,deterministic one-round protocol with optimal communication complexity Θ( k log( n/k )), based onAlgebraic Geometry codes [21]. For edit distance, except for the exponential time deterministicone-round protocol in [29] which achieves optimal communication complexity, for a long time onlyefficient randomized one round protocols with sub-optimal communication complexity are known.These include the work of Irmak et al. [22] with communication complexity O ( k log( nk ) log n ), thework of Jowhari [23] with communication complexity O ( k log n log ∗ n ), the work of Chakraborty etal. [10] with communication complexity O ( k log n ), and the work of Belazzougui and Zhang [8] withcommunication complexity O ( k (log k +log n )). In particular, the protocol in [8] has asymptoticallyoptimal communication complexity for k = 2 O ( √ log n ) , with success probability 1 − / poly ( k log n ).In 2018, Cheng et. al. [11], and Haeupler [18] independently gave an efficient, deterministicone-round protocol with communication complexity O ( k log ( n/k )). Finally, Haeupler [18] gave thefirst efficient randomized one-round protocol with optimal communication complexity O ( k log( n/k )).However, his protocol only succeeds with probability 1 − / poly ( n ).Document exchange is closely related to the (even more) fundamental problem of error correctingcodes. The goal of an error correcting code is to ensure that one party can successfully sendinformation to another party, despite errors caused by the communication channel. In this setting,1he first party (Alice) runs an encoding algorithm that turns a message of m bits into a codewordof n bits, and sends the codeword to the second party (Bob) through a channel. Bob then triesto recover the message by running a decoding algorithm. Similar to document exchange, there arealso two important goals here. First, one wants to keep n − m (the redundancy of the codeword) tobe as small as possible, or alternatively, to keep m (the message length) to be as large as possible.Second, one needs both the encoding and decoding to be efficient, i.e., run in polynomial time of m . There has been extensive study on error correcting codes, which we will not be able to completelysurvey here. Again, the channel error can have several different models, and the most studied areHamming errors and edit errors. For both cases, assuming k is an upper bound on the number oferrors, then it is known that the optimal message length one can achieve (with possibly exponentialtime encoding/decoding) is m = n − Θ( k log( n/k )). For Hamming errors, again we have efficientconstructions matching this bound, based on Algebraic Geometry codes [21]. For edit errors theconstructions are far behind, and for a long time we only have asymptotically optimal constructionsfor the two extreme cases of k = 1 [26] and k = αn for some small constant α > m = n − O ( k log ( n/k )).Cheng et. al. [11] further gave an efficient code with m = n − O ( k log n ), which is optimal for k ≤ n − α where α > syndrome . Given such a code, the syndrome canbe used as the information sent in a document exchange protocol. Conversely, given a one rounddocument exchange protocol, one can use a standard error correcting code on the information sentand use it as the syndrome in a systematic error correcting code.In all previous works, Alice and Bob have symmetric information—they both know that theirstring is within distance D ( x, y ) ≤ k to the other party’s string, or the total number of errors in thereceived codeword is at most k . However, in many practical situations, each party may have someadditional partial information that is not known to the other party. For example, in documentexchange, if Bob has made edits in some specific parts of the original document, then even withoutcarefully tracking the edits, Bob has some partial information of where the differences can happen.This information is not necessarily known to Alice. In another situation, suppose Alice sends a longstring to Bob by Internet routing, then this string may be broken into several parts and transmit-ted to Bob through different channels. These channels may have different behavior and introducedifferent numbers of errors. While it is reasonable that both parties know the parameters of allchannels, due to the routing process Alice may not know which channels her parts are sent through.On the other hand, Bob can learn these information by observing the received parts. Thus Bobwill have some partial information about the numbers of errors in specific parts of the receivedstring, which is not known to Alice. The fist example applies to document exchange and the secondexample applies to error correcting codes. One can now ask the following natural question, whichis the focus of this paper. Question:
Can we use these asymmetric information to reduce the communication complexity indocument exchange or the redundancy in error correcting codes, while still designing efficient pro-tocols or codes?
Towards answering this question, we first formally define our model.2 .1 The Model of Asymmetric Information
In this paper we focus on Hamming distance/Hamming errors in the model of asymmetric infor-mation. To model the asymmetric information, we assume that one party has some additionalinformation of where the differences/errors can happen. More formally, we use a vector of disjointsubsets S = ( S , · · · , S t ) to indicate the positions where the differences/errors can happen, and avector of integers k = ( k , · · · , k t ) to indicate the upper bounds on the numbers of differences/errorsin each set S i . For each S i , let s i denote the size of S i , i.e., s i = | S i | . We also use s to indicatethe vector s = ( s , · · · , s t ). We assume the parameters ( s , k , t ) are known to both parties, and that(without loss of generality) k ≥ k ≥ · · · ≥ k t . Definition 1.1. ( ( s , k , t ) Asymmetric Document Exchange) There are two parties Alice and Bob.Alice has a string x ∈ { , } n and Bob has a string y ∈ { , } n . Both parties know ( s , k , t ) . Inaddition, Bob knows a vector of disjoint subsets S = ( S , · · · , S t ) where ∀ i, S i ⊆ [ n ] and | S i | = s i .That is, within each set S i , the Hamming distance between x and y is at most k i . One party triesto learn the string of the other party. Definition 1.2. ( ( s , k , t ) Asymmetric Error Correcting Code) There are two parties Alice andBob. Both parties know ( s , k , t ) . Alice encodes a message of m bits into a codeword of n bits, usinga function Enc : { , } m → { , } n and sends it to Bob. Bob knows a vector of disjoint subsets S = ( S , · · · , S t ) where ∀ i, S i ⊆ [ n ] and | S i | = s i . That is, within each set S i , there are at most k i Hamming errors in the received codeword. Bob uses a function
Dec : { , } n → { , } m to recoverthe message. We require the protocol or code to succeed for every possible vector of disjoint subsets S =( S , · · · , S t ) with | S i | = s i , ∀ i , and for every possible distance/error pattern that is consistent with S = ( S , · · · , S t ) and k = ( k , · · · , k t ).We consider both deterministic and randomized protocols/codes. In the case of randomizedsolutions, we assume that the two parties have shared randomness, as is standard in all previousworks. In the case of error correcting codes, we further assume that the channel errors do notdepend on the shared randomness.Our model is quite general in capturing asymmetric information. A naive solution is to simplyignore the extra information, and apply a document exchange protocol or error correcting codefor k = P ti =1 k i Hamming distance or Hamming errors. However, our goal here is to see if theextra information can be used to design better protocols or codes. Another natural strategy forthe document exchange problem, is for Bob to first send the descriptions of S to Alice, and theycan then run a protocol on each set S i . However, this strategy can result in a significant amount ofcommunication, e.g., P ti =1 s i log n , which can be even larger than n . In some special situations, aset S i may be a continuous block in the string, and it suffices to just send the starting and endingindex, using 2 log n bits. If all sets S i are of this form, then the total number of bits required is2 t log n . Even this number can be large when the number of sets t is large. We also stress thatin our model and all results, each set S i does not need to be a continuous block. A final simplestrategy is to try to form a large continuous block which includes several S i ’s, but this can increasethe size of the sets significantly and thus also results in a penalty on the communication complexity. Remark 1.3.
In the asymmetric document exchange, it may seem unreasonable to assume thatAlice knows the vectors s , k . However, this is without loss of generality up to a small loss in com-munication complexity and communication rounds. Basically, Bob can first send these two vectorsto Alice. This only takes one round and the number of bits sent by Bob is O ( P ti =1 (log k i + log s i )) ,while the number of bits needed to distinguish all possible error patters is at least P ti =1 log (cid:0) s i k i (cid:1) . Theformer is always within a constant factor to (and in most cases smaller than) the latter. elated previous works. While document exchange and error correcting codes with asymmetricinformation are natural questions, to our knowledge they have not been studied systematically. Theonly previous work we found is the work of Belazzougui and Zhang [8], which studies a special caseof our model with t = 1, i.e., Bob’s extra information only has one subset S with | S | = s . They useentirely different techniques to give a document exchange protocol with sub-optimal communicationcomplexity O ( k (log s + log(1 /ε ))), where Bob can learn Alice’s string with success probability 1 − ε .However, there are a large body of works on a related topic [3, 25, 24, 36, 2, 5], which studythe problem of source coding/data compression with asymmetric information. In this setting, thedecoder has some prior distribution µ not known to the encoder, and the encoder tries to send aset of items drawn independently from the distribution to the decoder, using the smallest numberof bits as possible. The problem we study here, on the other hand, focuses on error correction.While there are similarities between these two problems, they are also fundamentally different. Forexample, all the efficient algorithms in these prior works run in time polynomial in the size of thesupport µ . This is prohibitive for our purpose since this number is already exponentially large.We note that source coding and error correction are the two most important applications ofinformation theory. Thus given the abundant works on source coding/data compression with asym-metric information, we believe a systematic study of document exchange and error correcting codeswith asymmetric information is also an important direction. We provide both lower bounds and upper bounds for document exchange and error correcting codeswith asymmetric information. To simplify the presentation, we first define some quantities. Giventwo vectors s = ( s , · · · , s t ) and k = ( k , · · · , k t ), we define H ( s , k ) = log (cid:16)Q ti =1 (cid:16)P k i j =0 (cid:0) s i j (cid:1)(cid:17)(cid:17) = P ti =1 log (cid:16)P k i j =0 (cid:0) s i j (cid:1)(cid:17) . Similarly, for two integers s and k with s ≥ k , we define H ( s, k ) = log (cid:16)P kj =0 (cid:0) sj (cid:1)(cid:17) .Note that if ∀ i, s i ≥ k i and s ≥ k , then H ( s , k ) = Θ( P ti =1 k i log( s i /k i )) and H ( s, k ) =Θ( k log( s/k )). Recall that k = P ti =1 k i and s = P ti =1 s i ≤ n , hence H ( s , k ) ≤ H ( n, k ). We havethe following theorem. Theorem 1.4.
In an ( s , k , t ) asymmetric DE problem, we have • Suppose Alice learns Bob’s string, then any deterministic protocol has communication com-plexity at least H ( n, k ) , and any randomized protocol with success probability ≥ / has com-munication complexity at least H ( n, k ) − . • Suppose Bob learns Alice’s string, then any randomized protocol with success probability ≥ / has communication complexity at least H ( s , k ) − . Furthermore if ∀ i, s i ≥ k i , then any oneround deterministic protocol has communication complexity at least H ( n, k ) . This theorem tells us the following important things: First, Bob’s extra information is onlyuseful for him to learn Alice’s string, but not useful in the other direction. Second, in the caseof a one round protocol for Bob to learn Alice’s string, for a wide range of parameters (i.e., when ∀ i, s i ≥ k i ), Bob’s extra information is only useful in randomized protocols.For upper bounds, we note that there are efficient deterministic protocols to meet the bound H ( n, k ), based on algebraic geometry codes. To meet the bound H ( s , k ), there is also a simpleone round randomized protocol: Alice hashes her string x using a random hash function, and Bobenumerates all possible strings to find the one with the correct hash value. It’s easy to see thatthis protocol succeeds if there is no hash collision, which happens with high probability if the hashfunction outputs some O ( H ( s , k )) bits. However, this protocol runs in exponential time, and ourmain result is an efficient protocol that gets close to this bound.4o state our main theorem, we define another quantity χ ( s, k, t ) ∈ N : first partition the interval[2 , n ] into disjoint subintervals { I j = [2 j − , j ) } , starting from j = 1. Then, for every i ∈ [ t ],put s i /k i into the corresponding subinterval. χ ( s, k, t ) is defined to be the number of subintervals I j which contain at least one s i /k i . We now have the following theorem. Theorem 1.5.
In an ( s , k , t ) asymmetric DE problem, suppose that ∀ i, s i ≥ k i . There is an effi-cient randomized one round protocol for Bob to learn Alice’s string, with communication complexity O ( χ ( s, k, t ) H ( s , k )) and error probability − Ω( k t ) + poly ( s ) . The protocol runs in time e O ( n ) . Note that χ ( s, k, t ) ≤ t and χ ( s, k, t ) ≤ log log n , so the above theorem immediately gives thefollowing two corollaries. Corollary 1.6.
In an ( s , k , t ) asymmetric DE problem, suppose that ∀ i, s i ≥ k i . There is an effi-cient randomized one round protocol for Bob to learn Alice’s string, with communication complexity O ( t H ( s , k )) and error probability − Ω( k t ) + poly ( s ) . The protocol runs in time e O ( n ) . Corollary 1.7.
In an ( s , k , t ) asymmetric DE problem, suppose that ∀ i, s i ≥ k i . There is an effi-cient randomized one round protocol for Bob to learn Alice’s string, with communication complexity O ((log log n ) H ( s , k )) and error probability − Ω( k t ) + poly ( s ) . The protocol runs in time e O ( n ) . In particular, Corollary 1.6 implies that if t is a constant, then we have a one round protocol withasymptotically optimal communication complexity, while Corollary 1.7 gives a one round protocolwith communication complexity optimal up to an additional (log log n ) factor. Both protocols runin near linear time. We also note that the simple strategy of ignoring the extra information canresult in communication complexity Ω( H ( s , k ) log n ) in the worst case.Similarly, we have both lower bounds and upper bounds for error correcting codes with asym-metric information. The first theorem shows that such information is only useful for a randomizedcode. Theorem 1.8.
In an ( s , k , t ) asymmetric ECC problem, if ∀ i, s i = | S i | ≥ k i , then any determinis-tic code must have distance at least k + 1 . In particular, this means m ≤ n − H ( n, k ) . Furthermore,any randomized code with success probability ≥ / must have message length m ≤ n − H ( s , k ) + 1 . Again, a code with randomized encoding and exponential time deterministic decoding canachieve message length m = n − O ( H ( s , k ). We design an efficient code that comes close to this. Theorem 1.9.
In an ( s , k , t ) asymmetric ECC problem, suppose ∀ i, s i ≥ k i . There is an efficientcode with randomized encoding and deterministic decoding, which has message length m = n − O ( χ ( s, k, t ) H ( s , k )) and error probability − Ω( k t ) + poly ( s ) . In particular, the message length can be max { n − O ( t H ( s , k )) , n − O ((log log n ) H ( s , k )) } , and the running time is e O ( n ) . Next we show that we can design efficient document exchange protocols with asymptoticallyoptimal communication complexity in a special case, roughly when s , k are geometric progressions. Theorem 1.10.
There is an efficient randomized one-round protocol for every ( s , k , t ) asymmet-ric DE problem, where s i = k Θ( i ) , k i = max { k/ Θ( i ) , Θ( kt log nk ) } ≤ s i / . The communicationcomplexity is O ( k ) and the error probability is − Ω( k/ log nk ) . We show that the problem of document exchange under edit distance can be reduced to thespecial case above, and thus we obtain the following theorem.5 heorem 1.11.
There is an efficient randomized one-round protocol for the DE problem withedit distance at most k . The communication complexity is O ( k log nk ) and the error probability is min { − Θ( k/ log nk ) , / poly ( n ) } . We also have both lower bounds and upper bounds for document exchange where both partieshave some asymmetric partial information, represented as a vector of disjoint subsets. For theclarity of presentation we omit the results here, and refer the reader to Section 8 for details.
Our lower bounds follow from relatively simple information theoretic arguments, so here we onlyprovide an informal outline of our protocols. We start with the asymmetric document exchange forHamming distance. Recall that the asymmetric information is in the form of S = ( S , · · · , S t ) and k = ( k , · · · , k t ), where ∀ i, | S i | = s i and the Hamming distance within S i is at most k i . We assume ∀ i, s i ≥ k i , and without loss of generality that k ≥ k · · · ≥ k t . The protocol for one set.
Our starting point is the simplest case where t = 1, i.e. there is onlyone set S of size s and the Hamming distance in S is at most k . In this case our goal is to give anefficient one round protocol with communication complexity O ( k log sk ). If s = n then this can beachieved by using a systematic algebraic geometry code or an expander code [32]. We will use thelatter and we briefly review the application of expander codes in document exchange.To run the protocol, the two parties choose a bipartite expander graph G : [ n ] × [ d ] → [ m ].Alice associates her string x with the n vertices on the left, and computes a string z of length m asfollows: For every i ∈ [ m ], let z i = L j ∈ Γ − ( i ) x j , where Γ − ( i ) is the set of neighbors of the rightvertex i in the expander. The string z consists of a sequence of parity checks of x , and is then sentto Bob.To recover x , Bob starts out with ˜ x = y as his current version of x , and maintains anotherstring z ′ ∈ { , } m using the same approach as above, except replacing the string x by ˜ x , i.e., z ′ consists of a sequence of parity checks of ˜ x . z and z ′ will differ in several coordinates, and Bob willgradually modify ˜ x into x by flipping some bits in ˜ x according to the parity checks. This processis known as belief propagation, and works as follows. Bob keeps finding a bit in ˜ x such that byflipping this bit, the Hamming distance between z ′ and z decreases by at least one. Bob flips thisbit and updates ˜ x and z ′ correspondingly. Bob stops when z ′ = z , at which point x = ˜ x and hehas successfully recovered x .For the analysis, we use the set R ⊆ [ n ] to denote the coordinates where x and ˜ x are different.We say the i ’th parity check bit is satisfied if z i = z ′ i , and unsatisfied otherwise. Let the numberof satisfied and unsatisfied checks in Γ( R ) (the neighbors of R ) be s and u . Assume the graph hasgood expansion, i.e. | Γ( R ) | = s + u ≥ . d | R | , and note that in Γ( R ), each satisfied check has atleast two neighbors in R . Thus 2 s + u ≤ d | R | . By the two inequalities, we deduce u ≥ . d | R | and thus at least one left vertex has more unsatisfied parity checks as neighbors than satisfiedparity checks, and Bob can flip this bit. The analysis holds as long as the expansion of the set R is guaranteed. Note that the number of unsatisfied checks is strictly decreasing in the process,thus | R | can never be more than 1 . k , since otherwise this will induce more than dk unsatisfiedchecks, but at the beginning there are at most dk unsatisfied checks. Therefore, we only need toguarantee the expansion of all R ⊆ [ n ] with | R | ≤ . k , and a random graph with m = O ( k log nk )and d = O (log nk ) satisfies this property with high probability.Going back to the case where s < n , the first issue is that we can’t afford to use an expanderwhich has good expansion for all subsets R as before, since this will make m = Ω( k log nk ). To fix6his, we instead just require the expansion to hold for all subsets R ⊆ S with | R | ≤ . k . Now, arandom graph with m = O ( k log sk ) and d = O (log sk ) satisfies this property with high probability,and both parties can generate the same expander by using the shared randomness. Similarly, whenrecovering x Bob will always look for a bit in S to flip. The analysis is now similar to the standardcase and this gives the protocol for the case of t = 1. The protocol for two sets.
We now consider the case with t >
1. Our goal is to design anefficient one round protocol with communication complexity close to H ( s , k ).The first idea may be to take the union of all S i , i ∈ [ t ] as one set S , and the Hamming distancein S is at most k = P i ∈ [ t ] k i . Now we can use the protocol for t = 1 described before. However,in this case the communication complexity will be O ( H ( s, k )), which may not be close to H ( s , k ).For example, consider the case where t = 2, k = n . , s = 10 n . , k = 10 , s = 0 . n . A directcomputation indicates H ( s, k ) = Ω( H ( s , k ) log n ) = ω ( H ( s , k )). It also appears hard to improve thisif we just use a single expander graph, since the decoding requires good expansion for all possiblesubsets of errors during the belief propagation, which can potentially be all possible subsets of sizeΩ( k ). This forces the right hand size of the graph to be H ( s, k ).To overcome this difficulty, our idea is to use more than one expander codes. Towards this, ourmain observation is that, the issue with the above example is due to the following fact: for some i ∈ [ t ], k i is large while s i is small, but for some other i , k i is small while s i is large. Indeed, in thecase of t = 2, there are two good situations where H ( s, k ) = O ( H ( s , k )):1. k and k are roughly the same, i.e., k = Θ( k ). In this case we have H ( s, k ) = Θ(( k + k ) log s + s k + k ) = Θ( k log s k + k log s k ) = Θ( H ( s , k )).2. log s k and log s k are roughly the same, i.e., log s k = Θ(log s k ). In this case we also have H ( s, k ) = Θ( H ( s , k )).Our protocol will exploit both of these good cases. We first illustrate this with a protocol forthe case of t = 2. Our idea is to reduce the number k (recall that k ≥ k ) to be roughly thesame as k (which is unnecessary if k and k are already roughly the same at the beginning). Inother words, we will first reduce the Hamming distance in S from k to at most ck , if k > ck for some constant c >
1. It is not immediately clear why this is feasible, since Alice does not knowthe subset S . Additionally, we need to make sure the communication complexity of this step isnot too large.We achieve this by using an expander code based on a bipartite expander G : [ n ] × [ d ] → [ m ]such that for all sets R ⊆ S with | R | ∈ [ ck , . k ], the set R has good expansion, i.e., | Γ( R ) | ≥ . d | R | . The expander is again generated by shared randomness, and we show that we can choose d = O (log s k ) , m = O ( k log s k ) and the graph satisfies the property with high probability. Alicewill again compute the parity checks z and send it to Bob.Now Bob will apply the same method as before: start with ˜ x = y and keep finding a bit in S with more unsatisfied parity checks as neighbors than satisfied parity checks. Bob flips this bit andcontinues doing this until no such bit can be found. Since the number of unsatisfied parity checkskeeps decreasing, the process will end in a finite number of steps. We claim that when it ends, theHamming distance in S is at most ck . This effectively reduces the Hamming distance in S .The main issue in the analysis here is that the different bits between x and y are not entirely in S , and this may cause problems in belief propagation. However, our observation is that when k is much larger than k , the effect of k can mostly be ignored. More specifically, let R be the set ofleft vertices which correspond to the different bits between x and y in S , and R be the set of leftvertices which correspond to the different bits in S . Thus | R | ≤ k . Let the number of satisfied and7nsatisfied checks in Γ( R ) be s and u . As long as | R | ∈ [ ck , . k ], we have | Γ( R ) | = s + u ≥ . d | R | ,and 2 s + u ≤ d | R | + d | R | ≤ (1+ c ) | R | . Combining these inequalities, we can still deduce u ≥ . d | R | ,by setting c = 10. Hence there must exist a bit in S to flip. Since the number of unsatisfied checksdecreases strictly, the size | R | in the process can never be larger than 1 . k . This is becauseotherwise there will be at least 1 . dk unsatisfied checks, while at the beginning there are onlyat most (1 + 1 /c ) dk = 1 . dk unsatisfied checks. Thus when this process stops, we must have | R | ≤ ck . At this point, we can use the protocol for one set together with another expander graphto finish the job, by considering the set S = S ∪ S which has Hamming distance at most ( c + 1) k .The total communication complexity is O ( k log s k ) + O ( k log s + s k ) = O ( H ( s , k )). The protocol for arbitrary t . We now generalize the above protocol to arbitrary t . Recall that k ≥ k ≥ · · · ≥ k t . Our idea is to use the above protocol of reducing Hamming distance repeatedly,while going through the index from 1 to t . More formally, we use i ′ to denote the current index and k ′ to denote an upper bound of the Hamming distance in ∪ j ∈ [ i ′ ] S j after possible steps of reducingdistance. We start with i ′ = 0 , k ′ = 0 and repeat the following: find the first index i > i ′ s.t. thecurrent Hamming distance in ∪ j ∈ [ i ] S j is much larger than the Hamming distance in ∪ tj = i +1 S j , i.e., k ′ + i X j = i ′ +1 k j > c t X j = i +1 k j = k ′′ . (1)Then we reduce the Hamming distance in ∪ j ∈ [ i ] S i to at most k ′′ by using the two set protocoldescribed before, regarding ∪ j ∈ [ i ] S j as one set and ∪ tj = i +1 S j as the other set. We now update k ′ = k ′′ , i ′ = i and continue the process. Finally, the Hamming distance in S = ∪ j ∈ [ t ] S i will bereduced to at most ( c + 1) k t , and we apply the one set protocol for S to finish the job.The correctness follows from the correctness of the one set protocol and the two set protocol.The main thing left is to bound the communication complexity. Note that except the first iteration,in each subsequent iteration i ′ will be updated to i ′ + 1. Thus the number of bits Alice sends in thisstep is m = O (cid:16) ( k ′ + k i ) log P j ∈ [ i ] s j k ′ + k i (cid:17) . We show that this is always O ( t H ( s , k )) by using the boundon k ′ , the fact that k ≥ k ≥ · · · ≥ k t , and k i ≤ s i / , ∀ i ∈ [ t ]. Thus the total communicationcomplexity is O ( t H ( s , k )). Note that this is a one round protocol since only Alice sends outinformation.Finally, we can get further improvement by grouping some sets together. Specifically, we dividethe interval [2 , n ] into disjoint subintervals I j = [2 j − , j ) , j = 1 , . . . , O (log log n ) and put eachsubset S i into one interval according to the number s i /k i . Whenever two subsets S i and S j arein the same interval, we have log( s i /k i ) = Θ(log( s j /k j )) and thus we can consider S i ∪ S j as oneset with Hamming distance k i + k j , without changing the communication complexity much. Now,taking the union of all subsets in the same interval to be one subset reduces the number of subsetsto χ ( s, k, t ), and applying our protocol results in communication complexity O ( χ ( s, k, t ) H ( s , k )). ECC with asymmetric information.
The protocol for document exchange can be used toconstruct an error correcting code. We do this by first estimating the length of the redundantinformation. Let m be the communication complexity of the ( s , k , t ) DE protocol for messagelength n . We choose an asymptotically good code C with message length m and codeword length n , which corrects k errors. The actual message length of our code will be n − n . On input message x , we run Alice’s DE protocol on x ◦ where = 0 n to get z ∈ { , } m . Then we encode z by C and the final codeword is x ◦ C ( z ). To decode, one first recovers z by running the decoding8lgorithm of C on the part C ( z ). Then we run Bob’s DE protocol using z , and by replacing the C ( z ) part with 0 n . The correctness follows from the code C and the DE protocol. We now describe our protocol for document exchange under edit distance, and show a connectionto the problem of document exchange under Hamming distance with asymmetric information.On a high level, our protocol follows the leveled structure used in several previous works [22,11, 18]. The protocol proceeds in L = O (log( nk )) levels where in each level, Alice sends a sketch ofher string x with O ( k ) bits to Bob. Bob then uses all the sketches and his string y to recover x .On Alice’s side, in the first level she divides her string into Θ( k ) blocks where each block hassize O ( nk ). In each subsequent level, every block from the previous level is divided evenly into twoblocks, and this ends when the block size becomes O (log nk ), which takes O (log( nk )) levels. In eachlevel, Alice applies a different random hash function to every block using the shared randomness,and computes a sketch based on the hash values. On Bob’s side, his recovering process also proceedsin L levels, where in each level Bob maintains a string ˜ x which is Bob’s current version of Alice’sstring x . Specifically, in each level Bob also applies the same hash functions to the blocks of ˜ x toget the hash values, then he uses this level’s sketch to recover the correct hash values of Alice’sbocks. Bob will then find the blocks in ˜ x which have inconsistent hash values with Alice’s blocks,and update these blocks using his string y by computing a non overlapping matching between y ’s blocks and the corresponding hash values. An important property of the protocol is that ineach level, the number of different blocks between x and ˜ x is always bounded by O ( k ) with highprobability. This ensures that Alice can send a short sketch to Bob for him to recover the correcthash values of all blocks.To ensure that Alice’s sketch in each level has length O ( k ), there are several non trivial issues.First, every hash function needs to have only O (1) bits of output, as in [18]. Second, even so,the general task of recovering s hash values with O ( k ) errors needs to use a sketch of size at leastlog (cid:0) sk (cid:1) = Ω( k log sk ), where s is the number of blocks in the current level. This can be as large asΩ( k log nk ) when s becomes n Ω(1) , and thus will be problematic. To fix this issue, [18] uses a morecareful analysis called “t-witness” to show that in each level, the total number of possible errorpatterns is 2 O ( k ) with high probability, instead of (cid:0) sk (cid:1) . Thus, in theory one can simply use anotherrandom hash function with O ( k ) bits of output to distinguish all error patterns, and this bringsthe sketch size back to O ( k ). However, simply doing this will result in an exponential running timesince it involves exhaustive search. Thus, [18] needs to first randomly partition the blocks into bins,such that with high probability each bin has O (log n ) hash errors. The exhaustive search in eachbin now takes poly ( n ) time. Unfortunately, this also increases the error probability from 2 − Ω( k ) to1 / poly ( n ).In our protocol, we instead replace the approach of random partitioning and exhaustive searchin [18] by a direct efficient approach, thus improving the error probability to be exponentially small.We achieve this by establishing a connection to the problem of document exchange under Hammingdistance with asymmetric information, as follows.Intuitively, in Bob’s process of recovering the string x , in each level Bob keeps track of thepositions of the possible blocks where his version ˜ x and x may be different (we call these blocksbad). More specifically, recall that we can show in each level, with high probability there are atmost O ( k ) bad blocks. In the next level the number of these blocks will at most double due tosplitting, however since we use random hash functions with O (1) output bits, we can show that inthe next level with high probability Bob will detect O ( k ) bad blocks and update them. Some ofthe updated blocks may still be bad, but Bob knows the positions of all updated blocks, and he9lso knows that there are at most O ( k ) bad blocks in them after the update. Now, suppose theseupdates happen in level j , and Bob is now in level i > j . Then the O ( k ) updated blocks will splitinto O ( k i − j ) smaller blocks. If any of these smaller blocks is bad and it remains undetected sofar, then it must have gone through j − i different hash functions. If we choose all hash functionsindependently, then the probability that this happens is 2 − c ( i − j ) for some constant c . By choosingthe number of output bits of the hash functions to be a large enough constant, we know thatthe expected number of smaller bad blocks that remain undetected so far is O ( k/ i − j ). With alittle extra effort, we can show that with high probability the number of these blocks is at most k i − j = max { k/ log nk , k/ i − j } , and Bob knows that these blocks are inside the subset S i − j withsize O ( k i − j ), which stems from the O ( k ) updated blocks in level j . In other words, this gives aforest with the O ( k ) updated blocks in level j being the roots, and the at most k i − j bad blocks areamong the | S i − j | = O ( k i − j ) leaves.Note that the bad blocks in level i can come from the updated blocks in all previous levels, thuswe get a vector S = ( S , · · · , S i − ) and a vector k = ( k , · · · , k i − ). Furthermore in this process,whenever a bad block stemming from some level j gets detected and updated in a later level j ′ , thisnew block in level j ′ will become a new root and all its descendents are removed from the set S i − j and put into the set S i − j ′ . This ensures that the final subsets ( S , · · · , S i − ) are disjoint. Finally,only Bob knows the sets ( S , · · · , S i − ), but both parties know ( s = | S | , · · · , s i − = | S i − | ) and( k , · · · , k i − ). Thus, we have reduced the problem of sending the sketch in level i to the problemof document exchange under Hamming distance with asymmetric information. We now give our protocol for document exchange with asymmetric information, in the specialsetting described above. Recall that we have s i = O ( k i ) , k i = max { k/ i − , k/ log nk } , i ∈ [ t ] , t = O (log nk ). One can compute H ( s , k ) = Θ( k ) here, so our protocol for the general setting willresult in sub-optimal communication complexity. We give a different protocol here, which uses justone expander graph instead of a sequence of expander graphs.The expander graph G : [ n ] × [ d ] → [ m ] is generated by the shared randomness, with m = O ( k )and the following expansion property: for every R ⊆ ∪ ti =1 S i where | R | ∈ [ k/ log nk , O ( k )] and ∀ i ∈ [ t ] , | R ∩ S i | ≤ k i , we have | Γ( R ) | ≥ . d | R | . Limiting the expansion to restricted setsrather than all sets R with | R | ∈ [ k/ log nk , O ( k )] is the key to reduce the number of right verticesfrom Ω( k log nk ) to O ( k ). Indeed, using a careful analysis of probabilities, we show that a randombipartite graph with constant d and m = O ( k ) satisfies this property with high probability. Themain intuition is that the sequence { s i , i ∈ [ t ] } roughly increases exponentially, while the sequence { k i , i ∈ [ t ] } roughly decreases exponentially.Using this expander Alice sends her parity checks to Bob, and Bob again runs a belief propa-gation algorithm. The purpose of this phase is to reduce the total Hamming distance between x and ˜ x (Bob’s current version of x , starting with ˜ x = y ) to at most k/ log nk . However, the beliefpropagation has tricky issues here, as the standard approach may flip much more than 20 k i bits in S i . This can result in a subset R ⊆ [ n ] which does not have good expansion, thus ruining the wholeprocess. To fix this, we prohibit the algorithm from flipping more than 20 k i bits in S i for each i .This is done by keeping track of the number of already flipped bits in each S i , and for any i if thisnumber reaches 19 k i , then subsequently in S i the algorithm will only flip bits that are previouslyflipped.To show that this indeed works, at each step of the belief propagation, let R ⊆ ∪ ti =1 S i standfor the set of indices where x and ˜ x have different bits, and let R ′ stand for R restricted to theindices which we can flip (due to our modification). Thus R ′ always has good expansion. Our first10bservation is that at any time, | R ′ | ≥ . | R | . This is because R ′ is different from R only if forsome S i , the number of bits already flipped is at least 19 k i . However originally there are at most k i errors in S i , so we have introduced at least 18 k i new errors. This means ∀ i, | R ′ ∩ S i | ≥ . | R ∩ S i | ,and thus | R ′ | ≥ . | R | . Now let ( s ′ , u ′ ) and ( s , u ) be the number of satisfied and unsatisfied checksin Γ( R ′ ) and Γ( R ) respectively. We know s ′ + u ′ ≥ . d | R ′ | . Also, again by the fact that eachsatisfied check in Γ( R ) has at least two neighbors in R , we have 2 s ′ + u ′ ≤ s + u ≤ d | R | ≤ d | R ′ | .From these two inequalities we can still deduce that u ′ ≥ . d | R ′ | , thus Bob can find a bit in R ′ toflip.When this process stops, the Hamming distance between x and ˜ x is at most k/ log nk . We cannow use a deterministic document exchange protocol for Bob to recover x . The communicationcomplexity is O (( k/ log nk ) log nk ) = O ( k ). The only error probability here comes from the generationof the expander graph, which is 2 − Ω( k/ log nk ) . We also show that the other errors in the protocol foredit distance is 2 − Θ( k/ log nk ) . Thus the total error of the protocol for edit distance is 2 − Θ( k/ log nk ) .When k < log n , we can switch to the protocol in [18] which has error 1 / poly ( n ). In this paper we initiated a systematic study of document exchange and error correcting codeswith asymmetric information. While we provided both lower bounds and upper bounds, as wellas efficient randomized constructions that are close to optimal, there are still many interestingproblems left. We list some below.
Question 1:
The most obvious open problem is to achieve optimal communication complexity (i.e., H ( s , k )) for a one round randomized protocol. Two related questions are to reduce the errorprobability of the randomized protocol, and to study the case where the condition ∀ i, s i ≥ k i does not hold. For example, is there a better deterministic protocol for the latter case? Question 2:
A better understanding of the problem in the case of two sided asymmetric infor-mation. The results in this paper only study the case of two sided asymmetric informationwhere s A + s B ≤ n , i.e., the subsets from both parties can be disjoint in the worst case. Whathappens when s A + s B > n ? In this case the subsets from both parties are guaranteed tooverlap, and the situation becomes more complicated. Question 3:
Two round deterministic protocol. We showed that for any one round deterministicprotocol, the asymmetric information is not useful. However, by a result of Orlitsky [28], thereexists a two round exponential time deterministic protocol with communication complexity O ( H ( s , k ) + log n ). The idea is that Bob sends a description of an appropriate hash function toAlice in the first round, and Alice sends the hash value of her string x in the second round. Theexponential running time comes from both the selection of hash functions and the recoveringof x using the hash value. It is an interesting open problem to see if we can design efficientprotocols matching this bound. Our result suggests a way to approximate this: Bob sendsa description of a sequence of appropriate expanders in the first round, and Alice sends theparity checks of her string x in the second round. Using our algorithm, the recovering of x in the second round is already efficient (in fact nearly linear time), however the first step ofselecting the expanders still requires exponential time. Question 4:
Optimal deterministic document exchange under edit distance. Our results also bringsome hope to obtain an optimal deterministic document exchange protocol under edit distance.Especially, we have replaced the decoding by exhaustive search approach in [18] by an efficientdecoding algorithm. However, how to appropriate pick a hash function remains a problem. We11lso note that reducing the error probability is the first step towards a deterministic protocol,since if the error probability is small enough, then by a simple union bound there exists anon-uniform deterministic protocol that runs in polynomial time.
Paper Organization
The rest of the paper is organized as follows. In Section 3 we introduce some basic technical tools.In Section 4 we show lower bounds for asymmetric DE in the general setting. In Section 5 wegive our protocol for asymmetric DE in the general setting. In Section 6 we give our protocol forasymmetric DE in a special setting. In Section 7 we give our protocol for DE under edit distanceby using the protocol in the previous section. In Section 8 we generalize our results and give lowerbounds and protocols for asymmetric DE with two sided information.
We will use the following well known parity check computation based on bipartite expander graphs.
Construction 3.1 (Expander Code Encoding [32]) . Let
Γ : [ n ] × [ d ] → [ m ] be a bipartite graphwith n left vertices, m right vertices, left degree d . The encoding of the Γ -expander code, on inputmessage x ∈ { , } n , is computed as x ◦ z, where z ∈ { , } m , z [ i ] = L j ∈ Γ − ( i ) x [ j ] , i ∈ [ m ] . Definition 3.2 ([17] ) . A bipartite graph with n left vertices, m right vertices and left degree d isa ( k, a ) expander if for every set of left vertices S ⊆ [ n ] of size k , we have | Γ( S ) | > ak . It is a ( ≤ k max , a ) expander if it is a ( k, a ) expander for all k ≤ k max . Here ∀ x ∈ [ n ], Γ( x ) outputs the set of all neighbours of x . It is also a set function which isdefined accordingly. Also ∀ x ∈ [ n ] , y ∈ [ d ], the function Γ : [ n ] × [ d ] → [ m ] is such that Γ( x, y ) isthe y -th neighbour of x . Theorem 3.3 ([17] ) . For all constants α > , for every n ∈ N , k max ≤ n , and ǫ > , thereexists an explicit ( ≤ k max , (1 − ǫ ) d ) expander with n left vertices, m right vertices, left degree d = O ((log n )(log k max ) /ǫ ) /α and m ≤ d k α max . Here d is a power of . The explicitness here means, given a left node, and an edge, the induced right node computedfound in time O (log n + log d ). Theorem 3.4 (Classic belief propagation for decoding [32]) . Let
Γ : [ n ] × [ d ] → [ m ] be a ( ≤ k, / d ) bipartite graph with left degree d l , right degree d r . Let y be an n -bit string whose distance from acodeword x is at most k/ . Then a repeated application of the following decoding algorithm to y will return x in time O ( d l d r m ) .Decoding algorithm: Upon receiving the input n -bit string y , as long as there exists a variablesuch that most of its neighbouring constraints are not satisfied, flip it. Theorem 3.5 ([21] [14] [31] Systematic Algebraic Geometry Code) . There exists an explicit con-struction of algebraic geometry linear ( n, m, d ) q -code with d + m ≥ n − n √ q − . , q = ⌈ nd ⌉ , polynomial-time decoding when the number of errors is less than half of the distance. Here n, q should be atleast some fixed constants. oreover for every message x ∈ F mq , the codeword is x ◦ z for some redundancy z ∈ F n − mq . Inother words, the code is systematic. A distribution X over Σ n is k -wise independent if for any k variables in X , their marginal distribu-tion is uniform. Theorem 3.6.
There exists an explicit construction of κ -wise independence generator g : { , } s →{ , } n , where s = O ( κ log nκ ) .Proof. Let C ⊥ be an algebraic geometry linear ( n, m, d ) q -code constructed by Theorem 3.5, with d = κ + 1, m ≥ n − O ( κ ), q = poly ( n/d ) = poly ( n/κ ).Consider the dual code C = ( C ⊥ ) ⊥ . By duality of codes, its message length is n − m = O ( κ ).Let the generator be g ( · ) = C ( · ), i.e. the encoding function of C . Note that the seed length in bitsis s = ( n − m ) log q = O ( κ log nκ ).We claim that any κ columns of the generating matrix M ∈ F m × nq of C , are linearly independent.Since otherwise there will be a codeword in C ⊥ , which has hamming distance ≤ κ = d − g ( u ) = uM is κ -wise independent, when u is uniform. For any κ symbols inthe output, the corresponding κ columns of M are linearly independent. So the matrix M K , K = { indices of these κ columns } , formed by these columns has rank κ . Thus there are κ rows which arelinearly independent. Hence each linear combination of these κ rows in M K can uniquely representone vector in the space of κ symbols. So ( uM ) K is uniform.To see this is an explicit construction, note that the encoding of C ⊥ is explicit. So the encodingof each e i ∈ F mq , i ∈ [ m ], where e i is i -th unit vector, is explicit. Thus the encoding matrix M ⊥ ,whose i -th row is C ⊥ ( e i ), can be computed explicitly. The corresponding parity check matrix,which is actually M the encoding matrix of its dual code C , can be computed explicitly using M ⊥ by standard procedures. So the construction is explicit.Random variables X , X , . . . , X n ∈ { , } n are ε -almost κ -wise independent in max norm if ∀ i , i , . . . , i κ ∈ [ n ] , ∀ x ∈ { , } κ , | Pr[ X i ◦ X i ◦ · · · ◦ X i κ = x ] − − κ | ≤ ε. A function g : { , } d → { , } n is an ε -almost κ -wise independence generator in max norm if g ( U ) = X = X ◦ · · · X n are ε -almost κ -wise independent in max norm. Unless stated otherwise,we only consider max norm in the following context. Theorem 3.7 ( ε -almost κ -wise independence generator [4]) . There exists an explicit constructions.t. for every n, κ ∈ N , ε > , it computes an ε -almost κ -wise independence generator g : { , } d →{ , } n , where d = O (log κ log nε ) .The construction is highly explicit in the sense that, ∀ i ∈ [ n ] , the i -th output bit can be computedin time ˜ O (log n + log ε ) given the seed and i . (The ˜ O here hides some log log n , log log(1 /ε ) factors) Theorem 3.8 (General moment inequality for k -wise independence) . Let X i ∈ { , } , i = 1 , . . . , n ,be a sequence of k -wise independent random variables. Let X = P ni =1 X i .For every ε > , Pr[ X ≥ (1 + ε ) E X ] ≤ (cid:18)
11 + ε (cid:19) k . .3 LCS and Matching Consider two strings x ∈ { , } pn , y ∈ { , } n ′ , hash functions h j : { , } p → { , } q , j ∈ [ n ]. Amonotone matching w = (( ρ , ρ ′ ) , . . . , ( ρ | w | , ρ ′| w | )) between x, y under h j , j ∈ [ n ] is s.t. for every i ∈ [ | w | ], h ρ i ( x [ ρ i , ρ i + p )) = h ρ i ( y [ ρ ′ i , ρ ′ i + p )), where ρ i ∈ [ pn ] , ρ ′ i ∈ [ n ′ ]. Also we consider x asbeing cut into length p blocks and each ρ i has to be a starting position of a block in x . Lemma 3.9.
For any x ∈ { , } pn , y ∈ { , } n ′ , k ∈ N , S ⊆ [ n ] , | S | = s , h j : { , } p → { , } q , j ∈ [ n ] , the number of matchings w = (( ρ , ρ ′ ) , . . . , ( ρ | w | , ρ ′| w | )) between x S and y under h j , j ∈ [ n ] s.t. | ρ ′ − ρ | + | ( ρ ′ − ρ ′ ) − ( ρ − ρ ) | + · · · + | ( ρ ′| w | − ρ ′| w |− ) − ( ρ | w | − ρ | w |− ) | ≤ k, is at most s + k (log k + s − k +log e ) . Here x S refers to the sequence of blocks of x . The j -th block of it is x S [ j ] ∈ { , } p , j ∈ [ s ]. Weuse pos ( j ) to refer to the starting position of block x S [ j ] in x . Proof.
Let’s first consider the number of matchings with length ˜ s ∈ { , , , . . . , s } . The number ofpossible ρ , . . . , ρ ˜ s is (cid:0) s ˜ s (cid:1) .Assume | ( ρ ′ j − ρ ′ j − ) − ( ρ j − ρ j − ) | = k j , j = 1 , . . . , ˜ s , ρ ′ − ρ = 0.For a sequence of fixed ρ , . . . , ρ ˜ s , the total number of possible matchings w s.t. | ρ ′ − ρ | + | ( ρ ′ − ρ ′ ) − ( ρ − ρ ) | + · · · + | ( ρ ′ ˜ s − ρ ′ ˜ s − ) − ( ρ ˜ s − ρ ˜ s − ) | = ˜ s X j =1 k j ≤ k, is at most 2 ˜ s (cid:18) k + ˜ s − s − (cid:19) = 2 ˜ s (cid:18) k + ˜ s − k (cid:19) ≤ ˜ s + k (log k +˜ s − k +log e ) ≤ s + k (log k + s − k +log e ) , Since each sequence of ρ ′ j , j ∈ [˜ s ] one-on-one corresponds to a sequence of k j ∈ N , j ∈ [˜ s ] and thesigns of ( ρ ′ j − ρ ′ j ) − ( ρ j − ρ j ) , j = 1 , . . . , ˜ s .So the overall number of possibilities is at most s X ˜ s =0 (cid:18) s ˜ s (cid:19) s + k (log k + s − k +log e ) ≤ s + k (log k + s − k +log e ) . Lemma 3.10 (DP for LCS within k edit operations) . There is an algorithm, on input x ∈{ , } pn , y ∈ { , } n ′ = O ( np ) , S ⊆ [ n ] , k = ED ( x, y ) , hash functions h i : { , } p → { , } q , i ∈ [ n ] ,outputs a monotone matching w = (( u , u ′ ) , . . . , ( u | w | , u ′| w | )) between x S and y under h i , i ∈ [ n ] s.t. | w | ≥ | S | − k , and | u ′ − u | + | ( u ′ − u ′ ) − ( u − u ) | + · · · + | ( u ′| w | − u ′| w |− ) − ( u | w | − u | w |− ) | ≤ k. Proof.
We present a dynamic programming to compute the maximum matching.For every j ∈ [ | S | ] , j ′ ∈ [ n ′ ], let f ( j, j ′ , l ) be the maximum matching w = (( u , u ′ ) , . . . , ( u | w | , u ′| w | ))between x S [1 , j ] and y [1 , j ′ ] under h i , i ∈ [ n ], s.t. • g ( w ) = | u ′ − u | + | ( u ′ − u ′ ) − ( u − u ) | + · · · + | ( u ′| w | − u ′| w |− ) − ( u | w | − u | w |− ) | ≤ l ;14 The last match matches x S [ j ] to y [ j ′ , j ′ + p ).If there is no such matching, then f ( j, j ′ , l ) is null and g ( ∅ ) = −∞ .We compute f ( j, j ′ , l ) as follows.To initialize, we let f (0 , ,
0) = ∅ .For every j ∈ [ | S | ] , j ′ ∈ [ n ′ ] , l ≤ k ,1. If h j ( x [ j ]) = h j ( y [ j ′ , j ′ + p ) ), then f ( j, j ′ , l ) is null and g ( null ) = ∞ ;2. Pick the maximum matching w in M = { f ( j , j ′ , l ) | j < j, j ′ < j ′ , l ≤ l, g ( f ( j , j ′ , l )) + | pos ( j ) − pos ( j ) − ( j ′ − j ′ ) | ≤ l } ;3. Let f ( j, j ′ , l ) = w ∪ { ( pos ( j ) , j ′ ) } . Finally we use an exhaustive search to find the maximum matching among f ( j, j ′ , k ) , j ∈ [ n ] , j ′ ∈ [ n ′ ] and output.Next we prove the correctness.We first claim that, there exists a matching w ∗ of length | S | − k between x S and y which has g ( w ∗ ) ≤ k . This is because we can match each i ∈ S to exactly the same entry after the k editoperations to get w ∗ . Here g ( w ∗ ) ≤ k is because otherwise the edit distance between x and y islarger than k .Assuming the i -th pair in w ∗ matches x S [ j i ] to y [ j ′ i , j ′ i + p ). Let w ∗ i be the first i pairs of w ∗ .We use induction to show that | f ( j | w ∗ | , j ′| w ∗ | , g ( w ∗ )) | ≥ | w ∗ | .For the base case, note that | f ( j , j ′ , g ( w ∗ )) | ≥ f (0 , , ◦ ( pos ( j ) , j ′ ).Suppose for i ≥ | f ( j i , j ′ i , g ( w ∗ i )) | ≥ i . For i +1, by our construction to compute f ( j i +1 , j ′ i +1 , g ( w ∗ i +1 )),we know g ( f ( j i , j ′ i , g ( w ∗ i )))+ | pos ( j i +1 ) − pos ( j i ) − ( j ′ i +1 − j ′ i ) | ≤ g ( w ∗ i )+ | pos ( j i +1 ) − pos ( j i ) − ( j ′ i +1 − j ′ i ) | = g ( w ∗ i +1 ) . So f ( j i , j ′ i , g ( w ∗ i )) is in M . Since in the second stage of the computing of f ( j i +1 , j ′ i +1 , g ( w ∗ i +1 )) wepick the maximum matching in M and add one more match to it, we know | f ( j i +1 , j ′ i +1 , g ( w ∗ i +1 )) | ≥ | f ( j i , j ′ i , g ( w ∗ i ) | + 1 ≥ i + 1 . This shows the induction step.As a result, the output matching has length at least | w ∗ | . In this section, we show some lower bounds for the asymmetric document exchange and errorcorrecting codes. Given the vectors s = ( s , · · · , s t ) and k = ( k , · · · , k t ), we define H ( s , k ) = log t Y i =1 k i X j =0 (cid:18) s i j (cid:19) = t X i =1 log k i X j =0 (cid:18) s i j (cid:19) . Similarly, for two integers s and k with s ≥ k , we define H ( s, k ) = log k X j =0 (cid:18) sj (cid:19) . Note that in particular we have H ( s , k ) ≥ P ti =1 k i log( s i /k i ) and H ( s, k ) ≥ k log( s/k ).We now have the following theorems. 15 heorem 4.1. In an ( s , k , t ) asymmetric DE problem where Bob has the vector of subsets S =( S , · · · , S t ) , let k = P ti =1 k i and suppose Alice learns Bob’s string. Then any deterministic protocolhas communication complexity at least H ( n, k ) , and any randomized protocol with success probability ≥ / has communication complexity at least H ( n, k ) − . This holds even if Alice knows s and k .Proof. Assume for the sake of contradiction that there is a deterministic protocol with commu-nication complexity less than H ( n, k ). Fix Alice’s string x , and the number of strings y withinHamming distance k of x is exactly 2 H ( n,k ) . For each of these strings, one can define a vectorof subsets S = ( S , · · · , S t ) consistent with s = ( s , · · · , s t ) such that with each subset S i theHamming distance is exactly k i . Since the transcript of the protocol is a deterministic function of( x, y, S , s , k , ⊔ ), at least two different y ’s from Bob’s side will produce the same transcript. Nowsince Alice’s final output is a deterministic function of x and the transcript, this means Alice willnot be able to distinguish the two different y ’s, contradicting that the protocol always succeeds.Similarly, assume for the sake of contradiction that there is a randomized protocol with commu-nication complexity less than H ( n, k ) −
1, that succeeds with probability ≥ /
2. Fix Alice’s string x and consider the 2 H ( n,k ) different strings y as above. By an averaging argument there is a fixing ofthe random bits used, such that the protocol succeeds for at least 2 H ( n,k ) − y ’s. Since the protocolis now fixed the same argument gives a contradiction.We now consider the case where Bob tries to learn Alice’s string, and we have the followingtheorem. Theorem 4.2.
In an ( s , k , t ) asymmetric DE problem where Bob has the vector of subsets S =( S , · · · , S t ) , let k = P ti =1 k i and suppose Bob learns Alice’s string. Then any randomized protocolwith success probability ≥ / has communication complexity at least H ( s , k ) − . Furthermore if ∀ i, s i = | S i | ≥ k i , then any one round deterministic protocol has communication complexity atleast H ( n, k ) . This holds even if Alice knows s and k .Proof. Assume for the sake of contradiction that there is a randomized protocol with communicationcomplexity less than H ( s , k ) −
1, that succeeds with probability ≥ /
2. Fix Bob’s string y , andthe number of strings x within Hamming distance k i in each subset S i is exactly 2 H ( s , k ) . By anaveraging argument there is a fixing of the random bits used, such that the protocol succeeds forat least 2 H ( s , k ) − x ’s. Thus, again at least two different x ’s will produce the same transcript, andBob will not be able to distinguish. This gives a contradiction.Similarly, assume for the sake of contradiction that there is a deterministic protocol with commu-nication complexity less than H ( n, k ). This means two different x ’s will produce the same transcriptin a one-round protocol, where the transcript is a deterministic function of ( x, s , k , t ). For thesetwo different x ’s, as long as ∀ i, s i = | S i | ≥ k i , one can define a vector of subsets S = ( S , · · · , S t )such that for each x , the Hamming distance between the corresponding substrings of x and y in S i is exactly k i . Thus the inputs to Bob are the same for the two x ’s. Since Bob’s final output is adeterministic function of his inputs and the transcript, Bob will not be able to distinguish the twodifferent x ’s, a contradiction.We also have the following theorem for asymmetric error correcting codes. Theorem 4.3.
In an ( s , k , t ) asymmetric ECC problem where Bob has the vector of subsets S =( S , · · · , S t ) , let k = P ti =1 k i . If ∀ i, s i = | S i | ≥ k i , then any deterministic code must have distanceat least k + 1 . In particular, m ≤ n − H ( n, k ) . Furthermore, any randomized code with successprobability ≥ / must have message length m ≤ n − H ( s , k ) + 1 . roof. Assume for the sake of contradiction that there is a deterministic code with distance at most2 k . This means there are two different codewords Enc ( x ) and Enc ( x ) with Hamming distance atmost 2 k . Thus, an adversary can come up with two error strings z , z where each z j has exactly k Enc ( x ) ⊕ z = Enc ( x ) ⊕ z = y . As long as ∀ i, s i = | S i | ≥ k i , one can definea vector of subsets S = ( S , · · · , S t ) such that for each z j , the number of 1’s in the subset S i isexactly k i . Thus for x and x , Bob receives the same string y and his other inputs are also thesame. This means that Bob will not be able to distinguish x i and x j , a contradiction.Now assume for the sake of contradiction that there is a randomized code with success proba-bility ≥ / m > n − H ( s , k ) + 1. By an averaging argument there existsa fixing of the random bits used in encoding and decoding, that succeeds for 2 m − > n − H ( s , k ) messages. Note that for any codeword, the number of all strings which have Hamming distanceat most k i in the subset S i to the codeword is 2 H ( s , k ) . This implies that there exists two differentcodewords Enc ( x ) and Enc ( x ) and a string y such that for each Enc ( x j ), y has Hamming distanceat most k i in the subset S i to the codeword Enc ( x j ). An adversary can thus change Enc ( x ) and Enc ( x ) into the same string y , and both error patterns are consistent with ( S , s , k ). Thus Bob willnot be able to distinguish x i and x j , a contradiction. We give a random protocol for the general setting s.t. the communication complexity is close tooptimal.
Lemma 5.1.
For every S ⊆ [ n ] , integer k ≤ k ≤ s = | S | , the probability that a random bipartitegraph with n left vertices, m ≥ dk /δ right vertices, left degree d = O (log sk ) , havingfor every R ⊆ S, with | R | ∈ [ k , k ] | Γ( R ) | > (1 − δ ) d | R | , (2) is at least − ε , where ε = 2 − Θ( δ (log sk ) k log kk ) .Note that when k = 1 , we get an ( n, m, d, S, ≤ k, − δ ) expander with probability at least − − Θ( δ log sk log(2 k )) ≤ − / poly ( s ) . We also denote a bipartite graph with the expansion property stated as an ( n, m, d, S, [ k , k ] , − δ ) expander. Proof.
The total number of sets R with size r is at most ( esr ) r .For a fixed set R , a fixed set T ⊆ [ m ] , | T | = (1 − δ ) d | R | Pr[Γ( R ) ⊆ T ] = (cid:18) | T | m (cid:19) dr = (cid:18) (1 − δ ) drm (cid:19) dr . (3)There are at most (cid:18) m | T | (cid:19) ≤ (cid:18) em | T | (cid:19) | T | = (cid:18) em (1 − δ ) dr (cid:19) (1 − δ ) dr (4)such set T . 17o by a union bound, the probability that for every R, | R | = r , Γ( R ) ≤ (1 − δ ) dr is at most (cid:18) em (1 − δ ) dr (cid:19) (1 − δ ) dr × (cid:18) (1 − δ ) drm (cid:19) dr × ( esr ) r = e (1 − δ ) dr (cid:18) (1 − δ ) drm (cid:19) δdr × ( esr ) r ≤ e dr e − δdr log mdr ( esr ) r ≤ − Θ( δdr log kr ) . (5)by letting m = 2 dk /δ , d = O (log sk ).By another union bound the probability that for every R, | R | ∈ [ k , k ], it does not have a goodexpansion is at most P kj = k − Θ( δdj log kj ) ≤ ( k − k + 1)2 − Θ( δdk log kk ) ≤ − Θ( δdk log kk ) .When k = 1, this is at most 2 − Θ( δ log sk log(2 k )) ≤ − Θ( δ log sk log(2 k )) ≤ / poly ( s ). Lemma 5.2.
Assume Γ is an ( n, m, d, S, [ k ′ , k ] , . expander. Let y be the expander-code encod-ing of x using Γ . Then there is an explicit decoding which, on input x ′ which has k i , i ∈ [ t ] errorsin S i from x , with k ≥ k ′ ≥ c P ti =2 k i , c = 10 , outputs ˜ x that has at most k ′ errors in S .Proof. We propose the following algorithm. For every iteration, find the first bit in S s.t. it hasmore unsatisfied checks than satisfied ones. Loop until we cannot find such bit anymore.Now we show this works. Assume there are at least k ′ errors in S . Denote A as the set ofindices of these errors. Let s be the number of satisfied neighbors of A = A ∩ S . Let u be thenumber of unsatisfied neighbors of A . By the expander property, | Γ( A ) | ≥ . d | A | . So s + u ≥ . d | A | . (6)On the other hand, each satisfied check is connected to at least one vertex in A since it is inΓ( A ). Thus it has to be connected to at least 2 vertices in A to make it to be satisfied. Also eachunsatisfied check is connected to at least 1 vertex in A . Hence2 s + u ≤ d | A | ≤ d | A | + d t X i =2 k i ≤ (1 + 1 c ) d | A | . (7)By Equation (6) and Equation (7), u ≥ . d | A | . So there has to be ≥ . S having more unsatisfied checks than satisfiedones. As a result, the algorithm can find a bit to flip and u is decreasing. On the other hand, ifat some iteration, | A | = 2 k , then u ≥ . dk but initially u ≤ dk which contradicts that u isdecreasing. As a result, the iterations will continue until there are less than k ′ errors in S . Theorem 5.3.
There is an efficient -round protocol s.t. for every ( s, k ) DE problem, it hascommunication complexity O ( k log sk ) , success probability − − Θ(log sk log k ) . roof. We first generate a random bipartite graph with n left vertices, left degree d = O (log sk ), m = O ( dk ) right vertices. By Lemma 5.1, with probability 1 − − Θ(log sk log k ) , it is an ( n, m, d, S, ≤ k, .
9) expander Γ. We use Γ to compute the sketch z of x .To decode, we use y , z and Γ. By Lemma 5.2, we can get x correctly.The running time of both parties are ˜ O ( n ). Without loss of generality, we assume k ≥ k ≥ · · · ≥ k t . Theorem 5.4.
There is a -way efficient protocol s.t. for every ( s , k , t ) DE with k i ≤ s i / , ∀ i ∈ [ t ] ,it has success probability − − Ω( k t ) − / poly ( s ) , communication complexity O ( t P i ∈ [ t ] k i log s i k i ) . Construction 5.5.
Efficient protocol for ( s , k , t ) DE .Alice: on input x ,1. Let i ′ = 0 , k ′ = 0 , string z be empty string;1.1. While i ′ ≤ t − , find i > i ′ s.t. k ′ + P ij = i ′ +1 k j > k ′′ , where k ′′ = c P tj = i +1 k j ; If cannotfind i then break the iterations;1.2. Generate an ( n, m, d, ∪ ij =1 S j , [ k ′′ , k ′ + P ij = i ′ +1 k j )] , . -expander Γ by Lemma , where d = O (log P ij =1 s j k ′ + P ij = i ′ +1 k j ) ;1.3. Compute z i which is the expander code of x using Γ , z = z ◦ z i ;1.4. Let i ′ = i, k ′ = k ′′ .2. Encode x to be z final by using a ( n, m, d = O (log sk final ) , S, ≤ k final , . expander Γ final gener-ated by Lemma , where k final = k ′ + P tj = i ′ +1 k j ;3. Send z ◦ z final to Bob.Bob: on input y , S , k , together with the message z ◦ z final from Alice;1. Let i ′ = 0 , k ′ = 0 , y ′ = y ;1.1. While i ′ = t , find i > i ′ s.t. k ′ + P ij = i ′ +1 k j > k ′′ , where k ′′ = c P tj = i +1 k j , c = 10 ;1.2. Generate an ( n, m, d, ∪ ij =1 S j , [ k ′′ , k ′ + P ij = i ′ +1 k j )] , . -expander Γ by Lemma usingthe same randomness as of Alice;1.3. Use Γ , z i to reduce the number of errors of y in ∪ ij =1 S j to be at most k ′′ by Lemma ;1.4. Let i ′ = i, k ′ = k ′′ .2. Decode x by Lemma for the ( S, k ′ + k t ) setting, using y ′ , z final , and the expander generatedthe same as the Γ final of Alice; Lemma 5.6.
The communication complexity is O (cid:16) t P j ∈ [ t ] k j log s j k j (cid:17) .Proof. By Lemma 5.1, m of Γ is O (cid:18) ( k ′ + P ij = i ′ +1 k j ) log P ij =1 s j k ′ + P ij = i ′ +1 k j (cid:19) .Note that in the first iteration, the algorithm may pick a i ∈ [ t ]. But in the succeeding iterations,it will always take i = i ′ + 1, since k ′ + k i ′ +1 > k ′′ and we always assume k i ′ +1 > m = O ( i X j =1 k j ) log P ij =1 s j P ij =1 k j ≤ O ( c t X j = i k j + k i ) log P ij =1 s j P ij =1 k j Since i − X j =1 k j ≤ c t X j = i k j ≤ O ( c t X j = i k j + k i ) log P ij =1 s j P ij =1 k j + k i Decreasing the denominator ≤ O ( c t X j = i k j + k i ) log P ij =1 s j ( c P tj = i +1 k j + k i ) Since i X j =1 k j > c t X j = i +1 k j ≤ O ( c t X j = i k j + k i ) log P ij =1 s j c P tj = i +1 k j + k i Because of big-O notation ≤ O k log P ij =1 s j k ! Let k = ( c t X j = i k j + k i ) ≤ O k log i Y j =1 ( s j k + 1) log( · ) is an increasing function= O i X j =1 k log( s j k + 1) . For each j ∈ [ i ], if k > k j , then since k ≤ ( c + 1) tk j , k log( s j k + 1) ≤ ( c + 1) tk j log( s j k j + 1) ≤ c + 1) tk j log s j k j ;Otherwise if k ≤ k j , then since k j ≤ s j , k log( s j k + 1) ≤ k log 2 s j k ≤ O ( k j log s j k j ) . Hence m = O (cid:16) t P ij =1 k j log s j k j (cid:17) .Next we consider the cases where we are in iterations from the 2nd to the last. We have m = O ( k ′ + k i ) log P ij =1 s j k ′ + k i ! = O k log P ij =1 s j k ! Let k = k ′ + k i = c t X j = i k j + k i ≤ O k log i Y j =1 ( s j k + 1) log( · ) is an increasing function20 O i X j =1 k log( s j k + 1) . For each j ∈ [ i ], if k > k j , then again since k ≤ ( c + 1) tk j , k log( s j k + 1) ≤ ( c + 1) tk j log( s j k j + 1) ≤ c + 1) tk j log s j k j ;Otherwise if k ≤ k j , then since k j ≤ s j , k log( s j k + 1) ≤ k log 2 s j k ≤ O ( k j log s j k j ) . Hence m = O (cid:16) t P ij =1 k j log s j k j (cid:17) .As there are at most t iterations, the total communication complexity is tm = O (cid:16) t P tj =1 k j log s j k j (cid:17) .Next we show the correctness. Lemma 5.7.
Bob can compute x correctly with probability at least − − Ω( k t ) − / poly ( s ) .Proof. In the first iteration, since Γ is an ( n, m, d, ∪ ij =1 S j , [ c P tj = i +1 k j , P ij =1 k j )]) expander, byLemma 5.2, we can successfully reduce the number of errors in ∪ ij =1 S j to be ≤ k ′′ .Note that as long as k i ′ +1 >
0, the number i , found in the iteration, will be i ′ + 1. So theiteration will continue until i ′ = t −
1. After the iterations, the number of errors in S is at most k ′ + k t = ( c + 1) k t .Finally, using z final and Γ final , by Lemma 5.2, Bob can compute x correctly.The protocol succeeds once all random expander graphs are as desired. For random expandergraph in iteration i , the success probability is 1 − − Ω( dk ′′ log k ′ k ′′ )) ≤ − − Ω( dk ′′ ) , by Lemma 5.1. Soby a union bound, the probability, that all iterations success, is at least 1 − − Ω( k t ) . In the finalstep, the success probability is 1 − poly ( s ) by Theorem 5.3. Hence the final success probability is asdesired. Proof of Theorem . The correctness and communication complexity immediately follows fromLemma 5.6, Lemma 5.7.For the efficiency, note that in Alice’s algorithm, she just randomly generate a bipartite graphwith logarithmic degree. And apply the expander encoding to get the sketch. So this is in nearlinear time. For Bob’s algorithm, as S i , i ∈ [ t ] are disjoint, and the belief propagation can be donein near linear time. Other operations are also in near linear time. So Bob’s algorithm is also innear linear time.When t is large, we can group some sets together to reduce t and hence get the followingtheorem. Theorem 5.8.
There is a -way efficient protocol s.t. for every ( s , k , t ) DE with k i ≤ s i / , ∀ i ∈ [ t ] ,it has success probability − − Ω( k t ) − / poly ( s ) , communication complexity O (cid:16) χ ( s , k , t ) P i ∈ [ t ] k i log s i k i (cid:17) .The running time of both parties are ˜ O ( n ) . roof. We cut the interval [2 , n + 1) into t ′ = O (log log n ) intervals s.t. the j -th interval I j is[2 j − , j ). Then for all i s.t. s i /k i ∈ I j , we union them to be a set S ′ j . Also we take k ′ j to bethe summation of the corresponding k i ’s. We neglect these intervals which do not cover any s j /k j ,getting a new problem i.e. a ( s ′′ , k ′′ , χ ) error correction problem.By Theorem 5.4, the communication complexity is O (cid:16) χ P j ∈ [ χ ] k ′′ j log s ′′ j k ′′ j (cid:17) . Since ∀ j ∈ [ χ ] , i ∈ I j , log s i k i = O (log s ′′ j k ′′ j ), the communication complexity is actually O (cid:16) χ P i ∈ [ t ] k i log s i k i (cid:17) .The time complexity and success probability is implied by Theorem 5.4.Notice that χ can only be as large as O (log log n ). So we have the following corollary. Corollary 5.9.
There is a -way efficient protocol s.t. for every ( s , k , t ) DE with k i ≤ s i / , ∀ i ∈ [ t ] ,it has success probability − − Ω( k t ) − / poly ( s ) , communication complexity O (cid:16) log log n P i ∈ [ t ] k i log s i k i (cid:17) .The running time of both parties are ˜ O ( n ) . We show that our construction for DE can be modified to work for stochastic coding setting.
Theorem 5.10.
There is an efficient stochastic ECC s.t. for every ( s , k , t ) type errors with k i ≤ s i / , ∀ i ∈ [ t ] , it has success probability − − Ω( k t ) − / poly ( s ) , message length n − O (cid:0) χ ( s , k , t ) H ( s , k ) (cid:1) .The running time of both encoding and decoding are ˜ O ( n ) .Proof. For encoding, we first compute the length of the redundancy. By the Alice’s algorithm ofTheorem 5.8, the sketch length for ( s , k , t ) document exchange, on input strings of length n , is s = O (cid:16) χ ( s , k , t ) P i ∈ [ t ] k i log s i k i (cid:17) . If we apply an an asymptotically good ECC C , e.g. expandercodes [32] [34], to encode the sketch, then the output has length r = O ( s ). Let the message lengthbe n − r .The encoding of message x has two parts. The first part is the message. The second part isthe sketch for ( s , k , t ) document exchange on input x ◦ , where is an all 0 string of length r . Weknow the sketch length is s . Next we apply C on s to get z which has length r . The final codewordis x ◦ z .We claim this code can indeed resist ( s , k , t ) type errors by describing the decoding along withits analysis.For decoding, assume the input is x ′ ◦ z ′ . Note that even if all errors happen on z , we candecode to recover z from z ′ , since z is a codeword of an ECC correcting k errors. After we get z .We apply Bob’s algorithm of Theorem 5.8 on x ′ ◦ , using the sketch z . The decoding will successbecause the error type is still ( s , k , t ), as we only remove some errors happened on z .The success probability only comes from the success probability of Theorem 5.8, since that’s theonly part we use randomness. So the success probability is as desired. The encoding and decodingare in near linear time since the protocol and the asymptotically good code [34] are both in nearlinear time. 22 Document Exchange with Asymmetric Information in a SpecialSetting
We first develop a randomized two-party (Alice and Bob) one-way hamming error document ex-change protocol in which Bob knows the errors can only happen in some subsets of all positions,where in each subset the number of errors is also bounded.The reason we consider this kind of encoding/decoding for special error patterns is that it canhave shorter redundancy than the general coding for bounded number of hamming errors.The encoding utilize a randomized bipartite expander graph with a large expansion.
Lemma 6.1.
For every n, k, k ′ , k ′′ , r, d, t ∈ N , k ′ ≤ r ≤ k ≤ n , k ′′ t log ek t k ′′ ≤ k ′ log kk ′ , δ ∈ (0 , , d ≥ δ − , constant c > , disjoint sets S i ⊆ [ n ] , i ∈ [ t ] , | S i | = k O ( i ) , k i = max( k/ O ( i ) , k ′′ ) ≤ | S i | / ,the probability that a random bipartite graph with n left vertices, m ≥ dk /δ right vertices, leftdegree d , having thatfor every R ⊆ ∪ i ∈ [ t ] S i , | R | = r ≥ k ′ , with | R ∩ S i | ≤ k i , ∀ i ∈ [ t ] , it holds | Γ( R ) | > (1 − δ ) dr, is at least − ε , where ε = 2 − Θ( δdk ′ log kk ′ ) . We denote the generated expander graph as a ( n, m, d, S , k , [ k ′ , k ] , − δ )-expander, where k isthe sequence of all k i , i ∈ [ t ]. Proof.
We show that a uniformly sampled bipartite graph works. The bipartite graph with n leftvertices, m right vertices, left degree d , is generated as follows. Each edge, from one vertex of theleft, has its ending vertex being uniformly chosen from the right vertices.For a fixed R , if | Γ( R ) | ≤ (1 − δ ) dr , then there exists a set T ⊆ [ m ] s.t. | T | = (1 − δ ) dr, | Γ( R ) | ⊆ T .There are at most (cid:18) m | T | (cid:19) ≤ (cid:18) em | T | (cid:19) | T | = (cid:18) em (1 − δ ) dr (cid:19) (1 − δ ) dr (8)such set T . For each T , Pr[Γ( R ) ⊆ T ] = (cid:18) | T | m (cid:19) dr = (cid:18) (1 − δ ) drm (cid:19) dr . (9)Consider a fixed r . Assuming r ∈ [ k j +1 , k j ], for some j ∈ [ t ]. Notice that j = Θ(log kr ).Let r i = R ∩ S i . The total number of different sequences r , . . . , r t is at most (cid:18) r + tr (cid:19) ≤ (cid:18) e ( r + t ) r (cid:19) r ≤ (cid:18) O ( 2 kr ) (cid:19) r ≤ O ( r log kr ) . (10)Consider a fixed sequence r i , i ∈ [ t ] with r i ≤ k i . The total number of possibilities of R ∩ S j , . . . , R ∩ S t is at most t Y i = j (cid:18) | S i | r i (cid:19) ≤ t Y i = j (cid:18) | S i | k i (cid:19) ≤ t Y i = j (cid:18) e | S i | k i (cid:19) k i ≤ t ′ Y i = j O ( i k O ( i ) ) · t Y i = t ′ (cid:18) e | S t | k ′′ (cid:19) k ′′ = 2 O (cid:16)P t ′ i = j i k O ( i ) (cid:17) · O (( t − t ′ ) k ′′ log ek tk ′′ ) ≤ O ( k O ( j ) j ) · O ( r log kr ) = 2 O ( r log kr ) . t ′ is the first index s.t. k i = k ′′ .On the other hand, the total number of possibilities of R ∩ S , . . . , R ∩ S j is at most j Y i =1 (cid:18) | S i | r i (cid:19) ≤ (cid:18)P ji =1 | S i | P ji =1 r i (cid:19) ≤ (cid:18) O ( k O ( j ) ) P ji =1 r i (cid:19) ≤ (cid:18) O ( k O ( j ) ) r (cid:19) ≤ O ( k O ( j ) r ) ! r ≤ O ( r log kr ) . (11)So by a union bound, the probability that for every R, | R | = r, | R ∩ S i | ≤ k i , Γ( R ) ≤ (1 − δ ) dr is at most (cid:18) em (1 − δ ) dr (cid:19) (1 − δ ) dr × (cid:18) (1 − δ ) drm (cid:19) dr × O ( r log kr ) = e (1 − δ ) dr (cid:18) (1 − δ ) drm (cid:19) δdr × O ( r log kr ) ≤ e dr e − δdr log mdr O ( r log kr ) ≤ − Θ( δdr log kr ) (12)by letting m = 2 dk /δ .Since k ≥ r ≥ k ′ , it holds that 2 − Θ( δdr log kr ) ≤ − Θ( δdk ′ log kk ′ ) . Remark 6.2.
Note that we can use a κ = O ( kd ) -wise independence generator to generate theedges of the graph. Each edge is chosen according to a random variable in a sequence that is κ -wiseindependent. Each random variable has support size m . Hence inequality (9) still holds. So we canapply the same argument. The decoding algorithm has two parts. Both parts use belief propagation techniques. In thefirst part, we reduce the number of errors slightly by using z . In the second part, we further reducethe number of errors to 0 by using z . Construction 6.3 (Protocol for a specific setting of parameters) . Let n, m, d, t ∈ N , k i ∈ N , k i ≤ n, i ∈ [ t ] , k ′ = O ( k/ log nk ) , disjoint sets S i ⊆ [ n ] , i ∈ [ t ] . Let S = ∪ i ∈ [ t ] S i .Let expander graph Γ : [ n ] × [ d ] → [ m ] , s.t. ∀ R ⊆ ∪ i ∈ [ t ] S i with | R | ∈ [ k ′ , O ( k )] and ∀ i ∈ [ t ] , | R ∩ S i | ≤ k i , it holds Γ ( R ) > . d | R | . Let C be a systematic Algebraic Geometry code from Theorem , with alphabet F q , messagelength n/q , redundancy length O ( k ′ ) correcting k ′ errors.Let x ∈ { , } n be the original message.The decoding takes an input string y ∈ { , } n , parity checks z generated by expander encodingof x using Γ , and z which is the redundancy part of C ( x ) .Stage 1:1. (Generating the restriction set) Let V = ∅ . For every i ∈ [ t ] , if the number of flipped bits in S i is less than k i , then V = V ∪ S i otherwise V = V ∪{ j | the j -th bit is flipped previously by this algorithm } ;(If a bit is flipped twice, then it is regarded as not flipped)2. Find j ∈ V s.t. the number of unsatisfied parity checks in Γ ( j ) is larger than | Γ ( j ) | / d / ;Flip the j -th bit, and restart this stage; If no such j , go to the next step;3. Go to the next stage. tage 2 (classic belief propagation using z ):1. Apply the decoding of C on the current y concatenated with z .2. Output the decoded message. Lemma 6.4. If HD ( y S , x S ) = 0 , ∀ i ∈ [ l ] , HD ( y S i , x S i ) ≤ k i , then the decoder outputs x correctly.Proof. Claim 6.5.
The first stage ends in at most O ( m ) rounds, and the number of errors in y is reducedto be less than k ′ .Proof. Let A τ be the set of indices of tampered bits (comparing to x ) in y at (immediately before)the τ -th round. At the beginning | A | = HD ( y, x ).We first show that if | A τ | ≥ k ′ , then we can indeed find an index j ∈ V s.t. the number ofunsatisfied parity checks in Γ ( j ) is larger than | Γ ( j ) | / A ′ τ = A τ ∩ V . Let s, s ′ be the numbers of satisfied checks in Γ ( A τ ) , Γ ( A ′ τ ). Let u, u ′ be the numbers of unsatisfied checks in Γ ( A τ ) , Γ ( A ′ τ ).Consider i ∈ [ l ] s.t. the number of flipped bits is exactly 19 k i . As HD ( y S i , x S i ) ≤ k i , the numberof tampered bits in S i is at most 20 k i . So | A τ ∩ S i | ≤ k i , since HD ( y S i , x S i ) ≤ k i . Also note thatthese tempered bits (at the beginning of the stage) can be flipped by the algorithm, we know thecurrent number of tampered bits in A ′ τ ∩ S i is at least 18 k i . So | A ′ τ ∩ S i | ≥ . | A τ ∩ S i | . (13)For i ∈ [ l ] s.t. the number of flipped bits is less than 19 k i , since V ∩ S i = S i , | A ′ τ ∩ S i | = | A τ ∩ S i | (14)As a result, noting that S i , i ∈ [ l ] are disjoint, | A ′ τ || A τ | = P i | A ′ τ ∩ S i | P i | A τ ∩ S i | ≥ . , (15)As | A τ | ≥ k ′ , it holds | A ′ τ | ≥ . k ′ ≥ k ′ . By the expansion property of Γ , s ′ + u ′ = | Γ ( A ′ τ ) | ≥ . d | A ′ τ | . (16)On the other hand, note that 2 s + u ≤ d | A τ | , since each satisfied check in Γ ( A τ ) must have atleast two bits in x A τ to be as addends. As A ′ τ = A τ ∩ V , we have s ′ ≤ s, u ′ ≤ u .Thus 2 s ′ + u ′ ≤ s + u ≤ d | A τ | . (17)Combining (16) and (17), we get u ′ ≥ . d | A ′ τ | − . d | A τ | ) . (18)Further by (15),(18), u ′ ≥ . d | A τ | ≥ . d | A ′ τ | . (19)Hence by an averaging argument, there is an index j ∈ V s.t. the number of unsatisfied paritychecks in Γ ( j ) is at least 0 . d . 25s a result, after doing the flipping for this round, the number of unsatisfied parity checks isstrictly decreased. Also note that because of the restriction sets in our algorithm our operationcannot create an A τ in some steps s.t. it does not have a good expansion. Hence, the first stageends when | A τ | < k ′ .Next we consider | A τ | < k ′ at the beginning of a round τ . There are two possible cases.The first case is that in step 2 the algorithm does not find a j ∈ V to conduct the operation,so it will go to the next stage as desired.The second case is that there is still an index j ∈ V s.t. the number of unsatisfied parity checksin Γ ( j ) is more than half. Hence after flipping, and the number of unsatisfied parity checks isagain strictly decreased. Note that there are at most O ( m ) unsatisfied checks. So this procedurewill end in at most O ( m ) rounds.For either case, stage 1 will end with | A τ | < k ′ . This shows the claim.As a result, after stage 1, the number of errors is less than 2 k ′ .As C can correct 2 k ′ errors, by Theorem 3.5, the decoding algorithm outputs x correctly. Theorem 6.6.
There is an efficient one-way protocol for every ( s , k , t ) DE, arbitrary s i = k Θ( i ) , l =Ω(log nk ) , k i = max { k/ Θ( i ) , Θ( kl log nk ) } ≤ s i / , t ≤ O ( √ l ) , having communication complexity O ( k ) , success probability − − Θ( k log log nk log nk ) .Proof. The protocol is constructed by Construction 6.3 and we will use a random ( n, m, d ) bipartitegraph to be Γ . By Lemma 6.1, a random bipartite ( n, m, d ) graph Γ is an ( n, m, d, S , k , [ k ′ , k ] , . ε = 2 − Θ( k ′ log kk ′ ) , where we let m = O (2 dk ) , d = O (1), k i = 20 k i , i ∈ [ t ], k ′ = O ( k/ log nk ). Also since t ≤ O ( √ l ), we have k ′′ t log k t k ′′ ≤ k ′ log kk ′ , where k ′′ = O ( kl log nk ).By Lemma 6.4, Bob can compute x , by using y, z, S , k , k ′ and the common randomness.The communication complexity is | z | = m = O ( k ). The protocol is efficient since both encodingand decoding are efficient. The failure probability is ε = 2 − Θ( k log log nk log nk ) since the construction of Γ is the only part we use randomness.Note that Theorem 1.10 directly follows from Theorem 6.6 by letting l = O ( t ). In this section we give the one-way document exchange protocol for edit distance. We begin witha randomized protocol where the two parties have shared randomness.
Construction 7.1.
The input string for Alice has length n ∈ N and there are totally k ∈ [Θ(log nk ) , Θ( n )] edit errors between Alice’s string and Bob’s string.Both Alice’s and Bob’s algorithms have L = O (log nk ) levels. For every i ∈ [ L ] , in the i -th level, • Let block size b i = n · i k , i.e., in each level we divide a block in the previous level evenly intotwo blocks; (We choose L properly s.t. b L = O (log nk ) ) • The number of blocks l i = n/b i ; lice: On input x ∈ { , } n ;1. For the i -th level,1.1. Partition x into consecutive blocks x [1 , b i ] , x [1 + b i , b i ] , . . . , x [1 + ( l − b i , l i b i ] ;1.2. Let h j : { , } b i → { , } c , j ∈ [ l i ] be a sequence of random hash functions with c being alarge enough constant positive integer;1.3. Compute v [ i ][ j ] = h j ( x [1 + ( j − b i , jb i ]) , j ∈ [ l i ] ;1.4. v [ i ] = ( v [ i ][1] , . . . , v [ i ][ l i ]) ;1.5. By the sketch construction of Theorem , compute z [ i ] ∈ { , } m = O ( k ) , a sketch of v [ i ] ,the expander constructed in this step being Γ : { , } l i × { , } d =10 → { , } m ;2. Compute the redundancy z final ∈ ( { , } b L ) Θ( k ) for the blocks of the L -th level by Theorem 3.5,where the code has distance k ;3. Send z = ( z [1] , z [2] , . . . , z [ L ]) , v [1] , z final .Bob: On input y ∈ { , } O ( n ) and received z , v [1] , z final ;1. Create ˜ x ∈ { , , ∗} n (i.e. his current version of Alice’s x ), initiating it to be ( ∗ , ∗ , . . . , ∗ ) ;2. Let A = [ l ] , A i = ∅ , i = 2 , , . . . , L ;3. For the i -th level, where ≤ i ≤ L − ,3.1. Divide ˜ x into length b i consecutive blocks, ˜ x [1 , b i ] , . . . , ˜ x [1 + ( l i − b i , l i b i ] ;3.2. Utilize the common randomness to get functions h j : { , } b i → { , } c , j ∈ [ l i ] that Alicegets in her stage 1.2.3.3. Compute ˜ v [ i ] = ( h (˜ x [1 , b i ]) , . . . , h l i (˜ x [1 + ( l i − b i , l i b i ])) ;3.4. For every i ′ ∈ [ ℓ ] , let S i ′ ⊆ [ l i ] be the indices of the (descendent) blocks in the currentlevel, whose ancestors are those blocks indicated by A i ′ , i.e. j is in S i ′ iff there is j ′ ∈ A i ′ s.t. [1 + ( j − b i , jb i ] ⊆ [1 + ( j ′ − b i ′ , j ′ b i ′ ] ;3.5. Compute v [ i ] by using the decoding algorithm from Construction 6.3 on input ˜ v [ i ] , S i ′ , k i ′ = max( k/ . c ( i − i ′ ) , k/ log nk ) , i ′ = i − , i − , . . . , , and the received z [ i ] ;3.6. Let T i = ∅ . For every j ∈ [ l i ] , if v [ i ][ j ] = ˜ v [ i ][ j ] , then put j ∈ T i and then check every i ′ = i − , i − , . . . , , if the j -th block in the current level is a descendent of the j ′ -th blockin the i ′ -th level, then remove j ′ from A i ′ ;3.7. Let A i = T i ;3.8. Compute w ∈ ( A i × [ | y | ]) | w | which is the maximum monotone matching between x ’s blocksindicated by A i , and y , under h , . . . , h l i , using v [ i ] , by Lemma 3.10; (We interpret w asa sequence of matches, the j th match being denoted as ( w [ j ][1] , w [ j ][2]) .)3.9. Evaluate ˜ x according to w , i.e. let ˜ x [ w [ j ][1]] = y [ w [ j ][2] , w [ j ][2] + b i − , ∀ j ∈ [ | w | ] ;4. In the L ’th level, apply the decoding of Theorem 3.5 on the blocks of ˜ x and z final to get x ;5. Return x . Next we show the correctness of our construction.Consider every level i ∈ [ L ], every i ′ = i − , i − , . . . ,
1. We denote the set descendants in the i -th level, stemming from A i ′ , as ˜ A i ′ . The indices set of undetected wrongly recovered blocks in˜ A i ′ , is denoted as B i ′ , i ′ = i − , . . . , i ∗ be s.t. k ′′ , k ′ / Θ(log nk ) ∈ [ k/ c ( i − i ∗ ) , k/ c ( i − i ∗ +1) ], k ′ , k/ log nk . Lemma 7.2.
For every i ∈ [ L ] , if ∀ i ′ < i, | T i ′ | = O ( k ) , and v [ i ′ ] are computed correctly by Bob,then for every i ′ ≤ i ∗ , the probability that | B i ′ | ≥ k ′′ is at most − Ω( k ′′ ) ; • for every i ′ ∈ ( i ∗ , i ) , the probability that | B i ′ | ≥ k i ′ = k/ . c ( i − i ′ ) is at most − Ω( ck/ c ( i − i ′ ) ) .Proof. Consider the possibilities of B i ′ . Each possibility can be described by a w -witness with w = | B i ′ | . The witness is a sequence of (number) w indices where each index is in the i -th levelindicating a wrongly recovered block. This sequence is further partitioned into i − i ′ + 1 groupscorresponding to levels i ′ , i ′ + 1 , . . . , i . We numerate these groups as group i ′ , i ′ + 1 , . . . , i .Consider the trees rooted at blocks in A i ′ . Each of them has height i − i ′ . Each node is a blockin a certain level between i ′ and i .The w -witness describes level i bad blocks which are descendants of blocks in A i ′ , uniquely inthe following way.Group j ∈ [ i ′ , i ] consists of indices of bad blocks, one for each depth i − j tree whose root is awrong block in level j . Note that for one tree, there may be many bad leaf blocks. For this case,we only pick the leftmost wrong one. These forms the group one. After each picking, we cut all theedges from that block to the root. This gives i − i ′ + 1 sub-trees. One of them is the last block. Weonly focus on sub-trees other than that picked block. They have depth from 1 to i − j . We updatethe set of trees by adding these trees from cutting and delete the trees being cut.In this way, every error patterns can be described. This is because, every leaf node is eitherbeing picked or still in one of the trees in the forest. Once the leaf is in one of the trees in theforest, it can be picked in a certain level of the picking procedure.Let the number of wrong blocks being picked for each level j be w j .The total number of error patterns is P = (cid:18) kw i ′ (cid:19) · ( i − i ′ ) w i ′ · (cid:18) w i ′ w i ′ +1 (cid:19) · ( i − i ′ − w i ′ +1 · (cid:18) w i ′ + w i ′ +1 w i ′ +2 (cid:19) · ( i − i ′ − w i ′ +2 · · · (cid:18)P i − j = i ′ w j w i (cid:19) ≤ (cid:18) kw i ′ (cid:19)(cid:18)P i − j = i ′ ( i − j ) w j P ij = i ′ w j (cid:19) · P i − j = i ′ ( i − j ) w j (20)For i ′ ≤ i ∗ , suppose P ij = i ′ ( i − j ) w j = k ′′ . Then P ≤ (cid:18) kk ′′ / ( i − i ′ ) (cid:19) · k ≤ O ( k ′′ ) · O (log k ) i − i ′ · k ′′ ≤ O ( k ′′ ) . (21)Note that the probability that a specific error pattern happens is at most 2 − c P ij = i ′ ( i − j ) w j = 2 − ck ′′ because each block in group j is checked for i − j times independently. Since c is a large enoughconstant, P ij = i ′ ( i − j ) w j is a integer in [0 , poly ( k log n )], we know by a union bound, P ij = i ′ ( i − j ) w j ≥ k ′′ happens with probability at most 2 − ck ′′ × O ( k ′′ ) × poly ( k log n ) ≤ − Ω( k ′′ ) .For i > i ∗ , suppose P ij = i ′ ( i − j ) w j = k/ . c ( i − i ′ ) = k i ′ . Then P ≤ (cid:18) kk i ′ / ( i − i ′ ) (cid:19) · k i ′ ≤ (0 . c ( i − i ′ )+ O (1)+log( i − i ′ )) · k i ′ / ( i − i ′ ) · k i ′ ≤ . ck i ′ , (22)28hen c is a large enough constant.Similarly, note that the probability that a specific error pattern happens is at most 2 − c P ij = i ′ ( i − j ) w j =2 − ck i ′ because each block in group j is checked for i − j times independently. Since c is a large enoughconstant, P ij = i ′ ( i − j ) w j is a integer in [0 , poly ( n )], we know by a union bound, P ij = i ′ ( i − j ) w j ≥ k i ′ happens with probability at most 2 − ck i ′ × . ck i ′ × poly ( k log n ) ≤ − Ω( k i ′ ) .As a result, w = P ij = i ′ w j > k i ′ happens with probability at most 2 − Ω( k i ′ ) ≤ − Ω( k ′′ ) . Lemma 7.3.
For every i ∈ [ L ] , if ∀ i ′ < i, | T i ′ | = O ( k ) , and v [ i ′ ] are computed correctly by Bob,then with probability at least − − Ω( k ′′ ) , i − X i ′ =1 | B i ′ | < k. Proof.
By Lemma 7.2, for i ′ ≤ i ∗ , with probability at least 1 − − Ω( k ′′ ) , | B i ′ | < k ′′ ; for i ′ > i ∗ , withprobability at least 1 − − Ω( k ′′ ) , | B i ′ | ≤ k i ′ = k/ . c ( i − i ′ ) .By a union bound, with probability at least 1 − i − Ω( k ′′ ) = 1 − − Ω( k ′′ ) , i − X i ′ =1 | B i ′ | = i ∗ X i ′ =1 | B i ′ | + i − X i ′ = i ∗ +1 | B i ′ | ≤ ( i ∗ − k ′′ + 0 . k < k. Lemma 7.4.
For every i ∈ L , at level i , if v [ i ] are computed correctly by Bob, and | T i | ≤ k , thenwith probability − − Θ( k ) , the number of wrongly recovered blocks introduced by w is at most k .Proof. Assume the number of wrongly recovered blocks introduced by w is larger than k . Thenthere more than k pairs in the matching are bad pairs. This happens with probability 1 / ck .Note that by Lemma 3.10, for w , | ρ ′ − ρ | + | ( ρ ′ − ρ ′ ) − ( ρ − ρ ) | + · · · + | ( ρ ′| w | − ρ ′| w |− ) − ( ρ | w | − ρ | w |− ) | ≤ k. By Lemma 3.9, since | T i | ≤ k , there are totally 2 O ( k ) possible matchings that can be outputby our algorithm.So by a union bound, the conclusion holds with probability 1 − − Θ( k ) . Lemma 7.5.
For every i ∈ L , in level i , if v [ i ] are computed correctly, and | T i | = O ( k ) , then withprobability − − Θ( k ) , the number of wrongly recovered blocks and uncovered blocks in T i after 3.9.is at most k .Proof. By Lemma 3.10, | w | ≥ | T i | − k . Thus the number of uncovered blocks is at most k . ByLemma 7.4, with probability 1 − / Θ( k ) , the number of wrongly recovered blocks introduced by w is at most k . So the total number of wrongly recovered blocks is at most 2 k . Lemma 7.6.
For every i ∈ L , with probability − − Θ( k ′′ ) , • after the first step of level i , the number of wrongly recovered blocks is at most k ; Bob can compute v [ i ] correctly; • the number of wrongly recovered blocks in T i is at most k after step 3.9..Proof. We use induction.In the first level, ˜ x = ( ∗ , ∗ , . . . , ∗ ). So the number of wrongly recovered blocks at the beginningis l = n/b = 6 k . So The number of wrongly recovered blocks is at most 6 k . Also Bob can get v [1] correctly, since it is directly sent by Alice. By Lemma 7.5, with probability 1 − / Θ( k ) , thetotal number of wrongly recovered blocks is at most 2 k if we regard uncovered blocks as wronglyrecovered.Suppose the conclusion holds for the first i − i .By Lemma 7.3, with probability 1 − − Ω( k ′′ ) , the total number of wrongly recovered blocks is P i − i ′ =1 | B i ′ | < k .By Lemma 6.1, with probability 1 − ε = 1 − − Ω( k ′ ) , Γ is a bipartite graph, having n = l i left vertices, m = O ( k ) right vertices, left degree d = O (1), s.t. ∀ R ⊆ [ n ] , | R | ∈ [ k ′ , k ] , | R ∩ S i ′ | ≤ k ′ i ′ = max(20 k/ c ( i − i ′ ) , k . ), Γ( R ) > . d | R | . Note that k ′ i ′ ≥ k i ′ . Also note that i ′ iterates in [1 , i − S i ′ is at most L ≤ k β/ √ log k . So by Theorem 6.6, Bob can get the correct v [ i ].As a result, by a union bound with probability 1 − L − Θ( k ′′ ) , Bob can compute v [ i ] correctly.Note that L = O (log nk ), k = Ω(log nk ). So the probability is at least 1 − − Θ( k ′′ ) .By Lemma 7.5, with probability 1 − / Θ( k ) , the total number of wrongly recovered blocks in T i is at most 2 k after stage 3.9..So the overall probability is as desired.This shows the inductive step. Lemma 7.7.
With probability − − Θ( k ′′ ) , Bob outputs x correctly.Proof. By Lemma 7.6, with probability 1 − − Θ( k ′′ ) , at the last level, there are at most 6 k wrongblocks. Since z final is the redundancy for a code with distance 16 k , all wrong blocks can be corrected.So Bob computes x correctly. Lemma 7.8.
The communication complexity is O ( k log nk ) .Proof. Note that since m = O ( k ), z [ i ] = O ( k ). Also note that | v [1] | = O ( k ), as the output lengthfor of the hash function is O (1) and l = O ( k ). | z final | = O ( k log nk ) by Theorem 3.5.So the overall communication complexity is P Li =1 | z [ i ] | + | v [1] | + | z final | = O ( k log nk ). Theorem 7.9.
There exists an efficient one-way edit distance document exchange protocol usingcommon randomness, for every n ∈ N , k = Ω(log nk ) , having sketch length O ( k log nk ) , successprobability − − Ω( k/ log nk ) .Proof. It immediately follows from Lemma 7.7, 7.8. The protocol is efficient since all componentsand steps are efficient. 30y combining Theorem 7.9 and the result of Haeupler [18], we immediately get the following.
Theorem 7.10.
There exists an efficient one-way edit distance document exchange protocol usingcommon randomness, for every n, k ∈ N , having sketch length O ( k log nk ) , success probability − min { − Θ( k/ log nk ) , / poly ( n ) } .Proof. When k = Ω(log nk ), we use Theorem 7.9. Otherwise we use the random protocol from [18]which has success probability 1 − / poly ( n ). Both of them have the sketch length as desired. In Construction 7.1, we use common randomness to generate hash functions h j , j ∈ [ l i ] for each i ∈ [ L ]. Also we use common randomness to generate the random bipartite graph Γ for the encodingof the hash values. Now we show that we can use almost κ -wise independence generator to reducerandomness. Lemma 7.11.
Replace the common randomness used in Construction 7.1, • for generating hash functions, by an ǫ -almost ck -wise independent distribution, with ǫ =2 − ck ; • for generating Γ , by O ( k ) -wise independent distributions over alphabet [ m ] . (Recall that m = O ( k ) ) Then with probability − − Θ( k ′ ) , Bob outputs x correctly.Proof. We need to recompute the following probabilities.In Lemma 7.2, a specific error pattern happens with probability at most 2 − ck ′′ ± ǫ ≤ − . ck ′′ .In Lemma 7.4, if there are k wrongly matched blocks introduced by w , then there are k hashcollisions each for a different h j in level i . So the probability is at most 2 − ck ± − ck = 2 − Θ( k ) .The rest of the analysis of the above two lemmas can still go through. These two lemmas arethe only two in the proof of Lemma 7.7 which will use the independence of hash functions.As a result, the proof of Lemma 7.7 can still go through. Theorem 7.12.
There exists an efficient one-way edit distance document exchange protocol, for ev-ery k = Ω(log nk ) , having sketch length O ( k max { log nk , log k } ) , success probability − − Ω( k/ log nk ) .Proof. Consider replacing the common randomness used in Construction 7.1 in the way of Lemma7.11. By Theorem 3.6, we can use a generator g of seed length O ( k max { log k, log nk } ) to generatethe O ( k )-wise independent distribution. By Theorem 3.7 we can use a generator g of seed length O (log k log nǫ ) to generate the ǫ -almost 10 ck -wise independent distribution.So we only need to let Alice send the seeds for these two, which have total length O ( k log nk ).Adding the communication complexity calculated by Lemma 7.8, the overall communication com-plexity is as desired.The correctness and success probability follows from 7.11. The protocol is efficient since allcomponents and steps are efficient. 31 Asymmetric Document Exchange with Two Sided Information
In this section we study document exchange with two sided asymmetric information. We have thefollowing definition.
Definition 8.1.
There are two parties Alice and Bob. Alice has a string x ∈ { , } n and Bob hasa string y ∈ { , } n . Alice knows a vector of disjoint subsets S A = ( S A , · · · , S At A ) and a vector ofintegers k A = ( k A , · · · , k At A ) . Bob knows a vector of disjoint subsets S B = ( S B , · · · , S Bt B ) and avector of integers k B = ( k B , · · · , k Bt B ) . That is, within each set S Ai or S Bi , the Hamming distancebetween x and y is at most k Ai or k Bi . Now one party tries to learn the string of the other party. Again, let s A = ( s A , · · · , s At ) where ∀ i, s Ai = | S Ai | . Similarly, let s B = ( s B , · · · , s Bt ) where ∀ i, s Bi = | S Bi | . We call this problem an ( s A , s B , k A , k B , t A , t B ) asymmetric document exchange(DE) problem, and we require the protocol to succeed for all possible configurations of the subsets S A = ( S A , · · · , S At A ), S B = ( S B , · · · , S Bt B ), and all possible strings x, y that are consistent with theparameters.We also have both lower bounds and upper bounds. Theorem 8.2.
In an ( s A , s B , k A , k B , t A , t B ) asymmetric DE problem, suppose Bob learns Alice’sstring. Let s A = P ti =1 s Ai and s B = P ti =1 s Bi , and assume s A + s B ≤ n . Let k A = P ti =1 k Ai and k B = P ti =1 k Bi . Then any deterministic protocol has communication complexity at least H ( n − s B , k A )+ H ( s B , k B ) , and any randomized protocol with success probability ≥ / has communicationcomplexity at least H ( n − s B , k A ) + H ( s B , k B ) − . In addition, if ∀ i, s Bi ≥ k Bi , then any one rounddeterministic protocol has communication complexity at least H ( n, k A + k B ) . This holds even if bothparties know ( s A , s B ) and ( k A , k B ) .Proof. The proof is similar to the one sided case. For a deterministic protocol, assume for thesake of contradiction that there is a protocol with communication complexity less than H ( n − s B , k A ) + H ( s B , k B ). Then fix Bob’s string y and there exist two different x ’s that produce thesame transcript, and in addition the inputs to Bob are the same. Thus Bob will not be able todistinguish the two x ’s, a contradiction. The case of a randomized protocol is essentially the sameup to an averaging argument.For the case of one round deterministic protocol, again the argument is similar as before. Assumefor the sake of contradiction that there is a protocol with communication complexity less than H ( n, k A + k B ). Fix Bob’s string y and the number of different x ’s within Hamming distance k A + k B is exactly 2 H ( n,k A + k B ) . For each such x , one can arrange the first at most k A differencesto happen in S A , and the rest of at most k B differences to happen in S B , such that the subsets in S A and S B are all disjoint (since s A + s B ≤ n ). Note that each x gives a vector S A , and the oneround transcript is a deterministic function of ( x, S A , s A , s B , k A , k B ), two different x ’s will producethe same transcript. At this point, one can define a vector S B consistent with s B , k B and both ofthe x ’s (since ∀ i, s Bi ≥ k Bi ). This means the inputs to Bob are the same for the two x ’s. SinceBob’s final output is a deterministic function of the transcript and ( y, S B , s A , s B , k A , k B ), Bob willnot be able to distinguish the two x ’s, a contradiction.The positive result directly follows from the one-side result i.e. Theorem 5.8. Theorem 8.3.
There exists an explicit protocol for all ( s A , s B , k A , k B , t A , t B ) DE , having com-munication complexity O (cid:16)(cid:0) χ ( s B , k B , t B ) + 1 (cid:1) (cid:0) H ( n − s B , k A ) + H ( s B , k B ) (cid:1)(cid:17) , success probability − − Ω(min( k t ,k A )) − / poly ( s A + s B ) , to let Bob learn Alice’s string. roof. The two party can just think that there are at most k A errors in set [ n ] − S B . This contributeone more set (and its error bound) to the error pattern. And the problem becomes a one-sideasymmetric information problem. So we can apply Theorem 5.8 and the conclusion follows. References [1] Khaled A. S. Abdel-Ghaffar and Amr El Abbadi. An optimal strategy for comparing file copies.
IEEE Transactions on Parallel and Distributed Systems , 5(1):87–93, 1994.[2] Micah Adler, Erik D. Demaine, Nicholas J.A. Harvey, and Mihai P?atra? Lower bounds forasymmetric communication channels and distributed source cod. In
SODA , pages 251–260,2006.[3] Micah Adler and Bruce M Maggs. Protocols for asymmetric communication channels.
Journalof Computer and System Sciences , 63(4):573–596, 2001.[4] Noga Alon, Oded Goldreich, Johan H˚astad, and Ren´e Peralta. Simple constructions of almostk-wise independent random variables.
Random Structures & Algorithms , 3(3):289–304, 1992.[5] Alexandr Andoni, Javad Ghaderi, Daniel Hsu, Dan Rubenstein, and Omri Weinstein. Codingsets with asymmetric information.
ArXiv e-prints , 2018.[6] Daniel Barbara and Hector Garcia-Molina. Exploiting symmetries for low-cost comparisonof file copies. In [1988] Proceedings. The 8th International Conference on Distributed , pages471–479. IEEE, 1988.[7] Daniel Barbara and Richard J. Lipton. A class of randomized strategies for low-cost comparisonof file copies.
IEEE Transactions on Parallel and Distributed Systems , 2(2):160–170, 1991.[8] Djamal Belazzougui and Qin Zhang. Edit distance: Sketching, streaming, and documentexchange. In
Proceedings of the 57th IEEE Annual Symposium on Foundations of ComputerScience , pages 51–60. IEEE, 2016.[9] Boris Bukh and Venkatesan Guruswami. An improved bound on the fraction of correctabledeletions. In
Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discretealgorithms , pages 1893–1901. ACM, 2016.[10] Diptarka Chakraborty, Elazar Goldenberg, and Michal Kouck´y. Low distortion embeddingfrom edit to hamming distance using coupling. In
Proceedings of the 48th IEEE AnnualAnnual ACM SIGACT Symposium on Theory of Computing . ACM, 2016.[11] Kuan Cheng, Zhengzhong Jin, Xin Li, and Ke Wu. Deterministic document exchange protocols,and almost optimal binary codes for edit errors. In , pages 200–211. IEEE, 2018.[12] Kuan Cheng, Zhengzhong Jin, Xin Li, and Ke Wu. Block edit errors with transpositions:Deterministic document exchange protocols and almost optimal binary codes. In . SchlossDagstuhl-Leibniz-Zentrum fuer Informatik, 2019.[13] Graham Cormode, Mike Paterson, Suleyman Cenk Sahinalp, and Uzi Vishkin. Communica-tion complexity of document exchange. In
Proceedings of the Eleventh Annual ACM-SIAMSymposium on Discrete Algorithms , pages 197–206. ACM, 2000.[14] Arnaldo Garcia and Henning Stichtenoth. On the asymptotic behaviour of some towers offunction fields over finite fields.
Journal of number theory , 61(2):248–273, 1996.3315] V. Guruswami and R. Li. Efficiently decodable insertion/deletion codes for high-noise andhigh-rate regimes. In ,pages 620–624, July 2016.[16] V. Guruswami and C. Wang. Deletion codes in the high-noise and high-rate regimes.
IEEETransactions on Information Theory , 63(4):1961–1970, April 2017.[17] Venkatesan Guruswami, Christopher Umans, and Salil Vadhan. Unbalanced expanders andrandomness extractors from Parvaresh-Vardy codes.
Journal of the ACM , 56(4), 2009.[18] Bernhard Haeupler. An optimal document exchange protocol. In , 2019.[19] Bernhard Haeupler and Amirbehshad Shahrasbi. Synchronization strings: codes for insertionsand deletions approaching the singleton bound. In
Proceedings of the 49th Annual ACMSIGACT Symposium on Theory of Computing , pages 33–46. ACM, 2017.[20] Bernhard Haeupler and Amirbehshad Shahrasbi. Synchronization strings: Explicit construc-tions, local decoding, and applications. In
Proceedings of the 50th Annual ACM Symposiumon Theory of Computing , 2018.[21] Tom Høholdt, Jacobus H Van Lint, and Ruud Pellikaan. Algebraic geometry codes.
Handbookof coding theory , 1(Part 1):871–961, 1998.[22] Utku Irmak, Svilen Mihaylov, and Torsten Suel. Improved single-round protocols for remotefile synchronization. In
INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computerand Communications Societies. Proceedings IEEE , volume 3, pages 1665–1676. IEEE, 2005.[23] Hossein Jowhari. Efficient communication protocols for deciding edit distance. In
ESA , 2012.[24] Eduardo Sany Laber and Leonardo Gomes Holanda. A new protocol for asymmetric commu-nication channels: Reaching the lower bounds.
Scientia Iranica , 8(4):297–302, 2001.[25] Eduardo Sany Laber and Leonardo Gomes Holanda. Improved bounds for asymmetric com-munication protocols.
Information Processing Letters , 83(4):205–209, 2002.[26] V. I. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.
Soviet Physics Doklady , 10:707, February 1966.[27] A Orlitsky and K Viswanathan. Practical algorithms for interactive communication. In
IEEEInt. Symp. on Information Theory , 2001.[28] Alon Orlitsky. Worst-case interactive communication 1: Two messages are almost optimal.
IEEE transactions on Information Theory , 36:1111–1126, 1990.[29] Alon Orlitsky. Interactive communication: Balanced distributions, correlated files, and average-case complexity. In [1991] Proceedings 32nd Annual Symposium of Foundations of ComputerScience , pages 228–238. IEEE, 1991.[30] L. J. Schulman and D. Zuckerman. Asymptotically good codes correcting insertions, deletions,and transpositions.
IEEE Transactions on Information Theory , 45(7):2552–2557, Nov 1999.[31] Kenneth W Shum, Ilia Aleshnikov, P Vijay Kumar, Henning Stichtenoth, and Vinay Deolalikar.A low-complexity algorithm for the construction of algebraic-geometric codes better than thegilbert-varshamov bound.
IEEE Transactions on Information Theory , 47(6):2225–2241, 2001.[32] Michael Sipser and Daniel A Spielman. Expander codes. In
Proceedings 35th Annual Sympo-sium on Foundations of Computer Science , pages 566–576. IEEE, 1994.[33] Michael Sipser and Daniel A Spielman. Expander codes.
IEEE transactions on InformationTheory , 42(6):1710–1722, 1996. 3434] Daniel A Spielman. Linear-time encodable and decodable error-correcting codes.
IEEE Trans-actions on Information Theory , 42(6):1723–1731, 1996.[35] Torsten Suel, Patrick Noel, and Dimitre Trendafilov. Improved file synchronization techniquesfor maintaining large replicated collections over slow networks. In
Proceedings. 20th Interna-tional Conference on Data Engineering , pages 153–164. IEEE, 2004.[36] John Watkinson, Micah Adler, and Faith E Fich. New protocols for asymmetric communicationchannels. In