Distributed compression through the lens of algorithmic information theory: a primer
aa r X i v : . [ c s . CC ] J un Distributed compression through the lens of algorithmicinformation theory: a primer
Marius Zimand ∗ September 23, 2018
Abstract
Distributed compression is the task of compressing correlated data by several parties, eachone possessing one piece of data and acting separately. The classical Slepian-Wolf theorem [SW73]shows that if data is generated by independent draws from a joint distribution, that is by a mem-oryless stochastic process, then distributed compression can achieve the same compression ratesas centralized compression when the parties act together. Recently, the author [Zim17] hasobtained an analogue version of the Slepian-Wolf theorem in the framework of Algorithmic In-formation Theory (also known as Kolmogorov complexity). The advantage over the classicaltheorem, is that the AIT version works for individual strings, without any assumption regardingthe generative process. The only requirement is that the parties know the complexity profile ofthe input strings, which is a simple quantitative measure of the data correlation. The goal ofthis paper is to present in an accessible form that omits some technical details the main ideasfrom the reference [Zim17].
Zack has three good friends, Alice, Bob, and Charles, who share with him every piece of informationthey have. One day, Alice, Bob, and Charles, separately , observe three collinear points A , respec-tively B and C , in the 2-dimensional affine space over the field with 2 n elements. Thus, each one ofAlice, Bob, and Charles possesses 2 n bits of information, giving the two coordinates of their respec-tive points. Due to the geometric relation, collectively, they have 5 n bits of information, becausegiven two points the third one can be described with just one coordinate. They want to email thepoints to Zack, without wasting bandwidth, that is by sending approximately 5 n bits, where “ap-proximately” means that they can afford an overhead of O (log n ) bits. Clearly, if they collaborate,they can send exactly 5 n bits. The problem is that they have busy schedules, and cannot find agood time to get together and thus they have to compress their points in isolation. How many bitsdo they need to send to Zack? Let us first note some necessary requirements for the compressionlengths. Let n A be the number of bits to which Alice compresses her point A , and let n B and n C have the analogous meaning for Bob and Charles. It is necessary that n A + n B + n C ≥ n, because Zack needs to acquire 5 n bits. It is also necessary that n A + n B ≥ n, n A + n C ≥ n, n B + n C ≥ n, because if Zack gets somehow one of the three points, he still needs 3 n bits of informations from theother two points. And it is also necessary that n A ≥ n, n B ≥ n, n C ≥ n, ∗ Department of Computer and Information Sciences, Towson University, Baltimore, MD.http://triton.towson.edu/˜mzimand n bits of informations from theremaining point.We will see that any numbers n A , n B and n C satisfying the above necessary conditions, are alsosufficient up to a small logarithmic overhead, in the sense that there are probabilistic compressionalgorithms such that if n A , n B and n C satisfy these conditions, then Alice can compress point A to a binary string p A of length n A + O (log n ), Bob can compress point B to a binary string p B oflength n B + O (log n ), Charles can compress point C to a binary string p C of length n C + O (log n ),and Zack can with high probability reconstruct the three points from p A , p B and p C . Moreover,the compression does not use the geometric relation between the points, but only the correlationof information in the points, as expressed in the very flexible framework of algorithmic informationtheory. Algorithmic Information Theory (AIT), initiated independently by Solomonoff [Sol64], Kolmogorov [Kol65],and Chaitin [Cha66], is a counterpart to the Information Theory (IT), initiated by Shannon. In ITthe central object is a random variable X whose realizations are strings over an alphabet Σ. TheShannon entropy of X is defined by H ( X ) = X x ∈ Σ P ( X = x ) (1 / log P ( X = x )) . The entropy H ( X ) is viewed as the amount of information in X , because each string x can bedescribed with ⌈ / log P ( X = x ) ⌉ bits (using the Shannon code), and therefore H ( X ) is the expectednumber of bits needed to describe the outcome of the random process modeled by X .AIT dispenses with the stochastic generative model, and defines the complexity of an individualstring x as the length of its shortest description. For example, the string x = 00000000000000000000000000000000has low complexity because it can be succinctly described as “2 zeros.” The string x = 10110000010101110101010011011100is a 32-bit string obtained using random atmospheric noise (according to random.org), and has highcomplexity because it does not have a short description.Formally, given a Turing machine M , a string p is said to be a program (or a description ) of astring x , if M on input p prints x . We denote the length of a binary string x by | x | . The Kolmogorovcomplexity of x relative to the Turing machine M is C M ( x ) = min {| p | | p is a program for x relative to M } . If U is universal Turing machine, then for every other Turing machine M there exists a string m such that U ( m, p ) = M ( p ) for all p , and therefore for every string x , C U ( x ) ≤ C M ( x ) + | m | . Thus, if we ignore the additive constant | m | , the Kolmogorov complexity of x relative to U is minimal.We fix a universal Turing machine U , drop the subscript U in C U ( · ), and denote the complexity of x by C ( x ). We list below a few basic facts about Kolmogorov complexity:1. For every string x , C ( x ) ≤ | x | + O (1), because a string x is trivially described by itself.(Formally, there is a Turing machine M that, for every x , on input x prints x .)2. Similarly to the complexity of x , we define the complexity of x conditioned by y as C ( x | y ) =min {| p | | U on input p and y prints x } .
2. Using some standard pairing function h· , ·i that maps pair of strings into single strings, wedefine C ( x, y ) the complexity of a pair of strings (and then we can extend to tuples with largerarity) by C ( x, y ) = C ( < x, y > ).4. We use the convenient shorthand notation a ≤ + b to mean that a ≤ b + O (log( a + b )), wherethe constant hidden in the O ( · ) notation only depends on the universal machine U . Similarly a ≥ + b means a ≥ b − O (log( a + b )), and a = + b means ( a ≤ + b and a ≥ + b ).5. The chain rule in information theory states that H ( X, Y ) = H ( X ) + H ( Y | X ). A similar ruleholds true in algorithmic information theory: for all x and y , C ( x, y ) = + C ( x ) + C ( y | x ). We present the problem confronting Alice, Bob, Charles (the senders) and Zack (the receiver) inan abstract and formal setting. We assume that each one of Alice, Bob, and Charles has n bitsof information, which, in concrete terms, means that Alice has an n -bit binary string x A , Bob hasan n -bit binary string x B , and Charles has an n -bit binary string x C . We also assume that the3-tuplet ( x A , x B , x C ) belongs to a set S ⊆ { , } n × { , } n × { , } n , which defines the way inwhich the information is correlated (for example, S may be the set of all three collinear points) andthat all parties (i.e., Alice, Bob, Charles, and Zack) know S . Alice is using an encoding fumction E A : { , } n → { , } n A , Bob is using an encoding fumction E B : { , } n → { , } n B , Charlesis using an encoding fumction E C : { , } n → { , } n C , and Zack is using a decoding fumction D : { , } n A × { , } n B × { , } n C → { , } n . Ideally, the requirement is that for all ( x A , x B , x C )in S , D ( E A ( x A ) , E B ( x B ) , E C ( x C )) = ( x A , x B , x C ). However, since typically the encoding functionsare probabilistic, we allow the above equality to fail with probability bounded by some small ǫ ,where the probability is over the random bits used by the encoding functions. Also, sometimes, wewill be content if the encoding/decoding procedures work, not for all, but only for “most” 3-tuplesin S (i.e., with probability close to 1, under a given probability distribution on S ).Our focus in this paper is to present distributed compression in the framework of AlgorithmicInformation Theory, but let us present first the point of view of Information Theory, where theproblem has been studied early on. The celebrated classical theorem of Slepian and Wolf [SW73]characterizes the possible compression rates n A , n B and n C for the case of memoryless sources. Thememoryless assumption means that ( x A , x B , x C ) are realizations of random variables ( X A , X B , X C ),which consist of n independent copies of a random variable that has a joint distribution P ( b , b , b )on triples of bits. In other words, the generative model for ( x A , x B , x C ) is a stochastic processthat consists of n independent draws from the joint distribution, such that Alice observes x A , thesequence of first components in the n draws, Bob observes x B , the second components, and Charlesobserves x C , the third components. A stochastic process of this type is called 3-DMS (DiscreteMemoryless Source). By Shannon’s Source Coding Theorem, if n ′ is a number that is at least H ( X A , X B , X C ) and if Alice, Bob, and Charles put their data together, then, for every ǫ >
0, thereexists an encoding/decoding pair E and D , where E compresses 3 n -bit strings to ( n ′ + ǫn )-bit stringsand D ( E ( X A , X B , X C )) = ( X A , X B , X C ) with probability 1 − ǫ , provided n is large enough. Thesecond part of Shannon’s Source Coding Theorem shows that this is essentially optimal becauseif the data is compressed to length smaller than H ( X A , X B , X C ) − ǫn (for constant ǫ ), then theprobability of correct decoding goes to 0. The Slepian-Wolf Theorem shows that such a compressioncan also be done if Alice, Bob, and Charles compress separately. Actually, it describes precisely thepossible compression lengths. Note that if the three senders compress separately to lengths n A , n B and n C as indicated above, then it is essentially necessary that n A + n B + n C ≥ H ( X A , X B , X C ) − ǫn , n A + n B ≥ H ( X A , X B | X C ) − ǫn (because even if Zack has X C , he still needs to receive a number ofbits equal to the amount of entropy in X A and X B conditioned by X C ), n A ≥ H ( X A | X B , X C ) − ǫn (similarly, even if Zack has X B and X C , he still needs to receive a number of bits equal to theamount of entropy in X A conditioned by X B and X C ), and there are the obvious other necessaryconditions obtained by permuting A, B and C . The Slepian-Wolf Theorem shows that for 3-DMS3hese necessary conditions are, essentially, also sufficient, in the sense that the slight change of − ǫn into + ǫn allows encoding/decoding procedures. Thus, in the next theorem we suppose that n A + n B + n C ≥ H ( X A , X B , X C ) + ǫn , and similarly for the other relations. Theorem 3.1 (Slepian-Wolf Theorem [SW73]) . Let ( X A , X B , X C ) be a 3-DMS, let ǫ > , andlet n A , n B , n C satisfy the above conditions (with + ǫn instead of − ǫn ). Then there exist encodingfunctions E A : { , } n → { , } n A , E B : { , } n → { , } n B , E C : { , } n → { , } n C and a decod-ing function D : { , } n A × { , } n B × { , } n C → { , } n such that P rob [ D ( E ( X A , X B , X C )) =( X A , X B , X C )] ≥ − O ( ǫ ) , provided n is large enough. There is nothing special about three senders, and indeed the Slepian-Wolf theorem holds for anynumber ℓ of senders, where ℓ is a constant, and for sources which are ℓ -DMS over any alphabetΣ. This means that the senders compress ( x , . . . , x ℓ ), which is realization of random variables( X , . . . , X ℓ ), obtained from n independent draws from a joint distribution p ( a , . . . , a ℓ ), with each a i ranging over the alphabet Σ. The i -th sender observes the realization x i of X i , and uses anencoding function E i : Σ n → Σ n i . Suppose that the compression lengths n i , i = 1 , . . . , ℓ , satisfy P i ∈ V n i ≥ H ( X V | X V ) + ǫn , for every subset V ⊆ { , . . . , ℓ } (where if V = { i , . . . , i t } , X V denotes the tuple ( X i , . . . , X i t ), and V denotes { , . . . , ℓ } − V ). Then the Slepian-Wolf theorem for ℓ -DMS states that there are E , . . . , E ℓ of the above type, and D : Σ n × . . . × Σ n ℓ → Σ n such that D ( E ( X ) , . . . , E ℓ ( X ℓ )) = ( X , . . . , X ℓ ) with probability 1 − ǫ .As pointed out above, the Slepian-Wolf theorem shows the surprising and remarkable fact that,for memoryless sources, distributed compression can be done at an optimality level that is on a parwith centralized compression. On the weak side, the memoryless property means that there is a lotof independence in the generative process: the realization at time i is independent of the realizationat time i −
1. Intuitively, independence helps distributed compression. For example, in the limitcase in which the senders observe realizations of fully independent random variables, then, clearly,it makes no difference whether compression is distributed or centralized. The Slepian-Wolf theoremhas been extended to sources that are stationary and ergodic [Cov75], but these sources are stillquite simple, and intuitively realizations which are temporally sufficiently apart are close to beingindependent.One may be inclined to believe that the optimal compression rates of distributed compression inthe theorem are caused by the independence properties of the sources. However, this is not so, andwe shall see that in fact the Slepian-Wolf phenomenon does not require any type of independence.Even more, it works without any generative model. For that we need to work in the framework ofKolmogorov complexity (AIT).Let us recall the example from Section 1: Alice, Bob, and Charles observe separately, respectively,the collinear points
A, B, C . Even without assuming any generative process for the three points, wecan still express their correlation using Kolmogorov complexity. More precisely, their correlation isdescribed by the Kolmogorov complexity profile, which consists of 7 numbers, giving the complexitiesof all non-empty subsets of { A, B, C } : ( C ( A ) , C ( B ) , C ( C ) , C ( A, B ) , C ( A, C ) , C ( B, C ) , C ( A, B, C )).Let us consider the general case, in which the three senders have, respectively, n -bit strings x A , x B , x C having a given complexity profile ( C ( x V ) | V ⊆ { x A , x B , x C } , V = ∅ ) (where x V is thenotation convention that we used for the ℓ -senders case of the Slepian-Wolf theorem). What are thepossible compression lengths, so that Zack can decompress and obtain ( x A , x B , x C ) with probability(1 − ǫ )?To answer this question, for simplicity, let us consider the case of a single sender, Alice. Shewants to use a probabilistic encoding function E such that there exists a decoding function D withthe property that for all n , and for all n -bit strings x , D ( E ( x )) = x , with probability 1 − ǫ . A lowerbound on the length | E ( x ) | is given in the following lemma. Lemma 3.2.
Let E be a probabilistic encoding function, and D be a decoding function such thatfor all strings x , D ( E ( x )) = x , with probability (1 − ǫ ) . Then for every k , there is a string x with C ( x ) ≤ k , such that | E ( x ) | ≥ k + log(1 − ǫ ) − O (1) . roof. Fix k and let S = { x | C ( x ) ≤ k } . It can be shown that for some constant c , | S | ≥ k − c ,where | S | is the size of S (the idea is that the first string which is not in S can be described withlog | S | + O (1) bits). Since for every x ∈ S , D ( E ( x, ρ )) = x with probability 1 − ǫ over the randomness ρ , there is some fixed randomness ρ such that D ( E ( x, ρ )) = x , for a fraction of 1 − ǫ of the x ’s in S . Let S ′ ⊆ S be the set of such strings x . Thus, | S ′ | ≥ (1 − ǫ ) | S | ≥ (1 − ǫ )2 k − c and the function E ( · , ρ ) is one-to-one on S ′ (otherwise decoding would not be possible). Therefore the function E ( · , ρ )cannot map all S ′ into strings of length k + log(1 − ǫ ) − ( c + 1).In short, if for every x , D ( E ( x )) = x with probability 1 − ǫ , then for infinitely many x it mustbe the case that | E ( x ) | ≥ C ( x ) + log(1 − ǫ ) − O (1). In other words, if we ignore the small terms, itis not possible to compress to length less than C ( x ).In the same way, similar lower bounds can be established for the case of more senders. Forexample, let us consider three senders that use the probabilistic encoding functions E A , E B and E C :if there is a decoding function D such that for every ( x A , x B , x C ), D ( E A ( x A ) , E B ( x B ) , E C ( x C )) = ( x A , x B , x C ) , with probability 1 − ǫ, (where the probability is over the randomness used by the encoding procedures) then for infinitelymany ( x A , x B , x C ) | E A ( x A ) | + | E B ( x B ) | + | E C ( x C ) | ≥ C ( x A , x B , x C ) + log(1 − ǫ ) − O (1) , | E A ( x A ) | + | E B ( x B ) | ≥ C ( x A , x B | x C ) + log(1 − ǫ ) − O (1) , | E A ( x A ) | ≥ C ( x A | x B , x C ) + log(1 − ǫ ) − O (1) , and similar relations hold for any permutation of A, B and C . As we did above, it is convenient touse the notation convention that if V is a subset of { A, B, C } , we let x V denote the tuple of stringswith indices in V (for example, if V = { A, C } , then x V = ( x A , x C )). Then the above relations canbe written concisely as X i ∈ V | E i ( x i ) | ≥ C ( x V | x { A,B,C }− V ) + log(1 − ǫ ) − O (1) , for all V ⊆ { A, B, C } , The next theorem is the focal point of this paper. It shows that the above necessary conditionsregarding the compression lengths are, essentially, also sufficient.
Theorem 3.3 (Kolmogorov complexity version of Slepian-Wolf coding [Zim17] ) . There exist prob-abilistic algorithms E A , E B , E C , a deterministic algorithm D , and a function α ( n ) = O (log n ) such that for every n , for every tuple of integers ( n A , n B , n C ) , and for every tuple of n -bit strings ( x A , x B , x C ) if X i ∈ V n i ≥ C ( x V | x { A,B,C }− V ) , for all V ⊆ { A, B, C } , (1) then(a) E A on input x A and n A outputs a string p A of length at most n A + α ( n ) , E B on input x B and n B outputs a string p B of length at most n B + α ( n ) , E C on input x C and n C outputs a string p C of length at most n C + α ( n ) ,(b) D on input ( p A , p B , p C ) outputs ( x A , x B , x C ) , with probability − /n . We present the proof of this theorem in the next section, but for now, we make several remarks: • Compression procedures for individual inputs (i.e., without using any knowledge regardingthe generative process) have been previously designed using the celebrated Lempel-Ziv meth-ods [LZ76, Ziv78]. Such methods have been used for distributed compression as well [Ziv84,DW85, Kuz09]. For such procedures two kinds of optimality have been established, both validfor infinite sequences and thus having an asymptotic nature. First, the procedures achieve a5ompression length that is asymptotically equal to the so-called finite-state complexity, whichis the minimum length that can be achieved by finite-state encoding/decoding procedures.Secondly, the compression rates are asymptotically optimal in case the infinite sequences aregenerated by sources that are stationary and ergodic [WZ94]. In contrast, the compression inTheorem 3.3 applies to finite strings and achieves a compression length close to minimal de-scription length. On the other hand, the Lempel-Ziv approach has lead to efficient compressionalgorithms that are used in practice. • At the cost of increasing the “overhead” α ( n ) from O (log n ) to O (log n ), we can obtaincompression procedures E A , E B and E C that run in polynomial time. On the other hand, thedecompression procedure D is slower than any computable function. This is unavoidable atthis level of optimality (compression at close to minimum description length) because of theexistence of deep strings . (Informally, a string x is deep if it has a description p of small lengthbut the universal machine takes a long time to produce x from p .) • The theorem is true for any number ℓ of senders, where ℓ is an arbitrary constant. We havesingled out ℓ = 3 because this case allows us to present the main ideas of the proof in arelatively simple form. • Romashchenko [Rom05] (building on an earlier result of Muchnik [Muc02]) has obtained aKolmogorov complexity version of Slepian-Wolf, in which the encoding and the decoding func-tions use O (log n ) of extra information, called help bits . The above theorem eliminates thehelp bits, and is, therefore, fully effective. The cost is that the encoding procedure is proba-bilistic and thus there is a small error probability. The proof of Theorem 3.3 is inspired fromRomashchenko’s approach, but the technical machinery is quite different. • The classical Slepian-Wolf theorem. can be obtained from the Kolmogorov complexity versionbecause if X is memoryless, then with probability 1 − ǫ , H ( X ) − c ǫ √ n ≤ C ( X ) ≤ H ( X )+ c ǫ √ n ,where c ǫ is a constant that only depends on ǫ . The central piece in the proof is a certain type of bipartite graph with a low congestion property.We recall that in a bipartite graph, the nodes are partitioned in two sets, L (the left nodes) and R (the right nodes), and all edges connect a left node to a right node. We allow multiple edges betweentwo nodes. In the graphs that we use, all left nodes have the same degree, called the left degree.Specifically, we use bipartite graphs G with L = { , } n , R = { , } m and with left degree D = 2 d .We label the edges outgoing from x ∈ L with strings y ∈ { , } d . We typically work with a familyof graphs indexed on n and such a family of graphs is computable if there is an algorithm that oninput ( x, y ), where x ∈ L and y ∈ { , } d , outputs the y -th neighbor of x . Some of the graphs alsodepend on a rational 0 < δ <
1. A constructible family of graphs is explicit if the above algorithmruns in time poly(n , /δ ).We now introduce informally the notions of a rich owner and of a graph with the rich ownerproperty . Let B ⊆ L . The B -degree of a right node is the number of its neighbors that are in B .Roughly speaking a left node is a rich owner with respect to B , if most of its right neighbors are“well-behaved,” in the sense that their B -degree is not much larger than | B | · D/ | R | , the averageright degree when the left side is restricted to B . One particularly interesting case, which is usedmany times in the proof, is when most of the neighbors of a left x have B -degree 1, i.e., when x “owns” most of its right neighbbors. A graph has the rich owner property if, for all B ⊆ L , most ofthe left nodes in B are rich owners with respect to B . In the formal definition below, we replace theaverage right degree with a value which may look arbitrary, but since in applications, this value isapproximately equal to the average right degree, the above intuition should be helpful.The precise definition of rich ownership depends on two parameters k and δ .6 efinition 4.1. Let G be a bipartite graph as above and let B be a subset of L . We say that x ∈ B is a ( k, δ ) -rich owner with respect to B if the following holds: • small regime case: If | B | ≤ k , then at least − δ fraction of x ’s neighbors have B -degree equalto , that is they are not shared with any other nodes in B . We also say that x ∈ B owns y with respect to B if y is a neighbor of x and the B -degree of y is . • large regime case: If | B | > k , then at least a − δ fraction of x ’s neighbors have B -degree atmost (2 /δ ) | B | · D/ k .If x is not a ( k, δ ) -rich owner with respect to B , then it is said to be a ( k, δ ) -poor owner with respectto B . Definition 4.2.
A bipartite graph G = ( L = { , } n , R = { , } m , E ⊆ L × R ) has the ( k, δ ) -richowner property if for every set B ⊆ L all nodes in B , except at most δ | B | of them, are ( k, δ ) -richowners with respect to B . The following theorem provides the type of graph that we use.
Theorem 4.3.
For every natural numbers n and k and for every rational number δ ∈ (0 , , thereexists a computable bipartite graph G = ( L, R, E ⊆ L × R ) that has the ( k, δ ) -rich property withthe following parameters: L = { , } n , R = { , } k + γ ( n/δ ) , left degree D = 2 γ ( n/δ ) , where γ ( n ) = O (log n ) .There also exists an explicit bipartite graph with the same parameters except that the overheadis γ ( n ) = O (log n ) . The graphs in Theorem 4.3 are derived from randomness extractors. The computable graph is ob-tained with the probabilistic method, and we sketch the construction in Section 5. The explicit graphrelies on the extractor from [RRV99] and uses a combination of techniques from [RR99], [CRVW02],and [BZ14]. L = { , } n Bx rich ownerspoor owners R = { , } k + γ ( n/δ ) degree D Figure 1: Graph with the ( k, δ ) rich owner property. If | B | ≤ k (small regime), a left node x is arich owner with respect to B if it owns (1 − δ ) of its neighbors; if | B | > k (large regime), if (1 − δ )of its neighbors have B -degree close to the average right B -degree. For every B ⊆ L , (1 − δ ) fractionof B are rich owners.In the figure, the grey neighbors are owned by x , and the white neighbor is notowned.Let us proceed now to the proof sketch of Theorem 3.3. We warn the reader that for the sakeof readability, we skip several technical elements. In particular, we ignore the loss of precision in = + , ≤ + , ≥ + , and we treat these relations as if they were = , ≤ , ≥ . E A , E B and E C have as inputs, respectively, the pairs ( x A , n A ) , ( x B , n B ) , ( x C , n C ),where x A , x B , x C are n -bit strings, and n A , n B , n C are natural numbers. The three encoding pro-cedures use, respectively the graphs G A , G B and G C , which have, respectively, the ( n A + 1 , /n ),( n B + 1 , /n ), n C + 1 , /n ) rich owner property. Viewing the strings x A , x B , x C as left nodes inthe respective graphs, the encoding procedures pick p A , p B , p C as random neighbors of x A , x B , x C (see Figure 2). L = { , } n G A B x A R = { , } n A + γ ( n/δ ) p A L = { , } n G B B x B R = { , } n B + γ ( n/δ ) p B L = { , } n G C B x C R = { , } n C + γ ( n/δ ) p C Figure 2: The encoding/decoding procedures. The senders use graphs G A , G B , G C with the richowner property, and then encode x A , x B , x C by random neighbors p A , p B , p C . The receiver uses B , B , B in the small regime in the respective graphs, for which x A , x B , x C are rich owners, andreconstructs x A , x B , x C as the unique neighbors of p A , p B , p C in B , B , B .We need to show that if n A , n B , n C satisfy the inequalities (1), then it is possible to reconstruct( x A , x B , x C ) from ( p A , p B , p C ) with high probability (over the random choice of ( p A , p B , p C )). Thegeneral idea is to identify computable enumerable subsets B , B , B of left nodes in the three graphs,which are in the “small regime,” and which contain respectively x A , x B , x C as rich owners. Then p A has x A as its single neighbor in B , and therefore x A can be obtained from p A by enumeratingthe elements of B till we find one that has p A as a neighbor ( x B , x C are obtained similarly).We shall assume first that the decoding procedure D knows the 7-tuple ( C ( x V ) | V ⊆ { A, B, C } , V = ∅ ), i.e., the complexity profile of ( x A , x B , x C ).The proof has an inductive character, so let us begin by analyzing the case when there is a singlesender, then when there are two senders, and finally when there are three senders. We show how to reconstruct x A from p A , assuming C ( x A ) ≤ n A . Let B = { x ∈ { , } n | C ( x ) ≤ C ( x A ) } . Since the size of B is bounded by 2 C ( x A )+1 ≤ n A +1 , it follows that B is in the small regime in G A .The number of poor owners with respect to B in G A is at most (1 /n ) · C ( x A )+1 ≈ C ( x A ) − n ,and it can be shown that any poor owner can be described by C ( x A ) − Ω(log n ) bits (essentially byits rank in some fixed standard ordering of the set of poor owners). Therefore the complexity of apoor owner is strictly less than C ( x A ) and thus x A is a rich owner with respect to B , as needed toenable its reconstruction from p A . We show how to reconstruct x A , x B from p A , p B , assuming C ( x A | x B ) ≤ n A , C ( x B | x A ) ≤ n B , C ( x A , x B ) ≤ n A + n B . 8f n A ≥ C ( x A ), then x A can be reconstructed from p A as in the case. Next, since n B ≥ C ( x B | x A ), x B can be reconstructed from x A and p B , similar to the case.So let us assume that C ( x A ) > n A . Let B = { x ∈ { , } n | C ( x | p A ) ≤ C ( x B | p A ) } . We show below that (1) B is in the small regime in G B , and (2) that it can be effectively enumerated.Since x B ∈ B , and since it is a rich owner with respect to B (by a similar argument with the oneused for x A in the case), this implies that x B can be reconstructed from p A , p B , and next x A can be reconstructed from x B and p A , as in the case (because n A ≥ C ( x A | x B )).It remains to prove the assertions (1) and (2) claimed above. We establish the following fact. Fact 1. (a) C ( p A ) = + n A ,(b) C ( p A , x B ) = + C ( x A , x B ) .(c) C ( x B | p A ) = + C ( x A , x B ) − n A .(d) C ( x B | p A ) ≤ + n B .Proof. (a) By the same argument used above, x A is still a rich owner with respect to B , but B isnow in the large regime. This implies that with probability 1 − (1 /n ), p A has, for some constant c , 2 C ( x A ) − n A + c log n neighbors in B , one of them being x A . The string x A can be constructed from p A and its rank among p A ’s neighbors in B . This implies C ( x A ) ≤ + C ( p A ) + ( C ( x A ) − n A ), andthus, C ( p A ) ≥ + n A . Since | p A | ≤ + n A , it follows that C ( p A ) ≤ + n A , and therefore, C ( p A ) = + n A .(b) The “ ≤ + ” inequality holds because p A can be obtained from x A and O (log n ) bits whichdescribe the edge ( x A , p A ) in G A . For the “ ≥ + ” inequality, let B ′ = { x ∈ { , } n | C ( x | x B ) ≤ C ( x A | x B ) } .B ′ is in the small regime in G A (because | B ′ | ≤ C ( x A | x B )+1 ≤ n A +1 ), and x A is a rich ownerwith respect to B ′ (because, as we have argued above, poor owners, being few, have complexityconditioned by x B , less than C ( x A | x B )). So, x A can be constructed from x B (which is needed forthe enumeration of B ′ ) and p A , and therefore, C ( x A , x B ) ≤ C ( p A , x B ).(c) C ( x B | p A ) = + C ( p A , x B ) − C ( p A ) = + C ( x A , x B ) − n A , by using the chain rule, and (a) and(b).(d) C ( x B | p A ) = + C ( x A , x B ) − n A ≤ + ( n A + n B ) − n A = n B , by using (c) and the hypothesis.Now the assertions (1) and (2) follow, because by Fact 1, (d), B is in the small regime, and byFact 1 (c), the decoding procedure can enumerate B since it knows C ( x A , x B ) and n A .Finally, we move to the case of three senders. This is the case stated in Theorem 3.3. We show how to reconstruct x A , x B , x C from p A , p B , p C , if n A , n B , n C satisfy the inequalities (1). We can actually assume that C ( x A ) >n A , C ( x B ) > n B , C ( x C ) > n C , because otherwise, if for example C ( x A ) ≤ n A , then x A can bereconstructed from p A as in the case, and we have reduced to the case of two senders. Asin the case, it can be shown that C ( p A ) = + n A , C ( p B ) = + n B , C ( p C ) = + n C .There are two cases to analyze. Case 1. C ( x B | p A ) ≤ n B or C ( x C | p A ) ≤ n C . Suppose the first relation holds. Then x B canbe reconstructed from p A , p B , by taking the small regime set for G B , B ′ = { x ∈ { , } n | C ( x | p A ) ≤ n B } , for which x B is a rich owner, and, therefore, with high probability owns p B . In this way, we reduceto the case. Case 2. C ( x B | p A ) > n B and C ( x C | p A ) > n C . We show the following fact.9 act 2. (a) C ( x B | x C , p A ) ≤ + n B and C ( x C | x B , p A ) ≤ + n C ,(b) C ( x B , x C | p A ) ≤ + n B + n C .Proof. (a) First note that C ( x C , p A ) = + C ( p A ) + C ( x C | p A ) ≥ + n A + n C . Then C ( x B | x C , p A ) = + C ( x B , x C , p A ) − C ( x C , p A ) ≤ + C ( x A , x B , x C ) − C ( x C , p A ) ≤ + ( n A + n B + n C ) − ( n A + n C ) = n B . The other relation is shown in the obvious similar way.(b) C ( x B , x C | p A ) = + C ( x B , x C , p A ) − C ( p A ) ≤ + C ( x A , x B , x C ) − C ( p A ) ≤ + ( n A + n B + n C ) − n A = n B + n C . Fact 2 shows that, given p A , the complexity profile of x B , x C satisfies the requirements for the case, and therefore these two strings can be reconstructed from p A , p B , p C . Next, since n A ≥ C ( x A | x B , x C ), x A can be reconstructed from p A , x B , x C , as in the case.Thus, both in Case 1 (which actually consists of two subcases Case 1.1 and Case 1.2) and inCase 2, x A , x B , x C can be reconstructed. There is still a problem: the decoding procedure needs toknow which of the cases actually holds true. In the reference [Zim17], it is shown how to determinewhich case holds true, but here we present a solution that avoids this.The decoding procedure launches parallel subroutines according to all three possible cases. Thesubroutine that works in the scenario that is correct produces ( x A , x B , x C ). The subroutines thatwork with incorrect scenarios may produce other strings, or may not even halt. How can thistroublesome situation be solved? Answer: by hashing. The senders, in addition to p A , p B , p C , use ahash function h , and send h ( x A ) , h ( x B ) , h ( x C ). The decoding procedure, whenever one of the parallelsubroutines outputs a 3-tuple, checks if the hash values of the tuple match ( h ( x A ) , h ( x B ) , h ( x C )),and stops and prints that output the first time there is a match. In this way, the decoding procedurewill produce with high probability ( x A , x B , x C ). Hashing.
For completeness, we present one way of doing the hashing. By the Chinese RemainderTheorem, if u and u are n -bit numbers (in binary notation), then u mod p = u mod p for at most n prime numbers p . Suppose there are s numbers u , . . . , u s , having length n in binary notationand we want to distinguish the hash value of u from the hash values of u , . . . , u s with probability1 − ǫ . Let t = (1 /ǫ ) sn and consider the first t prime numbers p , . . . , p t . Pick i randomly in { , . . . , t } and define h ( u ) = ( p i , u mod p i ). This isolates u from u , . . . , u s , in the sense that, withprobability 1 − ǫ , h ( u ) is different from any of h ( u ) , . . . , h ( u s ). Note that the length of h ( u ) is O (log n + log s + log(1 /ǫ )). In our application above, s = 3, corresponding to the three parallelsubroutines, and ǫ can be taken to be 1 /n , and thus the overhead introduced by hashing is only O (log n ) bits. Removing the assumption that the decoding procedure knows the inputs’ complexityprofile.
So far, we have assumed that the decoding procedure knows the complexity profile of x A , x B , x C . This assumption is lifted using a hash function h , akin to what we did above to handlethe various cases for . The complexity profile ( C ( x V ) | V ⊆ { A, B, C } , V = ∅} is a 7-tuple, with all components bounded by O ( n ). The decoding procedures launches O ( n ) subroutinesperforming the decoding operation with known complexity profile, one for each possible value ofthe complexity profile. The subroutine using the correct value of the complexity profile will output( x A , x B , x C ) with high probability, while the other ones may produce different 3-tuples, or maynot even halt. Using the hash values h ( x A ) , h ( x B ) , h ( x C ) (transmitted by senders together with p A , p B , p C ), the decoder can identify the correct subroutine in the same way as presented above.The overhead introduced by hashing is O (log n ).10 Constructing graphs with the rich owner property
We sketch the construction needed for the computable graph in Theorem 4.3. Recall that we usebipartite graphs of the form G = ( L = { , } n , R = { , } m , E ⊆ L × R ), in which every left nodehas degree D = 2 d , and for every x ∈ L , the edges outgoing from x are labeled with strings from { , } d . The construction relies on randomness extractors, which have been studied extensively incomputational complexity and the theory of pseudorandom objects. A graph of the above type issaid to be a ( k, ǫ ) extractor if for every B ⊆ L of size | B | ≥ k and for every A ⊆ R , (cid:12)(cid:12)(cid:12)(cid:12) | E ( B, A ) || B | · D − | A || R | (cid:12)(cid:12)(cid:12)(cid:12) ≤ ǫ, (2)where | E ( B, A ) | is the number of edges between vertices in B and vertices in A . In a ( k, ǫ ) extractor,any subset B of left nodes of size at least 2 k , “hits” any set A of right nodes like a random function:The fraction of edges that leave B and land in A is close to the density of A among the set of rightnodes. It is not hard to show that this implies the rich owner property in the large regime. Tohandle the small regime, we need graphs that maintain the extractor property when we considerprefixes of right nodes. Given a bipartite graph G as above and m ′ ≤ m , the m ′ -prefix graph G ′ is obtained from G by merging right nodes that have the same prefix of length m ′ . More formally, G ′ = ( L = { , } n , R ′ = { , } m ′ , E ′ ⊆ L × R ′ ) and ( x, z ′ ) ∈ E ′ if and only if ( x, z ) ∈ E for someextension z of z ′ . Recall that we allow multiple edges between two nodes, and therefore the mergingoperation does not decrease the degree of left nodes. Lemma 5.1.
For every k ≤ n , and every ǫ > , there exists a constant c and a computable graph G = ( L = { , } n , R = { , } k , E ⊆ L × R ) with left degree D = cn/ǫ such that for every k ′ ≤ k ,the k ′ -prefix graph G ′ = ( L = { , } n , R ′ = { , } k ′ , E ′ ⊆ L × R ′ ) is a ( k ′ , ǫ ) extractor.Proof. We show the existence of a graph with the claimed properties using the probabilistic method.Once we know that the graph exists, it can be constructed by exhaustive search. For some constant c that will be fixed later, we consider a random function f : { , } n × { , } d → { , } k . This definesthe bipartite graph G = ( L = { , } n , R = { , } k , E ⊆ L × R ) in the following way: ( x, z ) is an edgelabeled by y if f ( x, y ) = z . For the analysis, let us fix k ′ ∈ { , . . . , k } and let us consider the graph G ′ = ( L, R ′ , E ′ ⊆ L × R ′ ) that is the k ′ -prefix of G . Let K ′ = 2 k ′ and N = 2 n . Let us consider B ⊆ { , } n of size | B | ≥ K ′ , and A ⊆ R ′ . For a fixed x ∈ B and y ∈ { , } d , the probability thatthe y -labeled edge outgoing from x lands in A is | A | / | R ′ | . By the Chernoff bounds,Prob (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) | E ′ (B , A) || B | · D − | A || R ′ | (cid:12)(cid:12)(cid:12)(cid:12) > ǫ (cid:21) ≤ − Ω(K ′ · D · ǫ ) . The probability that relation (2) fails for some B ⊆ { , } k ′ of size | B | ≥ K ′ and some A ⊆ R ′ isbounded by 2 K ′ · (cid:0) NK ′ (cid:1) · − Ω( K ′ · D · ǫ ) , because A can be chosen in 2 K ′ ways, and we can consider that B has size exactly K ′ and there are (cid:0) NK ′ (cid:1) possible choices of such B ’s. If D = cn/ǫ and c is sufficientlylarge, the above probability is less than (1 / − k ′ . Therefore the probability that relation (2) failsfor some k ′ , some B and some A is less than 1 /
4. It follows, that there exists a graph that satisfiesthe hypothesis.Let G = ( L, R, E ⊆ L × R ) be the ( k, ǫ )-extractor from Lemma 5.1. Let δ = (2 ǫ ) / . As hintedin our discussion above, by manipulating relation (2), we can show that for every B ⊆ L of size | B | > k , (1 − δ ) fraction of nodes in B are rich owners with respect to B . This proves the richowner property for sets in the large regime. If B is in the small regime, then B has size 2 k ′ for some k ′ < k (for simplicity, we assume that the size of B is a power of two). Let us consider G ′ , the k ′ -prefix of G . As above, in G ′ , (1 − δ ) fraction of elements x in B are rich owners with respect to B .Recall that this means that if x is a rich owner then (1 − δ ) fraction of its neighbors have B -degreebounded by s , where s = (2 /δ ) | B | · D/ | R ′ | = O ( n/ǫ ). Using the same hashing technique, we can11split” each edge into poly(n /ǫ ) new edges. More precisely, an edge ( x, z ) in G ′ is transformed into ℓ = (1 /δ ) sn new edges, ( p , x mod p , z ) , . . . , ( p ℓ , x mod p ℓ , z ), where, as above, p i is the i -th primenumber. If x is a rich owner then (1 − δ ) of its “new” neighbors (obtained after splitting) have B -degree equal to one, as desired, because hashing isolates x from the other neighbors of z . The B -degree of these right nodes continues to be one also in G , because when merging nodes to obtain G ′ from G , the right degrees can only increase. Note that the right nodes in G have as labels the k -bit strings, and after splitting we need to add to the labels the hash values ( p i , x mod p i ), whichare of length O (log n/ǫ ), and this is the cause for the O (log n/δ ) overhead in Theorem 4.3.The explicit graph in Theorem 4.3 is obtained in the same way, except that instead of the “prefix”extractor from Lemma 5.1 we use the Raz-Reingold-Vadhan extractor [RRV99]. This paper is dedicated to the memory of Professor Solomon Marcus. In the 1978 freshman RealAnalysis class at the University of Bucharest, he asked several questions (on functions having patho-logical properties regarding finite variation). This has been my first contact with him, and, notcoincidentally, also the first time I became engaged in a type of activity that resembled mathemat-ical research. Over the years, we had several discussions, on scientific but also on rather mundaneissues, and each time, without exception, I was stunned by his encyclopedic knowledge on diversetopics, including of course various fields of mathematics, but also literature, social sciences, philos-ophy, and whatnot.The utilitarian function of mathematics is in most cases a consequence of its cognitivefunction, but the temporal distance between the cognitive moment and the utilitarianone is usually imprevisible.Solomon Marcus
References [BZ14] Bruno Bauwens and Marius Zimand. Linear list-approximation for short programs (orthe power of a few random bits). In
IEEE 29th Conference on Computational Complexity,CCC 2014, Vancouver, BC, Canada, June 11-13, 2014 , pages 241–247. IEEE, 2014.[Cha66] G. Chaitin. On the length of programs for computing finite binary sequences.
Journalof the ACM , 13:547–569, 1966.[Cov75] Thomas M. Cover. A proof of the data compression theorem of Slepian and Wolf forergodic sources (corresp.).
IEEE Transactions on Information Theory , 21(2):226–228,1975.[CRVW02] M. R. Capalbo, O. Reingold, S. P. Vadhan, and A. Wigderson. Randomness conductorsand constant-degree lossless expanders. In John H. Reif, editor,
STOC , pages 659–668.ACM, 2002.[DW85] G. Dueck and L. Wolters. The Slepian-Wolf theorem for individual sequences.
Problemsof Control and Information Theory , 14:437–450, 1985.[Kol65] A.N. Kolmogorov. Three approaches to the quantitative definition of information.
Prob-lems Inform. Transmission , 1(1):1–7, 1965.[Kuz09] S. Kuzuoka. Slepian-Wolf coding of individual sequences based on ensembles of linearfunctions.
IEICE Trans. Fundamentals , E92-A(10):2393–2401, 2009.[LZ76] A. Lempel and J. Ziv. On the complexity of finite sequences.
IEEE Trans. Inf. Theory ,IT-22:75–81, 1976. 12Muc02] Andrei A. Muchnik. Conditional complexity and codes.
Theor. Comput. Sci. , 271(1-2):97–109, 2002.[Rom05] A. Romashchenko. Complexity interpretation for the fork network coding.
InformationProcesses , 5(1):20–28, 2005. In Russian. Available in English as [Rom16].[Rom16] Andrei Romashchenko. Coding in the fork network in the framework of Kolmogorovcomplexity.
CoRR , abs/1602.02648, 2016.[RR99] Ran Raz and Omer Reingold. On recycling the randomness of states in space boundedcomputation. In Jeffrey Scott Vitter, Lawrence L. Larmore, and Frank ThomsonLeighton, editors,
STOC , pages 159–168. ACM, 1999.[RRV99] R. Raz, O. Reingold, and S. Vadhan. Extracting all the randomness and reducing theerror in Trevisan’s extractor. In
Proceedings of the 30th ACM Symposium on Theory ofComputing , pages 149–158. ACM Press, May 1999.[Sol64] R. Solomonoff. A formal theory of inductive inference.
Information and Control , 7:224–254, 1964.[SW73] D. Slepian and J.K. Wolf. Noiseless coding of correlated information sources.
IEEETransactions on Information Theory , 19(4):471–480, 1973.[WZ94] A.D. Wyner and J. Ziv. The sliding window Lempel-Ziv is asymptotically optimal.
Proc.IEEE , 2(6):872–877, 1994.[Zim17] M. Zimand. Kolmogorov complexity version of Slepian-Wolf coding. In
STOC 2017 ,pages 22–32. ACM, June 2017.[Ziv78] J. Ziv. Coding theorems for individual sequences.
IEEE Trans. Inform. Theory , IT-24:405–412, 1978.[Ziv84] J. Ziv. Fixed-rate encoding of individual sequences with side information.