Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties
11 Generalized Deduplication:Bounds, Convergence, and Asymptotic Properties
Rasmus Vestergaard, Qi Zhang, and Daniel E. LucaniDIGIT and Department of EngineeringAarhus University, Denmark{rv, qz, daniel.lucani}@eng.au.dk
Abstract
We study a generalization of deduplication, which enables lossless deduplication of highly similar data and show that classicdeduplication with fixed chunk length is a special case. We provide bounds on the expected length of coded sequences forgeneralized deduplication and show that the coding has asymptotic near-entropy cost under the proposed source model. Moreimportantly, we show that generalized deduplication allows for multiple orders of magnitude faster convergence than classicdeduplication. This means that generalized deduplication can provide compression benefits much earlier than classic deduplication,which is key in practical systems. Numerical examples demonstrate our results, showing that our lower bounds are achievable,and illustrating the potential gain of using the generalization over classic deduplication. In fact, we show that even for a simplecase of generalized deduplication, the gain in convergence speed is linear with the size of the data chunks.
I. I
NTRODUCTION
Deduplication [1] is a common practical compression technique in filesystems and other storage systems. It has been found toachieve significant space savings in several empirical studies for different workloads [2], [3]. Despite the practical importance,it has received little attention in the information theory community, with only Niesen’s recent work analyzing its compressionpotential [4]. As more data is generated every year, a thorough understanding of the fundamental limits of deduplication andsimilar techniques are of utmost importance.A significant shortcoming of deduplication is that near-identical files are not identified, and are considered as completelydifferent files. This can discourage the adoption of deduplication in some scenarios. An example is a network of Internet ofThings (IoT) devices sensing an underlying process. Their measurements will be highly correlated, but may differ slightly dueto spatial distance, measurement noise, and other factors. Deduplication for data of this type can, to some extent, be enabledwith a generalized view on deduplication. This generalized deduplication allows near-identical chunks to be deduplicated, whilestill ensuring lossless reconstruction of the data. The method has practical merits, and has been shown to achieve a compressionof modelled sensor data in many cases where deduplication is unable to [5]. Another instance is able to achieve a compressioncomparable to typical lossless compression methods for ECG data, while maintaining benefits from classic deduplication [6].This paper is a study of the theoretical properties of the technique, and it is shown how generalized deduplication comparesto classic deduplication.
A. Related work
To our knowledge, Niesen presents the only previous information-theoretical analysis of deduplication [4]. Niesen’s workintroduces a source model, formalizes deduplication approaches with chunks of both fixed-length and variable-length, andanalyzes the performance of the approaches. Our paper uses a similar strategy to analyze generalized deduplication.The manner in which deduplication is presented will make it clear that it is similar to classic universal source codingtechniques such as the LZ algorithms [7], [8]. In practice, the main difference between the methods is on the scale at whichthey operate. Deduplication attempts to identify large matching chunks (KB) on a global scale (GB to TB), whereas classicmethods identify smaller amounts of redundancy (B) in a relatively small window (KB to MB).The problem in deduplication is also similar to the problem of coding for sources with unknown alphabets [9] or multiplealphabets [10]. Such schemes attempt to identify the underlying source alphabet, and use this for universal compression, ideallyapproaching entropy regardless of the source’s distribution. Deduplication can be seen as one such approach, compressing thesource output by building a dictionary (alphabet) and replacing elements with a pointer to the dictionary.
B. Contributions
This paper provides a formal analysis of generalized deduplication and comparisons to classic deduplication, a special case.The main contributions are: a) Bounds:
We present a simple model for generalized deduplication as a source coding technique. This model is used toderive upper and lower bounds on the expected length of encoded sequences. The potential gain of the generalization againstthe classic approach is bounded, quantifying the value of the generalization for data fitting the source structure. a r X i v : . [ c s . I T ] A ug b) Asymptotic behavior: We derive the asymptotic cost of generalized deduplication, showing that the method convergesto as little as one bit more than the source entropy per chunk. We analyze how fast this convergence happens, and show thatthe generalization allows for faster convergence. c) Numerical results:
Concrete examples are used to show that the lower bounds are achievable. The generalization’spotential for faster convergence and compression gain is easily visualized.Theorem proofs are deferred to the appendices. II. P
ROBLEM S ETTING
A. Generalized deduplication
Generalized deduplication is now presented as a technique for source coding. In this paper, the technique operates on arandomly sampled binary sequence s , which consists of several chunks . The chunks are restricted to have equal length, n bits.The chunks in the sequence are a combination of a base and a deviation . The base is responsible for most of the chunk’sinformation content, whereas the deviation is the (small) difference between the base and the chunk. This property of the datais important for the coding procedure. Formally, the possible bases form a set X (cid:48) and the deviations form a set Y . These setsdefine the set of all potential chunks, Z (cid:48) = X (cid:48) ⊕ Y , i.e., the set arising from taking the symbol-wise exclusive-or of all basesin X (cid:48) with all deviations in Y . The method requires identification of a minimum distance mapping φ : Z (cid:48) → X (cid:48) , which willbe used to identify a chunk’s base. The deviation can be found by comparing the chunk to its base. The encoder and decodermust have prior knowledge of X (cid:48) and Y , which are used to determine the coded representations. These sets do not need tobe stored explicitly. The coders does not have prior knowledge of Z ⊆ Z (cid:48) , which is the set chunks are generated from. Inparticular, the coders does not know apriori which bases are the active ones, which is some set
X ⊆ X (cid:48) , forming Z = X ⊕ Y .The presented algorithm encodes (decodes) a sequence in one pass, encoding (decoding) over a dictionary of previouslyencountered bases. In practical systems, data is structured in databases, since this enables independent and parallel access andhigher speed. However, this paper follows the traditional source coding style of operating on a sequence, since this simplifiesanalysis.
Encoding:
The encoding procedure is initialized with an empty deduplication dictionary, D . To encode a sequence, it isprocessed sequentially, one chunk at a time. The mapping φ is applied to the chunk, identifying the base and the deviation.The base is deduplicated against elements in D . If it does not yet exist in the dictionary, it is added to the dictionary and thisis indicated with a in the output sequence followed by the base itself. If it already exists, this is indicated by a in thecoded sequence followed by a pointer to the chunk’s location in the dictionary, using (cid:100) log |D|(cid:101) bits . The deviation is addedto the output sequence, following the base. It does not need to be represented in full, since knowing Y allows specification ofa representation of q ≤ (cid:100) log |Y|(cid:101) bits. Decoding:
The coded sequence is uniquely decodable. The decoding procedure is also initialized with an empty deduplicationdictionary, D . Decoding happens one chunk at a time, parsing the sequence on the fly. If a is the first bit of a coded chunk,a base follows directly and is added to D . On the other hand, if a occurs, the base was deduplicated, so it must already existin D , and is looked up based on the following pointer. The coded deviation is expanded to its full representation. Finally, thechunk can be reconstructed by combining the base and deviation. The reconstruction is added to the output sequence. This isrepeated until the coded sequence has been processed in its entirety. Remark.
The classic deduplication approach arises as an important special case. It is obtained by considering each chunk asits own base, and thus there is no deviation. Formally, this means Y contains only the all-zero chunk of length n , so X (cid:48) = Z (cid:48) ,and φ is the identity function. B. Source model
A formal source model is now specified. All analysis in this paper uses this source structure. Chunks will have a length of n symbols, and are generated by a combination of two sources. Our analysis is restricted to binary symbols, so chunks are inthe binary extension field Z n .The first source generates the active bases, and is denoted by X ⊆ X (cid:48) . X (cid:48) is a packing of n -dimensional spheres withradius t in Z n . The second source generates the deviations, and is denoted by Y . This source consists of elements with lowhamming weight, i.e., Y = { v i ∈ Z n : w ( v i ) ≤ t } for the same t as the packing. This allows definition of the chunk source, Z = X ⊕ Y , which can be interpreted all points inside some spheres in Z n , where the spheres are centered at the basesfrom X and have radii t . The fact that a sphere packing is used for X (cid:48) implies that spheres are non-overlapping and, thus, P [ Z = z ] = P [ X = x ] · P [ Y = y ] and |Z| = |X ||Y| . We assume that chunks are drawn uniformly at random from Z . Example 1 (Source construction) . Let X (cid:48) be the set of codewords from the (7 , Hamming code and let Y consist of allbinary vectors of Hamming weight at most . Spheres of radii 1 cover the entire field, so Z (cid:48) = X (cid:48) ⊕ Y = Z . In this example,let the base source have two active elements, e.g., X = { , } , All logarithms in this paper are to base . and Z = X ⊕ Y then becomes Z = { , , , , , , , , , , , , , , , } with |Z| = |X ||Y| = 16 . An optimal coding of this source uses H ( Z ) = log |Z| = 4 bits per chunk. The mapping φ : Z (cid:48) → X (cid:48) (or Z → X ) can be derived from the decoding procedure for the Hamming code.
This source structure is a stylized model of the practical case where chunks tend to be similar, but not necessarily identical.An example is a surveillance camera, continuously taking pictures of the same location. The bases might then be the locationin different lighting, and a change in some of the image’s pixels can then be captured by the deviation.
C. Coding a source
Generalized deduplication has greater potential with large data sets and long chunks, yet a small example is useful tounderstand the method. An example is presented for the source of Example 1. A step-by-step explanation of the encoding anddecoding procedures is found in appendix A. We start with the simpler special case, classic deduplication.
Example 2 (Deduplication) . Let Z be the source from Example 1. Five chunks are chosen uniformly at random, andconcatenated. This forms a sequence of (cid:96) ( s ) = 35 bits : s = 0001000 | | | | . Applying deduplication to this sequence results in: s D = 1 . | . | . | . | . where the final dictionary is { , , } and (cid:96) ( s D ) = 29 bits are used in total. Let us now consider generalized deduplication. Full knowledge of X (cid:48) and Y is available, and is used to determine thedeviation representation and the minimum-distance mapping. Example 3 (Generalized deduplication) . Consider again the sequence s of Example 2. To apply generalized deduplication, arepresentation for the deviations is needed. As they are equiprobable H ( Y ) = log |Y| = 3 bits, so bits is optimal for theirrepresentation. An optimal representation is { ← , ← , . . . , ← } which in this special case is the syndrome representation of the (7,4) Hamming code. To compress the sequence, the minimum-distance mapping is applied to each chunk, identifying the closest base, which is a codeword of the Hamming code. The baseis here represented in full, although it may easily be compressed to four bits since X (cid:48) is known to be the set of codewordsfrom the (7,4) Hamming code. The result is: s G = 1 . . | .. | .. | . . | . . where the final dictionary is { , } and (cid:96) ( s G ) = 35 bits are used. Although in this limited example deduplication outperforms the generalization, our results show that this is not the case ingeneral. In fact, the results show that there are significant benefits in convergence speed of using the generalized form.III. B
OUNDS
In this section, the coded length of sequences is studied. Let s be a random binary sequence of C chunks of n bits each,so (cid:96) ( s ) = Cn . The interesting metric is the expected coded length, given the length of the original sequence. A. Bounds for coded sequence length for the generalization
The expected length of the sequence after generalized deduplication is R G ( C ) = E [ (cid:96) ( s G ) | (cid:96) ( s ) = Cn ] . This is decomposedas the sum of expected coded length of each chunk in s : R G ( C ) = C (cid:88) c =1 E (cid:2) I { x c (cid:54)∈ D c − } ( k + p ) + I { x c ∈ D c − } ( l ( D c − ) + p ) (cid:3) (1)where I {·} is the indicator function, D c − is the dictionary after chunk c − , x c is the base of chunk c , l ( D c − ) is the numberof bits needed to point to the dictionary, and finally q is the number of bits used for representing the deviation. The base itself Delimiters are inserted between chunks for ease of reading; the coding and decoding procedures do not require this. might be compressed to k bits with H ( X (cid:48) ) ≤ k ≤ n , since X (cid:48) is known. Since chunks are drawn uniformly at random from Z , this is equivalent to picking a base and a deviation uniformly at random from X and Y . Thus, P [ x c (cid:54)∈ D c − ] = (1 − |X | − ) c − (cid:44) p X ( c ) . (2)We now state Theorem 1 bounding the expected length after generalized deduplication in the presented source model. Theorem 1.
The expected length of the generalized deduplication-encoded sequence from C chunks of length n is bounded as θ L ( C, X , Y ) ≤ R G ( C ) ≤ θ U ( C, X , Y ) , where θ L ( C, X , Y ) = C (log |Y| + 1) + C (cid:88) c =1 (cid:104) kp X ( c ) + (1 − p X ( c )) log ( |X | (1 − p X ( c ))) (cid:105) (3) and θ U ( C, X , Y ) = C (log |Y| + 3) + C (cid:88) c =1 (cid:104) kp X ( c ) + |X | − min { ( c −
1) log( c − , |X | log |X |} (cid:105) . (4)The proof of the theorem is reported in appendix B. B. Bounds for coded sequence length for deduplication
Classic deduplication is a special case which allows for a slightly closer upper bound, and is therefore treated separately.The expected length of the sequence after deduplication is R D ( C ) = E [ (cid:96) ( s D ) | (cid:96) ( s ) = Cn ] . With the previous notation, R D ( C ) = C (cid:88) c =1 E (cid:2) I { z c (cid:54)∈ D c − } n + I { z c ∈ D c − } l ( D c − ) (cid:3) , (5)where z c is chunk c itself, since it is now the base. This base cannot be compressed as before, so it needs n bits. Theorem 2.
The expected length of the deduplication-encoded sequence from C chunks of length n is bounded as θ L ( C, Z , ) ≤ R D ( C ) ≤ θ U ( C, Z , ) − C where θ L and θ U are as in (3) and (4) with k = n since new chunks are represented with no compression, Z = X ⊕ Y and the set containing only the all-zero chunk of length n . The proof of the theorem is reported in appendix C. We illustrate the implications of Theorems 1 and 2 through a numericalexample in Section V.
C. Bounds for the gain of generalized deduplication
Theorems 1 and 2 can be used to bound the expected gain from using generalized deduplication instead of deduplication.
Definition 1.
The generalization ratio is G ( C ) = R D ( C ) R G ( C ) . The bounds of generalized deduplication from Theorem 1 and of deduplication from Theorem 2 are used to loosely boundthe generalization ratio as: θ L ( C, Z , ) θ U ( C, X , Y ) ≤ G ( C ) ≤ θ U ( C, Z , ) − Cθ L ( C, X , Y ) . (6)These bounds allow for a simple assessment of the expected gain in a specific scenario. IV. C
ONVERGENCE
A. Asymptotic storage cost
In this section, we provide theorems bounding the asymptotic coded length of a new chunk for generalized deduplication.Let ∆ R CG be the expected length of chunk C when generalized deduplication is used, i.e., ∆ R CG = R G ( C ) − R G ( C − . (7)Then the asymptotic cost of generalized deduplication is bounded by Theorem 3. Theorem 3.
Generalized deduplication has asymptotic cost H ( Z ) + 1 ≤ ∆ R ∞ G ≤ H ( Z ) + 3 where Z is the set of potential chunks. The proof of the theorem is reported in appendix D. Generalized deduplication is thus asymptotically within one and threebits of the entropy of Z . In practice, the method will operate on larger chunks with high entropy, so this overhead will benegligible. Similarly, let ∆ R CD be the expected length of chunk C in classic deduplication: ∆ R CD = R D ( C ) − R D ( C − . (8)For this special case, the closer upper bound in Theorem 2 translates to a closer upper bound in asymptotic cost. Theorem 4.
Classic deduplication has asymptotic cost H ( Z ) + 1 ≤ ∆ R ∞ D ≤ H ( Z ) + 2 where Z is the set of potential chunks. The proof of the theorem is reported in appendix E.
B. Rate of Convergence
Now that it is established that generalized deduplication schemes converge to slightly more than the entropy of Z , it is alsoimportant to quantify the speed of convergence. Generalized deduplication should converge faster than deduplication in general,since the number of potential bases is smaller. The generalization needs to identify |X | bases for convergence, whereas theclassic approach requires |X ||Y| = |Z| bases. Convergence of the classic approach thus requires identification of an additionalfactor of |Y| bases. To formally analyze this, the following definition is needed [11, pp 12–13]. Definition 2.
The rate of convergence of a sequence { a , a , ... } converging to ξ is µ = lim i →∞ (cid:12)(cid:12)(cid:12)(cid:12) a i +1 − ξa i − ξ (cid:12)(cid:12)(cid:12)(cid:12) , with smaller values implying faster convergence. For generalized deduplication, convergence happens according to the convergence of lim c →∞ P (cid:2) x c (cid:54)∈ D c − (cid:3) = 0 . Thissequence has converged when D c − = X , and thus the summand in (1) is constant. At this point ∆ R G remains constant, soit is sufficient to analyze the convergence of the sequence of probabilities. Thus, µ G = lim c →∞ (cid:12)(cid:12)(cid:12)(cid:12) P [ x c +1 (cid:54)∈ D c ] P [ x c (cid:54)∈ D c − ] (cid:12)(cid:12)(cid:12)(cid:12) = lim c →∞ (cid:0) − |X | − (cid:1) c (1 − |X | − ) c − = 1 − |X | . (9) Remark.
For the case of classic deduplication, µ D = µ G (cid:12)(cid:12) X = Z = 1 − |Z| . (10)Since |Z| ≥ |X | ⇒ µ D ≥ µ G . Thus, generalized deduplication will be able to converge faster. In fact, |Z| (cid:29) |X | even insimple cases. Both approaches exhibit linear convergence [11]. C R ∗ ( C ) [ b i t s ] DeduplicationDeduplication, BoundsDeflatedGeneralizedGeneralized, BoundsUncompressed
Fig. 1. Simulation and bounds for the expected sequence lengths, R D ( C ) and R G ( C ) , and simulation for the DEFLATE algorithm. C ∆ R C ∗ [ b i t s ] H ( Z ) + 1DeduplicationDeduplication, BoundsDeflatedGeneralizedGeneralized, Bounds Fig. 2. Simulation and bounds for the expected number of bits per additionalchunk, ∆ R CD , ∆ R CG and the DEFLATE algorithm. C . . . . G ( C ) Fig. 3. Simulation and bounds for the generalization ratio, G ( C ) . V. N
UMERICAL R ESULTS
To visualize the results presented in the paper, a concrete example is considered. The compression achieved by our methodis compared to the compression achieved by zlib [12], a well-known compression library implementing the popular DEFLATEalgorithm [13, Section 6.25], based on LZ77 [7] and Huffman coding [14].
Example 4.
Let X (cid:48) be the codewords of the (31 , Hamming code. A subset
X ⊂ X (cid:48) with |X | = 8 is chosen at random. Y is the set of binary vectors of length with weight or less. The resulting Z has |Z| = |X ||Y| = 8 ·
32 = 256 elements. Tocompare their performances, generalized deduplication, classic deduplication, and the DEFLATE algorithm are applied to C chunks uniformly drawn from this source.The upper and lower bounds of R { D,G } ( C ) from Theorems 1 and 2 are shown as dashed lines in Fig. 1. The solid linesare simulated averages. Our approach clearly outperforms the other approaches. The performance of classic deduplicationand the DEFLATE algorithm are, for this source, comparable while the deduplication dictionary is filling up. At the end ofthe simulation, both classic deduplication and the generalization have a smaller representation than the one of the DEFLATEalgorithm. It is seen that both classic deduplication and the generalization are converging to the same slope. The asymptoticslope comes from the asymptotic cost, H ( Z ) + 1 . When both schemes have converged, a gap remains between the lines. Thegap remains constant, but eventually becomes negligible as C → ∞ .The upper and lower bounds of ∆ R C { D,G } from Theorems 3 and 4 are shown as dashed lines in Fig. 2 as a function of thenumber of chunks, C . The assessment of the convergence rate in the previous section is now visualized: The faster convergenceof the generalization is easily seen. Further, the solid line shows the average which is seen to approximate the lower bound.This is because |X | , |Y| and |Z| all are powers of two for this source, and thus no overhead (compared to the lower bounds)are used to represent neither bases, deviations, nor the entire chunks. The DEFLATE algorithm is unable to approach theentropy, while the other approaches are.The generalization ratio is shown in Fig. 3. For the first few chunks deduplication performs best, but this is quickly outweighedby the faster convergence of the generalization. The gain grows sharply until convergence of ∆ R G , but slows down and thenstarts declining briefly thereafter. As the number of chunks goes to infinity, the ratio converges to . A general observation is that the maximum gain is achieved in the range where the generalization has converged, and classicdeduplication is still far from converging. It is also seen that, for the first few samples, the generalization performs slightly C G ( C ) |X| =4, |Y| =32 |X| =8, |Y| =64 |X| =16, |Y| =128 Fig. 4. Generalization ratio for different simulation configurations worse. This is caused by the convention to put the uncompressed base in the output. In reality, since X (cid:48) is known, it is sufficientto use (cid:100) log |X (cid:48) |(cid:101) ≤ n bits for each base. This will increase the gain slightly.The vital advantage of generalized deduplication is the smaller number of bases, which causes more matches with fewerchunks. Example 5.
Let the mapping φ for generalized deduplication be defined through the (1023, 1013) Hamming code. Chunksmust be bits ( ≈ B), and the potential bases X (cid:48) are the codewords. Y is the set of binary vectors of length with weight or less, so |Y| = 1024 . Thus |Z| = 1024 |X | . The amount of bases in classic deduplication is three orders ofmagnitude greater than in the generalization. By simulating sequences generated with longer chunks, it is clear that this increases the maximum generalization gain. Theconvergence of deduplication is affected by an increase in |Y| , which is unavoidable when changing the chunk size, unlessthe packing radius t is also changed. The generalization is oblivious of this, so its convergence will not be affected, and thusthe potential gain increases. In practice, where limited amounts of data are available, this enables the generalization to achievea significant gain in storage costs. Our simulations show that if |X | is fixed and the chunk length, n , is increased, then themaximum ratio, max C G ( C ) , increases linearly as a function of the chunk length. That is, the potential gain of using thegeneralization instead of classic deduplication increases linearly with the chunk length. Fig. 4 shows the generalization ratiofor three source configurations. These simulations show a clear trend that when the number of unique chunks a source canoutput grows, then the potential advantage of using the generalization instead of classic deduplication becomes greater.VI. C ONCLUSION
The preceding sections present an information-theoretical analysis of generalized deduplication, which allows deduplicationof near-identical data, and classic deduplication as a special case. By analyzing a simple source model, we show that sourcesexist for which the advantages of the generalization are significant. Indeed, we show that generalized deduplication exhibitslinear convergence with the number of data chunks. In the limit each data chunk can be represented by at most 3 bits morethan the entropy of the source, but our numerical results show that generalized deduplication can converge to the lower boundof 1 bit more than the entropy. The advantage of generalizing deduplication manifests itself in the convergence. If the datahas characteristics similar to our source model, then the generalization can converge to near-entropy costs with orders ofmagnitude less data than classic deduplication. With an m -to- mapping φ , a factor of m fewer bases must be identified,creating a potential for improving compression in practice, where the amount of data will be limited.The presented source model is somewhat stylized, and is not accurate for practical data sets. An important next step is tolift the restriction of having data uniformly distributed over the spheres, which will enable a study of the method for generalsources. Indeed, our future work will address how to make the method more practical. For instance, it is relatively simple toempirically model a chunk source, Z , given concrete data, but this source must be carefully split into two underlying sources,the base source X and the deviation source Y , in order to approximate the model and realize the potential of generalizeddeduplication. We have studied some strategies for generalized deduplication from a more practical perspective [5], [6], butthis task is not trivial in general. We will continue with this work in the future. A PPENDIX AA DETAILED EXAMPLE
Assume that X = { , } , and let Y = { v i ∈ Z : w ( v i ) ≤ } . Let Z = X ⊕ Y . Draw 5 elements from Z i.i.d. uniformly. Assume that these elements are: (0001000 , , , , . The elements are then concatenated to a sequence: s = 00010000010000001000011111100010000 . Classic DeduplicationEncoding:
The encoding is initialized with an empty dictionary, D . Since we know that chunks have length , the sequenceis split into chunks of that length: ks = 0001000 | | | | . Now, the chunks are handled sequentially. The first is . This chunk is not in D , so it is added to it. The new dictionarythen is D = { } and the encoded sequence after the first chunk is formed by adding a (since we added the chunk to the dictionary) and thenthe chunk itself (the dot is only for easier visualization): s D = 1 . . We then move to the next chunk, , which is not in the D . It is added, and a followed by the chunk is added to theencoded sequence: D = { , } ,s D = 1 . | . . The next element is . This element is already in the dictionary, so it is not added again. For this reason, a is placedin the output sequence, followed by a pointer to the element in the dictionary using (cid:100) log |D |(cid:101) = (cid:100) log 2 (cid:101) = 1 bit. Since theelement is the second in the dictionary, it is represented by : D = D = { , } ,s D = 1 . | . | . . The next element, , is new. It is added to the dictionary, and the encoded sequence following a : D = { , , } ,s D = 1 . | . | . | . . The final element is , which already is in the dictionary. A pointer to the dictionary is therefore added to the encoding,following a . The pointer now needs (cid:100) log |D |(cid:101) = (cid:100) log 3 (cid:101) = 2 bits. Since the element is the second in the dictionary, it isrepresented as . D = D = { , , } ,s D = 1 . | . | . | . | . . All chunks are now encoded, and s D is output as s D . Decoding:
The encoding is initialized with an empty dictionary, D . The sequence is processed sequentially. We start from s D = 10001000100100000111111110001 . The first bit is always a , since the dictionary is empty. It is also known that chunks have length . At first, the sequence canthen be parsed as: s D = 1 . | . The first element can now be extracted and added to the dictionary. It is also added to the decoded sequence directly: D = { } s = 0001000 . Since the inserted delimiter is followed by a , it is known that the next chunk is also new. Therefore, a delimiter can beinserted bits after the first delimiter: s D = 1 . | . | . The chunk is added to the dictionary and the decoded sequence: D = { , } ,s = 0001000 | . The new delimiter is followed by a flag this time. Therefore, the flag is followed by a pointer. Since (cid:100) log |D |(cid:101) = (cid:100) log 2 (cid:101) = 1 ,the flag is followed by a pointer of bit. A new delimiter can then be inserted: s D = 1 . | . | . | . The delimiter is followed by a , which means that the second element in the dictionary should be added to the output sequence: D = D = { , } ,s = 0001000 | | . A follows the last delimiter, so a chunk follows directly. A new delimiter is inserted after the chunk: s D = 1 . | . | . | . | , and the chunk is inserted into the dictionary and the output, resulting in D = { , , } ,s = 0001000 | | | . Finally, a follows the delimiter. Since (cid:100) log |D |(cid:101) = (cid:100) log 3 (cid:101) = 2 , the two bits after the flag (which luckily is the rest of thesequence) points to an element in the dictionary. The value is , so the second element in the dictionary should be added tothe output sequence: D = D = { , , } ,s = 0001000 | | | | . The decoding is now complete, and s is output as ˆ s . Luckily ˆ s = s , as expected. Generalized DeduplicationEncoding:
As deviations are are drawn uniformly from Y , H ( Y ) = log |Y| = 3 bits. bits is thus optimal for theirrepresentation. An optimal representation is { ← , ← , ← , ← , ← , ← , ← , ← } . The encoding is initialized with an empty dictionary, D . Since we know that chunks have length , it is split into chunks ofthat length: s = 0001000 | | | | . The chunks are handled sequentially. The first is . By applying the minimum distance mapping φ (decode and encodeusing that X (cid:48) is the Hamming codewords), the base is found to be . This base is not in D , so it is added to it. Inthis example, we decide not to compress the base, but leave it in full size. The dictionary is then: D = { } . Since the base was not in the dictionary, a is added to the sequence, and followed by the base. The deviation is the differencebetween the base, which in this case is . The deviation is changed to the optimal representation. After the first chunk,the coded sequence is thus: s G = 1 . . . The next chunk is . It also maps to the base . A is added to the output sequence, followed by a pointer of (cid:100) log |D |(cid:101) = (cid:100) log 1 (cid:101) = 0 bits pointing to the base. Since the base is the only element in the dictionary, no bits are needed to specify which one it is. The deviation is , which is added in the optimal representation. The dictionary and codedsequence thus becomes: D = D ,s G = 1 . . | .. . The next chunk is also , and will get the same coded representation. Thus D = D ,s G = 1 . . | .. | .. . This chunk, however, is followed by . The nearest neighbor in X (cid:48) (and X ) is . This will thus be the base.The base is not in D , so it is added to it, and D = { , } . The deviation is found by comparing the chunk to the base, and is . Changing this to the optimal representation, it isnow possible to form the coded representation of the chunk. It is added to the encoding: s G = 1 . . | .. | .. | . . . Finally, the last chunk is again. The base is of course still , and the deviation . Although this base hasbeen seen before, the representation in the output will be slightly different, since the dictionary has grown. Now (cid:100) log |D |(cid:101) = 1 bit is needed. The base is the first element in the dictionary, so it will be represented by a : D = D ,s G = 1 . . | .. | .. | . . | . . . The concludes the process, and s G is output as s G . It is worth noting that already D = X , and thus all subsequent chunksfrom Z will be represented with bits, one more than the entropy. This shows how the generalization can converge fasterthan classic deduplication. Decoding:
The encoding is initialized with an empty dictionary, D . The sequence is processed sequentially. We start from s G = 10000000100010101011111111100100101 . The sequence starts with a . This means that a base will follow the directly. The base is not compressed, so it has length n = 7 . The base is followed by a deviation represented with bits. This allows us to parse for the first chunk: s G = 1 . . | . The base is added to the dictionary, so D = { } , and the deviation is expanded to the full representation: → . The chunk is then reconstructed by combining thebase and the deviation, using bitwise exclusive-or: ⊕ . This is the reconstructed chunk, which is added to the decoded sequence, s = 0001000 . The next chunk has a flag, so the base is already in the dictionary. Since the dictionary has a single element only, bits areneeded for the pointer. The deviation is as always bits. This allows the parsing of the second chunk to be made: s G = 1 . . | .. | . The base is then again . The deviation is expanded: → . These two are added, forming the new chunk: ⊕ , and this chunk is added to the output: D = D ,s = 0001000 | . The third chunk starts with a too, so the base is indicated with bits, and is again the one already in the dictionary. Thecoded chunk is parsed as s G = 1 . . | .. | .. | and is the same as the previous. The reconstruction is the same, so D = D ,s = 0001000 | | . Now, the current last delimiter is followed by a , so a new base of bits and a -bit deviation follows. The parsing is s G = 1 . . | .. | .. | . . | . The base is , and needs to be added to the dictionary: D = { , } . The deviation is then expanded, → . The base and deviation reconstructs the chunk: ⊕ , which is added to the output: s = 0001000 | | | . The delimiter is now followed by a , so the base is already in the dictionary. (cid:100) log |D |(cid:101) = 1 bit is used for the pointer, sothe parsing is s G = 1 . . | .. | .. | . . | . . . The pointer is , so the base is the first element in the dictionary, i.e., . The deviation is → , so the chunkcan be combined to . This means D = D ,s = 0001000 | | | | . The coded sequence is now fully decoded, and ˆ s = s is output. As expected, ˆ s = s .A PPENDIX BP ROOF OF THEOREM Proof.
The structure of the source is such that drawing a chunk uniformly from Z is equivalent to drawing a base from X and a deviation from Y . Since bases are drawn uniformly at random, the probability that the base of chunk c is not alreadyin the dictionary is P (cid:2) x c (cid:54)∈ D c − (cid:3) = (cid:0) − |X | − (cid:1) c − . (11)The expected coded length can be bounded from below as: R G ( C ) = C (cid:88) c =1 E (cid:2) I { x c (cid:54)∈ D c − } ( k + p ) + I { x c ∈ D c − } ( l ( D c − ) + p ) (cid:3) ≥ C (cid:88) c =1 E (cid:2) I { x c (cid:54)∈ D c − } ( k + p ) + I { x c ∈ D c − } (log |D c − | + p ) (cid:3) (12) = C ( p + 1) + C (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) (13) ≥ C (log |Y| + 1) + C (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) (14) ≥ C (log |Y| + 1) + C (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − E (cid:2) |D c − | (cid:3) log E (cid:2) |D c − | (cid:3)(cid:1) (15) = C (log |Y| + 1) + C (cid:88) c =1 (cid:104) k (cid:0) − |X | − (cid:1) c − + (cid:16) − (cid:0) − |X | − (cid:1) c − (cid:17) log (cid:16) |X | (cid:16) − (cid:0) − |X | − (cid:1) c − (cid:17)(cid:17)(cid:105) (16) where the inequality in (12) follows from log | D c − | ≤ l ( D c − ) because l ( D c − ) = (cid:100) log | D c − |(cid:101) . The equality in (13) usesthat E (cid:2) I { x c ∈ D c − } log |D c − | (cid:3) = E (cid:2) E (cid:2) I { x c ∈ D c − } log |D c − | (cid:12)(cid:12) |D c − | (cid:3)(cid:3) = E (cid:2) P (cid:2) x c ∈ D c − (cid:12)(cid:12) |D c − | (cid:3) log |D c − | (cid:3) = |X | − E (cid:2) |D c − | log |D c − | (cid:3) . (14) follows from log |Y| ≤ p , since p = (cid:100) log |Y|(cid:101) . The inequality in (15) follows from Jensen’s inequality, since x log x is aconvex function. Finally, the equality in (16) comes from substituting (11) and the fact that E (cid:2) |D c − | (cid:3) = |X | (cid:88) i =1 P (cid:2) x i ∈ D c − | (cid:3) = |X | (cid:88) i =1 − P (cid:2) x i (cid:54)∈ D c − | (cid:3) = |X | (cid:88) i =1 − (cid:0) − |X | − (cid:1) c − = |X | (cid:16) − (cid:0) − |X | − (cid:1) c − (cid:17) . Equivalently, the value can be bounded from above: R G ( C ) = C (cid:88) c =1 E (cid:2) I { x c (cid:54)∈ D c − } ( k + p ) + I { x c ∈ D c − } ( l ( D c − ) + p ) (cid:3) ≤ C (cid:88) c =1 E (cid:2) I { x c (cid:54)∈ D c − } ( k + p ) + I { x c ∈ D c − } (log |D c − | + 1 + p ) (cid:3) (17) ≤ C ( p + 2) + C (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) (18) ≤ C (log |Y| + 3) + C (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) (19) ≤ C (log |Y| + 3) + C (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − min { ( c −
1) log( c − , | X | log | X |} (cid:1) (20) = C (log |Y| + 3) + C (cid:88) c =1 (cid:16) k (cid:0) − |X | − (cid:1) c − + |X | − min { ( c −
1) log( c − , |X | log |X |} (cid:17) (21)where the inequality in (17) follows from l ( D c − ) ≤ log |D c − | + 1 since l ( D c − ) = (cid:100) log |D c − |(cid:101) , (18) follows from the factthat I {·} ≤ . The inequality in (19) is due to the encoding of the deviations, p ≤ log |Y| + 1 , since p = (cid:100) log |Y|(cid:101) . The finalinequality in (20) follows from |D c − | ≤ c − , and the fact that the maximum possible size of the dictionary is |X | . Finally(11) is substituted to get (21). A PPENDIX CP ROOF OF THEOREM Proof.
The proof of the special case of deduplication naturally follows the same steps, but considers Z = X and Y containsonly the all-zero chunk. Because of this, deviations can be represented with exactly bits, so the step bounding their cost canbe skipped. For completeness, the full proof is given. Since chunks are drawn from Z uniformly at random, the probabilitythat chunk (=base) c is not already in the dictionary is P (cid:2) z c (cid:54)∈ D c − (cid:3) = (cid:0) − |Z| − (cid:1) c − . (22) The expected coded length can be bounded from below as: R D ( C ) = C (cid:88) c =1 E (cid:2) I { z c (cid:54)∈ D c − } n + I { z c ∈ D c − } l ( D c − ) (cid:3) ≥ C (cid:88) c =1 E (cid:2) I { z c (cid:54)∈ D c − } n + I { z c ∈ D c − } log |D c − | (cid:3) (23) = C + C (cid:88) c =1 (cid:0) n P (cid:2) z c (cid:54)∈ D c − (cid:3) + |Z| − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) (24) ≥ C + C (cid:88) c =1 (cid:0) n P (cid:2) z c (cid:54)∈ D c − (cid:3) + |Z| − E (cid:2) |D c − | (cid:3) log E (cid:2) |D c − | (cid:3)(cid:1) (25) = C + C (cid:88) c =1 (cid:104) n (cid:0) − |Z| − (cid:1) c − + (cid:16) − (cid:0) − |Z| − (cid:1) c − (cid:17) log (cid:16) |Z| (cid:16) − (cid:0) − |Z| − (cid:1) c − (cid:17)(cid:17)(cid:105) (26)where the inequality in (23) follows from log | D c − | ≤ l ( D c − ) since l ( D c − ) = (cid:100) log | D c − |(cid:101) . The equality in (24) uses that E (cid:2) I { z c ∈ D c − } log |D c − | (cid:3) = E (cid:2) E (cid:2) I { z c ∈ D c − } log |D c − | (cid:12)(cid:12) |D c − | (cid:3)(cid:3) = E (cid:2) P (cid:2) z c ∈ D c − (cid:12)(cid:12) |D c − | (cid:3) log |D c − | (cid:3) = |Z| − E (cid:2) |D c − | log |D c − | (cid:3) . The inequality in (25) follows from Jensen’s inequality, since x log x is a convex function. Finally, the equality in (26) comesfrom substituting (22) and the fact that E (cid:2) |D c − | (cid:3) = |Z| (cid:88) i =1 P (cid:2) z i ∈ D c − | (cid:3) = |Z| (cid:88) i =1 − P (cid:2) z i (cid:54)∈ D c − | (cid:3) = |Z| (cid:88) i =1 − (cid:0) − |Z| − (cid:1) c − = |Z| (cid:16) − (cid:0) − |Z| − (cid:1) c − (cid:17) . The expected cost can also be bounded from above: R D ( C ) = C (cid:88) c =1 E (cid:2) I { z c (cid:54)∈ D c − } n + I { z c ∈ D c − } l ( D c − ) (cid:3) ≤ C (cid:88) c =1 E (cid:2) I { z c (cid:54)∈ D c − } n + I { z c ∈ D c − } (log |D c − | + 1) (cid:3) (27) ≤ C + C (cid:88) c =1 (cid:0) n P (cid:2) z c (cid:54)∈ D c − (cid:3) + |Z| − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) (28) ≤ C + C (cid:88) c =1 (cid:0) n P (cid:2) z c (cid:54)∈ D c − (cid:3) + |Z| − min { ( c −
1) log( c − , | Z | log | Z |} (cid:1) (29) = 2 C + C (cid:88) c =1 (cid:16) n (cid:0) − |Z| − (cid:1) c − + |Z| − min { ( c −
1) log( c − , |Z| log |Z|} (cid:17) (30)where the inequality in (27) follows from l ( D c − ) ≤ log |D c − | + 1 since l ( D c − ) = (cid:100) log |D c − |(cid:101) , (28) follows from the factthat I {·} ≤ . The final inequality in (29) follows from |D c − | ≤ c − , and the fact that the maximum possible size of thedictionary is |Z| . Finally, (22) is substituted to get (30). A PPENDIX DP ROOF OF THEOREM Proof.
We use the bounds on the coded sequence length to to determine the asymptotic cost for each additional chunk. First,the lower bound is proven, by assuming a best-case source that follows the lower bound on the coded sequence length. Fromthe derivation of the lower bound of R G ( C ) , (14) is restated: R G ( C ) ≥ C (log |Y| + 1) + C (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) . By the definition of ∆ R C +1 G = R G ( C + 1) − R G ( C ) , a lower bound on the expected coded length of chunk C + 1 can befound from (14): ∆ R C +1 G ≥ ( C + 1)(log |Y| + 1) + C +1 (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) − (cid:32) C (log |Y| + 1) + C (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1)(cid:33) = log |Y| + 1 + k P (cid:2) x C +1 (cid:54)∈ D C (cid:3) + |X | − E (cid:2) |D C | log |D C | (cid:3) , and then the limit is ∆ R ∞ G ≥ lim C →∞ (cid:0) log |Y| + 1 + k P (cid:2) x C +1 (cid:54)∈ D C (cid:3) + |X | − E (cid:2) |D C | log |D C | (cid:3)(cid:1) = log |Y| + 1 + lim C →∞ k P (cid:2) x C +1 (cid:54)∈ D C (cid:3) + lim C →∞ |X | − E (cid:2) |D C | log |D C | (cid:3) = log |Y| + 1 + |X | − |X | log |X | (31) = log |Y| + 1 + log |X | = H ( Y ) + 1 + H ( X ) (32) = 1 + H ( Z ) , (33)where the equality in (31) uses that all x c have non-zero probability, so the probability of not having any specific one inthe dictionary goes to , and the dictionary converges to the entire set of possible bases, X . The fact that, by assumption, Z = X ⊕ Y with non-overlapping spheres means that drawing chunks uniformly from Z is equivalent to drawing uniformlydistributed elements from X and Y , and so the relations log |Y| = H ( Y ) , log |X | = H ( X ) and H ( Z ) = H ( X ) + H ( Y ) holds.This is used for (32) and (33).Finally, a similar argument can be made for the upper bound, by assuming a worst-case source that follows the upper boundon the coded sequence length. (19) is restated from the earlier derivation of the upper bound on R G ( C ) : R G ( C ) ≤ C (log |Y| + 3) + C (cid:88) c =1 (cid:0) k P (cid:2) x c (cid:54)∈ D c − (cid:3) + |X | − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) and, following the exact same steps as for the lower bound, the result is found to be ∆ R ∞ G ≤ H ( Z ) , concluding the proof. A PPENDIX EP ROOF OF THEOREM Proof.
The proof of the special case of deduplication follows the same structure as the generalized version. First, the lowerbound is proven, by assuming a best-case source that follows the lower bound on the coded sequence length. (24) is restatedfrom the derivation of the lower bound of R D ( C ) . R D ( C ) ≥ C + C (cid:88) c =1 (cid:0) n P (cid:2) z c (cid:54)∈ D c − (cid:3) + |Z| − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) . By the definition of ∆ R C +1 D = R D ( C + 1) − R D ( C ) , a lower bound on the expected coded length of chunk C + 1 can befound from (24): ∆ R C +1 D ≥ C + 1 + C +1 (cid:88) c =1 (cid:0) n P (cid:2) z c (cid:54)∈ D c − (cid:3) + |Z| − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) − (cid:32) C + C (cid:88) c =1 (cid:0) n P (cid:2) z c (cid:54)∈ D c − (cid:3) + |Z| − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1)(cid:33) = 1 + n P (cid:2) z C +1 (cid:54)∈ D C (cid:3) + |Z| − E (cid:2) |D C | log |D C | (cid:3) . The limit can now be evaluated: ∆ R ∞ D ≥ lim C →∞ (cid:0) n P (cid:2) z C +1 (cid:54)∈ D C (cid:3) + |Z| − E (cid:2) |D C | log |D C | (cid:3)(cid:1) = 1 + lim C →∞ n P (cid:2) z C +1 (cid:54)∈ D C (cid:3) + lim C →∞ |Z| − E (cid:2) |D C | log |D C | (cid:3) = 1 + |Z| − |Z| log |Z| (34) = 1 + log |Z| = 1 + H ( Z ) (35)where (34) uses that all z c has non-zero probability, so the probability of not having encountered any specific one before goesto , and that the maximum size of the dictionary is |Z| . Finally, log |Z| = H ( Z ) in (35) due to the uniform distribution.An equivalent argument can be made for the upper bound, by assuming a worst-case source that follows the upper boundon the coded sequence length. (28) is restated from the earlier derivation of the upper bound on R D ( C ) : R G ( C ) ≤ C + C (cid:88) c =1 (cid:0) k P (cid:2) z c (cid:54)∈ D c − (cid:3) + |Z| − E (cid:2) |D c − | log |D c − | (cid:3)(cid:1) . By repeating exactly the same steps as for the lower bound, the result is found to be ∆ R ∞ D ≤ H ( Z ) , concluding the proof. A CKNOWLEDGMENTS
This work was partially financed by the SCALE-IoT project (Grant No. DFF-7026-00042B) granted by the Danish Councilfor Independent Research, the AUFF Starting Grant AUFF-2017-FLS-7-1, and Aarhus University’s DIGIT Centre.R
EFERENCES[1] W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang, and Y. Zhou, “A Comprehensive Study of the Past, Present, and Future ofData Deduplication,”
Proc. IEEE , vol. 104, no. 9, pp. 1681–1710, 2016.[2] A. El-Shimi, R. Kalach, A. Kumar, A. Ottean, J. Li, and S. Sengupta, “Primary Data Deduplication—Large Scale Study and System Design,” in
USENIXATC , 2012, pp. 285–296.[3] D. T. Meyer and W. J. Bolosky, “A study of practical deduplication,”
ACM Trans. Storage , vol. 7, no. 4, pp. 1–20, 2012.[4] U. Niesen, “An Information-Theoretic Analysis of Deduplication,” in
IEEE ISIT , 2017, pp. 1738–1742.[5] R. Vestergaard, D. E. Lucani, and Q. Zhang, “Generalized Deduplication: Lossless Compression for Large Amounts of Small IoT Data,” in
EuropeanWireless Conf. , Aarhus, Denmark, may 2019.[6] R. Vestergaard, Q. Zhang, and D. E. Lucani, “Lossless Compression of Time Series Data with Generalized Deduplication,” in
IEEE GLOBECOM ,Waikoloa, USA, dec 2019.[7] J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,”
IEEE Trans. Inf. Theory , vol. 23, no. 3, pp. 337–343, 1977.[8] ——, “Compression of individual sequences via variable-rate coding,”
IEEE Trans. Inf. Theory , vol. 24, no. 5, pp. 530–536, 1978.[9] A. Orlitsky, N. Santhanam, and J. Zhang, “Universal Compression of Memoryless Sources Over Unknown Alphabets,”
IEEE Trans. Inf. Theory , vol. 50,no. 7, pp. 1469–1481, 2004.[10] J. Aberg, Y. M. Shtarkov, and B. J. M. Smeets, “Multialphabet coding with separate alphabet description,” in
IEEE SEQUENCES , 1997, pp. 56–65.[11] E. Suli and D. F. Mayers,
An Introduction to Numerical Analysis . Cambridge University Press, Cambridge, 2003.[12] J.-l. Gailly and M. Adler, “zlib compression library.” [Online]. Available: zlib.net[13] D. Salomon and G. Motta,
Handbook of Data Compression . Springer, London, 2010.[14] D. Huffman, “A Method for the Construction of Minimum-Redundancy Codes,”