A Normal Sequence Compressed by PPM ∗ but not by Lempel-Ziv 78
aa r X i v : . [ c s . D S ] S e p A Normal Sequence Compressed by PPM ∗ but not byLempel-Ziv 78 Liam Jordon ∗ [email protected] Philippe [email protected]. of Computer Science, Maynooth University, Maynooth, Co. Kildare, Ireland. Abstract
In this paper we compare the difference in performance of two of the Prediction by PartialMatching (PPM) family of compressors (PPM ∗ and the original Bounded PPM algorithm) andthe Lempel-Ziv 78 (LZ) algorithm. We construct an infinite binary sequence whose worst-casecompression ratio for PPM* is 0, while Bounded PPM’s and LZ’s best-case compression ratiosare at least 1 / de Bruijn strings of increasing order. Keywords: compression algorithms, Lempel-Ziv algorithm, Prediction by Partial Matchingalgorithms, normality A normal number in base b , as defined by Borel [10], is a real number whose infinite decimalexpansion in that base is such that for all block lengths n , every string of digits in base b of length n occur as a substring in the decimal expansion with limiting frequency b n . In this paper we restrictourselves to examining normal binary sequences, i.e. normal numbers in base 2.A common question studied about normal sequences is whether or not they are compressibleby certain families of compressors. Results by Schnorr and Stimm [16] and Dai, Lathrop, Lutz andMayordomo [8] demonstrate that lossless finite-state transducers (FSTs) cannot compress normalsequences. Becher, Carton and Heiber [1] explore what happens to the compressibilty of normalsequences in various scenarios such as when the FST has access to one or more counters or a stack,and what happens when the transducer is not required to run in real-time nor be deterministic.Carton and Heiber [5] show that deterministic and non-deterministic two-way FSTs cannot compressnormal sequences. Among other compression algorithms, Lathrop and Strauss [12] have shown thatthere exists a normal sequence such that the Lempel-Ziv 78 (LZ) algorithm can compress.In this paper we focus on the performance of the Prediction by Partial Matching (PPM) com-pression algorithm which was introduced by Cleary and Witten [7]. PPM works by building anadaptive statistical model of its input as it reads each character. The model keeps track of previ-ously seen substrings in the input, known as contexts , and the characters that follow them. When ∗ Supported by a postgraduate scholarship from the Irish Research Council. relevant contexts currently inthe model. These relevant contexts refer to suffixes of the of the already encoded part of the inputthat have been stored in the model. The next character is then encoded based on its frequencycounts in the relevant contexts. The model is updated after each character is encoded. This involvesupdating the frequency counts of the seen character in the relevant contexts and, if needed, addingnew contexts in the model. These prediction probabilities for each character encodes the sequencevia arithmetic encoding [17].In the original PPM (Bounded PPM), prior to encoding the input, a value k ∈ N must beprovided to the encoder which sets the maximum length of a context the model can store. Studieshave gone into identifying which value for k achieves the best compression. One may think thelarger the k , the better the compression. However, increasing k above 5 does not generally improvecompression [7]. Over a decade later, a new version of PPM was introduced, called PPM* [6]. Thisversion of the algorithm sets no upper bound on the length of contexts the model can keep trackof. Inspired by Mayordomo, Moser and Perifel [14] which compares the best-case and worse-casecompression ratio of various compression algorithms on certain sequences, in this paper we constructa normal sequence S and compare how it is compressed by PPM ∗ , Bounded PPM and LZ. PPM ∗ can compress S with a worst-case compression ratio of 0. We also show that no matter what upperbound for k chosen, Bounded PPM’s best-case compression ratio is at least 1 /
2. Also, LZ has abest-case compression ratio of 1 on S , i.e. S cannot be compressed by LZ. S is constructed such that it is an enumeration of all binary strings in order of length i.e. allstrings of length 1 followed by all strings of length 2 and so on. For instance, 0100011011 is anenumeration of all strings up to length 2. Such sequences cannot be compressed by LZ, which inturn means they cannot be compressed by any FST [18]. Thus S is normal. This enumeration isachieved via repetitions of de Bruijn strings which PPM ∗ can exploit to compress S .Some proofs are omitted from the main body of the paper due to space constraints. These areall contained in the appendix provided. N denotes the set of non-negative integers. A finite binary string is an element of { , } ∗ . A binarysequence is an element of { , } ω . The length of a string x is denoted by | x | . λ denotes the emptystring, i.e. the string of length 0. For all n ∈ N , { , } n denotes the set of binary strings of length n . For a string (or sequence) x and i, j, ∈ N with i ≤ j , x [ i..j ] denotes the i th through j th bits of x with the convention that if j < i then x [ i..j ] = λ . For a string x and string (sequence) y , xy (occasionally denoted by x · y ) denotes the string (sequence) of x concatenated with y . For a string x and n ∈ N , x n denotes x concatenated with itself n times. For strings x, y and string (sequence) z , if w = xyz , we say y is a substring of w , x is a prefix of w (sometimes denoted by x ⊑ w ), and if z is a string, then z is a suffix of w . For a sequence S and n ∈ N , S ↾ n denotes the prefix of S oflength n , i.e. S ↾ n = S [0 ..n − . The lexicographic ordering of { , } ∗ is defined by saying for twostrings x, y , x is less than y if either | x | < | y | or else | x | = | y | with x [ n..n ] = 0 and y [ n..n ] = 1 forthe least n such that x [ n ] = y [ n ].Given a sequence S and a function T : { , } ∗ → { , } ∗ , the best-case and worst-case compres- ion ratios of T on S are given by ρ T ( S ) = lim inf n →∞ | T ( S ↾ n ) | n and, R T ( S ) = lim sup n →∞ | T ( S ↾ n ) | n respectively.Given strings x, w we use the following notation to count the number of times w occurs as asubstring in x .1. The number of occurrences of w as a substring of x is given byocc( w, x ) = |{ u ∈ { , } ∗ : uw ⊑ x }| .
2. The block number of occurrences of w as a substring of x is given byocc b ( w, x ) = |{ i : x [ i + i + | w | −
1] = w, i ≡ | w |}| A sequence S is said to be normal , as defined by Borel [10] if for all w ∈ { , } ∗ lim n →∞ occ( w, S ↾ n ) n = 2 −| w | . We say that a sequence S is an enumeration of all strings , we mean that S can be broken intosubstrings S = S S S . . . , such that for each n , S n is a concatenation of all strings of length n witheach string occurring once. That is, for all w ∈ { , } i , occ b ( w, S i ) = 1 . Note that | S n | = n (2 n ) . Before we begin, we note that implementations of the PPM algorithm family implement what isknown as the exclusion principal to achieve better compression ratios. We ignore this in our imple-mentation for simplicity as even without this, the sequence we later build achieves a compressionratio of 0 via PPM ∗ . In the original presentation of PPM in 1984 [7], a bounded version is introduced. Prior to encodingthe input sequence, a value k ∈ N must be provided to the encoder which sets the maximum context length the model keeps track of. As such, we refer to this version as Bounded PPM and denoteBounded PPM with bound k as PPM k . By context, we mean previously seen substrings of the inputstream contained in the model. For each context, the model records what characters have followedthe context in the input stream, and the frequency each character has occurred. These frequenciesare used to build prediction probabilities that the encoder uses to encode the rest of the inputstream. When reading the next bit of the input stream, the encoder examines the longest relevant context each time and encodes the current character based on its current prediction probability inthat context. By relevant context, we mean suffixes of the input stream already read by the encoderthat are contained in the model. The longest relevant context available is referred to as the current context as it is the one the model uses to first encode the next character seen. Once encoded, the3odel is updated to include new contexts if necessary, and to update the prediction probabilitiesof the relevant contexts to reflect the character that has just been read.A problem occurs if the character being encoded has never occurred previously in the currentcontext. When this occurs, an escape symbol (denoted by $) is transmitted and the next shortestrelevant context becomes the new current context. If the character has not been seen before evenwhen the current context is λ , that is, the context where none of the previous bits are used to predictthe next character, an escape is outputted and the character is assigned the prediction probabilityfrom the order-( −
1) table. By convention, this table contains all characters in the alphabet beingused and assigns each character equal probability.A common question is what probabilities are assigned to these escape symbols. This paper uses
Method C proposed by Moffat [15]. Here, the escape symbol is given a frequency equal to thenumber of distinct characters predicted in the context so far. Hence in our case it will always havea count of 1 or 2.For instance in Table 1, the model for the string 0100110110 with bound k = 3 is found. In thecontext 01, the escape symbol $ has count 2 as both 0 and 1 have been seen, while $ in 101 hascount 1 as only 1 has been seen. A context is said to be deterministic if it has an escape countfrequency of 1.For example, suppose 0 is the next character to be encoded after input stream 0100110110 byPPM , whose model is shown in Table 1. The relevant contexts are 110 , , λ . The longestrelevant context is 110. The encoder escapes to the shorter context 10 since 0 is not seen in context110 and is encoded by the prediction probability . From 10, 0 is encoded with probability . Thefrequency counts of 0 will be updated in the 10, 0 and empty contexts. Also, 0 will be added asa prediction to context 110. Following this, if the next character to be encoded was another 0 themodel would start in context 100, and since 0 is not predicted here, it would transmit an escapesymbol with probability and then examine the next longest context 00 and proceed as necessary.If there was another bit b in the input stream after this, as 000 would be the current suffix of theinput stream but no context for 000 exists yet as it has never been seen before, the encoder wouldbegin in context 00 and proceed as before, and a context for 000 would be created predicting thecharacter b when the model updates. PPM ∗ PPM ∗ encodes its input very similarly to Bounded PPM in that it builds a model of contextsof the sequence, continuously updates the model, and encodes each character it sees based on itsfrequency probability of the current context. However there are some key differences. As there isno upper bound on the max context length stored in the model, instead of building a context forevery substring seen in the input, a context is only extended until it is unique. Suppose that of theinput stream to PPM ∗ , the prefix x has been read so far. This means that for a string w ∈ { , } ∗ ,if occ( w, x ) ≥
2, the context wb must be built in the model for each b that follows w in x . Whenexamining all relevant contexts to choose the be the first current context to encode the next bit,unlike Bounded PPM which chooses the longest context, PPM ∗ chooses the shortest deterministic context. Here, a context is said to be deterministic if it has an escape count frequency of 1. Ifno such context exists, the longest is chosen as with Bounded PPM. We also use the Method C approach to computing escape probabilities for PPM ∗ .The following is a full example of a model being updated. Suppose an input stream of s =0100110110 has already been read. The model for this is seen in Table 2. Say the next bit read is4txt pred cnt pb ctxt pred cnt pb ctxt pred cnt pbOrder k = 3 Order k = 2 Order k = 1001 1 1
00 1 1 $ $
010 0 1
01 0 1 $ $
011 0 2 $ $
10 0 1 $
100 1 1 Order k = 0 $ $ λ
101 1 1
11 0 2 $ $ $
110 1 1 Order k = − $ Table 1: PPM model for the input 0100110110a 0. The relevant contexts are the empty context, 0 , ,
110 and 0110. The shortest deterministiccontext is 110. It does not predict a 0 so an escape is transmitted with probability and then 0is transmitted from the context 10 with probability . The model is then updated as follows. Theempty context, 0 and 10 all predict a 0 so their counts are updated. 110 and 0110 don’t predict a0, so it is added as a prediction. Furthermore, the substrings 00 and 100 are not unique in s s . That is occ(00 , s = 1 , occ(100 , s = 1 while occ(00 , s ) = occ(100 , s ) = 1 . These contexts must be extended to create new contexts 001 and 1001. This is because 1 is whatfollows 00 and 100 in s . These contexts both predict 1. If another 0 is read after s
0, since both a 0and 1 now have been seen to follow 110 and 0110, contexts for 1100 and 01100 will be created bothpredicting a 0, since a context has to be made for each branching path of 110 and 0110 (1101 and01101 already exist).
The final output of the PPM encoder is a real number in the interval [0 ,
1) found via arithmeticencoding [2, 17]. The arithmetic encoder begins with the interval [0 , c is transmitted such that c ∈ [ a, b ), where [ a, b ) is the final interval and c can be encoded in −⌈ log( | b − a | ) ⌉ bits. At most1 bit of overhead is required. With c and the length of the original sequence to be encoded, thedecoder can find the original sequence.For simplicity, we assume the encoder can calculate the endpoints of the intervals with infinite5txt pred cnt pb ctxt pred cnt pb ctxt pred cnt pbOrder k = 5 101 1 1 Order k = 101101 1 1 $ $
110 1 1 Order k = 4 $ $ Order k = 2 1 0 3 $
00 1 1 $ $ $
01 0 1 Order k = 0Order k = 3 1 2 λ
010 0 1 $ $
10 0 1 $
011 0 2 Order k = − $ $
100 1 1
11 0 2 $ $ Table 2: PPM* model for the input 0100110110precision and waits until the end to convert the fraction to its final form at the end of the encoding.In reality, a fixed finite limit precision is used by encoders to represent the intervals and theirendpoints and a process known as renormalisation occurs to prevent the intervals becoming toosmall for the encoder to handle.
PPM ∗ In this section we build a normal sequence S such that R PPM ∗ ( S ) = 0. S will be an enumera-tion of all binary strings and built via a concatenations of de Bruijn strings. This ensures it isimcompressible by the Lempel-Ziv algorithm. de Bruijn Strings
Named after Nicolaas de Bruijn for his work from 1946[4], for n ∈ N , a de Bruijn string of order n is a string of length 2 n that when viewed cyclically, contains all binary strings of length n exactlyonce. That is, for a de Bruijn string x of order n , for all w ∈ { , } n , occ( w, x · x [0 ..n − . For example, 00011101 is a de Bruijn string of order 3.Henceforth, we use db ( n ) to denote the least lexicographic de Bruijn string of order n . Martinprovided the following algorithm to build this string in 1934 [13]:1. Write the string x = 1 n − .2. While possible, append a bit (with 0 taking priority over 1) to the end of x so that substringsof length n occur only once in x . 600000100001100010100011100100101100110100111101010111011011111100000010000110001010001110010010110011010011110101011101101111110000001000011000101000111001001011001101001111010101110110111111000001000011000101000111001001011001101001111010101110110111111000000100001100010100011100100101100110100111101010111011011111100000010000110001010001110010010110011010011110101011101101111110Table 3: To construct S , concatenate the six rows of this table. The first three rows are db (6)while the second three rows are db (6).3. When step 2 is no longer possible , remove the prefix 1 n − from x . The resulting string is db ( n ).Before we proceed we make note of the following properties of db ( n ) . Remark 4.1.
For n ≥ db ( n )[0 .. n ] = 0 n n − db ( n )[2 n − n − .. n −
1] = 1 n . Proof:
From the construction, db ( n ) must begin with 0 n . This is followed by a 1, otherwise thestring 0 n would occur twice. The next n − n n − was followed by another 0, this would result in either 0 n or 0 n − n n − n −
10 occurs twice. db ( n ) having suffix of 1 n is proven by Martin when showing when his algorithm terminates[13].We use the following notation for cyclic shifts of db ( n ). For 0 ≤ i < n , let db i ( n ) denote a leftshift of i bits of db ( n ). That is, db i ( n ) = db ( n )[ i.. n − · db ( n )[0 ..i − db ( n ) instead of db ( n ) when no shift occurs. db ji ( n ) denotes db i ( n ) concatenated with itself j times. S The infinite binary sequence S = S S S . . . is built such that each S n is a concatenation of allstrings of length n and maximises repetitions to exploit PPM*. Maximising repetitions ensuresdeterministic contexts are repeatedly used to predict bits in the sequence, thus resulting in com-pression.For every n ∈ N , n can be written in the form n = 2 s t , where s, t ∈ N ∪ { } and t is odd. Weset S n = B n, · B n, · · · B n, s − , where B n,i = db ti ( n ). Each B n,i is called the i th block of S n . Notethat if n is odd then S n = db n ( n ) , and if n is a power of 2 then S n = db ( n ) · db ( n ) · · · db n − ( n ) . Tohelp visualise this, table 3 is provided which shows how S is built.The following lemma states that S is in fact an enumeration of all binary strings, and hencenormal. This is the property which later ensures that S is Lempel-Ziv imcompressible. Martin proves that this occurs when | x | = 2 n + n − emma 4.2. For each n ∈ N , for w ∈ { , } n , occ b ( w, S n ) = 1 . Proof:
Consider the cyclic group of order 2 n , C n = h x | x n = e i , where e = x is the identityelement and x is the generator of the group. There exists a bijective mapping f : C n → { , } n such that for 0 ≤ a < n , x a is mapped to the substring of db ( n ) of length n beginning in position a . That is, f ( e ) = db ( n )[0 ..n − , f ( x ) = db ( n )[1 ..n ] , . . . , f ( x n − ) = db ( n )[2 n − · db ( n )[0 ..n − . Let s, t ∈ N ∪ { } such that t is odd and n = 2 s t . Consider the subgroup h x n i of C n . Fromgroup theory it follows that |h x n i| = 2 n gcd ( n, n ) = 2 s t − s = 2 n − s . So h x n i = n − s − [ i =0 { x in mod 2 n } = { e, x n , x n , . . . , x (2 n − s − n mod 2 n } . Concatenating the result of applying f to each element of h x n i beginning with e in the naturalorder gives the string σ = f ( e ) · f ( x n ) · f ( x n ) · · · f ( x (2 n − s − n mod 2 n ) .σ can be thought of as beginning with the string 0 n , cycling through db ( n ) in blocks of size n untilthe block 1 n is seen. As n − s n n = t , we have that σ = db t ( n ) = B .As | C n ||h x n i| = 2 s , there are 2 s cosets of h x n i in C n . As each coset is disjoint, each represents adifferent set of 2 n − s strings of { , } n . Specifically each coset represents B i = db ti ( n ). Therefore,for each y ∈ { , } n , for some i ∈ { , . . . , s − } , occ b ( y, B i ) = 1 and occ b ( y, B j ) = 0 for each j = i .Thus S n is an enumeration of { , } n .We proceed by examining some basic properties of each S n section of S for n large. Henceforth,we write S ↾ S n to denote S · · · S n .Suppose the encoder has already processed S ↾ S n − , so the next bit to be processed is the firstbit of S n . While the encoder’s model may contain contexts of length n after processing S ↾ S n − ,the following lemma shows it will contain all possible contexts of length n after reading the first2 n + n bits of S n . The idea is that in the first 2 n + n − S n , for each x ∈ { , } n − , x occursat least twice, and x x x must becreated, i.e. contexts for x x Lemma 4.3.
Let n ≥ and suppose the encoder has already processed S ↾ S n − . The encoder’smodel will contain contexts for all w ∈ { , } n once it has processed the S [0 .. (2 n + n − , i.e. thefirst n + n bits of S n .Proof: [Lemma 4.3]Consider x = S n [0 .. n + ( n − db ( n ) · n − (by Remark 4.1). By the definition of de Bruijn strings, for all w ∈ { , } n , occ( w, x ) = 1. Hence, for all v ∈ { , } n − , occ( v , x ) = occ( v , x ) = 1.As occ( v, x ) ≥ n − , x ) = 3), a context for v would have been created, and as v is notunique in x , contexts for each its branching paths have to be created, namely v v
1. However,one more bit is required to finish building the context in the case where v n − (the last n bitsof x ) as the model cannot build a context until it can say what it predicts. Hence | x | + 1 = 2 n + n bits are needed in total. 8 .3 The Bad Zone
For each S n , its first 2 n + 2 n bits are referred to as the bad zone . Here we make no assumptionabout the contexts being used and assume worst possible compression. The hope is that afterthe first 2 n + n bits of S n are encoded, either the contexts used to predict S n [0 .. n + n −
1] willhave been deterministic and will continue to correctly predict the remaining bits of S n , or thatnew deterministic contexts will have been created that correctly predict the remaining bits of S n .Unfortunately this may not always occur in the succeeding n bits. This commonly occurs if theoriginal contexts used straddle two S i sections.For instance, consider S [0 .. db (7) · is the context of length 7 that may be used. However,0 is not a deterministic context. Since S ends with 10 and S begins with 0
1, this results inthe substring 10 ,
1) = 2 with occ(0 ,
1) = 1and occ(0 ,
1) = 1. Hence 10 is a context of length 8 that may be used. We know it exists asocc(10 , S ↾ S · S [0 .. S and S (10 S and S (100 cannot occur anywhere else) and so it deterministically predicts a 0. However the bit currentlybeing predicted is a 1 and so an escape is required.The following lemma puts an upper bound on how many bits are required to encode any singularbit occurring in an S n zone. The proof requires a counting argument examining how many times acontext of length n and n − S S . . . S n of S . Lemma 4.4.
For almost every n , if S ↾ S n − has already been read, each bit in S n contributes atmost log( n ) bits to the encoding of S .Proof: For a fixed n , let b be the current bit of S n being encoded. Let x be the context used topredict b by the encoder. Then | x | ≥ n − n − S n as seen in Lemma 4.3. In the worst case scenario, x will be deterministic, butwill not predict b correctly and thus will transmit an escape. In the worst case for j ≤ n − x, S j ) ≤ j , i.e. once for every instance of db ( j ) in S j , and occ( x, S n ) = 2 n . Thus, we can boundthe maximum possible number of occurrences by n X j =1 j ≤ n . This results in an escape being transmitted in at most − log( n +1 ) bits.The b will then be transmitted by the non-deterministic context x [1 .. | x | − n − x is originally chosen as the shortest deterministic relevant context). Using the samelogic, this context will have appeared at most j times in S j for j ≤ n −
2, and 2( n −
1) times in S n − and 4 n in S n . Thus, we can bound the maximum possible number of occurrences by( n − X j =1 j ) + 2( n −
1) + 4 n ≤ n for n large. Hence, b will be transmitted in at most − log( n +2 ) bits. As such, b contributes atmost log( n + 2) + log( n + 1) ≤ log( n )9its to the encoding for n large.The above Lemma 4.4 and knowing the size of the bad zone allows us to bound the number ofbits contributed by the bad zone of S n . Corollary 4.5.
For almost every n , if S ↾ S n − has already been read, the bad zone of S n con-tributes at most (2 n + 2 n ) log( n ) bits to the encoding of S . In this section we prove our main result that R PPM ∗ ( S ) = 0 . This compression is achieved from the repetition of the de Bruijn strings which lead to repeated useof deterministic contexts. When deterministic contexts are used, correct predictions are performedwith probability kk +1 , for some k ∈ N . Note that as k increases, the number of bits contributed tothe encoding ( − log( kk +1 )) approaches 0.The following shows that for n large, whenever n is odd or n = 2 j for some j , the bits of S n notin the bad zone will be predicted by deterministic contexts. Lemma 4.6.
For n large, where n is odd or n = 2 j for some j , all bits not in the bad zone of S n are correctly predicted by deterministic contexts.Proof: For n -odd, the 2 n + 2 n + 1 th bit in S n will always be a 1 (if n is a power of 2, a similarargument holds but we look at the 2 n + 2 n th bit). This is because db ( n ) · db ( n )[0 .. n ] = db ( n ) · n n −
11 by Remark 4.1. The context used to predict this 1 will always be a suffix of the context010 n −
1. This context exists as we have that occ(010 n − , S n [0 .. n + 2 n ]) = 2. Claim 4.7.
The context 010 n − n − , S ↾ S n − ) = 0. The only place the string 10 n − occurs is in S n − where it would be preceded by 1 n − , in S n − where it would be preceded by 1 n − and not a0, or along a straddle between two prior S i ’s for i ≤ n −
1, but again, it would be preceded by a 1,and not a 0. Hence, 010 n − S n [0 .. n −
1] = db ( n ) where it is followed by a 1, andso is deterministic. This established the claim.As the context 010 n − n − y for the appropriate y ∈ { , } ∗ ) that are built while reading S n , must be deterministic also. Theyremain deterministic throughout the reading of S n since due to the construction of S n , any substringof S n of length at least n is always followed by the same bit. Thus, every bit not in the bad zone of S n is predicted by a deterministic context which is a suffix of an extension of the deterministiccontext 010 n − n -even but not of the form 2 j for some j , Lemma 4.6 does not hold as while most bitsare predicted by deterministic contexts, the shifts of the de Bruijn strings in the constructionof S n between blocks B n, and B n, mean that some contexts which may have originally beendeterministic in B n, , soon see the opposite bit due to these shifts at the start of B n, .10or instance, consider the string 1 . We have that occ(1 , S ↾ S ) = 0 as 1 is not containedin any de Bruijn string of order less than 6. However it does occur in S multiple times. The firsttwo times it occurs it sees a 0 (as db (6)[2 − .. − · db (6)[0 ..
5] = 1 ) by Remark 4.1). Howeverthe third time it sees a 1 due to the shift in B , (as db (6)[2 − .. − · db (6)[0 ..
5] = 1 is no longer deterministic.We first prove the following result which bounds the number of bits each block S n contributesto the encoding. In the following, | PPM ∗ ( S n | S ↾ S n − ) | represents the number of bits contributedto the output by the PPM ∗ encoder on S n if it has already processed S ↾ S n − . Theorem 4.8.
For almost every n , | PPM ∗ ( S n | S ↾ S n − ) | ≤ (2 n + 2 n + n ) log( n ) + log(( n − n ( n ) n +1 ) . Proof:
By Lemma 4.6, every bit outside the bad zone is predicted correctly by a deterministiccontext when n is odd or when n is a power or 2. This is not true for the remaining n as mentionedin the discussion preceding this theorem. As such, the output contributed by the case where n iseven but not a power of 2 acts as an upper bound for all n .In this case, n = 2 s t , for s, t, ∈ N , where t is odd. Recall that S n = B · B · · · B s − where B i = ( db ti ( n )). Let b n = 2 s , the number of blocks in S n . Unlike in the other two cases whereall contexts used to encode remained deterministic throughout the encoding of S n after the first2 n + 2 n bits are processed by Lemma 4.6, in this case some contexts do not due to the shifts thatoccur within each block. If there are b n blocks, there are b n − bad zone , a 1 is deterministically correctly predicted by the context 010 n − . This is because occ(010 n − , S ↾ S n − ) = 0 since the only place thestring 10 n − occurs is in S n − where it would be preceded by 1 n − and not a 0, in S n − where itwould be preceded by 1 n − and not a 0, or along a straddle between two prior S i ’s for i ≤ n − n − S n [0 .. n −
1] = db ( n )where it is followed by a 1 (by Remark 4.1), and so is deterministic.Following this, the next 2 n − n − n − S n . The next time 010 n − and so on. When 010 n − n th time, there are only 2 n − n + b n − fallen behind by b n − b n − n − n . Excluding the bad zone we have accounted for(2 n − n − n −
2) + (2 n − n + b n −
1) = 2 n n − n − n − n + 1 + b n bits. These are encoded in − log((( 12 ) n − n − ( 23 ) n − n − · · · ( n − n − n − n − )( n − n ) n − n + b n − ) ≤ log(( n − n ( n ) n +2 n ) (as b n < n )= log(( n − n ( n ) n +1 ) ( † )bits via arithmetic encoding.Things differ with other contexts used as they may be impacted by the shifts that occur betweenblocks as discussed previously. For simplicity it is assumed all bits not accounted for so far contributethe worst case number of bits possible to the encoding. There are n − n − − b n such bits.11hen by † , Lemma 4.4 and Corollary 4.5, we have that | PPM ∗ ( S n | S ↾ S n − ) | ≤ (2 n + 2 n + n )(log( n )) + log(( n − n ( n ) n +1 ) . We now prove the main theorem.
Theorem 4.9. R PPM ∗ ( S ) = 0 . Proof:
Note that the worst compression of S is achieved if the input ends with a complete badzone, i.e. for a prefix of the form S ↾ m = S . . . S n − S n [0 .. n + 2 n −
1] for some n .Let S ↾ m be such a prefix and let k be such that Theorem 4.8 holds for all zones S i with i ≥ k .The prefix S . . . S k − will always be encoded in O(1) bits. This giveslim sup m →∞ | PPM ∗ ( S ↾ m ) | m ≤ lim n →∞ n − P j = k (2 j + 2 j + j ) log( j ) | S . . . S n − | + (2 n + 2 n ) ! (by Thm 4.8)+ lim n →∞ (2 n + 2 n ) log( n ) + O (1) | S . . . S n − | + (2 n + 2 n ) ! = 0 . As the overhead contributes at most one bit, we have that R PPM ∗ ( S ) = 0 . As S is a normal sequence we have the following result. Corollary 4.10.
There exists a normal sequence S such that R PPM ∗ ( S ) = 0 . PPM k and PPM ∗ The following theorem demonstrates that for all k ∈ N , PPM k achieves a best-case compressionratio of at least on S . Suppose you are examining PPM k . The idea is that each context of length k predicts the same number of 0s and 1s in each S n zone for n ≥ k . For x ∈ { , } k , n ≥ k , supposeocc( x, S n ) = t . The least amount any bit can contribute in S n is if the first t times x is seen itsees a 0 and the final t times it is seen it sees a 1 (or vice versa). The t S n , this gives thelower bound of . For each k ∈ N we use PPM k ( x ) to denote the compression of x ∈ { , } ∗ whenthe max context length is bounded to be k. Theorem 4.11.
There exists a sequence S such that R PPM ∗ ( S ) = 0 but for all k ∈ N , ρ PPM k ( S ) ≥ . Proof:
Let S be our sequence from Theorem 4.9 that R PPM ∗ ( S ) = 0. Let k be the maximumcontext length for the bounded PPM compressor PPM k . Recall S = S S . . . where S i is anenumeration of all strings of length i . 12nce PPM k processes S S . . . S k , the model will contain a context for every string of length k .Let x ∈ { , } k . Let n x,b be the number of instances that x has been followed by b in S . . . S k − ,for b ∈ { , } . This means that if x is next followed by 0, it will be predicted with probability n x, n x, + n x, +2 . Let t > k , and consider the substring S ′ t = S t S t +1 [0 . . . k −
1] of S . A context x ∈ { , } k will appear t · t − k times in S ′ t , half the time followed by a 0, and half the time followedby a 1. The maximum compression of S ′ t that could be achieved is when each context predictioncontributes as few bits as possible. The minimum amount that can possibly be contributed by anysingle prediction occurs if the first t · t − k − times x is seen it is always followed by the same bit b ,and the remaining times by ˆ b , that is b flipped. This t · t − k − th instance of x being followed by b contributes the fewest amount of bits possible to the final encoding. Of course, this is a hypotheticalscenario and does not actually occur in our S , it serves as a lower bound.Suppose b = 0. Then in this hypothetical scenario for S ′ t, , the probabilities that a 0 is predictedis given by the sequence S ′ t, = (cid:26) n x, + P t − n = k +1 ( n · n − k − ) + j ( n x, + n x, + 2) + P t − n = k +1 ( n · n − k ) + j (cid:27) ≤ j
0, for almost every m we have | PPM k ( S ↾ m ) | ≥ m (1 − δ )( − log( 23 + ǫ )) ≥ m (1 − δ )( − log( 710 )) ≥ (1 − δ ) m . The bound in the above theorem can of course be made much tighter, but it is sufficient todemonstrate a difference between PPM ∗ and PPM k . The Lempel-Ziv 78 (LZ) algorithm [18] is a lossless dictionary based compression algorithm. Givenan input x ∈ { , } ∗ , LZ parses x into phrases x = x x . . . x n such that each phrase x i is unique inthe parsing, except for maybe the last phrase. Furthermore, for each phrase x i , every prefix of x i also appears as a phrase in the parsing. That is, if y is a prefix of x i , then y = x j for some j < i .Each phrase is stored in LZ’s dictionary. LZ encodes x by encoding each phrase as a pointer to its13ictionary containing the longest proper prefix of the phrase along with the final bit of the phrase.Specifically for each phrase x i , x i = x l ( i ) b i for l ( i ) < i and b i ∈ { , } . Then for x = x x . . . x n LZ( x ) = c l (1) b c l (2) b . . . c l ( n ) b n where c i is a prefix free encoding of the pointer to the i th element of LZ’s dictionary, and x = λ .Sequences that are enumerations of all strings are incompressible the LZ algorithm. As such,taking S from Theorem 4.9, by Corollary 4.10 and as S is an enumeration of all strings, we havethe following result. Theorem 5.1.
There exists a normal sequence S such that1. R PPM ∗ ( S ) = 0 ,2. ρ LZ ( S ) = 1 . Does there exists a sequence which acts as the opposite to Theorem 5.1? Can the constructionmethod for S in Theorem 4.9 be generalised so that an infinite family of sequences satisfy thetheorem? Bounded PPM and PPM ∗ gives rise to the possibility of developing a notion of Bennett’slogical depth [3] based on the PPM algorithms. S is an obvious candidate for a PPM-deep sequence,but how would properties such as the Slow Growth Law be defined in the PPM setting? Depthnotions based on compressors and transducers have already been introduced in [9, 11]. References [1] Ver´onica Becher, Olivier Carton, and Pablo Ariel Heiber. Normality and automata.
J. Comput.Syst. Sci. , 81(8):1592–1613, 2015.[2] Timothy C. Bell, John G. Cleary, and Ian H. Witten.
Text Compression . Prentice-Hall, Inc.,Upper Saddle River, NJ, USA, 1990.[3] C. H. Bennett. Logical depth and physical complexity.
The Universal Turing Machine, AHalf-Century Survey , pages 227–257, 1988.[4] Nicolaas Govert De Bruijn. A combinatorial problem. In
Proc. Koninklijke NederlandseAcademie van Wetenschappen , volume 49, pages 758–764, 1946.[5] Olivier Carton and Pablo Ariel Heiber. Normality and two-way automata.
Inf. Comput. ,241:264–276, 2015.[6] John G. Cleary and W. J. Teahan. Unbounded length contexts for PPM.
Comput. J. ,40(2/3):67–75, 1997.[7] John G. Cleary and Ian H. Witten. Data compression using adaptive coding and partial stringmatching.
IEEE Trans. Communications , 32(4):396–402, 1984.148] J. Dai, J.I. Lathrop, J.H. Lutz, and E. Mayordomo. Finite-state dimension.
TheoreticalComputer Science , 310:1–33, 2004.[9] David Doty and Philippe Moser. Feasible depth. In S. Barry Cooper, Benedikt L¨owe, andAndrea Sorbi, editors,
CiE , volume 4497 of
Lecture Notes in Computer Science , pages 228–237.Springer, 2007.[10] M. ´Emile Borel. Les probabilits dnombrables et leurs applications arithmtiques.
Rendicontidel Circolo Matematico di Palermo , 27(1):247–271, 1909.[11] Liam Jordon and Philippe Moser. On the difference between finite-state and pushdown depth.In
International Conference on Current Trends in Theory and Practice of Informatics , pages187–198. Springer, 2020.[12] J. Lathrop and M. Strauss. A universal upper bound on the performance of the Lempel-Zivalgorithm on maliciously-constructed data. In
Proceedings of the Compression and Complexityof Sequences 1997 , pages 123–135, 1997.[13] M. H. Martin. A problem in arrangements.
Bull. Amer. Math. Soc. , 40(12):859–864, 1934.[14] Elvira Mayordomo, Philippe Moser, and Sylvain Perifel. Polylog space compression, pushdowncompression, and lempel-ziv are incomparable.
Theory Comput. Syst. , 48(4):731–766, 2011.[15] Alistair Moffat. Implementing the PPM data compression scheme.
IEEE Trans. Communica-tions , 38(11):1917–1921, 1990.[16] Claus-Peter Schnorr and H. Stimm. Endliche automaten und zufallsfolgen.
Acta Inf. , 1:345–359, 1972.[17] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression.
Commun. ACM , 30(6):520–540, June 1987.[18] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding.