Re-Pair In Small Space
Dominik Köppl, Tomohiro I, Isamu Furuya, Yoshimasa Takabatake, Kensuke Sakai, Keisuke Goto
aa r X i v : . [ c s . D S ] N ov Re-Pair in Small Space
Dominik K¨oppl Tomohiro I Isamu Furuya Yoshimasa TakabatakeKensuke Sakai Keisuke Goto
Abstract
Re-Pair is a grammar compression scheme with favorably good compression rates. The compu-tation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hardto compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text oflength n whose characters are drawn from an integer alphabet, an O ( n ) ∩O ( n lg log τ n lg lg lg n/ log τ n )time algorithm computing Re-Pair in n ⌈ lg max( n, τ ) ⌉ bits of working space including the text space,where τ is the number of terminals and non-terminals. The algorithm works in the restore model,supporting the recovery of the original input in the time for the Re-Pair computation with O (lg n )additional bits of working space. We give variants of our solution working in parallel or in theexternal memory model. Keywords: Grammar Compression, Re-Pair, Computation in Small Space
Re-Pair [21] is a grammar deriving a single string. It is computed by replacing the most frequent bigramin this string with a new non-terminal, recursing until no bigram occurs more than once. Despite thissimple-looking description, both the merits and the computational complexity of Re-Pair are intriguing.As a matter of fact, Re-Pair is currently one of the most well-understood grammar schemes.Besides the seminal work of Larsson and Moffat [21], there are a couple of articles devoted to thecompression aspects of Re-Pair: Given a text T of length n whose characters are drawn from an integeralphabet of size σ , the output of Re-Pair applied to T is at most 2 nH k ( T )+ o ( n lg σ ) bits with k = o (log σ n )when represented naively as a list of character pairs [25], where H k denotes the empirical entropy of the k -th order. Using the encoding of Kieffer and Yang [19], Ochoa and Navarro [26] could improve theoutput size to at most nH k ( T ) + o ( n lg σ ) bits. Other encodings were recently studied by Ganczorz [14].Since Re-Pair is a so-called irreducible grammar, its grammar size, i.e., the sum of the symbols on theright hand of all rules, is upper bounded by O ( n/ log σ n ) [19, Lemma 2], which matches the information-theoretic lower bound on the size of a grammar for a string of length n . Comparing this size with thesize of the smallest grammar, its approximation ratio has O (( n/ lg n ) / ) as an upper bound [8] andΩ(lg n/ lg lg n ) as a lower bound [2].On the practical side, Yoshida and Kida [33] presented an efficient fixed-length code for compress-ing the Re-Pair grammar. Although conceived as a grammar for compressing texts, Re-Pair has beensuccessfully applied for compressing trees [23], matrices [30], or images [11].For different settings or for better compression rates, there is a great interest in modifications toRe-Pair. Charikar et al. [8, Sect. G] give an easy variation to improve the size of the grammar. Sekineet al. [28] provide an adaptive variant whose algorithm divides the input into blocks, and processes eachblock based on the rules obtained from the grammars of its preceding blocks. Subsequently, Masaki andKida [24] gave an online algorithm producing a grammar mimicking Re-Pair. Ganczorz and Jez [15]modified the Re-Pair grammar by disfavoring the replacement of bigrams that cross Lempel-Ziv-77(LZ77) [34] factorization borders, which allowed the authors to achieve practically smaller grammar sizes.Recently, Furuya et al. [13] presented a variant, called MR-Re-Pair , in which a most frequent maximalrepeat is replaced instead of a most frequent bigram.
Although Re-Pair is a well received grammar, there is not much literature found on how to computeRe-Pair efficiently. In this article, we focus on the problem to compute the grammar with an algorithm1orking in text space, forming a bridge between the domain of in-place string algorithms and the domainof Re-Pair computing algorithms. We briefly review some prominent achievements in both domains:
In-Place String Algorithms.
For the LZ77 factorization, K¨arkk¨ainen et al. [18] present an algorithmcomputing this factorization with O ( n/d ) words on top of the input space in O ( dn ) time for a variable d ≥
1, achieving O (1) words with O ( n ) time. For the suffix sorting problem, Goto [16] gave an algorithmto compute the suffix array with O (lg n ) bits on top of the output in O ( n ) time if each character of thealphabet is present in the text. This condition got improved to alphabet sizes of at most n by Li et al. [22].Finally, Crochemore et al. [9] showed how to transform a text into its Burrows-Wheeler transform byusing O (lg n ) of additional bits. Due to da Louza et al. [10], this algorithm got extended to computesimultaneously the LCP array with O (lg n ) bits of additional working space. Re-Pair Computation.
Re-Pair is a grammar proposed by Larsson and Moffat [21], who gave analgorithm computing it in expected linear time with 5 n + 4 σ + 4 σ ′ + √ n words of working space, where σ ′ is the number of non-terminals (produced by Re-Pair). This space requirement got improved by Billeet al. [5], who presented a linear time algorithm taking (1 + ǫ ) n + √ n words on top of the rewriteabletext space for a constant ǫ with 0 < ǫ ≤
1. Subsequently, they improved their algorithm in [4] toinclude the text space within the (1 + ǫ ) n + √ n words of working space. However, they assume thatthe alphabet size σ is constant and ⌈ lg σ ⌉ ≤ w/
2, where w is the machine word size. They also providea solution for ǫ = 0 running in expected linear time. Recently, Sakai et al. [27] showed how to convertan arbitrary grammar (representing a text) into the Re-Pair grammar in compressed space, i.e., withoutdecompressing the text. Combined with a grammar compression that can process the text in compressedspace in a streaming fashion, this result leads to the first Re-Pair computation in compressed space. Our Contribution.
In this article, we propose an algorithm that computes the Re-Pair grammar in O ( n ) ∩ O ( n lg log τ n lg lg lg n/ log τ n ) time (cf. Thm. 2.3 and Thm. 3.1) with max(( n/c ) lg n, n ⌈ lg τ ⌉ ) + O (lg n ) bits of working space including the text space, where τ is the number of terminals and non-terminals. Given that the characters of the text are drawn from a large integer alphabet with size σ = Ω( n ), the algorithm works in-place. This is the first non-trivial in-place algorithm, as a trivialapproach on a text T of length n would compute the most frequent bigram in Θ( n ) time by computingthe frequency of each bigram T [ i ] T [ i + 1] for every integer i with 1 ≤ i ≤ n −
1, keeping only themost frequent bigram in memory. This sums up to O ( n ) total time, and can be Θ( n ) for sometexts since there can be Θ( n ) different bigrams considered for replacement by Re-Pair. To achieve ourgoal of O ( n ) total time, we first provide a trade-off algorithm (cf. Lemma 2.2) finding the d mostfrequent bigrams in O ( n lg d/d ) time for a trade-off parameter d . We subsequently run this algorithmfor increasing values of d , and show that we need to run it O (lg n ) times, which gives us O ( n ) timeif d is increasing sufficiently fast. Our major tools are appropriate text partitioning, elementary scans,and sorting steps, which we visualize in Sect. 2.4 by an example, and practically evaluate in Sect. 2.5.When τ = o ( n ), a different approach using word-packing and bit-parallel techniques becomes attractive,leading to an O ( n lg log τ n lg lg lg n/ log τ n ) time algorithm, which we explain in Sect. 3. Our algorithmcan be parallelized (Sect. 5), used in external memory (Sect. 6), or adapted to compute the MR-Re-Pairgrammar in small space (Sect. 4). Finally, in Sect. 7 we study several heuristics that make the algorithmfaster on specific texts. We use the word RAM model with a word size of Ω(lg n ) for an integer n ≥
1. We work in the restoremodel [7], in which algorithms are allowed to overwrite the input, as long as they can restore the inputto its original form.
Strings.
Let T be a text of length n whose characters are drawn from an integer alphabet Σ of size σ = n O (1) . A bigram is an element of Σ . The frequency of a bigram B in T is the number of non-overlapping occurrences of B in T , which is at most | T | / e-Pair. We reformulate the recursive description in the introduction by dividing a Re-Pair construc-tion algorithm into turns. Stipulating that T i is the text after the i -th turn with i ≥ T := T ∈ Σ +0 with Σ := Σ, Re-Pair replaces one of the most frequent bigrams (ties are broken arbitrarily) in T i − with a non-terminal in the i -th turn. Given this bigram is bc ∈ Σ i − , Re-Pair replaces all occurrences of bc with a new non-terminal X i in T i − , and sets Σ i := Σ i − ∪ { X i } with σ i := | Σ i | to produce T i ∈ Σ + i .Since | T i | ≤ | T i − | −
2, Re-Pair terminates after m < n/ T m ∈ Σ + m contains no bigramoccurring more than once. A major task for producing the Re-Pair grammar is to count the frequencies of the most frequent bigrams.Our work horse for this task are frequency tables. A frequency table in T i of length f stores pairs ofthe form ( bc , x ), where bc is a bigram and x the frequency of bc in T i . It uses f (cid:6) lg( σ i n i / (cid:7) bits ofspace since an entry stores a bigram consisting of two characters from Σ i and its respective frequency,which can be at most n i /
2. Throughout this paper, we use an elementary in-place sorting algorithm likeheapsort:
Lemma 2.1 ([32]) . An array of length n can be sorted in-place in O ( n lg n ) time. By embracing the frequency tables, we present a solution with a trade-off parameter:
Lemma 2.2.
Given an integer d with d ≥
1, we can compute the frequencies of the d most frequent bi-grams in a text of length n whose characters are drawn from an alphabet of size σ in O (max( n, d ) n lg d/d )time using 2 d (cid:6) lg( σ n/ (cid:7) + O (lg n ) bits. Proof.
Our idea is to partition the set of all bigrams appearing in T into ⌈ n/d ⌉ subsets, compute thefrequencies for each subset, and finally merge these frequencies. In detail, we partition the text T = S · · · S ⌈ n/d ⌉ into ⌈ n/d ⌉ substrings such that each substring has length d (the last one has a length of atmost d ). Subsequently, we extend S j to the left (only if j >
1) and to the right (only if j < ⌈ n/d ⌉ ) suchthat S j and S j +1 overlap by one text position, for 1 ≤ j < ⌈ n/d ⌉ . By doing so, we take the bigram onthe border of two adjacent substrings S j and S j +1 for each j < ⌈ n/d ⌉ into account. Next, we create twofrequency tables F and F ′ , each of length d for storing the frequencies of d bigrams. With F and F ′ , weprocess each of the n/d substrings S j as follows: Let us fix an integer j with 1 ≤ j ≤ ⌈ n/d ⌉ . We first putall bigrams of S j into F ′ in lexicographic order. We can perform this within the space of F ′ in O ( d lg d )time since there are at most d different bigrams in S j . We compute the frequencies of all these bigramsin the complete text T in O ( n lg d ) time by scanning the text from left to right while locating a bigramin F ′ in O (lg d ) time with a binary search. Subsequently, we interpret F and F ′ as one large frequencytable, sort it with respect to the frequencies while discarding duplicates such that F stores the d mostfrequent bigrams in T [1 ..jd ]. This sorting step can be done in O ( d lg d ) time. Finally, we clear F ′ andare done with S j . After the final merge step, we obtain the d most frequent bigrams of T stored in F .Since each of the O ( n/d ) merge steps takes O ( d lg d + n lg d ) time, we need O (max( d, n ) · ( n lg d ) /d )time. For d ≥ n , we can build a large frequency table and perform one scan to count the frequencies ofall bigrams in T . This scan and the final sorting with respect to the counted frequencies can be done in O ( n lg n ) time. With Lemma 2.2, we can compute T m in O ( mn lg d/d ) time with additional 2 d (cid:6) lg( σ m n/ (cid:7) bits ofworking space on top of the text for a parameter d with 1 ≤ d ≤ n . In the following, we present an O ( n ) time algorithm that needs max(( n/c ) lg n, n ⌈ lg σ m ⌉ ) + O (lg n ) bits of working space, where thetext space is included as a rewriteable part in the working space and c ≥ T i from n i ⌈ lg σ i ⌉ bits to n i ⌈ lg σ i +1 ⌉ bits without additionalextra memory. Our main idea is to store a growing frequency table using the space freed up by replacingbigrams with non-terminals. In detail, we maintain a frequency table F in T i of length f k for a growing3ariable f k , which is set to f := O (1) in the beginning. The table F takes f k (cid:6) lg( σ i n/ (cid:7) bits, which is O (lg( σ n )) = O (lg n ) bits for k = 0. When we want to query it for a most frequent bigram, we linearlyscan F in O ( f k ) = O ( n ) time, which is not a problem since (a) the number of queries is m ≤ n , and (b)we aim for O ( n ) overall running time. A consequence is that there is no need to sort the bigrams in F according to their frequencies, which simplifies the following discussion. Frequency Table F . With Lemma 2.2, we can compute F in O ( n max( n, f k ) lg f k /f k ) time. Insteadof recomputing F for every turn i , we want to recompute it only when it no longer stores a most frequentbigram. However, it is ad-hoc not clear when this happens as replacing a most frequent bigram duringa turn (a) removes this entry in F and (b) can reduce the frequencies of other bigrams in F , makingthem possibly less frequent than other bigrams not tracked by F . Hence, the variable i for the i -th turn(creating the i -th non-terminal) and the variable k for recomputing the frequency table F the ( k + 1)-sttime are loosely connected. We group together all turns with the same f k and call this group the k -thround of the algorithm. At the beginning of each round, we enlarge f k and create a new F with acapacity for f k bigrams. Since a recomputation of F takes much time, we want to end a round onlyif F is no longer useful, i.e., when we no longer can guarantee that F stores a most frequent bigram. Toachieve our claimed time bounds, we want to assign all m turns to O (lg n ) different rounds, which canonly be done if f k grows sufficiently fast. Algorithm Outline.
Given we are at the beginning of the k -th round and the i -th turn, we computethe frequency table F storing f k bigrams, and keep additionally the lowest frequency of F as a threshold t ,which is treated as a constant during this round. During the computation of the i -th turn, we replace themost frequent bigram (say, bc ∈ Σ i ) in the text T i with a non-terminal X i +1 to produce T i +1 . Thereafter,we remove bc from F and update those frequencies in F which got decreased by the replacement of bc with X i +1 , and add each bigram containing the new character X i +1 into F if its frequency is at least t .Whenever a frequency in F drops below t , we discard it. If F becomes empty, we move to the ( k + 1)-stround, and create a new F for storing f k +1 frequencies. Otherwise ( F still stores an entry), we can besure that F stores a most frequent bigram. In both cases, we recurse with the ( i + 1)-st turn by selectingthe bigram with the highest frequency stored in F . We describe in the following how we update of F and how large f k +1 can be at least. Suppose that we are in the k -th round and in the i -th turn. Let t be the lowest frequency in F computedat the beginning of the k -th round. We keep t as a constant threshold for the invariant that all frequenciesin F are at least t during the k -th round. With this threshold we can assure in the following that F iseither empty or stores a most frequent bigram.Now suppose that the most frequent bigram of T i is bc ∈ Σ i , which is stored in F . To produce T i +1 (and hence advancing to the ( i + 1)-st turn), we enlarge the space of T i from n i ⌈ lg σ i ⌉ to n i ⌈ lg σ i +1 ⌉ ,and replace all occurrences of bc in T i with a new non-terminal X i +1 . Subsequently, we would like totake the next bigram of F . For that, however, we need to update the stored frequencies in F . To see thisnecessity, suppose that there is an occurrence of abcd with two characters a , d ∈ Σ i in T i . By replacing bc with X i +1 ,(a) the frequencies of ab and cd decrease by one , and(b) the frequencies of a X i +1 and X i +1 d increase by one. Updating F We can take care of the former changes (a) by decreasing the respective bigram in F (incase that it is present). If the frequency of this bigram drops below the threshold t , we remove it from F as there may be bigrams with a higher frequency that are not present in F . To cope with the latterchanges (b), we track the characters adjacent to X i +1 after the replacement, count their numbers, andadd their respective bigrams to F if their frequencies are sufficiently high. In detail, suppose that we havesubstituted bc with X i +1 exactly h times. Consequently, with the new text T i +1 we have additionally For the border case a = b = c (resp. b = c = d ), there is no need to decrement the frequency of ab (resp. cd ). lg σ i +1 bits of free space , which we call D in the following. Subsequently, we scan the text and putthe characters of Σ i +1 appearing to the left of each of the h occurrences of X i +1 into D . After sortingthe characters in D lexicographically, we can count the frequency of a X i +1 for each character a ∈ Σ i +1 preceding an occurrence of X i +1 in the text T i +1 by scanning D linearly. If the obtained frequency ofsuch a bigram a X i +1 is at least as high as the threshold t , we insert a X i +1 into F , and subsequentlydiscard a bigram with the currently lowest frequency in F if the size of F has become f k + 1. In casethat we visit a run of X i +1 ’s during the creation of D , we must take care of not counting the overlappingoccurrences of X i +1 X i +1 . Finally, we can count analogously the occurrences of X i +1 d for all characters d ∈ Σ i succeeding an occurrence of X i +1 . Capacity of F After the above procedure we have updated the frequencies of F . When F becomesempty, we end the k -th round and continue with the ( k + 1)-st round by creating a new frequency table F with capacity f k +1 . In what follows, we (a) analyze in detail when F becomes empty (as this determinesthe sizes f k and f k +1 ), and (b) show that we can compensate the number of discarded bigrams with anenlargement of F ’s capacity from f k bigrams to f k +1 bigrams for the sake of our aimed total runningtime: If the frequency of bc in T i is x , then we can reduce at most 2 x frequencies of other bigrams.Since a bigram must occur at least twice in T i to be present in F , the frequency of bc has to be atleast max(2 , ( f k − /
2) for discarding all bigrams of F , and each replacement of bc with X i +1 frees up ⌈ lg σ i +1 ⌉ bits of the text.Suppose that we have enough space available for storing the frequencies of αf k bigrams, where α isa constant (depending on σ i and n i ) such that F and the working space of Lemma 2.2 with d = f k canbe stored within this space. Let δ := lg( σ i +1 n i /
2) be the number of bits needed to store one entry in F ,and let β := min( δ/ lg σ i +1 , cδ/ lg n ) be the minimum number of characters that need to be freed to storeone frequency in this space. To understand the value of β , we look at the arguments of the minimumfunction in the definition of β and simultaneously at the maximum function in our aimed working spaceof max( n ⌈ lg σ m ⌉ , ( n/c ) lg n ) + O (lg n ) bits (cf. Thm. 2.3): • The first item in this maximum function allows us to spend lg σ i +1 bits for each freed charactersuch that we obtain space for one additional entry in F after freeing δ/ lg σ i +1 characters. • The second item allows us to use lg n additional bits after freeing up c characters. Hence, afterfreeing up cδ/ lg n characters, we have space to store one additional entry in F .With β , we have αf k +1 = αf k + max(2 /β, ( f k − / (2 β ))= αf k max(1 + 2 / ( αβf k ) , / (2 αβ ) − / (2 αβf k )) ≥ αf k (1 + 2 / (5 αβ )) =: γ i αf k with γ i := 1 + 2 / (5 αβ ) , where we used the equivalence 1 + 2 / ( αβf k ) = 1 + 1 / (2 αβ ) − / (2 αβf k ) ⇔ f k to estimate the twoarguments of the maximum function.Since we let f k grow by a factor of at least γ := min ≤ i ≤ n γ i > F , f k = Ω( γ k ), and therefore f k = Θ( n ) after k = O (lg n ) steps. Consequently, after reaching k = O (lg n ),we can iterate the above procedure a constant number of times to compute the non-terminals of theremaining bigrams occurring at least twice. Time Analysis
On the total picture, we compute F O (lg n ) times with Lemma 2.2. For the k -th time,we run the algorithm of Lemma 2.2 with d = f k on a text of length at most n − f k in O ( n ( n − f k ) · lg f k /f k )time with f k ≤ n . Summing this up, we yield O O (lg n ) X k =0 n − f k f k n lg f k = O n n X k kγ k ! = O (cid:0) n (cid:1) time in total. (1) The free space is consecutive after shifting all characters to the left. This additional treatment helps us to let f k grow sufficiently fast in the first steps to save our O ( n ) time bound, asfor sufficiently small alphabets and large text sizes, lg( σ n/ / lg σ = O (lg n ), which means that we might run the first O (lg n ) turns with f k = O (1), and therefore already spend O ( n lg n ) time.
5n the i -th turn, we update F by decreasing the frequencies of the bigrams affected by the substitution ofthe most frequent bigram bc with X i . For decreasing such a frequency, we look up its respective bigramwith a linear scan in F , which takes f k = O ( n ) time. However, since this decrease is accompanied witha replacement of an occurrence of bc , we obtain O ( n ) total time by charging each text position with O ( n ) time for a linear search in F . With the same argument, we can bound the total time for sorting thecharacters in D to O ( n ) overall time: Since we spend O ( h lg h ) time on sorting h characters precedingor succeeding a replaced character, and O ( f k ) = O ( n ) time on swapping a sufficiently large new bigramcomposed of X i +1 and a character of Σ i +1 with a bigram with the lowest frequency in F , we charge eachtext position again with O ( n ) time. Putting all time bounds together leads to the main result of thisarticle: Theorem 2.3.
We can compute Re-Pair on a string of length n in O ( n ) time with max(( n/c ) lg n, n ⌈ lg σ m ⌉ )+ O (lg n ) bits of working space including the text space, where c ≥ σ m is thenumber of terminal and non-terminal symbols. Output
Finally, we show that we can store the computed grammar in text space. More precisely, wewant to store the grammar in an auxiliary array A packed at the end of the working space such that theentry A [ i ] stores the right hand side of the non-terminal X i , which is a bigram. Thus the non-terminalsare represented implicitly as indices of the array A . We therefore need to subtract 2 lg σ i bits of spacefrom our working space αf k after the i -th turn. By adjusting α in the above equations, we can deal withthis additional space requirement as long as the frequencies of the replaced bigrams are at least three(we charge two occurrences for growing the space of A ).When only bigrams with frequencies of at most two remain, we switch to a simpler algorithm, dis-carding the idea of maintaining the frequency table F : Suppose that we work with the text T i . Let k bea text position, which is 1 in the beginning, but will be incremented in the following turns while holdingthe invariant that T [1 ..k ] does not contain a bigram of frequency two. We scan T i [ k..n ] linearly from leftto right and check, for each text position j , whether the bigram T i [ j ] T i [ j + 1] has another occurrence T i [ j ′ ] T i [ j ′ + 1] = T i [ j ] T i [ j + 1] with j ′ > j + 1, and if so,(a) append T i [ j ] T i [ j + 1] to A ,(b) replace T i [ j ] T i [ j + 1] and T i [ j ′ ] T i [ j ′ + 1] with a new non-terminal X i +1 to transform T i to T i +1 , and(c) recurse on T i +1 with k := j until no bigram with frequency two is left.The position k , which we never decrement, helps us to skip over all text positions starting with bigramswith a frequency of one. Thus, the algorithm spends O ( n ) time for each such text position, and O ( n )time for each bigram with frequency two. Since there are at most n such bigrams, the overall runningtime of this algorithm is O ( n ). Remark 2.4 (Pointer Machine Model) . Refraining from the usage of complicated algorithms, our algo-rithm consists only of elementary sorting and scanning steps. This allows us to run our algorithm ona pointer machine, yielding the same time bound of O ( n ). For the space bounds, we assume that thetext is given in n words, where a word is large enough to store an element of Σ m or a text position. Here, we present an exemplary execution of the first turn (of the first round) on the input T = cabaacabcabaacaaabcab . We visualize each step of this turn as a row in Fig. 1. A detailed descriptionof each row follows: Row 1:
Suppose that we have computed F , which has a constant number of entries . The highestfrequency is five achieved by ab and ca . The lowest frequency represented in F is three, whichbecomes the threshold for a bigram to be present in F such that bigrams whose frequencies dropbelow this threshold are removed from F . This threshold is a constant for all later turns until F is rebuilt (in the following round). During Turn 1, the algorithm proceeds now as follows: In the later turns when the size f k becomes larger, F will be put in the text space. c a b a a c a b c a b a a c a a a b c a b ab:5 ca:5 aa:3 c X a a c X c X a a c a a X c X ab:0 ca:1 aa:3 c X a a c X c X a a c a a X c X aa:3 c X a a c X c X a a c a a X c X c c c a c aa:3 c X a a c X c X a a c a a X c X a c c c c aa:3 c X a a c X c X a a c a a X c X c X :4 aa:3 c X a a c X c X a a c a a X c X a c a c c X :4 aa:3 c X a a c X c X a a c a a X c X a a c c c X :4 aa:3 c X a a c X c X a a c a a X c X c X :4 aa:3 D F
Figure 1: Step-by-step execution of the first turn of our algorithm on the string T = cabaacabcabaacaaabcab . The turn starts with the memory configuration given in Row 1. Positions 1to 21 are text positions, positions 22 to 24 belong to F ( f = 3, and it is assumed that a frequency fitsinto a text entry). Subsequent rows depict the memory configuration during Turn 1. A comment to eachrow is given in Sect. 2.4. Row 2:
Choose ab as a bigram to replace with a new non-terminal X (break ties arbitrarily). Replaceevery occurrence of ab with X while decrementing frequencies in F accordingly to the neighboringcharacters of the replaced occurrence. Row 3:
Remove from F every bigram whose frequency falls below the threshold. Obtain space for D byaligning the compressed text T . (The process of Row 2 and Row 3 can be done simultaneously.) Row 4:
Scan the text and copy each character preceding an occurrence of X in T to D . Row 5:
Sort characters in D lexicographically. Row 6:
Insert new bigrams (consisting of a character of D and X ) whose frequencies are at least aslarge as the threshold. Row 7:
Scan the text again and copy each character succeeding an occurrence of X in T to D (sym-metric to Row 4). Row 8:
Sort all characters in D lexicographically (symmetric to Row 5). Row 9:
Insert new bigrams whose frequencies are at least as large as the threshold (symmetric toRow 6).
We provide a simplified implementation in C++17 at https://github.com/koeppl/repair-inplace .The simplification is that we (a) fix the bit width of the text space to 16 bit, and (b) assume that Σ is thebyte alphabet. We further skip the step increasing the bit width from lg σ i to lg σ i +1 . This means that theprogram works as long as the characters of Σ m fit into 16 bits. The benchmark, whose results are displayedin Table 1, was conducted on a Mac Pro Server with an Intel Xeon CPU X5670 clocked at 2.93GHzrunning Arch Linux. The implementation was compiled with gcc-8.2.1 in the highest optimizationmode -O3 . Looking at Table 1, we can see that the running time is super-linear to the input size on alltext instances, which we obtained from the Pizza&Chili corpus ( http://pizzachili.dcc.uchile.cl/ ).Table 2 gives some characteristics about the used data sets. We see that the number of rounds is thenumber of turns plus one for every unary string a k with an integer k ≥ F Escherichia Coli cere coreutils einstein.de.txt einstein.en.txt influenza kernel para world leaders aa · · · a aa · · · a .Turns / σ Escherichia Coli cere coreutils
113 4.7 6.7 10.2 16.1 26.5 15 15 15 14 14 einstein.de.txt
95 1.7 2.8 3.7 5.2 9.7 14 14 15 16 16 einstein.en.txt
87 3.3 3.5 3.8 4.5 8.6 16 15 15 15 17 influenza kernel
160 4.5 8.0 13.9 24.5 43.7 10 11 14 14 13 para world leaders
87 2.6 4.3 6.1 10.0 42.1 11 11 11 11 14 aa · · · a aa · · · a are inplain units (not divided by thousand).empty such that the algorithm recomputes F after each turn. Note that the number of rounds can dropwhile scaling the prefix length based on the choice of the bigrams stored in F . In the case that the number of terminals and non-terminals τ := σ m is o ( n ), a word-packing approachbecomes interesting. We present techniques speeding up previously introduced operations on chunks of O (log τ n ) characters from O (log τ n ) time to O (lg lg lg n ) time. In the end, these techniques allow us tospeed up the sequential algorithm of Thm. 2.3 from O ( n ) time to the following: Theorem 3.1.
We can compute Re-Pair on a string of length n in O ( n lg log τ n lg lg lg n/ log τ n ) timewith max(( n/c ) lg n, n ⌈ lg τ ⌉ ) + O (lg n ) bits of working space including the text space, where c ≥ τ is the number of terminal and non-terminal symbols.Note that the O (lg lg lg n ) factor is due to the popcount function [31, Algo. 1], which has beenoptimized to a single instruction on modern computer architectures.8peration Description X ≪ j shift X j positions to the left X ≫ j shift X j positions to the right ¬ X bitwise NOT of XX ⊗ Y bitwise XOR of X and Y − X ) returns the position of the most significant set bit of X , i.e., ⌊ lg X ⌋ + 1; see [12,Sect. 5] for a constant time algorithm using O (lg n ) bitsrmPreRun( X ) sets all bits of the maximal prefix of consecutive ones to zerormSufRun( X ) sets all bits of the maximal suffix of consecutive ones to zeroFigure 2: Operations used in Figs. 4 and 5 for two bit vectors X and Y . All operations can be computedin constant time. See Fig. 3 for an example of rmSufRun and rmPreRun.rmPreRun( X )Operation Example X ¬ X ≪ (1 + msb( ¬ X )) (1 ≪ (1 + msb( ¬ X ))) − ((1 ≪ (1 + msb( ¬ X ))) −
1) & X rmSufRun( X )Operation Example X ¬ X ¬ X − ( ¬ X −
1) & X ¬ (( ¬ X −
1) & X ) ¬ (( ¬ X −
1) & X ) & X Figure 3: Step-by-step execution of rmPreRun( X ) and rmSufRun( X ) introduced in Fig. 2 on a bitvector X . First, we deal with accelerating the computation of the frequency of a bigram in T by exploiting broad-word search thanks to the word RAM model. We start with the search of single characters and subse-quently extend this result to bigrams: Lemma 3.2.
We can count the occurrences of a character c ∈ Σ in a string of length O (log σ n ) in O (lg lg lg n ) time. Proof.
Let q be the largest multiple of ⌈ lg σ ⌉ fitting into a computer word, divided by ⌈ lg σ ⌉ . Let S ∈ Σ ∗ be a string of length q . Our first task is to compute a bit mask of length q ⌈ lg σ ⌉ marking with a ‘1’ theoccurrences of a character c ∈ Σ in S . For that, we follow the constant time broadword pattern matchingof Knuth [20, Sect. 7.1.3] : Let H and L be two bit vectors of length ⌈ lg σ ⌉ having marked only the mostsignificant or the least significant bit, respectively. Let H q and L q denote the q times concatenation of H and L , respectively. Then the operations in Fig. 4 yield an array X of length q with X [ i ] = ( ⌈ lg σ ⌉ − S [ i ] = c, X has ⌈ lg σ ⌉ bits.To obtain the number of occurrences of c in S , we use the popcount operation returning the numberof zero bits in X , and divide the result by ⌈ lg σ ⌉ . The popcount instruction takes O (lg lg lg n ) time [31,Algo. 1].Having Lemma 3.2, we show that we can compute the frequency of a bigram in T in O ( n lg lg lg n/ log σ n )time. For that, we partition T into strings of length ⌊ log σ n ⌋ fitting into a computer word, and call eachstring of this partition a chunk . For each chunk S , we call find( c, S ) to compute the bit vector X storingthe occurrences of c in S . In case that we want to use Lemma 3.2 when c ∈ Σ is a bigram, we interpret See https://github.com/koeppl/broadwordsearch for a practical implementation. S → SX ← S ⊗ c q match S with c q ; X [ i ] = 0 ⇔ S [ i ] = c = S → XY ← X − L q = X → YX ← Y & ¬ X X [ i ] & 2 ⌈ lg σ ⌉ − ⇔ S [ i ] = c = Y → XX ← X & H q X [ i ] = 0 ⇔ S [ i ] = c = X → XX ← ( X − ( X ≫ ( ⌈ lg σ ⌉ − | X X as in Eq. (2) = X → X Figure 4: Broadword matching all occurrences of a character in a string S fitting into a computer word.For the last step, special care has to be taken when the last character of S is a match, as shifting X ⌈ lg σ ⌉ bits to the right might erase a ‘1’ bit witnessing the rightmost match. In the description column, X is treated as an array of integers with bit width ⌈ lg σ ⌉ . In this example, S = , c has the bitrepresentation with lg σ = 3, and q = 3. T ∈ Σ n of length n as a text T ∈ (Σ ) ⌈ n/ ⌉ of length ⌈ n/ ⌉ . The result is, however, not the frequency ofthe bigram c in general. For computing the frequency a bigram bc ∈ Σ , we distinguish the cases b = c and b = c . Case b = c . By applying Lemma 3.2 to find the character bc ∈ Σ in a chunk S (interpreted as a stringof length ⌊ q/ ⌋ on the alphabet Σ ), we obtain the number of occurrences of bc starting at odd positionsin S . To obtain this number for all even positions, we apply the procedure to d S with d ∈ Σ \ { b , c } .Additional care has to be taken at the borders of each chunk matching the last character of the currentchunk and the first character of the subsequent chunk with b and c , respectively. Case b = c . This case is more involving as overlapping occurrences of bb can occur in S , which wemust not count. To this end, we watch out for runs of b ’s, i.e., substrings of maximal lengths consistingof the character b (here, we consider also maximal substrings of b with length 1 as a run). We separatethese runs into runs ending either at even or at odd positions. We do this because the frequency of bb in a run of b ’s ending at an even (resp. odd) position is the number of occurrences of bb within this runending at an even (resp. odd) position. We can compute these positions similarly to the approach for b = c by first (a) hiding runs ending at even (resp. odd) positions, and then (b) counting all bigramsending at even (resp. odd) positions. Runs of b that are a prefix or a suffix of S are handled individuallyif S is neither the first nor the last chunk of T , respectively. That is because a run passing a chunkborder starts and ends in different chunks. To take care of those runs, we remember the number of b ’sof the longest suffix of every chunk, and accumulate this number until we find the end of this run, whichis a prefix of a subsequent chunk. The procedure for counting the frequency of bb inside S is explainedwith an example in Fig. 5. With the aforementioned analysis of the runs crossing chunk borders, we canextend this procedure to count the frequency of bb in T . We conclude: Lemma 3.3.
We can compute the frequency of a bigram in a string T of length n whose characters aredrawn from an alphabet of size σ in O ( n lg lg lg n/ log σ n ) time.10peration Description Exampleinput S bbdbbdcbbbdbb = SX ← find( b , S ) search b in S → XX ← rmPreRun( X ) erase prefix of b ’s → XM ← rmSufRun( X ) erase suffix of b ’s → MB ← findBigram( , M ) & M starting of each b run → BE ← findBigram( , M ) & M end of each b run → EM ← M & ¬ B trim head of runs → MX ← B − ( E & ( ) q/ ) bit mask for all runsending at even posi-tions = B( & ) → XX ← M & X occurrences of all b sbelonging to runsending at even posi-tions = X = M → X popcount( X & ( ) q/ ) frequency of all bb sbelonging to runsending at even posi-tions = X X ← B − ( E & ( ) q/ ) bit mask for all runsending at odd posi-tions = B( & ) → XX ← M & X occurrences of all b sbelonging to runsending at odd posi-tions = X = M → X popcount( X & (10) q/ ) frequency of all bb sbelonging to runsending at odd posi-tions = X Figure 5: Finding a bigram bb in a string S of bit length q , where q is the largest multiple of 2 ⌈ lg σ ⌉ fitting into a computer word, divided by ⌈ lg σ ⌉ . In the example, we represent the strings M , B , E , and X as arrays of integers with bit width x := ⌈ lg σ ⌉ and write and for 1 x and 0 x , respectively. LetfindBigram( bc , X ) := find( bc , X ) | find( bc , d X ) for d = b be the frequency of a bigram bc with b = c asdescribed in Sect. 3.1. Each of the popcount queries gives us one occurrence as a result (after dividingthe returned number by ⌈ lg σ ⌉ ), thus the frequency of bb in S , without looking at the borders of S , istwo. As a side note, modern computer architectures allow us to shrink the 0 x or 1 x blocks to single bitsby instructions like pext u64 taking a single CPU cycle. Similarly to Lemma 2.2, we present an algorithm computing the d most frequent bigrams, but now withthe word-packed search of Lemma 3.3. Lemma 3.4.
Given an integer d with d ≥
1, we can compute the frequencies of the d most frequent bi-grams in a text of length n whose characters are drawn from an alphabet of size σ in O ( n lg lg lg n/ log σ n )time using d (cid:6) lg( σ n/ (cid:7) + O (lg n ) bits. Proof.
We allocate a frequency table F of length d . For each text position i with 1 ≤ i ≤ n −
1, wecompute the frequency of T [ i ] T [ i + 1] in O ( n lg lg lg n/ log σ n ) time with Lemma 3.3. After computing11 frequency, we insert it into F if it is one of the d most frequent bigrams among the bigrams we havealready computed. We can perform the insertion in O (lg d ) time if we sort the entries of F by theirfrequencies.Studying the final time bounds of Eq. (1) for the sequential algorithm of Sect. 2, we see that we spend O ( n ) time in the first turn, but spend less time in later turns. Hence, we want to run the bit-parallelalgorithm only in the first few turns until f k becomes so large that the benefits of running Lemma 2.2outweigh the benefits of the bit-parallel approach of Lemma 3.4. In detail, for the k -th round, we set d := f k and run the algorithm of Lemma 3.4 on the current text if d is sufficiently small, or otherwisethe algorithm of Lemma 2.2. In total, we yield O O (lg n ) X k =0 min (cid:18) n − f k f k n lg f k , ( n − f k ) lg lg lg n log τ n (cid:19) = O n n X k =0 min (cid:18) kγ k , lg lg lg n log τ n (cid:19)! = O (cid:18) n lg log τ n lg lg lg n log τ n (cid:19) time in total, (3)where τ = σ m is the number of terminals and non-terminals, and k/γ k > lg lg lg n/ log τ n ⇔ k = O (lg(lg n/ (lg τ lg lg lg n ))).To obtain the claim of Thm. 3.1, it is left to show that the k -th round with the bit-parallel approachuses O ( n lg lg lg n/ log τ n ) time, as we now want to charge each text position with O ( n/ log τ n ) timewith the same amortized analysis as after Eq. (1). We target O ( n/ log τ n ) time for(1) replacing all occurrences of a bigram,(2) shifting freed up text space to the right,(3) finding the bigram with the highest or lowest frequency in F ,(4) updating or exchanging an entry in F , and(5) looking up the frequency of a bigram in F .Let x := ⌈ lg σ i +1 ⌉ and q be the largest multiple of x fitting into a computer word, divided by x .For Item (1), we partition T into substrings of length q , and apply Item (1) to each such substring S .Here, we combine the two bit vectors of Fig. 5 used for the two popcount calls by a bitwise OR, andcall the resulting bit vector Y . Interpreting Y as an array of integers of bit width x , Y has q entries,and it holds that Y [ i ] = 2 x − S [ i ] is the second character of an occurrence of the bigramwe want to replace . We can replace this character in all marked positions in S by a non-terminal X i +1 using x bits with the instruction ( S & ¬ Y ) | (( Y & L q ) · X i +1 ), where L with | L | = x is the bit vectorhaving marked only the least significant bit. Subsequently, for Item (2), we erase all characters S [ i ] with Y [ i + 1] = ( Y ≪ x )[ i ] = 2 x − S sequentially. In the subse-quent bit chunks, we can use word-packed shifting. The sequential bit shift costs O ( | S | ) = O (log σ i +1 n )time, but on an amortized view, a deletion of a character is done at most once per original text position.For the remaining points, our trick is to represent F by a minimum and a maximum heap, bothrealized as array heaps. For the space increase, we have to lower γ adequately. Each element of anarray heap stores a frequency and a pointer to a bigram stored in a separate array B storing all bigramsconsecutively. A pointer array P stores pointers to the respective frequencies in both heaps for eachbigram of B . The total data structure can be constructed at the beginning of the k -th round in O ( f k )time, and hence does not worsen the time bounds. While B solves Item (5), the two heaps with P solve Items (3) and (4) even in O (lg f k ) time.In case that we want to store the output in working space, we follow the description in the paragraphafter Thm. 2.3, where we now use word-packing to find the second occurrence of a bigram in T i in O ( n/ log σ i n ) time. like in Item (1), the case that the bigram crosses a boundary of the partition of T is handled individually Computing MR-Re-Pair in Small Space
We can adapt our algorithm to compute the MR-Re-Pair grammar scheme proposed by Furuya et al. [13].The difference to Re-Pair is that MR-Re-Pair replaces the most frequent maximal repeat instead of themost frequent bigram, where a maximal repeat is a reoccurring substring of the text whose frequency decreases when extending it to the left or to the right. Our idea is to exploit the fact that a mostfrequent bigram corresponds to a most frequent maximal repeat [13, Lemma 2]. This means that wecan find a most frequent maximal repeat by extending all occurrences of a most frequent bigram totheir left and to their right until all are no longer equal substrings. Although such an extension can betime consuming, this time is amortized by the number of characters that are replaced on creating anMR-Re-Pair rule. Hence, we conclude that we can compute MR-Re-Pair in the same space and timebounds as our algorithm computing the Re-Pair grammar. Suppose that we have p processors on a CRCW machine, supporting in particular parallel insertionsof elements and frequency updates in a frequency table. In the parallel setting, we allow us to spend O ( p lg n ) bits of additional working space such that each processor has a extra budget of O (lg n ) bits.In our computational model, we assume that the text is stored in p parts of equal lengths such that wecan enlarge a text using n lg σ to n (lg σ + 1) bits in max(1 , n/p ) time without extra memory. For ourparallel variant computing Re-Pair, our working horse is a parallel sorting algorithm: Lemma 5.1 ([3]) . We can sort an array of length n in O (max( n/p,
1) lg n ) parallel time with O ( p lg n )bits of working space. The work is O ( n lg n ).The parallel sorting allows us to state Lemma 2.2 in the following way: Lemma 5.2.
Given an integer d with d ≥
1, we can compute the frequencies of the d most frequent bi-grams in a text of length n whose characters are drawn from an alphabet of size σ in O (max( n, d ) max( n/p,
1) lg d/d )time using 2 d (cid:6) lg( σ n/ (cid:7) + O ( p lg n ) bits. The work is O (max( n, d ) n lg d/d ). Proof.
We follow the computational steps of Lemma 2.2, but (a) divide a scan into p parts, (b) conducta scan in parallel but a binary search sequentially, and (c) use Lemma 5.1 for the sorting. This gives usthe following time bounds for each operation:Operation Lemma 2.2 Parallelfill F ′ with bigrams O ( d ) O (max(1 , d/p ))sort F ′ lexicographically O ( d lg d ) O (max( d/p,
1) lg n )compute frequencies of F ′ O ( n lg d ) O ( n/p lg d )merge F ′ with F O ( d lg d ) O (max( d/p,
1) lg n )The O ( n/d ) merge steps are conducted in the same way, yielding the bounds of this lemma.In our sequential model, we produce T i +1 by performing a left shift after replacing all occurrences ofa most frequent bigram with a new non-terminal X i +1 such that we gain free space at the end of thetext. As described in our computational model, our text is stored as a partition of p substrings, eachassigned to one processor. Instead of gathering the entire free space at T ’s end, we gather free space atthe end of each of these substrings. We bookkeep the size and location of each such free space (there areat most p many) such that we can work on the remaining text T i +1 like it would be a single continuousarray (and not fragmented into p substrings). This shape allows us to perform the left shift in O ( n/p )time, while spending O ( p lg n ) bits of space for the locations of the free space fragments. We naturally extend the definition of frequency from bigrams to substrings meaning the number of non-overlappingoccurrences. We pad up the last part with dummy characters to match n/p characters. p ≤ n , exchanging Lemma 2.2 with Lemma 5.2 in Eq. (1) yields O O (lg n ) X k =0 n − f k f k np lg f k = O n p lg n X k k γ k ! = O (cid:18) n p (cid:19) time in total.It is left to provide an amortized analysis for updating the frequencies in F during the i -th turn. Here, wecan charge each text position with O ( n/p ) time, as we have the following time bounds for each operation:Operation Sequential Parallellinearly scan F O ( f k ) O ( f k /p )linearly scan T i O ( n i ) O ( n i /p )sort D with h = | D | O ( h lg h ) O (max(1 , h/p ) lg h )The first operation in the above table is used, among others, for finding the bigram with the lowestor highest frequency in F . Computing the lowest or highest frequency in F can be done with a singlevariable pointing to the currently found entry with the lowest or highest frequency during a parallel scanthanks to the CRCW model. Theorem 5.3.
We can compute Re-Pair in O ( n /p ) time with p ≤ n processors on a CRCW machinewith max(( n/c ) lg n, n ⌈ lg σ m ⌉ ) + O ( p lg n ) bits of working space including the text space, where c ≥ σ m is the number of terminal and non-terminal symbols. The work is O ( n ). The last part of this article is devoted to the first external memory (EM) algorithm computing Re-Pair,which is another way to overcome the memory limitation problem. We start with the definition of the EMmodel, present an approach using a sophisticated heap data structure, and another approach adaptingour in-place techniques.For the following, we use the EM model of Aggarwal and Vitter [1]. It features fast internal mem-ory (IM) holding up to M data words, and slow EM of unbounded size. The measure of the per-formance of an algorithm is the number of input and output operations (I/Os) required, where eachI/O transfers a block of B consecutive words between memory levels. Reading or writing n contigu-ous words from or to disk requires scan( n ) = Θ( n/B ) I/Os. Sorting n contiguous words requiressort( n ) = O (( n/B ) · log M/B ( n/B )) I/Os. For realistic values of n , B , and M , we stipulate thatscan( n ) < sort( n ) ≪ n .A simple approach is based on an EM heap maintaining the frequencies of all bigrams in the text.A state-of-the-art heap is due to Jiang and Larsen [17] providing insertion, deletion, and the retrievalof the maximum element in O ( B − log M/B ( N/B )) I/Os, where N is the size of the heap. Since N ≤ n ,inserting all bigrams takes at most sort( n ) I/Os. As there are at most n additional insertions, deletionsand maximum element retrievals, this sums to at most 4 sort( n ) I/Os. Finally, we need to scan the text m times to replace the occurrences of the retrieved bigram, triggering m P mi =1 scan( | T i | ) ≤ m scan( n ) I/Os.In the following, we show an EM Re-Pair algorithm that evades the use of complicated data structuresand prioritizes scans over sorting.This algorithm is based on our Re-Pair algorithm. It uses Lemma 2.2 with d := Θ( M ) such that F and F ′ can be kept in IM. This allows us to perform all sorting steps and binary searches in IM withoutadditional I/O. We only trigger I/O operations for scanning the text, which is done ⌈ n/d ⌉ times, sincewe partition T into d substrings. In total, we spend at most mn/M scans for the algorithm of Lemma 2.2.For the actual algorithm, an update of F is done m times, during which we replace all occurrences of achosen bigram in the text. This gives us m scans in total. Finally, we need to reason about D , which isalso created m times. However, D may be larger than M , such that we may need to store it in EM. Giventhat D i is D in the i -th turn, we sort D in EM, triggering sort( D i ) I/Os. With a converse of Jensen’s In the CREW model, concurrent writes are not possible. A common strategy lets each processor compute the entry ofthe lowest or highest frequency within its assigned range in F , which is then merged in a tournament tree fashion, causing O (lg p ) additional time. f ( x ) := n lg n ) we obtain P mi =1 sort( | D i | ) ≤ sort( n )+ O ( n log M/B D . We finally yield: Theorem 6.1.
We can compute Re-Pair with min(4 sort( n ) , ( mn/M ) scan( n )+sort( n )+ O ( n log M/B m scan( n ) I/Os in external memory.Our approach can be practically favorable to the heap based approach if m = o (lg n ) and mn/M = o (lg n ), or if the EM space is also of major concern. The achieved O ( n ) time bound seems to convey the impression that this work is only of purely theoreticinterest. However, we provide here some heuristics, which can help us to overcome the practical bottleneckat the beginning of the execution, where only O (lg n ) of bits of working space are available. In otherwords, we want to study several heuristics to circumvent the need to call Lemma 2.2 with a smallparameter d , as such a case means a considerable time loss. Even a single call of Lemma 2.2 with asmall d prevents the computation of Re-Pair of data sets larger than 1 MiB within a reasonable timeframe (cf. Sect. 2.5). We present three heuristics depending on whether our space budget on top of thetext space is within1. σ i lg n bits,2. n i lg( σ i +1 + n i ) bits, or3. O (lg n ) bits. Heuristic 1. If σ i is small enough such that we can spend σ i lg n bits, then we can compute thefrequencies of all bigrams in O ( n ) time. Whenever we reach a σ j that lets σ j lg n grow outside of ourbudget, we have spent O ( n ) time in total for reaching T j from T i as the costs for replacements can beamortized by twice of the text length. Heuristic 2.
Suppose that we are allowed to use ( n i −
1) lg( n i /
2) = ( n i −
1) lg n i − n i + O (lg n i ) bitsadditionally to the n i lg σ i bits of the text T i . We create an extra array F of length n i − F [ j ] stores the frequency of T [ j ] T [ j + 1] in T [1 ..j ]. We can fill the array in σ i scans over T i , costingus O ( n i σ i ) time. The largest number stored in F is the most frequent bigram in T . Heuristic 3.
Finally, if the distribution of bigrams is skewed, chances are that one bigram outnumbersall others. In such a case we can use the following algorithm to find this bigram:
Lemma 7.1.
Given there is a bigram in T i (0 ≤ i ≤ n ) whose frequency is higher than the sum offrequencies of all other bigrams, we can compute T i +1 in O ( n ) time using O (lg n ) bits. Proof.
We use the Boyer-Moore majority vote algorithm [6] for finding the most frequent bigram in O ( n )time with O (lg n ) bits of working space. Acknowledgments
This work is funded by the JSPS KAKENHI Grant Numbers JP18F18120 (Dominik K¨oppl), 19K20213(Tomohiro I) and 18K18111 (Yoshimasa Takabatake), and the JST CREST Grant Number JPMJCR1402including AIP challenge program (Keisuke Goto).
References [1] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems.
Com-mun. ACM , 31(9):1116–1127, 1988.[2] H. Bannai, M. Hirayama, D. Hucke, S. Inenaga, A. Jez, M. Lohrey, and C. P. Reh. The smallestgrammar problem revisited. arXiv 1908.06428 , 2019.153] K. E. Batcher. Sorting networks and their applications. In
Proc. AFIPS , volume 32 of
AFIPSConference Proceedings , pages 307–314, 1968.[4] P. Bille, I. L. Gørtz, and N. Prezza. Practical and effective Re-Pair compression. arXiv 1704.08558 ,2017.[5] P. Bille, I. L. Gørtz, and N. Prezza. Space-efficient Re-Pair compression. In
Proc. DCC , pages171–180, 2017.[6] R. S. Boyer and J. S. Moore. MJRTY: A fast majority vote algorithm. In
Automated Reasoning:Essays in Honor of Woody Bledsoe , Automated Reasoning Series, pages 105–118, 1991.[7] T. M. Chan, J. I. Munro, and V. Raman. Selection and sorting in the “restore” model.
ACM Trans.Algorithms , 14(2):11:1–11:18, 2018.[8] M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. Thesmallest grammar problem.
IEEE Trans. Information Theory , 51(7):2554–2576, 2005.[9] M. Crochemore, R. Grossi, J. K¨arkk¨ainen, and G. M. Landau. Computing the Burrows-Wheelertransform in place and in small space.
J. Discrete Algorithms , 32:44–52, 2015.[10] F. A. da Louza, T. Gagie, and G. P. Telles. Burrows-Wheeler transform and LCP array constructionin constant space.
J. Discrete Algorithms , 42:14–22, 2017.[11] P. De Luca, V. M. Russiello, R. Ciro Sannino, and L. Valente. A study for image compression usingRe-Pair algorithm. arXiv e-prints , 2019.[12] M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees.
J.Comput. Syst. Sci. , 47(3):424–436, 1993.[13] I. Furuya, T. Takagi, Y. Nakashima, S. Inenaga, H. Bannai, and T. Kida. MR-RePair: Grammarcompression based on maximal repeats. In
Proc. DCC , pages 508–517, 2019.[14] M. Ganczorz. Entropy lower bounds for dictionary compression. In
Proc. CPM , volume 128 of
LIPIcs , pages 11:1–11:18, 2019.[15] M. Ganczorz and A. Jez. Improvements on Re-Pair grammar compressor. In
Proc. DCC , pages181–190, 2017.[16] K. Goto. Optimal time and space construction of suffix arrays and LCP arrays for integer alphabets.
ArXiv e-prints , 2017.[17] S. Jiang and K. G. Larsen. A faster external memory priority queue with decreasekeys. In
Proc.SODA , pages 1331–1343, 2019.[18] J. K¨arkk¨ainen, D. Kempa, and S. J. Puglisi. Lightweight Lempel-Ziv parsing. In
Proc. SEA , volume7933 of
LNCS , pages 139–150, 2013.[19] J. C. Kieffer and E. Yang. Grammar-based codes: A new class of universal lossless source codes.
IEEE Trans. Information Theory , 46(3):737–754, 2000.[20] D. E. Knuth.
The Art of Computer Programming, Volume 4, Fascicle 1: Bitwise Tricks & Tech-niques; Binary Decision Diagrams . Addison-Wesley, 12th edition, 2009.[21] N. J. Larsson and A. Moffat. Offline dictionary-based compression. In
Proc. DCC , pages 296–305,1999.[22] Z. Li, J. Li, and H. Huo. Optimal in-place suffix sorting. In
Proc. SPIRE , volume 11147 of
LNCS ,pages 268–284, 2018.[23] M. Lohrey, S. Maneth, and R. Mennicke. XML tree structure compression using repair.
Inf. Syst. ,38(8):1150–1167, 2013. 1624] T. Masaki and T. Kida. Online grammar transformation based on Re-Pair algorithm. In
Proc. DCC ,pages 349–358, 2016.[25] G. Navarro and L. M. S. Russo. Re-Pair achieves high-order entropy. In
Proc. DCC , page 537, 2008.[26] C. Ochoa and G. Navarro. RePair and all irreducible grammars are upper bounded by high-orderempirical entropy.
IEEE Trans. Information Theory , 65(5):3160–3164, 2019.[27] K. Sakai, T. Ohno, K. Goto, Y. Takabatake, T. I, and H. Sakamoto. RePair in compressed spaceand time. In
Proc. DCC , pages 518–527, 2019.[28] K. Sekine, H. Sasakawa, S. Yoshida, and T. Kida. Adaptive dictionary sharing method for Re-Pairalgorithm. In
Proc. DCC , page 425, 2014.[29] S. Simic. Jensen’s inequality and new entropy bounds.
Appl. Math. Lett. , 22(8):1262–1265, 2009.[30] Y. Tabei, H. Saigo, Y. Yamanishi, and S. J. Puglisi. Scalable partial least squares regression ongrammar-compressed data matrices. In
Proc. SIGKDD , pages 1875–1884, 2016.[31] S. Vigna. Broadword implementation of rank/select queries. In
Proc. WEA , volume 5038 of
LNCS ,pages 154–168, 2008.[32] J. W. J. Williams. Algorithm 232 - heapsort.
Communications of the ACM , 7(6):347–348, 1964.[33] S. Yoshida and T. Kida. Effective variable-length-to-fixed-length coding via a Re-Pair algorithm. In
Proc. DCC , page 532, 2013.[34] J. Ziv and A. Lempel. A universal algorithm for sequential data compression.