Efficient and Compact Representations of Some Non-Canonical Prefix-Free Codes
Antonio Fariña, Travis Gagie, Szymon Grabowski, Giovanni Manzini, Gonzalo Navarro, Alberto Ordóñez
aa r X i v : . [ c s . D S ] D ec Efficient and Compact Representations ofSome Non-Canonical Prefix-Free Codes ⋆ Antonio Fari˜na , Travis Gagie , Giovanni Manzini , ,Gonzalo Navarro , and Alberto Ord´o˜nez Database Laboratory, University of A Coru˜na, Spain Helsinki Institute for Information Technology (HIIT)Department of Computer Science, University of Helsinki, Finland Institute of Computer Science, University of Eastern Piedmont, Italy IIT-CNR, Pisa, Italy Department of Computer Science, University of Chile, Chile Yoop SL, Spain
Abstract.
For many kinds of prefix-free codes there are efficient andcompact alternatives to the traditional tree-based representation. Sincethese put the codes into canonical form, however, they can only be usedwhen we can choose the order in which codewords are assigned to char-acters. In this paper we first show how, given a probability distributionover an alphabet of σ characters, we can store a nearly optimal alpha-betic prefix-free code in o ( σ ) bits such that we can encode and decodeany character in constant time. We then consider a kind of code intro-duced recently to reduce the space usage of wavelet matrices (Claude,Navarro, and Ord´o˜nez, Information Systems , 2015). They showed howto build an optimal prefix-free code such that the codewords’ lengths arenon-decreasing when they are arranged such that their reverses are in lex-icographic order. We show how to store such a code in O (cid:0) σ log L + 2 ǫL (cid:1) bits, where L is the maximum codeword length and ǫ is any positiveconstant, such that we can encode and decode any character in constanttime under reasonable assumptions. Otherwise, we can always encodeand decode a codeword of ℓ bits in time O ( ℓ ) using O ( σ log L ) bits ofspace. Binary prefix-free codes can be represented as binary trees whose leaves arelabelled with the characters of the source alphabet, so that the ancestor at ⋆ Funded in part by European Union’s Horizon 2020 research and innovation pro-gramme under the Marie Sk lodowska-Curie grant agreement No 690941 (projectBIRDS). The first author was supported by: MINECO (PGE and FEDER) grantsTIN2013-47090-C3-3-P and TIN2015-69951-R; MINECO and CDTI grant ITC-20151305; ICT COST Action IC1302; and Xunta de Galicia (co-founded withFEDER) grant GRC2013/053. The second author was supported by Academy ofFinland grants 268324 and 250345 (CoECGR). The fourth author was supported byMillennium Nucleus Information and Coordination in Networks ICM/FIC P10-024F,Chile. Fari˜na et al. depth d of the leaf labelled x is a left child if the d th bit of the codeword for x is a 0, and a right child if it is a 1. To encode a character, we start at theroot and descend to the leaf labelled with that character, at each step writinga 0 if we go left and a 1 if we go right. To decode an encoded string, we startat the root and descend according to the bits of the encoding until we reach aleaf, at each step going left if the next bit is a 0 and right if it is a 1. Then weoutput the character associated with the leaf and return to the root to continuedecoding. Therefore, a codeword of length ℓ is encoded/decoded in time O ( ℓ ).This all generalizes to larger code alphabets, but for simplicity we consider onlybinary codes in this paper.There are, however, faster and smaller representations of many kinds ofprefix-free codes. If we can choose the order in which codewords are assignedto characters then, by the Kraft Inequality [8], we can put any prefix-free codeinto canonical form [13] — i.e., such that the codewords’ lexicographic orderis the same as their order by length, with ties broken by the lexicographic or-der of their characters — without increasing any codeword’s length. If we storethe first codeword of each length as a binary number then, given a codeword’slength and its rank among the codewords of that length, we can compute thecodeword via a simple addition. Given a string prefixed by a codeword, we cancompute that codeword’s length and its rank among codewords of that lengthvia a predecessor search. If the alphabet consists of σ characters and the maxi-mum codeword length is L , then we can build an O ( σ log L )-bit data structurewith O (log L ) query time that, given a character, returns its codeword’s lengthand rank among codewords of that length, or vice versa. If L is at most a con-stant times the size of a machine word (which it is when we are considering, e.g.,Huffman codes for strings in the RAM model) then in theory we can make thepredecessor search and the data structure’s queries constant-time, meaning wecan encode and decode in constant time [5].There are applications for which there are restrictions on the codewords’ or-der, however. For example, in alphabetic codes the lexicographic order of thecodewords must be the same as that of the characters. Such codes are usefulwhen we want to be able to sort encoded strings without decoding them (be-cause the lexicographic order of two encodings is always the same as that of theencoded strings) or when we are using data structures that represent point setsas sequences of coordinates [10], for example. Interestingly, since the mappingbetween symbols and leaves is fixed, alphabetic codes need only to store thetree topology, which can be represented more succinctly than optimal prefix-freecodes, in 2 σ + o ( σ ) bits [9], so that encoding and decoding can still be done intime O ( ℓ ). There are no, however, equivalents to the faster encoding/decodingmethods used on canonical codes [5].In Section 2 we show how, given a probability distribution over the alphabet,we can store a nearly optimal alphabetic prefix-free code in o ( σ ) bits such thatwe can encode and decode any character in constant time. We note that wecan still use our construction even if the codewords must be assigned to the on-Canonical Prefix-Free Codes 3 characters according to some non-trivial permutation of the alphabet, but thenwe must store that permutation such that we can evaluate and invert it quickly.In Section 3 we consider another kind of non-canonical prefix-free code, whichClaude, Navarro, and Ord´o˜nez [1] introduced recently to reduce the space usageof their wavelet matrices. (Wavelet matrices are alternatives to wavelet trees [6,10] that are more space efficient when the alphabet is large.) They showed howto build an optimal prefix-free code such that the codewords’ lengths are non-decreasing when they are arranged such that their reverses are in lexicographicorder. They represent the code in O ( σL ) bits, and encode and decode a codewordof length ℓ in time O ( ℓ ). We show how to store such a code in O ( σ log L ) bits,and still encode and decode any character in O ( ℓ ) time. We also show that, byusing O (cid:0) σ log L + 2 ǫL (cid:1) bits, where ǫ is any positive constant, we can encode anddecode any character in constant time when L is at most a constant times thesize of a machine word. Our first variant is simple enough to be implementable.We show experimentally that it uses 23–30 times less space than a classicalimplementation, at the price of being 10–21 times slower at encoding and 11–30at decoding. Our approach to storing an alphabetic prefix code compactly has two parts: first,we show that we can build such a code such that the expected codeword length isat most a factor of (1 + O (cid:0) / √ log n (cid:1) ) = 1 + O (cid:0) / √ log n (cid:1) greater than optimal,the code-tree has height at most lg σ + √ lg σ + 3, and each subtree rooted atdepth ⌈ lg σ − √ lg σ ⌉ is completely balanced; then, we show how to store such acode-tree in o ( σ ) bits such that encoding and decoding take constant time.Evans and Kirkpatrick [2] showed how, given a binary tree on n leaves, wecan build a new binary tree of height at most ⌈ lg n ⌉ + 1 on the same leaves inthe same left-to-right order, such that the depth of each leaf in the new tree isat most 1 greater than its depth in the original tree. We can use their resultto restrict the maximum codeword length of an optimal alphabetic prefix code,for an alphabet of σ characters, to be at most lg σ + √ lg σ + 3, while forcing itsexpected codeword length to increase by at most a factor of 1 + O (cid:0) / √ log σ (cid:1) .To do so, we build the tree T opt for an optimal alphabetic prefix code and thenrebuild, according to Evans and Kirkpatrick’s construction, each subtree rootedat depth ⌈√ lg σ ⌉ . The resulting tree, T lim , has height at most ⌈√ lg σ ⌉ + ⌈ lg σ ⌉ +1and any leaf whose depth increases was already at depth at least ⌈√ lg σ ⌉ .There are better ways to build a tree T lim with such a height limit. Itai [7] andWessner [14] independently showed how, given a probability distribution over analphabet of σ characters, we can build an alphabetic prefix code T lim that hasmaximum codeword length at most lg σ + √ lg σ + 3 and is optimal among allsuch codes. Our construction in the previous paragraph, even if not optimal,shows that the expected codeword length of T lim is at most 1 + O (cid:0) / √ log σ (cid:1) times times that of an optimal code with no length restriction. Fari˜na et al.
Further, let us take T lim and completely balance each subtree rooted at depth ⌈ lg σ − √ lg σ ⌉ . The height remains at most lg σ + √ lg σ + 3 and any leaf whosedepth increases was already at depth at least ⌈ lg σ − √ lg σ ⌉ , so the expectedcodeword length increases by at most a factor oflg σ + √ lg σ + 3 ⌈ lg σ − √ lg σ ⌉ = 1 + O (cid:16) / p log σ (cid:17) . Let T bal be the resulting tree. Since the expected codeword length of T lim is inturn at most a factor of 1 + O (cid:0) / √ log n (cid:1) larger than that of T opt , the expectedcodeword length of T bal is also at most a factor of (1 + O (cid:0) / √ log n (cid:1) ) = 1 + O (cid:0) / √ log n (cid:1) larger than the optimal. T bal then describes our suboptimal code.To represent T bal , we store a bitvector B [1 ..σ ] in which B [ i ] = 1 if andonly if the codeword for the i th character in the alphabet has length at most ⌈ lg σ − √ lg σ ⌉ , or if the i th leaf in T is the leftmost leaf in a subtree rooted atdepth ⌈ lg σ − √ lg σ ⌉ . With Pˇatra¸scu’s implementation [12] for B this takes atotal of O (cid:16) lg σ − √ lg σ log σ + σ/ log c σ (cid:17) = O ( σ/ log c σ ) bits for any constant c ,and allows us to perform in constant time O ( c ) the following operations on B :(1) access, that is, inspecting any B [ i ]; (2) rank, that is, rank ( B, i ) counts thenumber of 1s in any prefix B [1 ..i ]; and select, that is, select ( B, j ) is the positionof the j th 1 in B , for any j .Let us for simplicity assume that the alphabet is [1 ..σ ]. For encoding inconstant time we store an array S [1 .. ⌈ lg σ − √ lg σ ⌉ ], which stores the explicit codeassigned to the leaves of T bal where B [ i ] = 1, in the same order of B . That is,if B [ i ] = 1, then the code assigned to the character i is stored at S [ rank ( B, i )],using lg σ + √ lg σ +3 = O (log σ ) bits. Therefore S requires O (cid:16) lg σ − √ lg σ log σ (cid:17) = o ( σ/ log c σ ) bits of space, for any constant c . We can also store the length of thecode within the same asymptotic space.To encode the character i , we check whether B [ i ] = 1 and, if so, we simplylook up the codeword in S as explained. If B [ i ] = 0, we find the preceding 1 at i ′ = select ( B, rank ( B, i )), which marks the leftmost leaf in the subtree rootedat depth ⌈ lg σ − √ lg σ ⌉ that contains the i th leaf in T . Since the subtree iscompletely balanced, we can compute the code for the character i in constanttime from that of the character i ′ : The size of the balanced subtree is r = i ′′ − i ′ ,where i ′′ = select ( B, rank ( B, i ′ ) + 1), and its height is h = ⌈ lg r ⌉ . Then thefirst 2 r − h codewords are of the same length of the codeword for i ′ , and thelast 2 h − r have one bit less. Thus, if i − i ′ < r − h , the codeword for i ′ is S [ rank ( B, i ′ )]+ i − i ′ , of the same length of that of i ; otherwise it is one bit shorter,( S [ rank ( B, i ′ )]+2 r − h ) / i − i ′ − (2 r − h ) = S [ rank ( B, i ′ )] / i − i ′ − ( r − h − ).To be able to decode quickly, we store an array A [1 .. ⌈ lg σ − √ lg σ ⌉ ] such that,for 1 ≤ j ≤ ⌈ lg σ − √ lg σ ⌉ , if the ⌈ lg σ − √ lg σ ⌉ -bit binary representation of j − i th codeword, then A [ j ] stores i and the length of that codeword.If, instead, the ⌈ lg σ − √ lg σ ⌉ -bit binary representation of j is the path label tothe root of a subtree of T bal with size more than 1, then A [ j ] stores the position on-Canonical Prefix-Free Codes 5 i ′ in B of the leftmost leaf in that subtree (thus B [ i ′ ] = 1). Again, A takes O (cid:16) log σ − √ log σ log σ (cid:17) = o ( σ/ log c σ ) bits, for any constant c .Given a string prefixed by the i th codeword, we take the prefix of length ⌈ lg σ − √ lg σ ⌉ of that string (padding with 0s on the right if necessary), view itas the binary representation of a number j , and check A [ j ]. This either tells usimmediately i and the length of the i th codeword, or tells us the position i ′ in B of the leftmost leaf in the subtree containing the desired leaf. In the latter case,since the subtree is completely balanced, we can compute i in constant time: Wefind i ′′ , r , and h as done for encoding. We then take the first h bits of the string(including the prefix we had already read, and padding with a 0 if necessary),and interpret it as the number j ′ . Then, if d = j ′ − S [ rank ( B, i ′ )] < r − h , itholds i = i ′ + d . Otherwise, the code is of length h − i = i ′ + 2 r − h + ⌊ ( d − (2 r − h )) / ⌋ = i ′ + r − h − + ⌊ d/ ⌋ . Theorem 1.
Given a probability distribution over an alphabet of σ characters,we can build an alphabetic prefix code whose expected codeword length is at mosta factor of O (cid:0) / √ log σ (cid:1) more than optimal and store it in O ( σ/ log c σ ) bits,for any constant c , such that we can encode and decode any character in constanttime O ( c ) . As we mentioned in Section 1, in order to reduce the space usage of their waveletmatrices, Claude, Navarro, and Ord´o˜nez [1] recently showed how to build anoptimal prefix code such that the codewords’ lengths are non-decreasing whenthey are arranged such that their reverses are in lexicographic order. Specifically,they first build a normal Huffman code and then use the Kraft Inequality to buildanother code with the same codeword lengths with the desired property. Theystore an O ( σL )-bit mapping between characters and their codewords, whereagain σ is the alphabet size and L is the maximum length of any codeword,which allows them to encode and decode codewords of length ℓ in time O ( ℓ ). (Inthe wavelet matrices, they already spend O ( ℓ ) time in the operations associatedwith encoding and decoding.)Assume we are given a code produced by Claude et al.’s construction. Wereassign the codewords of the same length such that the lexicographic orderof the reversed codewords of that length is the same as that of their charac-ters. This preserves the property that codeword lengths are non-decreasing withtheir reverse lexicographic order. The positive aspect of this reassignment isthat all the information on the code can be represented in σ lg L bits as a se-quence D = d , . . . , d σ , where d i is the depth of the leaf encoding character i inthe code-tree T . We can then represent D using a wavelet tree [6], which uses O ( σ log L ) bits and supports the following operations on D in time O (log L ):(1) access any D [ i ], which gives the length ℓ of the codeword of character i ; (2)compute r = rank ℓ ( D, i ), which gives the number of occurrences of ℓ in D [1 ..i ],which if D [ i ] = ℓ gives the position (in reverse lexicographic order) of the leaf Fari˜na et al. representing character i among those of codeword length ℓ ; and (3) compute i = select ℓ ( D, r ), which gives the position in D of the r th occurrence of ℓ , orwhich is the same, the character i corresponding to the r th codeword of length ℓ (in reverse lexicographic order).If, instead of O (log L ) time, we wish to perform the operations in time O ( ℓ ),where ℓ is the length of the codeword involved in the operation, we can simplygive the wavelet tree of D the same shape of the tree T . We can even performthe operations in time O (log ℓ ) by using a wavelet tree shaped like the trie forthe first σ codewords represented with Elias γ - or δ -codes [4, Observation 1].The size stays O ( σ log L ) if we use compressed bitmaps at the nodes [6, 10].We are left with two subproblems. For decoding the first character encodedin a binary string, we need to find the length ℓ of the first codeword and thelexicographic rank r of its reverse among the reversed codewords of that length,since then we can decode i = select ℓ ( D, r ). For encoding a character i , we find itslength ℓ = D [ i ] and the lexicographic rank r = rank ℓ ( D, i ) of its reverse amongthe reversed codewords of length ℓ , and then we must find the codeword given ℓ and r . We first present a solution that takes O ( L log σ ) = O ( σ log L ) furtherbits and works in O ( ℓ ) time. We then present a solution that takes O (cid:0) ǫL (cid:1) further bits and works in constant time.Let T be the code-tree and, for each depth d between 0 and L , let nodes ( d )be the total number of nodes at depth d in T and let leaves ( d ) be the number ofleaves at depth d . Let v be a node other than the root, let u be v ’s parent, let r v be the lexicographic rank (counting from 1) of v ’s reversed path label among allthe reversed path labels of nodes at v ’s depth, and let r u be defined analogouslyfor u . Notice that since T is optimal it is strictly binary, so half the nodes ateach positive depth are left children and half are right children. Moreover, thereversed path labels of all the left children at any depth are lexicographicallyless than the reversed path labels of all the right children at the same depth (or,indeed, at any depth). Finally, the reversed path labels of all the leaves at anydepth are lexicographically less than the reversed path labels of all the internalnodes at that depth. It follows that – v is u ’s left child if and only if r v ≤ nodes ( depth ( v )) / – if v is u ’s left child then r v = r u − leaves ( depth ( u )), – if v is u ’s right child then r v = r u − leaves ( depth ( u )) + nodes ( depth ( v )) / r u in terms of r v .Suppose we store nodes ( d ) and leaves ( d ) for d between 0 and L . With thethree observations above, given a codeword of length ℓ , we can start at the rootand in O ( ℓ ) time descend in T until we reach the leaf v whose path label is thatcodeword, then return its depth ℓ and the lexicographic rank r = r v of its reversepath label among all the reversed path labels of nodes at that depth. Then wecompute i from ℓ and r as described, in further O (log ℓ ) time. For encoding i , Since the code tree has height L and σ leaves, it follows that L < σ . This descent is conceptual; we do not have a concrete node v at each level, but wedo know r v .on-Canonical Prefix-Free Codes 7 we obtain as explained its length ℓ and the rank r = r v of its reversed codewordamong the reversed codewords of that length. Then we use the formulas to walkup towards the root, finding in each step the rank r u of the parent u of v , anddetermining if v is a left or right child of u . This yields the ℓ bits of the codewordof i in reverse order (0 when v is a left child of u and 1 otherwise), in overalltime O ( ℓ ). This completes our first solution, which we evaluate experimentallyin Section 4. Theorem 2.
Suppose we are given an optimal prefix code in which the code-words’ lengths are non-decreasing when they are arranged such that their reversesare in lexicographic order. We can store such a code in O ( σ log L ) bits — pos-sibly after swapping characters’ codewords of the same length — where σ is thealphabet size and L is the maximum codeword length, such that we can encodeand decode any character in O ( ℓ ) time, where ℓ is the corresponding codewordlength. If we want to speed up descents, we can build a table that takes as argumentsa depth and several bits, and returns the difference between r u and r v for anynode u at that depth and its descendant v reached by following edges corre-sponding to those bits. Notice that this difference depends only on the bits andthe numbers of nodes and leaves at the intervening levels. If the table accepts t bits as arguments at once, then it takes L t log σ bits and we can descend in O ( L/t ) time. Setting t = ǫL/
2, and since L ≥ lg σ , we use O (cid:0) ǫL (cid:1) space anddescend from the root to any leaf in constant time.Speeding up ascents is slightly more challenging. Consider all the path labelsof a particular length that end with a particular suffix of length t : the lexico-graphic ranks of their reverses form a consecutive interval. Therefore, we canpartition the nodes at any level by their r values, such that knowing which parta node’s r value falls into tells us the last t bits of that node’s path label, and thedifference between that node’s r value and the r value of its ancestor at depth t less. For each depth, we store the first r value in each interval in a predecessordata structure, implemented as a trie with degree σ ǫ/ ; since there are at most2 t intervals in the partition for each depth and L ≥ lg σ , setting t = ǫL/ O (cid:0) L ǫL/ σ ǫ/ log σ (cid:1) ⊂ O (cid:0) ǫL (cid:1) bits and ascend from any leafto the root in constant time.Finally, the operations on the wavelet tree can be made constant-time byusing a balanced multiary variant [3]. Theorem 3.
Suppose we are given an optimal prefix code in which the code-words’ lengths are non-decreasing when they are arranged such that their reversesare in lexicographic order. Let L be the maximum codeword length, so that it isat most a constant times the size of the machine word. Then we can store sucha code in O (cid:0) σ log L + 2 ǫL (cid:1) bits — possibly after swapping characters’ codewordsof the same length — where ǫ is any positive constant, such that we can encodeand decode any character in constant time. Fari˜na et al.Collection Length Alphabet Entropy max code Entropy of level( n ) size ( σ ) ( H ( P )) length( L ) entries ( H ( D )) EsWiki
EsInv
Indo
Table 1.
Main statistics of the texts used.
We have run experiments to compare the solution of Theorem 2 (referred to as
WMM in the sequel, for Wavelet Matrix Model) with the only previous encoding,that is, the one used by Claude et al. [1] (denoted by
TABLE ). Note that ourcodes are not canonical, so other solutions [5] do not apply.Claude et al. [1] use for encoding a single table of σL bits storing the codeof each symbol, and thus they easily encode in constant time. For decoding,they have tables separated by codeword length ℓ . In each such table, they storethe codewords of that length and the associated character, sorted by codeword.This requires σ ( L + lg σ ) further bits, and permits decoding binary searching thecodeword found in the wavelet matrix. Since there are at most 2 ℓ codewords oflength ℓ , the binary search takes time O ( ℓ ).For the sequence D used in our WMM , we use binary Huffman-shaped wavelettrees with plain bitmaps. The structures for supporting rank/select efficientlyrequire 37 .
5% space overhead, so the total space is 1 . σ H ( D ), where H ( D ) ≤ lg L is the per-symbol zero-order entropy of the sequence D . We also add asmall index to speed up select queries [11] (that is, decoding), which can beparameterized with a sampling value that we set to { , , , } . Finally, westore the values leaves and nodes , which add an insignificant L bits in total.We used a prefix of three datasets in http://lbd.udc.es/research/ECRPC .The first one, EsWiki , contains a sequence of word identifiers generated by usingthe Snowball algorithm to apply stemming to the Spanish Wikipedia. The sec-ond one,
EsInv , contains a concatenation of differentially encoded inverted listsextracted from a random sample of the Spanish Wikipedia. The third dataset,
Indo was created with the concatenation of the adjacency lists of Web graph
Indochina-2004 available at http://law.di.unimi.it/datasets.php . In Ta-ble 1 we provide some statistics about the datasets. We include the the numberof symbols in the dataset ( n ) and the alphabet size ( σ ). Assuming P is the rel-ative frequency of the alphabet symbols, H ( P ) indicates (in bits per symbol)the empirical entropy of the sequence. This is approximates the average ℓ valueof queries. Finally we show L , the maximum code length, and the zero-orderentropy of the sequence D , H ( D ), in bits per symbol. The last column is thena good approximation of the size of our Huffman-shaped wavelet tree for D .Our test machine has a Intel(R) Core(tm) [email protected] CPU (4 cores/8siblings) and 64GB of DDR3 RAM. It runs Ubuntu Linux 12.04 (Kernel 3.2.0-99-generic). The compiler used was g++ version 4.6.4 and we set compiler optimiza- on-Canonical Prefix-Free Codes 9 tion flags to − O
9. All our experiments run in a single core and time measuresrefer to CPU user-time . η s / s y m bo l Space (bits/alphabet symbol)Collection EsWikiCompression [96.0;18.34][3.2;175.2] TABLEWMM 0 100 200 300 400 500 600 700 0 20 40 60 80 100 η s / s y m bo l Space (bits/alphabet symbol)Collection EsWikiDecompression [96.0;39.6][7.7;512.1][3.7;694.4] TABLEWMM 0 50 100 150 200 250 0 10 20 30 40 50 60 70 80 90 100 η s / s y m bo l Space (bits/alphabet symbol)Collection EsInvCompression [96.0;11.0][3.6;232.6] TABLEWMM 0 100 200 300 400 500 600 0 20 40 60 80 100 η s / s y m bo l Space (bits/alphabet symbol)Collection EsInvDecompression [96.0;18.6][8.8;505.8][4.2;556.4] TABLEWMM 0 50 100 150 0 10 20 30 40 50 60 70 80 90 100 η s / s y m bo l Space (bits/alphabet symbol)Collection IndoCompression [96.0;8.8][3.5;132.7] TABLEWMM 0 100 200 300 400 500 600 0 20 40 60 80 100 η s / s y m bo l Space (bits/alphabet symbol)Collection IndoDecompression [96.0;45.9][8.7;300.5][4.2;513.5] TABLEWMM
Fig. 1.
Size of code representations versus either compression time (left) or decompres-sion time (right). Time is measured in nanoseconds per symbol.
Figure 1 compares the space required by both code representations and theircompression and decompression times. As expected, the space per character ofour new code representation,
WMM , is close to 1 . H ( D ), whereas that of TABLE is close to 2 L + lg σ . This explains the large difference in space between bothrepresentations, a factor of 23–30 times. For decoding we show the mild effectof adding the structure that speeds up select queries.The price of our representation is the encoding and decoding time. While the TABLE approach encodes using a single table access, in 8–18 nanoseconds, our representation needs 130–230, which is 10 to 21 times slower. For decoding, thebinary search performed by
TABLE takes 20–50 nanoseconds, whereas our
WMM representation requires 510–700 in the slowest and smallest variant (i.e., 11–30times slower). Our faster variants require 300–510 nanoseconds, which is stillseveral times slower.
A classical prefix code representation uses O ( σL ) bits, where σ is the alphabetsize and L the maximum codeword length, and encodes in constant time anddecodes a codeword of length ℓ in time O ( ℓ ). Canonical prefix codes can be rep-resented in O ( σ log L ) bits, so that one can encode and decode in constant timeunder reasonable assumptions. In this paper we have considered two families ofcodes that cannot be put in canonical form. Alphabetic codes can be representedin O ( σ ) bits, but encoding and decoding takes time O ( ℓ ). We gave an approx-imation that worsens the average code length by a factor of 1 + O (cid:0) / √ log σ (cid:1) ,but in exchange requires o ( σ ) bits and encodes and decodes in constant time.We then consider a family of codes that are canonical when read right to left.For those we obtain a representation using O ( σ log L ) bits and encoding anddecoding in time O ( ℓ ), or even in O (1) time under reasonable assumptions if weuse O (cid:0) ǫL (cid:1) further bits, for any constant ǫ > D with a shape that lets it operate in time O ( ℓ )or O (log ℓ ), as used to prove Theorem 2; currently we gave it Huffman shapein order to minimize space. Since there are generally more longer than shortercodewords, the Huffman shape puts them higher in the wavelet tree of D , sothe longer codewords perform faster and the shorter codewords perform slower.This is the opposite effect as the one sought in Theorem 2. Therefore, a faithfulimplementation may lead to a slightly larger but also faster representation.An interesting challenge is to find optimal alphabetic encodings that canencode and decode faster than in time O ( ℓ ), even if they use more than O ( σ )bits of space. Extending our results to other non-canonical prefix codes is alsoan interesting line of future work. Acknowledgements
This research was carried out in part at University of A Coru˜na, Spain, whilethe second author was visiting and the fifth author was a PhD student there.It started at a StringMasters workshop at the Research Center on Informationand Communication Technologies (CITIC) of the university. The workshop waspartly funded by EU RISE project BIRDS (Bioinformatics and Information Re-trieval Data Structures). The authors thank Nieves Brisaboa and Susana Ladra. on-Canonical Prefix-Free Codes 11
References
1. F. Claude, G. Navarro, and A. Ord´o˜nez. The wavelet matrix: An efficient wavelettree for large alphabets.
Inf. Systems , 47:15–32, 2015.2. W. Evans and D. G. Kirkpatrick. Restructuring ordered binary trees.
J. Algo-rithms , 50:168–193, 2004.3. P. Ferragina, G. Manzini, V. M¨akinen, and G. Navarro. Compressed representa-tions of sequences and full-text indexes.
ACM Trans. Alg. , 3(2):article 20, 2007.4. T. Gagie, M. He, J. I. Munro, and P. K. Nicholson. Finding frequent elements incompressed 2d arrays and strings. In
Proc. SPIRE , pages 295–300, 2011.5. T. Gagie, G. Navarro, Y. Nekrich, and A. Ord´o˜nez. Efficient and compact repre-sentations of prefix codes.
IEEE Trans. Inf. Theory , 61(9):4999–5011, 2015.6. R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes.In
Proc. SODA , pages 841–850, 2003.7. A. Itai. Optimal alphabetic trees.
SIAM J. Comp. , 5:9–18, 1976.8. L. G. Kraft.
A device for quantizing, grouping, and coding amplitude modulatedpulses . M.Sc. thesis, EE Dept., MIT, 1949.9. J. I. Munro and V. Raman. Succinct representation of balanced parentheses andstatic trees.
SIAM J. Comp. , 31(3):762–776, 2001.10. G. Navarro. Wavelet trees for all.
J. Discr. Alg. , 25:2–20, 2014.11. G. Navarro and E. Providel. Fast, small, simple rank/select on bitmaps. In
Proc.SEA , LNCS 7276, pages 295–306, 2012.12. M. Pˇatra¸scu. Succincter. In
Proc. FOCS , pages 305–313, 2008.13. E. S. Schwartz and B. Kallick. Generating a canonical prefix encoding.
Comm. ofthe ACM , 7:166–169, 1964.14. R. L. Wessner. Optimal alphabetic search trees with restricted maximal height.