Tree structure compression with RePair
TTree structure compression with RePair
Markus Lohrey ,(cid:63) , Sebastian Maneth , and Roy Mennicke Universit¨at Leipzig, Institut f¨ur Informatik, Germany NICTA and University of New South Wales, Australia [email protected], [email protected], [email protected]
Abstract.
In this work we introduce a new linear time compression algorithm, called ”Re-pair for Trees”, whichcompresses ranked ordered trees using linear straight-line context-free tree grammars. Such grammars general-ize straight-line context-free string grammars and allow basic tree operations, like traversal along edges, to beexecuted without prior decompression. Our algorithm can be considered as a generalization of the ”Re-pair” al-gorithm developed by N. Jesper Larsson and Alistair Moffat in 2000. The latter algorithm is a dictionary-basedcompression algorithm for strings. We also introduce a succinct coding which is specialized in further compress-ing the grammars generated by our algorithm. This is accomplished without loosing the ability do directly executequeries on this compressed representation of the input tree. Finally, we compare the grammars and output filesgenerated by a prototype of the Re-pair for Trees algorithm with those of similar compression algorithms. Theobtained results show that that our algorithm outperforms its competitors in terms of compression ratio, runtimeand memory usage.
Trees are nowadays a common data structure used in computer science to represent data hierar-chically. This is, for instance, evidenced by XML documents which are widely used after theirintroduction in 1996. They are sequential representations of ordered unranked trees. When pro-cessing trees it is often convenient to hold the tree structure in memory in order to retain fast andrandom access to its nodes. However, this often leads to a heavy resource consumption in termsof memory usage due to the necessary pointer structure which represents the tree structure. Thespace needed to load an entire XML document into main memory in order to access it through aDOM proxy is usually 3–8 times larger than the size of the document itself [WLH07]. Therefore,it is essential for very large tree structures to use a memory efficient representation.In [FGK03,BGK03] directed acyclic graphs (DAGs) were proposed to overcome this prob-lem. By sharing common subtrees one is able to reduce the size of the in-memory representationby a factor of about 10 [BGK03]. One of the most appealing properties of this representationis that queries like the ones of the XPath language can be directly executed on the compressedrepresentation, i.e. , it is not necessary to completely unfold the DAG.Later, in [BLM08] so called linear straight-line context-free tree grammars were proposedas a more succinct representation of an input tree. These grammars represent exactly one treeand generalize the concept of sharing common subtrees to the sharing of repeating tree patterns.Most important, this new representation is still queryable, i.e. , queries can be evaluated withoutprior decompression. At the same time, the complexity of querying, e.g., using XQuery, stays thesame as for DAGs [LM06]. (cid:63)
The first author is supported by the DFG research project ALKODA. a r X i v : . [ c s . D S ] J u l ource message: a b c d a b c After the replacement of ( a, b ) : A c d A c
Fig. 1:
The pair ( a, b ) is replaced by the new symbol A . However, finding the smallest linear straight-line context-free tree grammar generating agiven tree is NP-hard. Already finding the smallest context-free string grammar for a given stringis NP-complete [CLL + Our main contribution is a compression algorithm, called ”Re-pair for Trees”, which is basedon linear straight-line context-free tree grammars. Our investigations show that, regarding ourtest data, the grammars generated by Re-pair for Trees are always smaller than the grammarsproduced by the BPLEX algorithm. In addition, our algorithmoutperforms BPLEX in terms of runtime and memory usage. Note that especially runtimewas a huge drawback of the BPLEX implementation.The Re-pair for Trees algorithm is a generalization of the ”Re-pair” algorithm which was de-veloped by L
ARSSON and M
OFFAT in [LM00]. The latter algorithm is an offline dictionary-basedcompression method for strings consisting of a simple but powerful phrase derivation methodand a compact dictionary encoding. A dictionary-based compression algorithm is an algorithmwhere the input message is parsed into a sequence of phrases selected from a dictionary. Sincethe reference to a phrase in the dictionary is more compact than the phrase itself often a consid-erable compression can be achieved. Re-pair’s dictionary is inferred offline since it is generatedby considering the whole input message and since it is written out as a part of the compresseddata so that it is available to the decoder. The name Re-pair stands for ”recursive pairing” anddescribes the idea of the algorithm. The latter is to count the frequencies of all pairs formed bytwo adjacent symbols of the source message, replacing the most frequent pair by a new symbol(see Fig. 1), updating the frequency counters of all involved pairs and repeating this process untilthere are no pairs occurring twice in the source message. This compression technique allowssearching the compressed data without prior decompression.
In Sect. 3 we explain in detail the two steps of which the Re-pair for Trees algorithm consists.We also present a complete example of a run of our algorithm and consider the compressibilityof special types of trees depending on the maximal rank allowed for a nonterminal. Sect. 4explains some of the implementation details for the Re-pair for Trees algorithm which is called2reeRePair. In particular, we elaborate on its linear runtime, the internal data structures usedand its efficient in-memory representation of the input tree. Moreover, in Sect. 5 we present asuccinct coding which is specialized in further compressing the grammars generated by the Re-pair for Trees algorithm without loosing the ability to directly execute queries on this compressedrepresentation of the input tree. By using a combination of multiple Huffman codings, a run-length coding and a fixed-length coding the resulting file sizes are always smaller than the sizesof the files generated by competing compression algorithms when executed on our test data.In Sect. 6 we compare the compression results of our implementation of the Re-pair for Treesalgorithm with several other compression algorithms. In particular, we consider BPLEX and”Extended-Repair”. The latter algorithm is also based on the Re-pair for strings algorithm andwas independently developed at the University of Paderborn, Germany [Kri08,BHK10].
In the following, N > = N \ { } denotes the set of non-zero natural numbers. For a set X wedenote by X ∗ the set of all finite words over X . For w = x x . . . x n ∈ X ∗ we define | w | = n .The empty word is denoted by ε .We sometimes surround an element of N by square brackets in order to emphasize that wecurrently consider it a character instead of a number. For instance, for the sequence of integers we shortly write [2] [1] instead of to clarify that we are not dealing with the fifthpower of . A ranked alphabet is a tuple ( F , rank ) , where F is a finite set of function symbols and thefunction rank : F → N assigns to each α ∈ F its rank . Furthermore, we define F i = { a ∈ F | rank ( α ) = i } . We fix a ranked alphabet ( F , rank ) in the following. An F -labeled ordered tree isa pair t = ( dom t , λ t ) , where(1) dom t ⊆ N ∗ > is a finite set of nodes ,(2) λ t : dom t → F ,(3) if w = vv (cid:48) ∈ dom t , then also v ∈ dom t , and(4) if v ∈ dom t and λ t ( v ) ∈ F n , then vi ∈ dom t if and only if ≤ i ≤ n .The node ε ∈ dom t is called the root of t . By index ( w ) , where w = vi ∈ dom t \ { ε } and i ∈ N > ,we denote the index i of the node w , i.e. , w is the i -th child of its parent node. Furthermore, wedefine parent ( w ) = v . The size of t is given by the number of edges of which it consists, i.e. , wehave | t | = | dom t | − . The depth of the tree t is depth ( t ) = max {| u | | u ∈ dom t } . We identify an F -labeled tree t with a term in the usual way: if λ t ( ε ) = α ∈ F i , then this term is α ( t , . . . , t i ) ,where t j is the term associated with the subtree of t rooted at node j , where j ∈ { , . . . , i } . Theset of all F -labeled trees is T ( F ) . Example 1.
In Fig. 2 an F -labeled ordered tree t is shown. We have dom t = { ε, , , , , , , , , , , , , , , , } .3 gia a ia a gia a ia b ha Fig. 2: F -labeled ordered tree t We fix a countable set Y = { y , y , . . . } with Y ∩ F = ∅ of (formal context-) parameters (belowwe also use a distinguished parameter z / ∈ Y ). The set of all F -labeled trees with parametersfrom Y ⊆ Y is denoted by T ( F , Y ) . Formally, we consider parameters as function symbols ofrank and define T ( F , Y ) = T ( F ∪ Y ) . The tree t ∈ T ( F , Y ) is said to be linear if everyparameter y ∈ Y occurs at most once in t . By t [ y /t , . . . , y n /t n ] we denote the tree that isobtained by replacing in t for every i ∈ { , , . . . , n } every y i -labeled leaf with t i , where t ∈ T ( F , { y , . . . , y n } ) and t , . . . , t n ∈ T ( F , Y ) . A context is a tree C ∈ T ( F , Y ∪ { z } ) in whichthe distinguished parameter z appears exactly once. Instead of C [ z/t ] we write briefly C [ t ] . Let t = ( dom t , λ t ) ∈ T ( F , { y , . . . , y n } ) such that for every y i there exists a node v ∈ dom t with λ t ( v ) = y i . We say that t is a tree pattern occurring in t (cid:48) ∈ T ( F , Y ) if there exist a context C ∈ T ( F , Y ∪ { z } ) and trees t , . . . , t n ∈ T ( F , Y ) such that C (cid:2) t [ y /t , y /t , . . . , y n /t n ] (cid:3) = t (cid:48) . For further consideration, let us fix a countable infinite set N i of symbols of rank i ∈ N with F i ∩ N i = ∅ and Y ∩ N = ∅ . Hence, every finite subset N ⊆ (cid:83) i ≥ N i is a ranked alphabet.A context-free tree grammar (over the ranked alphabet F ) or short CF tree grammar is a triple G = ( N, P, S ) , where(1) N ⊆ (cid:83) i ≥ N i is a finite set of nonterminals ,(2) P (the set of productions ) is a finite set of pairs ( A → t ) , where A ∈ N , t ∈ T ( F ∪ N, { y , . . . , y rank ( A ) } ) , t / ∈ Y , each of the parameters y , . . . , y rank ( A ) appears in t , and (3) S ∈ N is the start nonterminal of rank .We assume that every nonterminal B ∈ N \ { S } as well as every terminal symbol from F occursin the right-hand side t of some production ( A → t ) ∈ P .Let us define the derivation relation ⇒ G on T ( F ∪ N, Y ) as follows: s ⇒ G s (cid:48) iff there existsa production ( A → t ) ∈ P with rank ( A ) = n , a context C ∈ T ( F ∪ N, Y ∪ { z } ) , and trees t , . . . , t n ∈ T ( F ∪ N, Y ) such that s = C [ A ( t , . . . , t n )] and s (cid:48) = C [ t [ y /t · · · y n /t n ]] . Let L ( G ) = { t ∈ T ( F ) | S ⇒ ∗G t } ⊆ T ( F ) . In contrast to [LMSS09], our definition of a context-free tree grammar inherits productivity, i.e. , t / ∈ Y and each parameter y , . . . , y rank ( A ) appears in t for every ( A → t ) ∈ P . This is justified by the fact that the grammars generated by the Re-pairfor Trees algorithm are always productive. size |G| of the CF tree grammar G is defined by |G| = (cid:88) ( A → t ) ∈ P | t | .That means that |G| equals the sum of the numbers of edges of the right-hand sides of P ’sproductions. We consider the following restrictions on context-free tree grammars: – G is k -bounded (for k ∈ N ) if rank ( A ) ≤ k for every A ∈ N . – G is monadic if it is -bounded. – G is linear if for every ( A → t ) ∈ P the term t is linear.Let G = ( N, P, S ) be a CF tree grammar. We denote the set of all nodes in the right-hand sidesof G ’s productions which are labeled by the nonterminal A ∈ N by ref G ( A ) , i.e. , ref G ( A ) = { ( t, v ) | ∃ ( B → t ) ∈ P : v ∈ dom t ∧ λ t ( v ) = A } .Furthermore, let us define the following relation: (cid:32) G = { ( A, B ) ∈ N × N | ( B → t ) ∈ P ∧ A occurs in t } A straight-line context-free tree grammar (SLCF tree grammar) is a CF tree grammar G =( N, P, S ) , where(1) for every A ∈ N there is exactly one production ( A → t ) ∈ P with left-hand side A , and(2) the relation (cid:32) G is acyclic.The conditions (1) and (2) ensure that L ( G ) contains exactly one tree, which we denote by val ( G ) .Let G be an SLCF tree grammar. We call the reflexive transitive closure of (cid:32) G the hierarchicalorder of G and denote it by (cid:32) ∗G . Example 2.
Consider the (linear and monadic) SLCF tree grammar G = ( N, P, S ) given by thefollowing productions: S → f (cid:0) A ( a ) , A ( b ) , B (cid:1) A ( y ) → g (cid:0) i ( a, a ) , i ( a, y ) (cid:1) B → h ( a ) We have val ( G ) = t , where t ∈ T ( F ) is the tree from Example 1 on page 3.SLCF tree grammars can be considered as a generalization of the well-known DAGs (see, forinstance, [LM06] for a common definition). Whereas the latter is a structure preserving compres-sion of a tree by sharing common subtrees (see Fig. 3.1 for a depiction), SLCF tree grammarsbroaden this concept to the sharing of repeated tree patterns in a tree (see Fig. 3.2). Actually, aDAG can be considered as a -bounded SLCF tree grammar.Let G = ( N, P, S ) be a linear SLCF tree grammar. We define the function sav G ( A ) = | ref G ( A ) | · ( | t | − rank ( A )) − | t | , (1)5 t (cid:48) t (cid:48) Fig. 3.1:
A tree t containing two oc-currences of the very same subtree t (cid:48) . t p p Fig. 3.2:
A tree t containing two oc-currences of the tree pattern p . which computes for every production ( A → t ) ∈ P its contribution to a small representationof the tree val ( G ) by the linear SLCF tree grammar G . The value sav G ( A ) specifies the numberof edges by which the production with left-hand side A reduces the size of the grammar G .However, sav G is not restricted to positive values. In particular, for a production ( A → t ) ∈ P with | ref G ( A ) | = 1 we have sav G ( A ) = − rank ( A ) . Thus, a production which is only referencedonce can be safely removed from the grammar without increasing the size of G .Context-free tree grammars [CDG +
07] and especially SLCF tree grammars have been thor-oughly studied recently. In theory, SLCF tree grammars in theory can be exponentially moresuccinct than DAGs [LM06], which already can achieve exponential compression ratios. Further-more, in [LM06] various membership and complexity problems were considered. It was shownthat in many cases the same complexity bounds hold as for DAGs. In particular, it was pinpointedthat for a given nondeterministic tree automaton A and a linear, k -bounded SLCF tree grammar G it can be checked in polynomial time if val ( G ) is accepted by A – provided that k is a constant.This is a worth mentioning result since in the context of XML, for instance, tree automata areused to type check XML documents against an XML schema ( cf. [MLMK05,Nev02]). More-over, this result was further improved in [LMSS09], where it was shown that every linear SLCFtree grammar can be transformed in polynomial time into a monadic (and linear) one. Togetherwith the above mentioned result from [LM06], a polynomial time algorithm for testing if a givennondeterministic tree automaton accepts a tree given by a linear SLCF tree grammar (of arbitrarymaximal rank for the nonterminals) can be obtained.In [BLM08] the so called BPLEX algorithm was presented. It produces for a given -boundedSLCF tree grammar G , i.e. , G represents a DAG, in time O ( |G | ) an equivalent linear SLCF treegrammar G , where val ( G ) = val ( G ) and G is k -bounded ( k is an input parameter). Experi-ments have shown that |G | is approximately 2–3 times smaller than |G | .Moreover, in [LMSS09] it was proved that the evaluation problem for core XPath (the nav-igational part of XPath) over SLCF tree grammars is PSPACE-complete just as this was provedearlier for DAG-compressed trees by F RICK , G
ROHE and K
OCH in [FGK03]. The evaluationproblem for XPath asks whether a given node in a given tree is selected by a given XPath expres-sion. This result is remarkable since with SLCF tree grammars one achieves better compressionratios than with DAGs. 6 books>
An simplified XML document.
Regarding XML documents, we use the official terminology introduced in [BPSM + elements which are either delimited by start-tags and end-tags or by an empty-element tag . The text between the start-tag and the end-tag of an elementis called the element’s content . An element with no content is said to be empty . There is exactlyone element, called root , which does not appear in the content of any other element. Example 3.
The simplified XML document from Fig. 3 consists of 21 elements of the five types books , book , author , title and isbn . The elements of type books and book are de-limited by start- and end-tags and exhibit element content. The remaining elements are emptyelements delimited by empty-element tags. The root of the XML document is the element of type books .The name in the start- and end-tags of an element give the element’s type . Elements can specify attributes by using name-value pairs. Consider for instance the element
An XML document tree can be considered as an unranked tree, i.e. , nodes with the same labelpossibly have a varying number of children. Figure 4 shows the XML document tree of the XMLdocument from Example 3. In our case, the XML document tree is a ranked tree, i.e. , all nodeswith the same label exhibit the same number of children. However, the XML document mightas well have contained an element of type book exhibiting a second author child element. Inthis case, we would have not obtained a ranked tree.In the next section we will learn that our Re-pair for Trees algorithm operates on ranked treesonly. Therefore, in general, a transformation of an XML document tree becomes necessary. Acommon way of modeling such a tree in a ranked way is to transform it into a binary F -labeledordered tree t by encoding first-child and next-sibling relations. In fact,7 ooksbookauthor title isbn bookauthor title isbn bookauthor title isbn bookauthor title isbn bookauthor title isbn Fig. 4:
XML document tree of the XML document listed in Fig. 3 books book author title isbn · · · book author title isbn times book author title isbn Fig. 5:
Binary tree representation of the XML document tree from Fig. 4. – the first child element of an XML element becomes the left child of the node representing itsparent element and – the right sibling element of another element becomes the right-child of the node representingits left sibling ( cf. Fig. 5).Note that a node representing a leaf (resp. a last sibling) of the XML document has no left(resp. no right) child in the binary tree model representation. Therefore F does not consist of theelement types of the XML document but of special versions of the element types indicating thatthe left, the right, both or no children are missing. In Fig. 5 this is denoted by superscripts at theend of the element types. These superscripts are listed in Table 1 together with their meanings.Let us point out that another way of preserving the rankedness along with circumventing the Superscript Meaning00 no children10 no right child01 no left child11 two children
Table 1:
The superscripts and their meanings. introduction of special labels with a lower rank is the introduction of placeholder nodes. Thesecan be used to indicate missing left or right children. However, our experiments showed that ourimplementation of Re-pair for Trees achieves slightly less competitive compression results inthis setting. 8n [BLM08] it was stated that the binary tree model allows access to the next-in-preorder andprevious-in-preorder node in O ( depth ) , where depth refers to the longest path from the root ofthe XML document to one of its leaves. Furthermore, in [MSV03] it was demonstrated that XMLquery languages can be readily evaluated on the binary tree model. In this section we study the Re-pair for Trees algorithm in detail. It consists of two steps, namely,a replacement step and a pruning step . Furthermore, a detailed example of a run of our algorithmis presented. Finally, we investigate the impact of a possible restriction on the maximal rankallowed for nonterminals.
In order to be able to elaborate on our Re-pair for Trees algorithm we need the following defini-tions. Recall that we have fixed a ranked alphabet F of function symbols, a set N of nonterminalsand a set Y of parameters. We define the set of triples Π = (cid:91) a ∈F∪N { a } × { , , . . . , rank ( a ) } × ( F ∪ N ) .A digram is a triple α = ( a, i, b ) ∈ Π . The symbol a is called the parent symbol of the digram α and b is called the child symbol of the digram α , respectively. We define par ( α ) = rank ( a ) + rank ( b ) − and pat ( α ) = a (cid:0) y , . . . , y i − , b ( y i , . . . , y j − ) , y j , . . . , y par ( α ) (cid:1) ,where j = i + rank ( b ) and y , y , . . . , y par ( α ) ∈ Y . Let m ∈ N ∪ {∞} . We further define the set Π m = { α ∈ Π | par ( α ) ≤ m } .Obviously, it holds that Π ∞ = Π . We can consider pat ( α ) as the tree pattern which is rep-resented by the digram α . We usually denote digrams by possibly indexed lowercase letters α, α , α , . . . , β, . . . of the Greek alphabet. An occurrence of the digram α ∈ Π within the tree t = ( dom t , λ t ) ∈ T ( F ∪ N , Y ) is a node v ∈ dom t at which a subtree pat ( α )[ y /t , y /t , . . . , y par ( α ) /t par ( α ) ] ,with t , t , . . . , t par ( α ) ∈ T ( F ∪ N , Y ) , is rooted. The set of all occurrences of the digram α in t is denoted by OCC t ( α ) ⊆ dom t .Let α = ( a, i, b ) ∈ Π and t ∈ T ( F ∪ N , Y ) . Two occurrences v, w ∈ OCC t ( α ) are overlap-ping if one of the following equations holds: v = w , vi = w or wj = v . Otherwise, i.e. , if v and w are not overlapping, v and w are said to be non-overlapping . A subset σ ⊆ OCC t ( α ) is said tobe overlapping if there exist overlapping v, w ∈ σ , otherwise it is called non-overlapping . It iseasy to see that the set OCC t ( α ) is non-overlapping if a (cid:54) = b . In contrast, if we have a = b , theset OCC t ( α ) potentially contains overlapping occurrences. Consider the following example:9 t ft ft ft t Fig. 6:
Tree t ∈ T ( F ) consisting of nodes labeled by the terminal f ∈ F and the subtrees t , t , . . . , t ∈ T ( F ) . We have todeal with overlapping occurrences of the digram ( f, , f ) . Example 4.
Let t ∈ T ( F ) be the tree depicted in Fig. 6 and let α = ( f, , f ) . Hence, { ε, , } ⊆ OCC t ( α ) , where on the one hand ε and and on the other hand and are overlapping occur-rences of α .Let α ∈ Π and t ∈ T ( F ∪ N , Y ) . Let σ ⊆ OCC t ( α ) be a non-overlapping set. Furthermore, letus assume that σ ∪ { v } is overlapping for all v ∈ OCC t ( α ) \ σ , i.e. , σ is maximal with respectto inclusion among non-overlapping subsets. Then σ is not necessarily maximal with respect tocardinality. Example 5.
Consider the tree t ∈ T ( F ) which is depicted in Fig. 6. Let α = ( f, , f ) ∈ Π . Wehave OCC t ( α ) = { ε, , } . Let σ = { } ⊆ OCC t ( α ) . The set σ is non-overlapping and σ ∪ { v } is overlapping for all v ∈ OCC t ( α ) \ σ . However, σ is not maximal with respect to cardinality.Consider the non-overlapping subset σ (cid:48) = { ε, } ⊆ OCC t ( α ) . We have | σ | < | σ (cid:48) | .Example 5 shows us that we cannot choose an arbitrary subset σ ⊆ OCC t ( α ) which is non-overlapping and maximal with respect to inclusion to obtain a set which is maximal with respectto cardinality. Let us also point out that the set OCC t ( α ) may contain more than one maximal(with respect to cardinality) non-overlapping subset. Example 6.
Consider the tree f ( f ( f ( a ))) over the ranked alphabet F . The sets { ε } and { } areboth maximal with respect to cardinality.The algorithm retrieve-occurrences( t, α ) from Fig. 8 computes one non-overlappingsubset of OCC t ( α ) which we denote by occ t ( α ) . Lemma 1 ascertains that this subset is maximalwith respect to cardinality. Using the function next-in-postorder listed in Fig. 7 we tra-verse the tree t in postorder. We begin by passing the parameters t and ε and obtain the first node u ∈ dom t of t in postorder. The second node in post order is obtained by passing the parameters t and u . This step can be repeated to traverse the whole tree t in postorder. For every node v which is encountered during the postorder traversal it is checked if v is an occurrence of α andif it is non-overlapping with all occurrences already contained in the current set occ t ( α ) . If bothconditions are fulfilled, the node v is added to occ t ( α ) .10 FUNCTION next-in-postorder( t, v ) // let t = ( dom t , λ t ) if ( v = ε ) then v := walk-down( t , v ); else i := index ( v ) + 1 ; v := parent ( v ) ; if ( rank ( λ t ( v )) ≥ i ) then v := vi ; v := walk-down( t , v ); endif endif return v ; ENDFUNC
FUNCTION walk-down( t , v ) // let t = ( dom t , λ t ) while ( true ) do if ( rank ( λ t ( v )) > ) then v := v ; else return v ; endif endwhile ENDFUNC
Fig. 7:
The algorithm which is used to traverse a tree in postorder.1
FUNCTION retrieve-occurrences( t, α ) // let α = ( a, i, b ) occ t ( α ) := ∅ ; v := ε ; while (true) do v := next-in-postorder( t , v ); if ( v ∈ OCC t ( α ) ∧ vi / ∈ occ t ( α ) ) then occ t ( α ) := occ t ( α ) ∪ { v } endif if ( v = ε ) then return occ t ( α ) ; endif endwhile ENDFUNC
Fig. 8:
The function retrieve-occurrences which is used to construct the set occ t ( α ) for a digram α ∈ Π and a tree t ∈ T ( F ∪ N ) . Now, let us assume that we have constructed the set occ t ( α ) ⊆ OCC t ( α ) using the function retrieve-occurrences . If a (cid:54) = b in α = ( a, i, b ) we have occ t ( α ) = OCC t ( α ) . In thefollowing, we show that the subset occ t ( α ) ⊆ OCC t ( α ) is maximal with respect to cardinality. Lemma 1.
Let α ∈ Π and t ∈ T ( F ∪ N , Y ) . Let σ ⊆ OCC t ( α ) be non-overlapping andmaximal with respect to cardinality. Then the equation | occ t ( α ) | = | σ | holds.Proof. In the following we briefly write ”maximal” for ”maximal with respect to cardinality”.Let α = ( a, i, b ) ∈ Π , t ∈ T ( F ∪ N , Y ) and σ as above. The graph ( V, E ) with V = OCC t ( α ) ∪ { vi | v ∈ OCC t ( α ) } and E = { ( v, vi ) | v ∈ OCC t ( α ) } is a disjoint union of paths. Maximal non-overlapping subsets of OCC t ( α ) exactly correspondto maximum matchings in ( V, E ) . Clearly, a path with an odd number of edges has a uniquemaximum matching, whereas a path with an even number of edges has two maximum matchings:one containing the first edge (in direction from the root) and one containing the last edge on the11 fffa a fa a ffa a fa a fffa a fa a ffa a fa a Fig. 9.1:
The tree t ∈ T ( F ) . AfAfa aa a Afa aa a Afa aa a Afa aa a
Fig. 9.2:
The tree t [ α/A ] . path. Intuitively, the algorithm from Fig. 8 finds the maximum matching in ( V, E ) which containsfor every path with an even number of edges the last edge in direction from the root. (cid:117)(cid:116) Let t ∈ T ( F ∪ N , Y ) , α = ( a, i, b ) ∈ Π and A ∈ N par ( α ) . By t [ α/A ] we denote the tree whichis obtained by replacing all occurrences from occ t ( α ) in the tree t by the nonterminal A (inparallel). More precisely, we replace every subtree pat ( α )[ y /t , y /t , . . . , y par ( α ) /t par ( α ) ] ,where t , t , . . . , t par ( α ) ∈ T ( F ∪ N , Y ) , which is rooted at an occurrence v ∈ occ t ( α ) by a newsubtree A ( t , t , . . . , t par ( α ) ) . Example 7.
Consider the tree t ∈ T ( F ) which is depicted in Fig. 9.1. We have occ t ( α ) = { ε, , , , } , where α = ( f, , f ) . By replacing the digram α in t by a nonterminal A ∈ N we obtain the tree t [ α/A ] which is depicted in Fig. 9.2.For t ∈ T ( F ∪ N ) and m ∈ N ∪ {∞} , we define max m ( t ) = (cid:40) α ∈ Π m if occ t ( α ) (cid:54) = ∅ and ∀ β ∈ Π m : | occ t ( β ) | ≤ | occ t ( α ) | undefined if ∀ α ∈ Π m : occ t ( α ) = ∅ The function max m : T ( F ∪N ) → Π associates with every tree t ∈ T ( F ∪N ) a digram α ∈ Π m which occurs in t most frequently (with respect to all digrams from Π m ). If there are multiplemost frequent digrams, we can choose any of them. In contrast, we have max m ( t ) = undefined ifthere is no most frequent digram. If m = ∞ there is no most frequent pair if and only if the tree t consists of exactly one node. Now let us assume that m (cid:54) = ∞ . We have max m ( t ) = undefined if and only if t consists of exactly one node or if for all digrams α occurring in t it holds that α / ∈ Π m .In the sequel, if we do not specify the maximal rank allowed for a nonterminal, we alwaysassume that m = ∞ . For convenience we write max ( t ) instead of max ∞ ( t ) , i.e. , we omit thesymbol ∞ . 12 .2 Replacement of Digrams In this section we introduce the first step of our Re-pair for Trees algorithm, namely, the replace-ment step . Let m ∈ N ∪ {∞} be the maximal rank allowed for a nonterminal and let the tree t = ( dom t , λ t ) ∈ T ( F ) be the input of our algorithm.We describe a run of the Re-pair for Trees algorithm by a sequence of h + 1 linear SLCF treegrammars G , G , . . . , G h , where h ∈ N . For every i ∈ { , , . . . , h } we have G i = ( N i , P i , S i ) , ( S i → t i ) ∈ P i , α i = max m ( t i ) and val ( G i ) = t . The grammar G contains solely the startproduction ( S → t ) , where t = t . We obtain the grammar G i +1 by replacing the digram α i in the right-hand side of G i ’s start production t i by a new nonterminal A i +1 ∈ N par ( α i ) \ N i ( ≤ i ≤ h − ). We set N i +1 = ( N i \ { S i } ) ∪ { S i +1 , A i +1 } and P i +1 = ( P i \ { ( S i → t i ) } ) ∪ { (cid:0) A i +1 → pat ( α i ) (cid:1) , ( S i +1 → t i +1 ) } ,where t i +1 = t i [ α i /A i +1 ] .The computation stops if there is no digram α ∈ Π m occurring at least twice in the start pro-duction of the current grammar, i.e. , either the equation | occ t h ( max m ( t h )) | = 1 or the equation max m ( t h ) = undefined holds. In contrast, for all ≤ i ≤ h − we have | occ t i ( max m ( t i )) | > .Note that the linear SLCF tree grammar G h is almost in Chomsky normal form (CNF) as it isdefined in [LMSS09]. By appropriately transforming the right-hand side of S h (as it is describedin the proof of Proposition 5 of [LMSS09]) and introducing a production with right-hand side a ( y , . . . , y n ) for every terminal a ∈ F n ( n ∈ N ) we would obtain a linear SLCF tree grammarwhich perfectly meets the requirements of the CNF.The linear SLCF tree grammar G h can only be considered an intermediate result, since itpotentially consists of productions which do not contribute to a compact representation of theinput tree t . Therefore, we get rid of unprofitable productions by eliminating them during the so-called pruning step . The latter, which is described in the next section, is executed directly afterthe replacement step. Let G = ( N, P, S ) be a linear SLCF tree grammar. We eliminate a production ( A → t ) from P as follows:(1) For every reference ( t (cid:48) , v ) ∈ ref G ( A ) we replace the subtree A ( t , t , . . . , t n ) rooted at v ∈ dom t (cid:48) by the tree t [ y /t , y /t , . . . , y n /t n ] ,where t , . . . , t n ∈ T ( F ∪ N , Y ) and n = rank ( A ) .(2) We update the set of productions by setting P := P \ { ( A → t ) } . Regarding our implementation of the Re-pair for Trees algorithm which is described in Sect. 4, m is a parameter which canbe specified by the user. G = ( N, P, S ) be the linear SLCF tree grammar generated in the replacement step of ouralgorithm, i.e. , we have G = G h . Let n = | N | and let ω = B , B , . . . , B n − , B n be a sequence of all nonterminals of N in hierarchical order, i.e. , the following conditions hold:(i) B n = S (ii) ∀ ≤ i < j ≤ n : B j (cid:54) (cid:32) ∗G B i Let ( B i → t i ) , ( B j → t j ) ∈ P , where ≤ i, j < n and i (cid:54) = j . If we eliminate B i this may havean impact on the value of sav G ( B j ) from (1). We need to differentiate between two cases:(1) B j → B i t j If B i occurs in t j , i.e. , B i (cid:32) G B j , then | t j | is increased because of the elimination of B i . Atthe same time, sav G ( B j ) goes up if we have | ref G ( B j ) | > . The increase of | t j | is due to thefact that we can assume that the inequality |{ v ∈ dom t i | λ t i ( v ) / ∈ Y}| ≥ holds. Everyproduction which was introduced in the replacement step represents a digram and thereforeconsists of at least two nodes labeled by the parent and child symbol, respectively, of thisdigram.(2) B i → B j t i If B j occurs in t i , i.e. , B j (cid:32) G B i , then | ref G ( B j ) | and therefore sav G ( B j ) are possiblyincreased by eliminating B i . In fact, both values go up if | ref G ( B i ) | > . First phase
In the first phase of the pruning step, we eliminate every production ( A → t ) ∈ P with | ref G ( A ) | = 1 . That way we achieve not only a possible reduction of the size of G (becausewe have sav G ( A ) = − rank ( A ) for every A ∈ N referenced only once) but we also decrementthe number of nonterminals | N | each time we eliminate such a production. Second phase
In the second phase of the pruning step we eliminate all remaining inefficientproductions. We consider a production ( A → t ) ∈ P as inefficient if sav G ( A ) ≤ . Unfortunately,this time we have to deal with a rather complex optimization problem. In contrast to the firstphase, the decision whether to eliminate a production ( A → t ) ∈ P or not does now dependon the value sav G ( A ) . However, the latter may be increased by eliminating other nonterminals(see the above case distinction). This forces us to use a heuristic to decide what productions toremove next from the grammar. In fact, after completing the first phase, we cycle through theremaining productions in their reverse hierarchical order. For every ( A → t ) ∈ P we check if sav G ( A ) ≤ . If this proves to be true, we eliminate ( A → t ) . That way |G| and | N | are possiblyfurther reduced.The following example shows that the size of the final grammar generated by the Re-pairfor Trees algorithm may depend on the order in which possible inefficient productions are elim-inated. 14 xample 8. Consider the linear SLCF tree grammar G = ( N, P, S ) , where N = { S, A, B } and P is the following set of productions: S → f ( A ( a, a ) , B ( A ( a, a ))) A ( y , y ) → f ( B ( y ) , y ) B ( y ) → f ( y , a ) Let us assume that the grammar G was generated by the replacement step of our algorithm andthat we now want to remove all inefficient productions. We have sav G ( A ) = − and sav G ( B ) =0 , i.e. , the productions with left-hand sides A and B do not contribute to a small representationof the input tree val ( G ) . Let us consider the following two cases:(1) If we eliminate the production with left-hand side A , we obtain the grammar G = ( N , P , S ) ,where N = { S , B } and P is the following set of productions: S → f ( f ( B ( a ) , a ) , B ( f ( B ( a ) , a ))) B ( y ) → f ( y , a ) We have |G | = 11 and sav G ( B ) = 1 , i.e. , the production with left-hand side B is notconsidered inefficient.(2) In contrast, the elimination of the production with left-hand side B yields the linear SLCFtree grammar G = ( N , P , S ) , where N = { S , A } and P is the following set of pro-ductions: S → f ( A ( a, a ) , f ( A ( a, a ) , a )) A ( y , y ) → f ( f ( y , a ) , y ) We also eliminate the production with left-hand side A since we have sav G ( A ) = 0 . Thisleads to an updated grammar G = ( N , P , S ) , where N = { S } and P contains solelythe production S → f ( f ( f ( a, a ) , a ) , f ( f ( f ( a, a ) , a ) , a )) .We have |G | = 12 .This case distinction shows that the order in which inefficient productions are eliminated hasan influence on the size of the final grammar (since |G | < |G | ). Let us consider the sequence A, B, S which is the only way to enumerate the nonterminals from N in hierarchical order.Due the fact that the above described heuristic cycles through the productions in their reversehierarchical order to eliminate inefficient productions we would obtain the larger grammar G ifwe would execute the pruning step with G as the input grammar.Given the above example one might expect better compression results if the inefficient produc-tions are eliminated in the order of their sav G -values, i.e. , if we would proceed as follows: as longas their is a production whose left-hand side has a sav G -value smaller or equal to we remove aproduction whose left-hand side has the smallest occurring sav G -value. However, our investiga-tions showed that this approach leads to unappealing final grammars — at least for our set of test15 igram α | occ t ( α ) | ( title , , isbn ) 5( author , , title ) 5( book , , author )4( book , , book ) 2( book , , book ) 1( book , , author )1( books , , book ) 1 Table 2.1:
All digrams en-countered in the input tree t and their number of non-overlapping occurrences. digram α | occ t ( α ) | ( author , , A ) 5( book , , author )4( book , , book ) 2( book , , book ) 1( book , , author )1( books , , book ) 1 Table 2.2:
All digramsencountered in the tree t and their number of non-overlapping occurrences. digram α | occ t ( α ) | ( book , , A ) 4( book , , book ) 2( book , , book ) 1( book , , A ) 1( books , , book ) 1 Table 2.3:
All digramsencountered in the tree t and their number of non-overlapping occurrences. input trees. The grammars generated by this approach exhibit nearly the same number of edgesbut much more nonterminals (about 50% more) compared to the grammars obtained using theabove heuristic.Note that it is not possible to already detect digrams leading to inefficient productions during thereplacement step. For instance, we would not act wisely if we would ignore digrams occurringonly twice and exhibiting a large number of parameters a priori. Example 9.
Imagine an input tree t ∈ T ( F ) comprising two instances of a large tree pattern t (cid:48) ∈ T ( F , Y ) . Let λ t (cid:48) ( v ) (cid:54) = λ t (cid:48) ( u ) for all v, u ∈ dom t (cid:48) , u (cid:54) = v . Furthermore, let us assume thatall symbols in the tree pattern t (cid:48) are not occurring outside of this pattern. For every digram α occurring in the tree pattern t (cid:48) (whose replacement may firstly lead to a production with a largenumber of parameters) we would have | occ t ( α ) | = 2 . It becomes clear that this great redundancyin the input tree t , which can be represented by a production with right-hand side t (cid:48) , wouldnot be detected if we would not carry out these initially anything but efficient seeming digramreplacements. Let the tree depicted in Fig. 5 be our input tree t and let there be no restrictions on the maximalrank allowed for a nonterminal. We set G = ( N , P , S ) , where N = { S } and P solelycontains the production ( S → t ) . Table 2.1 shows every digram α encountered in t along withits number of non-overlapping occurrences | occ t ( α ) | . Furthermore, this table tells us that the twodigrams ( title , , isbn ) and ( author , , title ) are the most frequent digrams occuring in t .We decide to replace the former digram and therefore have max ( t ) = ( title , , isbn ) =: α .Now, in the first iteration of our computation, we generate a new linear SLCF tree grammar G = ( N , P , S ) as follows. We introduce a new nonterminal A ∈ N and set N = { S , A } .After that, we introduce the new production (cid:0) A → pat ( α ) (cid:1) , where pat ( α ) = title ( isbn ) .Finally, we set P = { (cid:0) S → t (cid:1) , (cid:0) A → pat ( α ) (cid:1) } , where we have t = t [ α /A ] . The tree t is depicted in Fig. 10.In the second iteration, during which we generate the grammar G = ( N , P , S ) , we have max ( t ) = ( author , , A ) =: α as it can be seen in Table 2.2. Again, we introduce a new16 ooks book author A · · · book author A times book author A Fig. 10:
Tree t which evolved from the input tree t in the first iteration of our computation. books book A · · · book A times book A Fig. 11:
Tree t which evolved from the tree t in the second iteration of our computation. nonterminal A ∈ N with right-hand side pat ( α ) , set N = { S , A , A } and set P = { ( S → t ) , (cid:0) A → pat ( α ) (cid:1) , (cid:0) A → pat ( α ) (cid:1) } , where t = t [ α /A ] (see Fig. 11). We have max ( t ) =( book , , A ) =: α ( cf. Table 2.3) in the third iteration of our algorithm. This time, we need tointroduce a new nonterminal A ∈ N , i.e. , a nonterminal with one parameter, with right-handside pat ( α ) = book ( A , y ) . We obtain the grammar G = ( N , P , S ) , where N = { S , A , A , A } , P = ( P \ { ( S → t ) } ) ∪ { ( S → t ) , ( A → pat ( α )) } and t = books ( A ( A ( A ( A ( book ( A )))))) by replacing the occurrences of α .In the fourth and last iteration the digram ( A , , A ) is replaced by a new nonterminal A ∈N . Therefore, we obtain the grammar G = ( N , P , S ) with edges and nonterminals,where we have N = { S , A , A , A , A } and P is the following set of productions: S → books ( A ( A ( book ( A )))) A ( y ) → A ( A ( y )) A ( y ) → book ( A , y ) A → author ( A ) A → title ( isbn ) Finally, in the pruning step, we begin with merging the right-hand side of A with the right-hand side of A since | ref G ( A ) | = 1 , i.e. , it is only referenced once. This yields the updatedproduction (cid:0) A → author ( title ( isbn )) (cid:1) . Furthermore, we roll back the replacement of thedigram ( A , , A ) due to the fact that it does not contribute to the reduction of the total number ofedges. Although the production with left-hand side A is referenced twice in the right-hand sides17f G and removes redundancy this gain is neutralized by the necessary edge to the parameternode. This is indicated by the sav G value of A , see (1): sav G ( A ) = | ref G ( A ) | · ( | A ( A ( y )) | − rank ( A )) − | A ( A ( y )) | = 2 · (2 − − With these adjustments we obtain the linear SLCF tree grammar G = ( N, P, S ) , where N = { S , A , A } and P is the following set of productions: S → books ( A ( A ( A ( A ( book ( A )))))) A ( y ) → book ( A , y ) A → author ( title ( isbn )) Compared to the grammar G it has the same number of edges (namely 10) but nearly half asmuch nonterminals only. It is very unlikely to be confronted with an XML document tree which, in the binary tree model,is represented by a perfect binary tree . Nevertheless we want to investigate the compression per-formance of our algorithm on this kind of trees since it is an interesting aspect from a theoreticalpoint of view. Last but not least our undertaking is justified by the fact that the actual Re-pair forTrees algorithm is not restricted to applications processing XML files but can be used in otherapplications as well. The latter, in turn, may exhibit ranked trees similar to full binary trees.Let t ∈ T ( F ) be a sufficiently large perfect binary tree of which each inner node is la-beled by a terminal f ∈ F and each leaf is labeled by a terminal a ∈ F . A run of Re-pairfor Trees on t consists of · ( d − iterations folding the input tree beginning at its leaves,where d = depth ( t ) . Thus, in the first two iterations, the digrams formed by the leaf nodesand their parents are replaced. We obtain the productions A ( y ) → f ( y , a ) and A → A ( a ) each occurring d − times. Now, we undertake further digram replacements in a bottom up fash-ion. In the (2 i − -th and i -th iteration we replace two digrams resulting in the productions A i − ( y ) → f ( y , A i − ) and A i → A i − ( A i − ) , respectively, where ≤ i ≤ d − .The production with left-hand side A k − occurs only once for every ≤ k ≤ d − . There-fore, in the pruning step, for every ≤ k ≤ d − the production with left-hand side A k − iseliminated by merging its right-hand side with the right-hand side of the production with left-hand side A k . In particular, the production with left-hand side A is merged with the productionfor A resulting in a production A → f ( a, a ) .Finally, we obtain a linear SLCF tree grammar with d nonterminals — including the left-handside of the start production S → f ( A d − , A d − ) — and a total of · d edges. Note that eventhough some of the intermediate productions exhibit parameters the final grammar consists onlyof nonterminals of rank . Thus, the generated grammar is a DAG and in this particular case theminimal DAG of the input tree. A perfect binary tree is a binary tree in which every node is either of rank or and all leaves are at the same level ( i.e. , thepaths to the root are of the same length). In contrast, a full binary tree has no restrictions on the level of the leaves, i.e. , theonly requirement is that every node is either of rank or . fffa a fa a ffa a fa a fffa a fa a ffa a fa a Fig. 12.1:
Perfect binary tree t ∈ T ( F ) of height4 A ( y ) → f ( y , a ) A → A ( a ) A ( y ) → f ( y , A ) A → A ( A ) A ( y ) → f ( y , A ) A → A ( A ) Fig. 12.2:
Productions be-fore the pruning step with-out the start production. A → f ( a, a ) A → f ( A , A ) A → f ( A , A ) S → f ( A , A ) Fig. 12.3:
Productions af-ter the pruning step.
Example 10.
Let t ∈ T ( F ) be the perfect binary tree from Fig. 12.1 with 30 edges and depth ( t ) =4 . A run of Re-pair for Trees initially generates the 6 productions listed in Fig. 12.2. Afterthe pruning step we finally obtain the linear SLCF tree grammar G = ( N, P, S ) , where N = { A , A , A , S } and the set of productions P consists of the productions from Fig. 12.3. The sizeof G is |G| = 8 . It seems natural to assume that, in general, trees can be compressed best by the Re-pair for Treesalgorithm if there are no restrictions on the maximal rank of a nonterminal. However, it turnsout that there are (not so uncommon) types of trees for which the opposite is true. Firstly, in thissection, we will construct a set of trees whose compressibility is best if there are no restrictionson the maximal rank of a nonterminal. After that, in the succeeding section, we will present a setof trees whose compressibility is best when restricting the maximal rank to .Let us consider the infinite set M = { t , t , t , . . . } ⊆ T ( F ) of trees, where for all i ∈ N > the tree t i has the following properties: – The tree t i is a perfect binary tree of depth i . – Each inner node of t i is labeled by the terminal f ∈ F . – Each leaf of t i is labeled by a unique terminal from F , i.e. , there do not exist two differentleaves which are labeled by the same symbol. Example 11.
Figure 13.1 shows a simplified depiction of the tree t ∈ M . The inner nodeslabeled by the symbol f ∈ F are represented by a circle filled with paint. In contrast, the leaves,of which each is labeled by a unique symbol from F , are depicted by a circle which is not filledwith paint.The tree t is compressed by a run of our algorithm as follows. The digrams ( f, , f ) and ( f, , f ) occur equally often in t . It makes no difference to the size of the final grammar whetherwe replace the former or the latter. Let us replace the digram ( f, , f ) (whose occurrences arepainted in green in Fig. 13.1) by a nonterminal A ∈ N with right-hand side f ( y , f ( y , y )) . Weobtain the tree of the form shown in Fig. 13.2. After that, the digram ( A , , f ) , which occurs the19 ig. 13.1: The tree t ∈ M . Fig. 13.2:
The tree t ∈ M after replacing the digram ( f, , f ) . Fig. 13.3:
The tree t ∈ M after the second iteration, i.e. , after replacing the digrams ( f, , f ) and ( A , , f ) . Fig. 13.4:
The tree which remains after replacing the digram ( A , , A ) in the tree from Fig. 13.3. Fig. 13.5:
The tree t ∈ M after iterations of our algorithm. We obtained a -ary tree whose inner nodes are labeled bythe nonterminal A . i − B i − y y y r B i − y r +1 y r +2 y · r B i − y ( r − · r +1 y ( r − · r +2 y r Fig. 14:
Right-hand side s i of the nonterminal B i , where r = rank ( B i − ) and i > . same number of times as ( f, , f ) did, is replaced by the nonterminal A ∈ N with right-handside A (cid:0) f ( y , y ) , y , y (cid:1) . The occurrences of ( A , , f ) are marked with green paint in Fig. 13.2.The right-hand side of the nonterminal A is merged with the right-hand side of A during thepruning step since A is only referenced once. This yields the production with left-hand side A and right-hand side f (cid:0) f ( y , y ) , f ( y , y ) (cid:1) .After the replacement of the above two digrams the right-hand side of the start production isa -ary tree of depth whose inner nodes are labeled by A (see Fig. 13.3). Now, the digrams ( A , , A ) , ( A , , A ) , ( A , , A ) , ( A , , A ) occur equally often. Again, the order of the digram replacements makes no difference to the finalgrammar. Assuming that at first we replace the digram ( A , , A ) , which is marked with greenpaint in Fig. 13.3, by a new nonterminal A , we obtain the tree shown in Fig. 13.4. After that, thedigrams ( A , , A ) , ( A , , A ) and ( A , , A ) are replaced in three additional iterations. Theabove four digram replacements result in a new production A ( y , . . . , y ) → A (cid:0) A ( y , . . . , y ) , . . . , A ( y , . . . , y ) (cid:1) after pruning the grammar. The remaining tree is a 16-ary tree of depth (of the form depictedin Fig. 13.5) whose inner nodes are labeled by the nonterminal A . In this tree there is no digramoccurring more than once. Therefore, the execution of our algorithm stops.Now, we want to analyze the behavior of Re-pair for Trees on a tree from M in general. Let x ∈ N > and let it : N > → N > be the following function: it ( x ) = x − (cid:88) i =0 i Let B , B , B , . . . be a sequence of nonterminals where for all i > the following conditionsare fulfilled: – rank ( B i ) = 2 i – s i ∈ T ( F ∪ N , Y ) is the right-hand side of B i – If i = 1 , we have s i = f ( f ( y , y ) , f ( y , y )) and if i > , the tree s i is of the form shown inFig. 14, where r = rank ( B i − ) = 2 i − . 21 n − y y y r − h B n − y r − h +1 y r − h +2 y r − h B n − y h ( r − y h ( r − y r + h ( r − ( r − h ) m any h many r many r many Fig. 15:
Right-hand side of the nonterminal C , where r = rank ( B n − ) . Regarding the nonterminals A and A from Example 11, we have B = A and B = A ,respectively. Let i ∈ { , , . . . , k } . The following two equations hold: rank ( B i ) = 2 i = 2 i − · i − = rank ( B i − ) (2) | s i | = rank ( B i ) + rank ( B i − ) (3)For convenience, we define rank ( B ) = rank ( f ) = 2 .Now assume that we have an unlimited maximal rank allowed for a nonterminal. After it ( n ) iterations on t n +1 ∈ M we have obtained the nonterminals B , B , . . . , B n . The right-hand sideof the start nonterminal is a rank ( B n ) -ary tree of height (see also Example 11, where n = 2 ).At this point, no further replacements are carried out. For each of the generated nonterminals B , . . . , B n we have | ref G ( B i ) | = rank ( B i ) + 1 , (4)where i ∈ { , , . . . , n } ( cf. Fig. 14). Hence, we have sav G ( B i ) (1) = | ref G ( B i ) | · rank ( B i − ) − | s i | (3) = | ref G ( B i ) | · rank ( B i − ) − rank ( B i ) − rank ( B i − ) (4) = rank ( B i ) · rank ( B i − ) − rank ( B i ) (2) = rank ( B i − ) − rank ( B i − ) > since rank ( B i − ) ≥ rank ( B ) = 2 . Therefore, none of the nonterminals B , . . . , B n will beeliminated in the pruning step.Now assume that the maximal rank is m ∈ N , i.e. , we have m < ∞ . Choose the smallest n ∈ N such that n > m . (5)Thus, B n is the first nonterminal in the sequence B , B , . . . with a rank bigger than m . Letus consider a run of Re-pair for Trees on a tree t j ∈ M with j ≥ n + 1 . Then, as above,the nonterminals B , . . . , B n − will be obtained after it ( n − iterations (if we would prunethe corresponding grammar by now). At this point, the right-hand side of the start productionis a n − -ary tree of height j / n − ≥ , where all inner nodes are labeled by the nonterminal22 ig. 16: Right-hand side of the current start production after replacing the digram ( A , , A ) . B n − . Now, we can carry out h additional digram replacements leading to the nonterminals C , C , . . . , C h ∈ N , where h = max { l ∈ N | r + l · ( r − ≤ m } (6)and r = rank ( B n − ) = 2 n − . We claim that r = rank ( B n − ) > h (7)holds. To see this, let us assume that r = rank ( B n − ) ≤ h . We have m (6) ≥ r + h · ( r − ≥ r + r · ( r −
1) = r = 2 n .However, this contradicts (5).In case h > , we can argue as follows: After the pruning step, the nonterminals C , C , . . . , C h form one nonterminal C ∈ N with rank ( C ) = h · r + r − h = r + h · ( r − (see Fig. 15). It occursat least n + 1 = r + 1 many times according to (4) (the nonterminal C occurs as often as B n does after it ( n ) iterations on t j in the unlimited case). Each occurrence of C reduces the size ofthe corresponding grammar by h edges and the right-hand side of C consists of r + h · r edges(see Fig. 15). Now, let us consider the sav -value of C (assuming that G is the current grammarafter it ( n ) + h iterations): sav G ( C ) (1) = | ref G ( C ) | · h − ( r + h · r ) ≥ ( r + 1) · h − h · r − r = ( r − r + 1) · h − r ≥ r − r + 1= ( r − Thus, we have sav G ( C ) > , i.e. , the nonterminal C is not eliminated during the pruning step. Example 12.
Let us assume that the maximal rank for a nonterminal is restricted to in Exam-ple 11. In this case we are able to undertake exactly one additional digram replacement in the treefrom Fig. 13.4 resulting in a new nonterminal A ∈ N . If we replace the digram ( A , , A ) ,we obtain the tree shown in Fig. 16. We have n = 2 , h = 2 and C = A . After the pruning step,the right-hand side of A is of the form A ( y , y , A ( y , y , y , y ) , A ( y , y , y , y )) .23e can further state that the nonterminal B n − is not eliminated since it occurs h + 1 times inthe right-hand side of C (see Fig. 15) and ( r − h ) · | ref G ( C ) | ≥ ( r − h ) · ( r + 1) times inthe right-hand side of the current start production (below each occurrence of C there are r − h occurrences of B n − and C occurs at least r + 1 times). Therefore, we have | ref G ( B n − ) | ≥ h + 1 + ( r − h ) · ( r + 1)= h + 1 + r − hr + r − h = r − hr + r + 1 .Because of (7), the inequality | ref G ( B n − ) | > r +1 holds. As shown before for the unlimited rank,in this case B n − has a sav -value bigger than and therefore the nonterminals B , B , . . . , B n − are not eliminated.Let H m be the grammar which is obtained after it ( n −
1) + h iterations on the tree t j whenrestricting the maximal rank to m and let H ∞ be the current grammar after it ( n ) iterations on t j when an unlimited rank is allowed. We can conclude that |H m | > |H ∞ | holds — no matterwhether we have h > or h = 0 — because of the following two facts:(1) Each occurrence of B n saves rank ( B n − ) edges (see Fig. 14) and therefore according to(7) more than an occurrence of C does. The nonterminals B n and C occur equally often.However, C is only existent if h > .(2) The nonterminals B , B , . . . , B n − (which are existent in both grammars, H m and H ∞ ) andthe nonterminals B n and C are not eliminated during the pruning step.Let G m ( G ∞ ) be the final grammar which is generated by a run of Re-pair for Trees on thetree t j when restricting the maximal rank of a nonterminal to m (not restricting the maximalrank). We have G m = H m and |G ∞ | ≤ |H ∞ | . The latter holds because with every additionaldigram replacement at least one edge is absorbed and because during the pruning step onlynonterminals with a sav -value smaller than or equal to are eliminated. Therefore |G m | > |G ∞ | holds. Thus, we have shown that, in general, the trees from M can be compressed best if thereare no restrictions on the maximal rank allowed for a nonterminal. Example 13.
Table 3 shows a comparison of the grammars generated by different runs of ouralgorithm on the trees t , t and t from M . By G ( G ∞ ) we denote the final grammar which isgenerated when restricting the maximal rank to (not restricting the maximal rank). In the preceding section we investigated a set of trees whose compressibility was best if wedid not restrict the maximal rank of a nonterminal. Now, we want to construct a set of treeswhich behaves contrarily, i.e. , we construct trees which can be compressed best if we limit themaximal rank of a nonterminal to . In order to make it easier to quickly understand the followingdefinition we want to refer the reader to Fig. 17.1 which shows one of the trees we define in thesequel. 24 ree t i depth ( t i ) | t i | |G | |G ∞ | t t t
16 131070 87386 66090
Table 3:
Comparison of the sizes of the final grammars.
First of all, let us define a labeling function l : N → F , where l ( i ) = a if i ≡ b if i ≡ c if i ≡ d if i ≡ e if i ≡ and i ∈ N . Now, we define for all n ∈ N the tree s n = ( dom s n , λ s n ) ∈ T ( F ) , where dom s n = (cid:32) n (cid:91) i =0 [2] i (cid:33) ∪ (cid:32) n − (cid:91) i =0 [2] i [1] (cid:33) and λ s n ( v ) = f ∈ F if v = [2] i , ≤ i < n l ( i ) ∈ F if v = [2] i [1] , ≤ i < n l (2 n ) ∈ F if v = [2] n .Let us define U = { s n | n ∈ N , n ≥ } . In the following we will show that for every run ofRe-pair for Trees on a tree s ∈ U we have |G | < |G ∞ | , where G is the grammar generated whenallowing a maximal rank of for a nonterminal and G ∞ is the resulting grammar when there isno restriction on the maximal rank.Let us consider a run G ∞ , G ∞ , . . . , G ∞ n − of the Re-pair for Trees algorithm on the tree s n with no restrictions on the maximal rank of a nonterminal, where G ∞ i = ( N i , P i , S i ) , ( S i → t i ) is the start production of G ∞ i and i ∈ { , , . . . , n − } . In the first iteration of our computationthe digram ( f, , f ) is the most frequent digram, i.e. , max ( t ) = ( f, , f ) . This is because of | occ s n (cid:0) ( f, , f ) (cid:1) | = 2 n − whereas for every x ∈ { a, b, c, d, e } the inequality | occ s n (cid:0) ( f, , x ) (cid:1) | ≤ (cid:100) n / (cid:101) holds. Therefore, we replace the digram ( f, , f ) by a new nonterminal A and obtain G ∞ . Inevery subsequent iteration i we replace max ( t i − ) = ( A i − , i − + 1 , A i − ) by a new nonterminal A i , where i ∈ { , , . . . , n − } . For every ≤ i ≤ n − the right-hand side of the start productionof the grammar G ∞ i is given by the tree t i = ( dom t i , λ t i ) , where dom t i = n − i (cid:91) j =0 (cid:2) i + 1 (cid:3) j ∪ i (cid:91) k =1 2 n − i − (cid:91) j =0 (cid:2)(cid:0) i + 1 (cid:1)(cid:3) j [ k ] λ t i ( v ) = A i ∈ N i +1 if v = [2 i + 1] j with ≤ j ≤ n − i − , l ( j · i + k − if v = [2 i + 1] j [ k ] with ≤ j ≤ n − i − and ≤ k ≤ i , l (2 n ) if v = [2 i + 1] n − i . Example 14.
The Figs. 17.1, 17.2, 17.3 and 17.4 show the right-hand sides of the start produc-tions of the grammars G , G , G and G generated by a run of our algorithm on the tree s .In order to argue that we have max ( t i ) = ( A i , i + 1 , A i ) =: α i for every < i < n , weinvestigate the number of occurrences of all digrams occurring in the right-hand side of G ∞ i ’s startproduction. Firstly, it is easy to verify that | occ t i ( α i ) | = 2 n − i − . In contrast, for every ≤ k ≤ i and x ∈ { a, b, c, d, e } the inequality | occ t i (cid:0) ( A i − , k, x ) (cid:1) | ≤ (cid:98) n − i / (cid:99) holds. This is because everypower of is not divisible by , i.e. , for every ≤ k ≤ i and every ≤ j ≤ n − i − we have λ t i ([2 i + 1] j [ k ]) (cid:54) = λ t i ([2 i + 1] j +1 [ k ]) (cid:54) = λ t i ([2 i + 1] j +2 [ k ]) (cid:54) = λ t i ([2 i + 1] j +3 [ k ]) (cid:54) = λ t i ([2 i + 1] j +4 [ k ]) .Due to the fact that we do not replace digrams with child symbols a , b , c , d or e , the right-handside of G ∞ n − ’s start production has to contain at least n nodes labeled by these symbols, i.e. , wecan conclude that |G ∞ n − | ≥ n . Therefore the compression ratio cannot be better than 50%.In contrast, a run G , G , . . . , G k of our algorithm on the tree s n leads to a significantly bet-ter compression ratio when restricting the maximal rank of a nonterminal to , where k ∈ N > , G i = ( N i , P i , S i ) , ( S i → t i ) is the start production of G i and i ∈ { , , . . . , k } . In the first it-eration we have max ( t ) (cid:54) = ( f, , f ) , since a replacement of ( f, , f ) would result in a non-terminal with a rank greater than . Therefore only the digrams ( f, , a ) , ( f, , b ) , ( f, , c ) , ( f, , d ) , ( f, , e ) and subsequent digrams can be replaced. It turns out that after the first nineiterations the pattern f ( a, f ( b, f ( c, f ( d, f ( e, . . . )))) is represented by a new nonterminal A with rank ( A ) = 1 . The actual order of the replacements within the first nine iterations depends on themethod used to choose a most frequent digram when there are multiple most frequent digrams.Refer to Example 15 for one possible proceeding.The right-hand side of G ’s start production is a degenerated tree mainly consisting of con-secutive nonterminals A . The corresponding nodes — there are roughly n / of them — arethen boiled down using approximately log ( n / ) digram replacements. Therefore the number oftotal edges of the resulting grammar is in O ( n ) , i.e. , it is of logarithmic size (the size of theinput tree s n is n +1 + 1 ). Thus, we were able to construct a set of trees which exhibit a bettercompressibility when restricting the maximal rank of a nonterminal to . Example 15.
Let us consider a run of Re-pair for Trees on the tree s ∈ U when restricting themaximal rank of a nonterminal to (see Fig. 18.1 for a depiction of s ). Table 4 shows one ofseveral possible orders of digram replacements and the Fig. 18 shows how the right-hand sidesof the start productions evolve. 26 a fb fc fd fe fa fb fc fd fe fa fb fc fd fe fa b Fig. 17.1:
The tree s ∈ U which is the right-hand side of G ’s start production. A a b A c d A e a A b c A d e A a b A c d A e a b Fig. 17.2:
The right-hand side of G ’s start produc-tion. A a b c d A e a b c A d e a b A c d e a b Fig. 17.3:
The right-hand side of G ’s start produc-tion. A a b c d e a b c A d e a b c d e a b Fig. 17.4:
The right-hand side of G ’s start produc-tion. a fb fc fd fe fa fb fc fd fe fa fb fc fd fe fa b A fb fc fd fe A fb fc fd fe A fb fc fd fe A b A A fc fd fe A A fc fd fe A A fc fd fe A bA fc fd fe A fc fd fe A fc fd fe A b A A fd fe A A fd fe A A fd fe A b A A A fe A A A fe A A A fe A bA A fe A A fe A A fe A b A fe A fe A fe A b A A A A A A A b A A A A b Fig. 18:
The right-hand sides for the nonterminals S , . . . , S teration Replaced digram New nonterminal cf. Figure f, , a ) A f, , b ) A A , , A ) A f, , c ) A f, , d ) A A , , A ) A A , , A ) A f, , e ) A A , , A ) A Table 4:
A run of Re-pair for Trees on the tree s ∈ U with a maximal nonterminal rank of . We implemented a prototype of the Re-pair for Trees algorithm, named TreeRePair, running onXML documents. In the sequel, we demonstrate that it produces for any XML document tree in O ( | t | ) time a linear k -bounded SLCF tree grammar G , where k ∈ N is a constant, val ( G ) = t and t ∈ T ( F ) is the binary representation of the input tree.There are several reasons to restrict the maximal rank to a constant k . One of them is that onlythis way we are able to obtain a linear-time implementation. Another reason is that for every k -bounded linear SLCF tree grammar G generated by TreeRePair it can be checked in polynomialtime if a given tree automaton accepts val ( G ) (using a result from [LM06]). Last but not least,Sect. 3.7 on page 24 showed us that for flat XML documents leading to a right-leaning binarytree it is quite promising to restrict the maximal rank. The latter reason is also supported by ourexperiments with different maximal ranks on our test set of XML documents.On average, a maximal rank of leads to the best compression performance ( cf. Sect. 6.7on page 64). Due to this fact TreeRePair generates -bounded linear SLCF tree grammars bydefault. This can be adjusted by using the -max_rank switch. The XML document tree of the input file can be directly transformed into a binary F -labeledtree t = ( dom t , λ t ) ∈ T ( F ) . The XML document is parsed by a SAX-like parser calling thefunctions start-element and end-element (see Figs. 20 and 21) of an object taking careof the tree construction. The latter is called tree constructor in the sequel.The tree constructor uses three stacks to properly encode the SAX events. Firstly, the stack index_stack keeps track of the index of the current element read. The stack name-stack stores the element types of the elements in order to be able to update the labeling function λ t within the end-element function. Together with the stack hierarchy_stack , which isused to maintain the current sequence of parents within t , enough information stands by to encodethe SAX events. Refer to Sect. 2.4 on page 7 for an explanation of the binary tree model. Analogously to our definition for ranked trees: If an element is the n -th child of its parent element, then the index of thiselement is n . FUNCTION start-element( name ) if ( hierarchy_stack is not empty) then i := index_stack .top() + 1 ; index_stack .pop(); index_stack .push( i ); v := hierarchy_stack .top(); if ( i = 1 ) then u := v else u := v endif name_stack .push( name ); else u := ε ; λ t ( ε ) := name ; endif dom t := dom t ∪ { u } ; index_stack .push( ); hierarchy_stack .push( u ); ENDFUNC
Fig. 20:
The start-element function which is called for every start-tag.
To be more precise, if the parser encounters a start-tag, it extracts the element type of theelement and passes it to the tree constructor by calling the function start-element . If it isthe first call of start-element , we must be dealing with the root of the document. Thus, thestack hierarchy_stack is empty and the else -part beginning in line 15 is processed. Firstof all, the variable u is identified with ε (and later added to the set dom t ). Afterwards, the labelingfunction λ t is updated accordingly. Since, in the binary tree model, the root has no sibling nodesand since it is assumed that the input tree consists of at least two nodes, it is clear that the terminalsymbol labeling the root node will have a left child but no right child (therefore the superscript in line 16).If we consider a subsequent call of start-element , the hierarchy stack is not empty andtherefore the if -part is processed. Firstly, the index stack is updated in the lines 3–5 and afterthat the node v ∈ dom t is retrieved from the hierarchy stack (line 7). The tree node v will bethe parent of the node which is added in the following. We introduce a new node u which islater (but still in the same call of this function) added to dom t (line 19). The node u becomes theleft child of v if it represents the first child element of the element which is represented by v .In contrast, u becomes a right child if the current index i is greater than one, i.e. , if the elementbeing processed is a sibling element of the element represented by v . Regarding the node u ,we are unable to update the labeling function λ t at this time since we do not know if the XMLelement being processed has children or sibling elements.If an end-tag is encountered by the input parser, the function end-element listed in Fig. 21is called. Now, the index of the current XML element is consulted in order to bubble up the se-quence of parents stored by the hierarchy stack the correct number of times. Lastly, after process-ing the repeat loop, the node representing the first child element of the current XML element(the end-tag of its last child element was just read) is on top of the hierarchy stack. For everynode v ∈ dom t which is removed from the hierarchy stack within the repeat loop the labelingfunction λ t is updated. 30 FUNCTION end-element i := index_stack .top(); repeat i times v := hierarchy_stack .top(); name := name_stack .top(); l := 0 , r := 0 ; if ( v ∈ dom t ) then l := 1 ; endif if ( v ∈ dom t ) then r := 1 ; endif λ t ( v ) := name lr ; hierarchy_stack .pop(); name_stack .pop(); endrepeat index_stack .pop(); ENDFUNC
Fig. 21:
The end-element function which is called for every end-tag encountered in the input XML document.
Example 16.
Fig. 22 shows the evolution of the data structures after the first calls to the functions start-element() and end-element() , respectively, when parsing the input tree fromFig. 4. It shows the content of the three stacks after the body of the corresponding function hasbeen executed, where is denotes the index stack, hs denotes the hierarchy stack and ns denotesthe name stack. Regarding Fig. 22, the element on top of the stack is always the upper elementin the depiction of the corresponding stack. If there has not been assigned a label to a node, i.e. ,the labeling function λ has not been updated accordingly yet, the node is depicted in brackets.The binary representation of the input tree can be obtained in linear runtime since the function start-element and the function end-element , respectively, are each called only once forevery node of the input tree. Furthermore, the body of the repeat loop of the latter function isexecuted once for every input node (except for the root node). Re-pair for Trees on Multiary Trees
Another way of modeling an XML document tree in aranked way is the multiary tree model. In contrast to the binary tree model (which we describedin Sect. 2.4 on page 7), this model does not encode the input tree by a binary tree but it turns theinput tree into a ranked tree by introducing a terminal symbol for each element type/number ofchildren combination which occurs in the input tree. Let us assume that an element type occursthree times and that there are three different numbers of children attach to the correspondingelements. In the multiary tree model, there are introduced three different terminal symbols.During our investigations we also evaluated a TreeRePair version based on the multiary treemodel. However, this modified version of our algorithm was outperformed by the original ver-sion in terms of compression ratio. This is due to the nature of typical XML documents. XMLelements encountered in real-world XML documents often exhibit a long list of children ele-ments. Therefore, compared to the binary tree model, a multiary tree model representation ofan XML document leads to a higher number of different digrams occurring less often. This, inturn, reduces TreeRePair’s ability to compress the XML document tree by the same degree as itis possible for the binary case. 31
1) Function call start-element(books) ε is hs ns books (2) Function call start-element(book) ε book is hs ns books (1) (3) Function call start-element(author) author ε book is hs ns books (1)(11) (4) Function call end-element()
111 1 author ε book is hs ns books (1)(11) (5) Function call start-element(title) title author ε book is hs ns books (1)(11)(112) (6) Function call end-element() title author ε book is hs ns books (1)(11)(112) (7) Function call start-element(isbn) isbn title author ε book is hs ns books (1)(11)(112)(1122) (8) Function call end-element() isbn title author ε book is hs ns books (1)(11)(112)(1122) (9) Function call end-element() ε book is hs ns books (1) author title isbn (10) Function call start-element(book)
120 1 book ε book is hs ns books (1) author title isbn (12) (11) Function call start-element(book) author book ε book is hs ns books (1) author title isbn (12)(121) Fig. 22:
Content of the stacks after each call of the start-element() and end-element() , respectively, functions whenparsing the tree from Fig. 4. In addition at each step their is a depiction of the binary tree which is constructed so far. xample 17. Consider for example the XML document tree from Fig. 4. The element of type books has five children elements of type book , i.e. , each of the five digrams ( books , , book ) , ( books , , book ) , . . . , ( books , , book ) occurs only once. None of these digrams is replaced by TreeRePair since a replacement is onlyreasonable if the corresponding digram occurs at least twice. In contrast, the binary tree modelleads to two occurrences of the digram ( book , , book ) which can be replaced by a new nonter-minal symbol in a run of TreeRePair ( cf. Fig. 5).
In this section we show that the ranked input tree of our algorithm can be efficiently stored asa DAG in memory. This DAG representation can be made nearly transparent to the rest of thealgorithm ( cf.
Sect. 4.5 on page 41). Thus, by default, the tree constructor of our prototype doesnot only directly transform the XML document tree into a ranked representation but also infersthe corresponding minimal -bounded SLCF tree grammar G = ( N, P, S ) , i.e. , the minimalDAG, of the latter on the fly.In [BGK03] it has been demonstrated that the representation of XML document trees basedon the concept of sharing subtrees is highly efficient. Their experiments have shown that inseveral cases the size of the DAG was less than 10% of the uncompressed XML document tree.Therefore, the sharing of common subtrees enables us to load large XML documents trees whichwould have otherwise exceeded the computation resources. In addition to that it avoids timeconsuming swapping and the repetitive re-computation of the same results concerning subtreesthat are shared.Now, let us elaborate on how one can infer the DAG of the ranked representation t =( dom t , λ t ) ∈ T ( F ) of the XML document tree. The tree constructor must check for every nodewhich is removed from the hierarchy stack in the end-element function if the subtree rootedat this node can be shared. This can be accomplished by calling the function share-subtree listed in Fig. 23. To better understand this function, let us assume that we want to check if thesubtree t (cid:48) ∈ T ( F ) rooted at a node v ∈ dom t can be shared. If we already encountered an exactcopy of t (cid:48) while reading the input tree, all subtrees of t (cid:48) must have been shared before. Thus, thetree t (cid:48) must be of depth 1 and all children nodes must be labeled by nonterminals of the DAGgrammar G . Therefore, it is only necessary to compare the labels of the root of t (cid:48) and its directchildren with those of all subtrees encountered until now. This can be done in constant time withthe help of a hash table.Now, let us assume that we have processed an exact copy of t (cid:48) earlier, i.e. , t (cid:48) can be shared.Thus, the condition in line 3 is evaluated to true and the subtrees_ht hash table contains t (cid:48) . Hence, the else -part beginning in line 6 is processed. If there already exists a nonterminal B ∈ N with right-hand side t (cid:48) then we set A := B . We can check this in O (1) time becausewith each entry of the hash table subtrees_ht we can store a pointer to the corresponding Note that the DAG representation can also be circumvented by using the -no dag switch. In this case the whole binary treewith all its possible redundancy is constructed in main memory. FUNCTION share-subtree( v ) let t (cid:48) be the subtree rooted at v ; if ( ∀ ≤ i ≤ rank ( λ t ( v )) : λ t ( vi ) ∈ N ) then if ( subtrees_ht does not contain t (cid:48) ) then insert t (cid:48) into subtrees_ht ; else if ( ∃ B ∈ N : ( B → t (cid:48) ) ∈ P ) then A := B ; else choose nonterminal A ∈ N \ N ; N := N ∪ { A } ; P := P ∪ { ( A → t (cid:48) ) } ; let u be the node at which the first occurrence of t (cid:48) is rooted; replace subtree rooted at u by A ; w := parent ( u ) ; if ( ∀ ≤ i ≤ rank ( λ t ( w )) : λ t ( wi ) ∈ N ) then let t (cid:48)(cid:48) be the subtree rooted at w ; insert t (cid:48)(cid:48) into subtrees_ht ; endif endif replace subtree rooted at v by A ; endif endif ENDFUNC
Fig. 23:
The function share-subtree which checks for the subtree rooted at the node v ∈ dom t if it can be shared. If this isthe case then the sharing is performed. production. Otherwise, i.e. , if there exists no ( B → t (cid:48)(cid:48) ) ∈ P with t (cid:48) = t (cid:48)(cid:48) , we introduce a newnonterminal A ∈ N \ N with right-hand side t (cid:48) and replace the first occurrence u of the subtree t (cid:48) by A . There can be only one earlier occurrence of the subtree t (cid:48) since otherwise we wouldalready have inserted a corresponding production. Furthermore, we can guarantee constant timeaccess to u because with each entry in the hash table subtrees_ht we can store a pointer tothe corresponding first occurrence. Finally, we add the subtree rooted at the node parent ( u ) tothe hash table if all of its subtrees are shared. We do not need to insert the subtree rooted at thenode parent ( v ) since we will process parent ( v ) in a later step (since we are traversing the inputtree in postorder). In contrast, if t (cid:48) was not encountered until now, we add it to the hash table subtrees_ht (line 5) in order to be able to share possible later occurrences of it.Initially, i.e. , after reading the input tree, all shared subtrees are of depth 1. In order to reducethe number of nonterminals of the DAG grammar (without increasing the number of total edges)all productions referenced only once are eliminated. All in all, the inferring of the DAG grammarneeds linear time and can be conveniently combined with the step of transforming the input treeinto a ranked tree. The data structures we use in our implementation are similar to those used in [LM00]. In orderto be able to focus on the essentials, we do not pay attention to the fact that, internally, the inputtree is represented by a DAG.Let us assume that the binary input tree t = ( dom t , λ t ) ∈ T ( F ) has been generated by ourimplementation after reading a corresponding XML document tree. Hence, the tree t is the ranked34 fa fa a fa fa a Fig. 24:
The tree t ∈ T ( F ) modeled by the node objects from Fig. 25. representation of the latter. In main memory, every node v ∈ dom t is represented by an objectexhibiting several pointers. These allow constant time access to the parent and all children of thenode v and to the possible next and previous occurrences of the digram α = (cid:0) λ t ( v ) , i, λ t ( vi ) (cid:1) ,where i ∈ { , , . . . , rank ( λ t ( v )) } . The pointers to the next and previous occurrences of α forma doubly linked list of all the occurrences in occ t ( α ) . We call this type of list an occurrences list(of α ) in the sequel. The specific order of the occurrences in an occurrences list is not relevant.Every digram is represented by a special object. It exhibits two pointers which reference thefirst and the last element of the corresponding occurrences list. Let us consider a digram α ∈ Π with | occ t ( α ) | = m , where m < (cid:98)√ n (cid:99) and n = | t | . Then the corresponding object exhibits twomore pointers which point to the next and previous, respectively, digram β ∈ Π with | occ t ( β ) | = m . These pointers form a doubly linked list of all digrams occurring m times. We denote this typeof list the m -th digram list . In contrast, all digrams γ ∈ Π with | occ t ( γ ) | ≥ (cid:98)√ n (cid:99) are organizedin one doubly linked list which is called the top digram list .These doubly linked lists of digrams are again referenced by a digram priority queue . Thisqueue consists of (cid:98)√ n (cid:99) entries. The i -th entry stores a pointer to the head of the i -th digramlist, where ≤ i < (cid:98)√ n (cid:99) . The (cid:98)√ n (cid:99) -th entry references the head of the top digram list. Referto Sect. 4.4 on page 36 for an explanation on why we designed the digram lists and priorityqueue as described above. Lastly, there is a digram hash table storing pointers to all occurringdigrams. It allows constant time access to all digrams and therefore constant time access to thefirst occurrence of each digram.Let us consider the following example to see how the utilized data structures work. Example 18.
Let us assume that the tree t = ( dom t , λ t ) ∈ T ( F ) shown in Fig. 24 has been gen-erated by our implementation after reading a corresponding XML document tree. Then Fig. 25shows a simplified depiction of the data structures used to efficiently replace the digrams in thereplacement step. All non-null pointers are represented by arrows starting in a filled circle andending in an empty circle. A filled circle without an outgoing arrow denotes a null pointer.With respect to Fig. 25, there is a total of node objects representing tree nodes labeled bythe two symbols f ∈ F and a ∈ F . An instance of a tree node v ∈ dom t is represented by atabular box as it is shown in Fig. 25.1. Unlike depicted, in our implementation a symbol is notdirectly stored within the node structure but for every unique symbol there is an object which During our investigations we also implemented a TreeRePair version avoiding these doubly linked lists of occurrences. In-stead, for every digram, we used a hashed set storing pointers to all occurrences. However, this version had no benefitscompared to the doubly linked list approach but lead to slightly longer runtimes. Considering the memory usage, in somecases it achieved better results while in others a substantial increase was noticed. arentchildrennextprevious f Fig. 25.1:
A graphical representa-tion of an object representing a treenode labeled by f ∈ F . ( f, , a ) prev nextfirst last Fig. 25.2:
A graphicalrepresentation of a digram ( f, , a ) ∈ Π . is referenced by the corresponding nodes. The upper left empty circle of the box represents thememory address of the tree node instance. Thus, every arrow representing a pointer to the latterwill end in this empty circle. The filled circle in the first row of the tabular box represents thepointer to the possible parent node parent ( v ) . The pointer to the i -th child vi of the node v isdepicted by an arrow starting at the filled circle in the i -th column of the children row, where i ∈ { , , . . . , rank ( λ t ( v )) } . Analogously, a pointer to a possible next (previous) occurrence ofthe digram α = (cid:0) λ t ( v ) , i, λ t ( vi ) (cid:1) is represented by a filled circle in the i -th column of the rowlabeled by next ( previous , respectively), where i ∈ { , , . . . , rank ( λ t ( v )) } .Each digram ( f, , f ) , ( f, , a ) , ( f, , f ) and ( f, , a ) is represented by a tabular box (seeFig. 25.2). Again, unlike depicted, in our implementation a symbol is not directly stored withinthe digram structure but the latter contains two pointers to the objects representing a and b . Thefirst and the last element of the occurrences list of the digram α are referenced by the first and last pointers of the object representing the digram α . The pointers prev (previous) and next are part of the | occ t ( α ) | -th digram list if | occ t ( α ) | < (cid:98)√ n (cid:99) and n = | t | . Otherwise theybelong to the top digram list.The digram ( f, , f ) forms a trivial doubly linked list, namely, the 1st digram list. The latter isreferenced by the entry of the priority queue. The digram ( f, , a ) forms the (trivial) top digramlist which is referenced by the entry of the priority queue. In contrast, the digrams ( f, , a ) and ( f, , f ) each occur twice and therefore point to each other with their next and previous pointers, respectively. The first element of the resulting 2nd digram list is referenced by the entry of the priority queue. The digram hash table stores the pointers to all four occurring digrams. For any given input tree with n edges, TreeRePair produces in time O ( | t | ) a k -bounded linear SLCF tree grammar G , where k ∈ N is a constant, t ∈ T ( F ) is the binaryrepresentation of the input tree, and val ( G ) = t . It is straightforward to come up with a linear time implementation of the pruning step of theRe-pair for Trees algorithm ( cf.
Sect. 3.3 on page 13). Therefore, we just want to investigate thecomplexity of the replacement step which was described in Sect. 3.2 on page 13.With every replacement of a digram occurrence one edge of the input tree is absorbed. There-fore, a run of TreeRePair can consist of at most n − iterations, where n is the size of the inputtree. Each replacement of an occurrence can be accomplished in O (1) time since at most k chil-dren need to be reassigned — in our implementation, the reassignment of a child node is just a36 igram Hash Table ( f, , f )( f, , a )( f, , f )( f, , a ) Digram Priority Queue ≥ Doubly Linked Digrams ( f, , f ) prev nextfirst last ( f, , a ) prev nextfirst last ( f, , f ) prev nextfirst last ( f, , a ) prev nextfirst last Tree Nodes parentchildrennextprevious f parentchildrennextprevious f parentchildrennextprevious f parentchildrennextprevious f parentchildrennextprevious f parent a parent a parent a parent a parent a parent a Fig. 25:
A simplified depiction of a part of the data structures used by our implementation. FUNCTION retrieve-all-occs( t ) v := ε ; while (true) do v := next_in_postorder( t , v ); if ( v (cid:54) = ε ) then α := ( λ t ( parent ( v )) , index ( v ) , λ t ( v )) ; if ( v / ∈ occ t ( α ) ) then occ t ( α ) := occ t ( α ) ∪ { parent ( v ) } endif else return ; endif endwhile ENDFUNC
Fig. 26:
The function retrieve-all-occs which is used to construct the set occ t ( α ) for every digram α ∈ Π occurring inthe tree t ∈ T ( F ∪ N ) . It uses the function next-in-postorder listed in Fig. 7. matter of updating two pointers. For every production which is introduced during a run of ouralgorithm it holds that the right-hand side t is of size | t | < k , i.e. , it can be constructed inconstant time.However, to show that the replacement step can be performed in linear time two more aspectsneed to be considered. Imagine that we are in the i -th iteration of our algorithm (and G i − is thecurrent grammar). Let t ∈ T ( F ∪ N ) be the right-hand side of G i − ’s start production.(1) Updating the sets of non-overlapping occurrences
In every iteration of our algorithm we need to know the number of occurrences of eachdigram. Only in that case we are able to determine the most frequent digram. In addition, forreplacing the digram max k ( t ) , we need to know occ t ( max k ( t )) . How can we compute the set occ t ( α ) for every digram α ∈ Π without traversing the whole right-hand side of the currentstart production in each iteration?(2) Retrieving the most frequent digram
Let us assume that there is an up to date set occ t ( α ) available for every α ∈ Π occurringin t (in the form of occurrences lists). How do we determine the most frequent digram inconstant time?In the following we consider each of the above aspects in detail. Updating the Sets of Non-overlapping Occurrences
Let the binary tree t = ( dom t , λ t ) ∈ T ( F ) be our input tree. At the beginning of the replacement step the set occ t ( α ) for every digram α ∈ Π occurring in t is initially constructed. This is done by parsing the tree t in a similar way asit is done in the function retrieve-occurrences which is listed in Fig. 8. However, duringthe traversal not only one digram is considered but for every encountered digram α ∈ Π the set occ t ( α ) is constructed. Fig. 26 shows a possible function which accomplishes this task.Therefore, in the first iteration of our computation we have up to date sets of non-overlappingoccurrences at hand. However, we cannot afford to redo this traversal in every subsequent itera-tion. In this case we would not be able to achieve a linear runtime of our algorithm. As already mentioned at the beginning of this section on page 29: The maximal rank of a nonterminal of a grammar generatedby TreeRePair is k ∈ N . The constant k can be specified by a command line switch. ga b c d fgha b c a ha b c ba fa ha b c Fig. 27:
The tree t (cid:48) ∈ T ( F ) . All occurrences which would be absorbed by the replacement are highlighted.1 FUNCTION remove-absorbed-occs( t, v, j ) if ( v (cid:54) = ε ) then α := (cid:0) λ t ( parent ( v )) , index ( v ) , λ t ( v ) (cid:1) ; occ (cid:48) t ( α ) := occ (cid:48) t ( α ) \ { parent ( v ) } ; endif for ( l ∈ { , , . . . , rank ( λ t ( v )) } ) do α := (cid:0) λ t ( v ) , l, λ t ( vl ) (cid:1) ; occ (cid:48) t ( α ) := occ (cid:48) t ( α ) \ { v } ; endfor for ( l ∈ { , , . . . , rank ( λ t ( vj )) } ) do α := (cid:0) λ t ( vj ) , l, λ t ( vjl ) (cid:1) ; occ (cid:48) t ( α ) := occ (cid:48) t ( α ) \ { vj } ; endfor ENDFUNC
Fig. 28:
Listing of the function remove-absorbed-occs which removes all absorbed occurrences from the occ (cid:48) t sets. Fortunately, there is another way of keeping track of the sets of non-overlapping occurrences.It relies on the fact that every replacement of an digram occurrence v only involves those occur-rences in the neighborhood of v which overlap with v . Example 19.
Let us consider the tree t (cid:48) = ( dom t (cid:48) , λ t (cid:48) ) ∈ T ( F ) which is depicted in Fig. 27. Theoccurrences which would be absorbed by the replacement of the occurrence ∈ dom t (cid:48) of thedigram ( f, , g ) are highlighted.For every digram α ∈ Π we set occ (cid:48) t ( α ) := occ t ( α ) and base all upcoming computations on theset occ (cid:48) t ( α ) . In particular we use them to determine the most frequent digram in each iteration.Let us consider the i -th iteration of a run G , G , . . . , G h of Re-pair for Trees on the input tree t ∈ T ( F ) , where h ∈ N and i ∈ { , , . . . , h } . Then G i − = ( N i − , P i − , S i − ) is the currentgrammar. Let t i − ∈ T ( F ) be the right-hand side of S i − . Let us assume that an up to date set occ (cid:48) t i − ( β ) for every β ∈ Π which is occurring in t i − is at hand. Further, let us assume that max ( t i − ) = ( a, j, b ) =: α and let v ∈ occ (cid:48) t i − ( α ) .Before the actual replacement of the occurrence v we make use of the function listed inFig. 28. The function call remove-absorbed-occs( t i − , v, j ) removes all occurrenceswhich will be absorbed by the upcoming replacement from the sets occ (cid:48) t i − . After the replace-39 FUNCTION add-new-occs( t, u ) if ( u (cid:54) = ε ) then α := (cid:0) λ t ( parent ( u )) , index ( u ) , λ t ( u ) (cid:1) ; occ t ( α ) (cid:48) := occ (cid:48) t ( α ) ∪ { parent ( u ) } ; endif for ( l ∈ { , , . . . , rank ( λ t ( u )) } ) do α := (cid:0) λ t ( u ) , l, λ t ( ul ) (cid:1) ; occ (cid:48) t ( α ) := occ (cid:48) t ( α ) ∪ { u } ; endfor ENDFUNC
Fig. 29:
Listing of the function add-new-occs which adds all newly created occurrences to the occ (cid:48) t sets. fa fb fc d Fig. 30:
Tree t (cid:48)(cid:48) ∈ T ( F ) consisting of nodes labeled by the terminal symbols a, b, c, d, f ∈ F . We have to deal with threeoverlapping occurrences of the digram ( f, , f ) . ment of v by a new node u with λ t i ( u ) = A i ∈ N we call the function add-new-occs (whichis listed Fig. 29) and pass the tree t i − and the node u . The function add-new-occs adds allnew occurrences which arose by the introduction of u to the sets of non-overlapping occurrences.Finally, after all occurrences from occ (cid:48) t i − ( α ) have been replaced, we set occ (cid:48) t i ( β ) := occ (cid:48) t i − ( β ) for all β ∈ Π occurring in t i .Let α ∈ Π be a digram occurring in t i . The above computed set occ (cid:48) t i ( α ) may not be equalto the actual set occ t i ( α ) as it would be constructed by a complete postorder traversal of t i usingthe function retrieve-occurrences from Fig. 8. Example 20.
Consider, for instance, the tree t (cid:48)(cid:48) ∈ T ( F ) depicted in Fig. 30. Let α = ( f, , f ) .In the first iteration of our algorithm, we would obtain occ (cid:48) t (cid:48)(cid:48) ( α ) := occ t (cid:48)(cid:48) ( α ) = { } . Now, letus assume that we replace the digram ( f, , c ) (we could easily enlarge t (cid:48)(cid:48) such that ( f, , c ) is the most frequent digram and still show the same). After performing this replacement andespecially after calling the functions remove-absorbed-occs and add-new-occs wewould have occ (cid:48) t (cid:48)(cid:48) ( α ) = ∅ . However, a postorder traversal of the updated tree t (cid:48)(cid:48) would result in occ t (cid:48)(cid:48) ( α ) = { ε } .Updating the sets of non-overlapping occurrences takes constant time per occurrence replace-ment. At most k +1 occurrences need to be removed by the function remove-absorbed-occs and at most k + 1 occurrences need to be added by the function add-new-occs . An occur-rence v of a digram α can be removed from the occurrences list of α in constant time by settingthe next and previous pointers of the corresponding node object to null. In addition, if v is the first (last) occurrence in the occurrence list of α the first ( last ) pointer of the objectrepresenting the digram α needs to be updated. This can also be accomplished in constant time40y using the digram hash table. Analogously, an occurrence can be added to an occurrences listin O (1) time. Retrieving the Most Frequent Digram
We now investigate the time needed to obtain the mostfrequent digram in an iteration of our algorithm. First of all, let us state the following fact: Let m ∈ N ∪ {∞} and let G , G , . . . , G n be a run of Re-pair for Trees, where n ∈ N > , G i =( N i , P i , S i ) and ( S i → t i ) ∈ P i for every i ∈ { , , . . . , n } . Then | occ t i ( max m ( t i )) | ≥ | occ t i +1 ( max m ( t i +1 )) | holds for every i ∈ { , , . . . , n − } . For every digram α ∈ Π occurring in t i it holds that | occ t i ( α ) | ≥ | occ t i +1 ( α ) | and for every digram β ∈ Π which was introduced in G i +1 it holds that | occ t i +1 ( β ) | ≤ | occ t i ( max m ( t i )) | , where i ∈ { , , . . . , n − } .It is easy to see that, if the top digram list is empty, we can obtain the most frequent digram inconstant time. We just need to walk down the remaining (cid:98)√ n (cid:99)− digram lists and choose the firstelement of the first non-empty list. In every iteration, after we have determined the most frequentdigram, we remember the first non-empty digram list in order to save ourself the needless andtime-consuming rechecking of the empty digram lists.Now, let us assume that the top digram list, i.e. , the doubly linked list of all digrams occurringat least (cid:98)√ n (cid:99) times, is not empty. We need to scan all elements in it since the digrams containedare not ordered by their frequency. There can be roughly at most √ n digrams in the top digramlist. Therefore, we need roughly O ( √ n ) time to retrieve the most frequent digram. However, bythe replacement of this digram at least (cid:98)√ n (cid:99) edges are absorbed. It is easy to see that, all in all,obtaining the most frequent digram needs constant time on average.In a run of TreeRePair we can replace at most n − digram occurrences and, as shown before,the replacement of each occurrence, the update of the sets of non-overlapping occurrences andthe determination of the most frequent pair can be accomplished in constant time per occurrencereplacement. Thus, the whole replacement step can be completed in linear time. In the preceding section, dealing with the complexity of our implementation of the Re-pair forTrees algorithm, we did not pay attention to the underlying DAG representation of the input tree.This enabled us to concentrate on the essentials. Nevertheless, we have to clarify the impactof this representation, particularly concerning the compression performance and the runtime ofour implementation, since TreeRePair uses it by default. Only by starting TreeRePair with the -no dag switch it forgos the DAG representation and loads the whole input tree into mainmemory.Let G = ( N, P, S ) be a -bounded SLCF tree grammar. We assume without loss of generalitythat for every B ∈ N it holds that B (cid:32) ∗G S . Let ( A → t ) ∈ P , t = ( dom t , λ t ) ∈ T ( F ) and v ∈ dom t . We define the function unfold using the algorithm listed in Fig. 31. It holds that Intuitively, we define | occ t n ( max m ( t n )) | = 0 if max m ( t n ) = undefined . FUNCTION unfold( G , t, v ) let G = ( N, P, S ) and A → t ∈ P ; if ref G ( A ) (cid:54) = ∅ then M := ∅ ; for each ( t (cid:48) , v (cid:48) ) ∈ ref G ( A ) do M := M ∪ { uv | u ∈ unfold ( G , t (cid:48) , v (cid:48) ) } ; endfor else M := { v } ; endif return M ; ENDFUNC
Fig. 31:
The algorithm which computes unfold ( G , t, v ) , where we have t ∈ T ( F ∪ N ) and v ∈ dom t .1 FUNCTION retrieve-all-occs-dag( t ) v := ε ; while (true) do v := next_in_postorder( t , v ); if ( v (cid:54) = ε ) then if ( λ t ( v ) / ∈ N ) then α := ( λ t ( parent ( v )) , index ( v ) , λ t ( v )) ; if ( v / ∈ occ (cid:48) t ( α ) ) then occ (cid:48) t ( α ) := occ (cid:48) t ( α ) ∪ { parent ( v ) } ; endif else let t (cid:48) be the right-hand side of λ t ( v ) ; if ( λ t (cid:48) ( ε ) (cid:54) = λ t ( parent ( v )) then α := ( λ t ( parent ( v )) , index ( v ) , λ t (cid:48) ( ε )) ; occ (cid:48) t ( α ) := occ (cid:48) t ( α ) ∪ { parent ( v ) } ; endif endif else return ; endif endwhile ENDFUNC
Fig. 32:
The function retrieve-all-occs listed in Fig. 26 adapted for the DAG case. For every α ∈ Π the set occ t ( α ) isinitially set to ∅ . unfold ( G , t, v ) ⊆ dom val ( G ) and it also holds that (cid:91) ( A → t ) ∈ P,v ∈ dom t unfold ( G , t, v ) = dom val ( G ) .Let us consider a run G , G , . . . , G h of TreeRePair, where G i = ( N i , P i , S i ) , ( S i → t i ) ∈ P i , h ∈ N and i ∈ { , , . . . , h } . Then, in our implementation, t i is represented by a -bounded(linear) SLCF tree grammar G i = ( N i , P i , S i ) , i.e. , we have val ( G i ) = t i , by default. Constructing the Sets of Non-overlapping Occurrences
In the first iteration of TreeRePair weneed to construct the set occ t ( α ) for every digram α ∈ Π occurring in t . Our first try to ac-complish this could be a postorder traversal of all the right-hand sides of P ’s productions usingthe function retrieve-all-occs listed in Fig. 26 on page 38. However, when traversing theright-hand sides of the DAG grammar G individually, we do not consider occurrences spanningtwo productions of the DAG. 42 ga b c ga b c Fig. 33:
The tree t ∈ T ( F ) which can be represented by a DAG grammar with productions ( S → f ( A, A )) and ( A → g ( a, b, c )) . ffa fa a fa fa a Fig. 34:
The tree t (cid:48) ∈ T ( F ) which can be represented by a DAG grammar with productions A → f ( A , A ) and A → f (cid:0) a, f ( a, a ) (cid:1) . Example 21.
Consider the DAG grammar G = ( N, P, S ) , where N = { S, A } and P containsthe two productions ( S → f ( A, A )) and ( A → g ( a, b, c )) . It is a compressed representation ofthe tree t ∈ T ( F ) depicted in Fig. 33. If we would use the function retrieve-all-occs to determine all digram occurrences in the right-hand sides of P ’s productions, we would notcapture the node ε ∈ dom t which is an occurrence for both the digram ( f, , g ) and the digram ( f, , g ) .As we have seen, it is necessary to modify the retrieve-all-occs function slightly to alsotake occurrences spanning two productions into account. We use the algorithm listed in Fig. 32to obtain the set occ (cid:48) t ( α ) for every right-hand side t of G ’s productions and every digram α ∈ Π occurring in t . After that, we set occ (cid:48) t ( α ) := (cid:91) ( A → t ) ∈ P ,v ∈ occ (cid:48) t ( α ) unfold ( G , t, v ) .We test in line 13 of the retrieve-all-occs function if α has equal parent and child sym-bols. If this proves to be true, we do not add the corresponding occurrence to occ (cid:48) t ( α ) , i.e. , we donot consider occurrences of a digram with equal parent and child symbols spanning two produc-tions of the DAG. If we would do so, we would possibly register overlapping occurrences andrun into problems during a later replacement of α . Consider the following example: Example 22.
Consider the DAG grammar G = ( N, P, A ) given by the productions ( A i → t i ) ∈ P , where i ∈ { , } , t = f ( A , A ) and t = f ( a, f ( a, a )) . It is a compressed representationof the tree t (cid:48) ∈ T ( F ) depicted in Fig. 34. We use the algorithm from Fig. 32 to obtain thesets occ (cid:48) t i ( α ) for i ∈ { , } and every digram α ∈ Π occurring t (cid:48) . Let us assume that we omitthe check in line 13, i.e. , we also consider occurrences of digrams with equal parent and child43 fb fa fa fa afc fa fa fa a Fig. 35:
Tree t (cid:48)(cid:48) ∈ T ( F ) with seven overlapping occurrences of the digram ( f, , f ) . symbols spanning two productions. The union (cid:91) i ∈{ , } ,v ∈ occ (cid:48) ti (( f, ,f )) unfold ( G , t i , v ) = { ε, , } contains the overlapping occurrences ε and of the digram ( f, , f ) .The precaution from line 13 leads sometimes to situations in which we replace fewer occurrencesof a digram with equal parent and child symbols as we would replace when not using the DAGrepresentation. Example 23.
Consider the tree t (cid:48)(cid:48) ∈ T ( F ) from Fig. 35 which can be represented by the DAGgrammar consisting of the two productions ( S → t ) and ( A → t ) with t = f ( f ( b, A ) , f ( c, A )) and t = f ( a, f ( a, f ( a, a ))) . After careful counting one can tell that t (cid:48)(cid:48) exhibits at most fournon-overlapping occurrences of the digram α = ( f, , f ) . However, if we use the above func-tion retrieve-all-occs-dag we only capture three of them. We obtain occ (cid:48) t ( α ) = { ε } , occ (cid:48) t ( α ) = { } and therefore occ (cid:48) t (cid:48)(cid:48) ( α ) = (cid:91) i ∈{ , } ,v ∈ occ (cid:48) ti ( α ) unfold ( G , t i , v ) = { ε, , } .Even though this approach does not capture all the occurrences which could be captured whennot using the DAG representation, it still achieves a competitive compression performance onour set of test files ( cf. Sect. 6.6 on page 62). It seems that a more involved method of dealingwith digrams with equal parent and child symbols spanning two productions would necessitate apartial unfolding of the DAG. The latter, however, would certainly result in a longer runtime.
Updating the Sets of Non-overlapping Occurrences
Considering the graph representation ofa DAG, a tree node can exhibit multiple parent nodes. In fact, a node has multiple parent nodesif it is the root of the right-hand side of a production of the corresponding DAG grammar and ifthis production is referenced multiple times. 44
FUNCTION remove-absorbed-occs-dag( t, v, j ) if ( v (cid:54) = ε ) then remove-occ-dag( t, parent ( v ) , index ( v ) ); else let A be the right-hand side of t ; for each ( t (cid:48) , u ) ∈ ref G i ( A ) do remove-occ-dag( t (cid:48) , parent ( u ) , index ( u ) ); endfor endif for ( l ∈ { , , . . . , rank ( λ t ( v )) } ) do remove-occ-dag( t, v, l ); endfor for ( l ∈ { , , . . . , rank ( λ t ( vj )) } ) do remove-occ-dag( t, vj, l ); endfor ENDFUNC
FUNCTION remove-occ-dag( t, v, j ) if ( λ t ( vj ) / ∈ N ) then α := (cid:0) λ t ( v ) , j, λ t ( vj ) (cid:1) ; else let t (cid:48) be the right-hand side of λ t ( vj ) ; α := (cid:0) λ t ( v ) , j, λ t (cid:48) ( ε ) (cid:1) ; endif occ (cid:48) t ( α ) := occ (cid:48) t ( α ) \ { v } ; ENDFUNC
Fig. 36:
Listing of the function remove-absorbed-occs-dag which removes all absorbed occurrences from the occ (cid:48) t setswhen using the DAG mode. To capture all digram occurrences which are absorbed by the replacement of a digram weneed to take care of the above fact. The remove-absorbed-occs function listed in Fig. 28needs to be adapted accordingly. Instead of removing one occurrence formed by the node beingreplaced and its parent, we need to iterate over possibly multiple parents and remove all corre-sponding occurrences. In Fig. 36 the function remove-absorbed-occs-dag is listed whichincorporates this necessary modification. Analogously, the function add-new-occs listed inFig. 29 must be modified to work properly in the DAG mode. Fig.37 shows an adapted version.It is easy to see that our linear runtime is not negatively affected by this loop over all parents.Far from it — as mentioned earlier, the DAG representation saves us time by avoiding repetitivere-calculations.
Replacing the Digrams
The third and last scenario in which we have to take special care ofthe DAG representation is when replacing an occurrence of a digram α ∈ Π spanning twoproductions of the DAG grammar. Due to our restriction on digrams with equal parent and childsymbols the digram α has to have different parent and child symbols. In the following we wantto use an example to describe what needs to be done when replacing the digram α . Example 24.
Consider the DAG grammar given by the productions S → f (cid:0) g ( t , A ) , A (cid:1) and A → h ( t , t ) which represents the F -labeled tree t depicted in Fig. 38. Imagine that we want45 FUNCTION add-new-occs-dag( t, v ) if ( v (cid:54) = ε ) then add-occ-dag( t, parent ( v ) , index ( v ) ); else let A be the right-hand side of t ; for each ( t (cid:48) , u ) ∈ ref G i ( A ) do add-occ-dag( t (cid:48) , parent ( u ) , index ( u ) ); endfor endif for ( l ∈ { , , . . . , rank ( λ t ( v )) } ) do add-occ-dag( t, v, l ); endfor ENDFUNC
FUNCTION add-occ-dag( t, v, j ) if ( λ t ( vj ) / ∈ N ) then α := (cid:0) λ t ( v ) , j, λ t ( vj ) (cid:1) ; else let t (cid:48) be the right-hand side of λ t ( vj ) ; α := (cid:0) λ t ( v ) , j, λ t (cid:48) ( ε ) (cid:1) ; endif occ (cid:48) t ( α ) := occ (cid:48) t ( α ) ∪ { v } ; ENDFUNC
Fig. 37:
Listing of the function add-new-occs-dag which adds all new occurrences to the occ (cid:48) t sets when using the DAGmode. fgt ht t ht t Fig. 38:
Depiction of the F -labeled tree t . We have t , t , t ∈ T ( F ) . to replace the sole occurrence of the digram ( f, , h ) , i.e. , an occurrence spanning two produc-tions. In order to do that we mainly have to complete the following three steps.(1) We first have to introduce for every child of the node labeled by h a new production. Thus,we obtain two new productions B → t and C → t . We can skip this step for every childnode which is already labeled by a nonterminal of the DAG grammar.(2) We need to update the production with left-hand side A to A → h ( B, C ) .(3) Finally, we introduce a new nonterminal D representing the digram ( f, , h ) and update theproduction for S to S → D ( g ( t , A ) , B, C ) .The above steps are only necessary if the production with left-hand side A is referenced morethan once. Otherwise we could have directly connected the children of h to the newly introducednode labeled by D and removed the production with left-hand side A from the grammar. For the sake of convenience, our example uses a rather small tree and we decide to replace a digram occurring only once. Wecould easily enlarge t such that ( f, , h ) occurs multiple times and still show the following. ixed-length codingSuper Huffman codingRun-length coding3 Base Huffman codingsLinear SLCF tree grammar Fig. 39:
Hierarchy of the employed encodings.
Since at most k new productions need to be introduced, the replacement of a digram occurrencecan still be accomplished in constant time. All in all, it has become clear that even when repre-senting the input tree of our algorithm as a DAG our implementation runs in linear time. The source code of the TreeRePair prototype and its documentation is available at the GoogleCode TM README.txt file in the root directory of the TreeRePair distribution.
In order to achieve a compact representation of the input tree of our TreeRePair algorithm wefurther compress the generated linear SLCF tree grammar by a binary succinct coding. The tech-nique we use is loosely based on the DEFLATE algorithm described in [Deu96]. In fact, we usea combination of a fixed-length coding, multiple Huffman codings and a run-length coding toencode different aspects of the grammar ( cf.
Fig. 39).In spite of the fact that we obtain an extremely compact binary representation of the generatedSLCF tree grammar we are still able to directly execute queries on it with little effort. Basically,we only have to reconstruct the Huffman trees to be able to partially decompress the grammar ondemand.In [MMS08] many different variants of succinct codings specialized in SLCF tree grammarswere investigated. Among them there was one encoding scheme which turned out to achieve47he best compression performance in general — at least with respect to the set of sample SLCFtree grammars which was used in this work. However, our experiments show that, regarding theSLCF tree grammars generated by TreeRePair, this encoding is outperformed by the succinctcoding which we present in this section.
In this section, we want to elaborate on the following topics: How do we need to modify thepruning step of our algorithm to make our succinct coding as efficient as possible? How doesTreeRePair efficiently deal with parameter nodes? How can we serialize a Huffman tree in acompact way?
Inefficient Productions
Our experiments showed that, at least for our set of test XML docu-ments, we achieve better compression results in terms of the size of the output file if we slightlymodify the pruning step of our algorithm. It turns out that our succinct coding, which we describein the following sections, is most efficient if we prune all productions with a sav -value smallerthan or equal to (instead of pruning all productions with a sav -value smaller than or equal to as it is described in Sect. 3.3 on page 13). However, we use this modification only if we makethe size of the output file a top priority (by using the switch -optimize filesize ). Oth-erwise, when optimizing the number of edges of the final grammar ( i.e. , when using the switch -optimize edges ), we stick to the original version of the pruning step. Handling of Parameter Nodes
Let G = ( N, P, S ) be the linear SLCF tree grammar which wasgenerated by a run of TreeRePair. Then, for every production ( A → t ) ∈ P it holds that y i ∈ Y labels the i -th parameter node of t in preorder, where i ∈ { , , . . . , rank ( A ) } . Due to this factit is sufficient to represent the parameter symbols y , y , . . . , y rank ( A ) ∈ Y by a single parametersymbol y ∈ Y . Let ( B → t (cid:48) ) ∈ P be another production and let v ∈ dom t (cid:48) with λ t (cid:48) ( v ) = A . Now,let us assume that we want to eliminate the production ( A → t ) and that we use only a singleparameter symbol labeling all parameter nodes. It is clear that the i -th (in preorder) parameternode of t must be replaced by the subtree which is rooted at the i -th child of v .Our implementation takes advantage of the above simplification, i.e. , it uses only one param-eter symbol y for every occurring parameter node. Serializing Huffman trees
As stated in [Deu96], it is sufficient to only write out the lengthsof the generated codes to be able to reconstruct a Huffman tree at a later date. However, thisrequires the decompressor to be aware of the following. – What symbols are encoded by the corresponding Huffman tree? – In what order are their code lengths listed?In our case only integers need to be encoded by Huffman codings because we will encode allsymbols by integers (see Sect. 5.2 on page 49). Hence, it is obvious to use the natural order ofintegers to list the lengths of the generated codes. Let us assume that n ∈ N is the biggest integer48 ymbol Code a b c e Table 5.1:
Huffman coding beforethe reorganization of the codes. Theletters are listed in their natural order, i.e. , in alphabetic order. Symbol Code a b c e Table 5.2:
Huffman coding from Ta-ble 5.1 after the reorganization of thecodes. which needs to be encoded and which was assigned a code to, respectively. We just need to loopover all integers m ≤ n in their natural order and print out the corresponding code length foreach of it. For every k < n for which no code was assigned to we print out a code length of .In order to solely rely on the code lengths there is still something which needs to be consid-ered. We are required to assign new codes to the integers based on the lengths of their originalcodes. More precisely, the new code assignment has to fulfill the following two requirements.(1) All codes of the same code length exhibit lexicographically consecutive values when order-ing them in the natural order of the integers they represent.(2) Shorter codes lexicographically precede longer codes.This reorganization of the Huffman codes does not affect the compression performance of thecoding since only codes of the same length are swapped. The following example is based on anexample from [Deu96]. Example 25.
Imagine that we want to use a Huffman coding to encode the letters a , b , c and e which are each occurring multiple times in a data stream. Let us assume that we obtain theHuffman codes listed in Table 5.1. In order to be able to store the corresponding Huffman treeby only writing out the lengths of the Huffman codes we need to assign new codes to the letters.Table 5.2 shows the newly assigned codes which fulfill the above two requirements (1) and (2).Now, let us assume that the decompressor expects the code lengths to be the lengths of codesassigned to the letters of the Latin alphabet and that these code lengths are ordered in the naturalorder of the letters they represent. Then, the corresponding Huffman tree can be unambiguouslyrepresented by the following sequence of code lengths: , , , , . Note that we need to insert acode length of at the position of the letter d since there is no code assigned to the letter d . In this section we want to elaborate on the information which needs to be stored in the output fileof our algorithm in order to be able to reconstruct the generated linear SLCF tree grammar at alater date. We also want to demonstrate how this data can be efficiently represented. However, atthis time we do not pay attention to the fixed-length, run-length or Huffman codings which areemployed in a subsequent step of the encoding process. For the sake of simplicity we considerthese encodings in separate subsections of this section.49 it string Description rank rank , right child rank , left child rank Table 5:
The bit strings encoding the children characteristics together with their meaning.
Let G = ( N, P, S ) be the linear SLCF tree grammar which was generated by a run ofTreeRePair. Before we are able to compile the information which needs to be written out weneed to assign to every symbol from F ∪ ( N \ { S } ) ∪ { y } a unique integer. In fact, we assign toevery symbol from F a unique ID from the set { , , . . . , |F |} ⊂ N . We assign the ID |F | + 1 to y , i.e. , to the special symbol labeling all parameter nodes in the right-hand sides of P ’s produc-tions. Finally, we associate with every symbol from the set of nonterminals N \ { S } a unique IDfrom the set {|F | + 2 , |F | + 3 , . . . , |F | + | N |} . The IDs are assigned to the nonterminals in sucha way that the nonterminal A ∈ N \ { S } has a higher ID than the nonterminal B ∈ N \ { S } if B (cid:32) + G A holds. Writing out the Necessary Informations
Now, we are able to write out the information neededto reconstruct G in four steps. Bear in mind that the values mentioned below are not directlywritten to the output file but that they are additionally encoded by a combination of multipleHuffman codings, a run-length coding and a fixed-length coding later on. First step
In the first step, we write out the number of terminal symbols |F | and the number ofintroduced productions | N | − , i.e. , we are not counting the start production. By handing overthis information to the decompressor we avoid the insertion of separators marking, for instance,the end of the enumeration of elements types (which are written out in the third step). Second step
In the second step, we directly append a representation of the children characteristicsof the terminal symbols. By children characteristics we mean their rank and, concerning terminalsymbols of rank , if we are dealing with a left or a right child. Due to the fact that all terminalsymbols have a rank of at most two, we can encode this information using two bits per symbol.Table 5 lists all the bit strings we use together with a brief description of their meanings. Wewrite out the children characteristics as follows: Firstly, we print out a bit string from Table 5representing a certain children characteristic. After that we append the number of correspondingterminal symbols and finally we enumerate their IDs. We do this for the characteristics , and . We omit the enumeration of all terminal symbols with a rank of since their IDs can bereconstructed with the information in hand. In fact, we just need to subtract the set of IDs of allterminal symbols with children characteristics , and from the set of IDs of all terminalsymbols from F (which is { , , . . . , | F |} ). Consult Sect. 2.4 on page 7 for an explanation on why this information is necessary to reconstruct the input tree. N since thesecan be easily reconstructed by counting the number of parameter nodes in the correspondingright-hand sides. The latter are written to the output file in the fourth step. Third step
In this step, we print the element types of the terminal symbols in the ascending orderof their IDs to the output file. We do this by writing out the ASCII code of every single letter.The individual names are terminated by the ASCII character
ETX which is assumed not to beused within the element types of the terminal symbols.
Fourth step
In this last step we serialize the productions of G in the ascending order of the IDsof their left-hand sides. For every production ( A → t ) ∈ P we just write out the IDs of the labelsof t ’s nodes in preorder. We do not need to use special marker symbols to indicate the nestingstructure of the symbols and their IDs, respectively. When parsing the output file this hierarchycan be easily obtained by taking care of the individual ranks of the symbols.We can also omit the specification of the left-hand side A since both, its ID and its rank,can be reconstructed with the information in hand. Imagine that we are parsing the output file toreconstruct the productions of G . If we are parsing the i -th production, the ID of its left-hand sidemust be |F | + 1 + i , where i ∈ { , , . . . , | N |} . As already mentioned, the rank of the left-handside can be obtained by counting the parameter nodes in the right-hand side once this has beenreconstructed.Note that it is superfluous to insert separators between the representations of the produc-tions from P since their boundaries can be calculated based on the ranks of the symbols. Again,imagine that we are trying to reconstruct the productions of P by parsing the output file of ouralgorithm. Let ( A → t ) ∈ P be the first production we encounter. The tree t can only consistof nodes labeled by terminal symbols, i.e. , we must have t ∈ T ( F ) . The ranks of all symbolsfrom F are known since the necessary information was written to the compressed file in thesecond step. Therefore, we can easily reconstruct t by iteratively parsing the corresponding IDsin the output file. While doing so we are also able to count the number of occurrences of thesymbol y ∈ Y in t . Thus, we are aware of the value of rank ( A ) . After that, we proceed withdecoding the second production ( A (cid:48) → t (cid:48) ) ∈ P by iteratively parsing the next IDs. We have t (cid:48) ∈ T ( F ∪ { A } ) , i.e. , the ranks of all occurring symbols are known. That way all productionsfrom P can be reconstructed. Example 26.
In order to get a clear picture of the representation described above we apply theprevious four steps to the linear SLCF tree grammar G = ( N, P, S ) over the ranked alphabet F from Sect. 3.4 on page 18, i.e. , we have N = { S , A , A } and P is the following set of This is due to the fact that we have written out the productions in the ascending order of the IDs of their left-hand sides. TheseIDs were assigned to the nonterminals in such a way that the nonterminal A ∈ N \ { S } has a higher ID than B ∈ N \ { S } if B (cid:32) + G A holds. Therefore, the right-hand side of ( A → t ) , which is the first production which was written out, does notcontain any node labeled by a nonterminal from N . ymbol ID books isbn title author book book y A A Fig. 40:
All symbols with the ID assigned to them. The symbol y is the symbol used to label the parameter nodes in the right-handsides of P ’s productions. productions: S → books ( A ( A ( A ( A ( book ( A )))))) A ( y ) → book ( A , y ) A → author ( title ( isbn )) First of all, we assign to every symbol from
F ∪ ( N \ { S } ) ∪ { y } a unique ID as it is shown inFig. 40. After that we are able to write out the grammar exactly as described above resulting inthe value sequence depicted in Fig. 41. We accomplish this task in four steps:(1) We begin by writing out the number of terminals ( ) directly followed by the number ofnonterminals minus the start nonterminal (2) — see the values 0 and 1 in the depiction.(2) After that the children characteristics of all terminal symbols are written to the file. We beginby specifying all terminal symbols of rank (values 2–4). This is done by firstly writingout the bit string and the number of corresponding symbols ( ). Finally, the ID of theterminal symbol isbn , which is the sole terminal symbol of rank , is listed.Analogously, the terminal symbols with children characteristics and are enumerated(values 5–12).(3) Now, the element types of all terminal symbols are exported to the output file (values 13–46).For each of them the decimal value of each ASCII character is written out. The element type books , for instance, is encoded by the sequence , , , , .(4) Finally, the productions from P are written out in the ascending order of the IDs of theirleft-hand sides. Thus, the production with left-hand side A is serialized as the very firstproduction (values 47–49). It is encoded by the unambiguous sequence of IDs , , rep-resenting the terminal symbols author , title and isbn of the right-hand side of A inpreorder. Afterwards the remaining productions with left-hand sides A (values 50–52) and S (values 53–59) are printed to the output file in this order. Possible Optimizations
Of course, there is still room to further reduce the data which needs tobe written to the output file. Consider, for instance, terminal symbols of the same element type52 children characteristicselement typesproductions ’b’ ’o’ ’o’’k’ ’s’ ’ETX’ ’i’ ’s’ ’b’ ’n’ ’ETX’ ’t’ ’i’ ’t’ ’l’ ’e’ ’ETX’ ’a’ ’u’’t’ ’h’ ’o’ ’r’ ’ETX’ ’b’ ’o’ ’o’ ’k’ ’ETX’ ’b’ ’o’ ’o’ ’k’ ’ETX’ Fig. 41:
Representation of the grammar G from Example 26. but different children characteristics. In the case of our implementation, the element type of thesesymbols is written to the file two or three times in the second step. However, an optimization withrespect to this redundancy does only lead to marginally better compression results. This is due tothe fact that typically the major part of the output file is the enumeration of the productions.Still regarding the second step, we could at first determine the most frequent children charac-teristic and omit the enumeration of all corresponding terminal symbols. This dynamic approachcertainly leads to a small reduction of the size of the output file compared to always skipping thechildren characteristic .Another aspect which confesses optimization potential are possible long lists of the parametersymbol y which emerge when writing out the right-hand sides of productions with a higher rank.In this case, run-length coding can lead to a better compression performance. However, we didnot further investigate this matter since we focus on generating grammars with nonterminals witha maximal rank of . Even though a Huffman tree has to be serialized for every Huffman coding used within ouroutput file, we decided in favor of using four distinct Huffman codings. We use three of them forencoding – the start production, – the remaining productions, the children characteristics of the terminal symbols and the num-bers of terminals and nonterminals, and finally – the names of the terminals.In the sequel, we call these three Huffman codings the base Huffman codings . The fourth Huff-man coding, which we call super Huffman coding , is used to encode the Huffman trees of the53bove codings. Our tests with different numbers of Huffman codings revealed that, in general,the above approach leads to the best compression results. This is at least true for most of theXML test documents we used. Base Huffman Codings
We serialize the three base Huffman codings by writing out the lengthsof the generated codes as it is described in Sect. 5.1 on page 48. However, we additionally applya run-length coding and the super Huffman coding to achieve a compact binary representation.In Sect. 5.3 on page 54 we elaborate on how exactly the run-length coding works. We briefly callthe length of a code of a base Huffman coding a base code length in the sequel. Analogously, wedenote the lengths of the codes of the super Huffman coding by the term super code lengths .We output the number of base code lengths in front of every serialized base Huffman coding, i.e. , in front of every enumeration of base code lengths. That way the decompressor knows howmany bits are part of this binary representation. Let us point out that this number of code lengthsis encoded using k bits instead of using the super Huffman coding, where k ∈ N is a constantwhich is fixed at compile time. We do this due to the following fact. Let n ∈ N be the numberof code lengths and let us assume that we encode n , which is usually many times larger than themaximum over all code lengths, using the super Huffman coding. This would result in a big gapof unused integers between the super code lengths and n . This again would lead to a long list of ’s when storing the super Huffman tree by enumerating its code lengths. In general, this leadsto a reduced compression performance compared to a fixed-length coding of n using k bits. Super Huffman Coding
The super Huffman coding will also be stored by the sequence of itscode lengths. However, the relatively small set of integers is encoded by a fixed-length codingusing n ∈ N bits, where n is the smallest possible number of bits which can be used to encodeall super code lengths. More precisely, we serialize the super Huffman coding in three steps:(1) First of all, we print out the binary representation of the number n using k bits, where k ∈ N is a fixed number of bits which is specified at compile time.(2) Let m ∈ N be the biggest base code length. We print out the binary representation of m using k bits. With this information the decompressor knows that the next n · m bits make upthe list of super code lengths.(3) Finally, the binary representations of the m many super code lengths are written to the outputfile using n bits for each code length. The super code lengths are printed in the natural orderof the integers which are represented by the corresponding codes. Run-length Coding of the Base Code Lengths
In this section we explain the run-length codingwhich is applied to the enumerations of code lengths used to write all base Huffman codings tothe output file. This additional encoding marks a major contribution to the compactness of ourrepresentation. The bigger a code length is, the more different codes of that length are possible. Atthe same time a sequence of several occurrences of the same code length within the enumerationof all code lengths becomes more likely. In addition, our experience shows that it frequentlyhappens that there is a longer run of ’s in the list of all code lengths due to symbols which nocodes were assigned to. 54 xample 27. Consider, for instance, the example from Sect. 5.3 and in particular the base Huff-man coding C which is listed in Table 6.3 on page 56. This Huffman coding does not assigncodes to the symbols – . This results in a sequence of zeros within the enumeration of thecode lengths of C . Definition 1.
Let m, k ∈ N , where k ≥ (cid:98) log ( m ) (cid:99) + 1 . In the following we denote by bin k ( m ) the ( -padded) binary representation b k b k − . . . b of m , i.e., the following holds: m = k (cid:88) i =0 b i · i We encode an enumeration of code lengths using a run-length coding as follows: Let us assumethat n ∈ N is the maximum code length. Then we use the three additional integers n +1 , n +2 and n + 3 to indicate certain types of runs — we call them run indicators in the sequel. Principally,all runs with a length less than or equal to are straightly written to the output file. In contrast, arun of a code length m ∈ N exceeding this bound is encoded as follows: – If we have m > , we use the run indicator n + 1 and a bit string with a length of to indicate – repetitions of the code length m . If k > is the length of the run of m and l = k mod 7 ( i.e. , l ∈ { , , . . . , } ), then this run is encoded as follows: • if l > : m ( n + 1) bin (3) (cid:124) (cid:123)(cid:122) (cid:125) (cid:98) k / (cid:99) times ( n + 1) bin ( l − • if l ≤ : m ( n + 1) bin (3) (cid:124) (cid:123)(cid:122) (cid:125) (cid:98) k / (cid:99) times [ m ] l Note that [ m ] l denotes l many consecutive m ’s. – If we have m = 0 , we use the run indicator n + 2 with an appended bit string of length todenote – repetitions of m . In contrast, we use the run indicator n + 3 together with a bitstring of length to encode – repeated ’s.If k > is the length of the run of ’s and l = k mod 139 ( i.e. , l ∈ { , , . . . , } ), thenthis run is encoded as follows: • if l > : ( n + 3) bin (127) (cid:124) (cid:123)(cid:122) (cid:125) (cid:98) k / (cid:99) times ( n + 3) bin ( l − • if < l ≤ : ( n + 3) bin (127) (cid:124) (cid:123)(cid:122) (cid:125) (cid:98) k / (cid:99) times ( n + 2) bin ( l − • if l ≤ : ( n + 3) bin (127) (cid:124) (cid:123)(cid:122) (cid:125) (cid:98) k / (cid:99) times [ m ] l ymbol Old code New code Table 6.1:
Huffman coding C usedto encode the start production. Symbol Old code New code Table 6.2:
Huffman coding C usedto encode the productions from P \{ S } , the children characteristics, andnumbers of terminals and nontermi-nals.Symbol Old code New code Table 6.3:
Huffman coding C usedto encode the names of the terminalsymbols. Symbol Old code New code Table 6.4:
Super Huffman codingused to encode the code lengths of thebase Huffman codings. xample 28. Consider the following sequence of integers:
Now, let us assume that we want to encode the above sequence using our run-length coding.Obviously, we have n = 5 . The above run of ’s with a length of 6 is represented by the sequence since we have n + 1 = 6 and bin (6 −
4) = 10 . In contrast, the run of ’s with a length of leads to the sequence because it holds that n + 2 = 7 and that bin (9 −
4) = 101 . All inall, we obtain the sequence .Surprisingly, our investigations evinced that an approach which dynamically adjusts the lengthof the bit strings used in the above encoding depending on the size of the input grammar doesnot lead to significantly better compression results.
Example
This example continues the encoding of the linear SLCF tree grammar G from Exam-ple 26 on page 51. The Tables 6.1, 6.2 and 6.3 list the three base Huffman codings, called C , C and C in the sequel, which are calculated by our implementation. The columns labeled Oldcode show the initial Huffman codes while the columns labeled
New code list the newly assignedcodes after the necessary reorganization described in Sect. 5.1 on page 48.While Fig. 41 on page 53 shows the second part of the output file as it is generated by a runof TreeRePair the Fig. 42 shows the first part of it. The latter stores the base Huffman codings C , C and C together with the corresponding super Huffman coding. For the sake of clarity thecorresponding values are denoted by their integer representation instead of by their fixed-lengthor Huffman code. The Huffman coding C from Table 6.1, for instance, is given by the sequenceof code lengths ranging from value 13 to value 22, where value 12 informs us about the lengthof this sequence. Analogously, the code lengths of the Huffman codings C and C are given bythe values 24–32 and 34–60, respectively. The sequence of code lengths of the Huffman coding C exhibits a longer run, namely, consecutive occurrences of the code length . This run isencoded by the run indicator n + 3 and the bit string bin (94 −
12) = 1010010 , where n = 6 is the maximal length of a code from C .The super Huffman coding listed in Table 6.4 is written to the output file (values 2–11) using3 bits per integer as it is stated by the value 0 of the output file. There need to be enumerated super code lengths since values — the base code lengths , , . . . , and the run indicators , , which are used by the base Huffman coding C — need to be encoded. In the following, we compare the compression performance of our implementation of the Re-pairfor Trees algorithm with existing algorithms. Furthermore, we will check the impact of the DAGrepresentation of the input tree on the compression factors achieved and we will learn about theinfluences of small changes to the maximal rank allowed for a nonterminal.57 super Huffman codingbase Huffman coding C base Huffman coding C base Huffman coding C Fig. 42:
Depiction of the part of the output file which contains the serialized four Huffman codings.
The set of XML documents we used for investigating the performance of TreeRePair consistsof 23 files with different characteristics ( cf.
Table 6). Most of them were used in past papersevaluating various XML compressors and therefore may be familiar to the reader. The originalfiles can be obtained from the sources listed in Table 7. In all cases character data, attributes,comments, namespace information were removed from the XML files, i.e. , the XML documentsconsist only of start tags, end tags and empty element tags. We do so, because, at this time,TreeRePair ignores this information and solely concentrates on the XML document tree.
Basically, we compare our implementation of Re-pair for Trees with two other compression al-gorithms based on linear SLCF tree grammars, namely, BPLEX [BLM08] and Extended-Repair[Kri08,BHK10]. The former is a sliding-window based linear time approximation algorithm. Itsearches bottom-up in a fixed window for repeating tree patterns. The size of the sliding win-dow, the maximal pattern size and the maximal rank of a nonterminal can be specified as inputparameters. One of the main drawbacks of BPLEX is that there exists only a slowly runningimplementation of it.Extended-Repair (which we sometimes call E-Repair in the sequel) is an algorithm devel-oped by a group from the University of Paderborn, Germany [Kri08,BHK10]. This algorithm is,just like our Re-pair for Trees algorithm, based on the Re-pair algorithm introduced in [LM00].However, it was independently developed and exhibits some fundamental differences to our al-gorithm. One of the main differences is that the Extended-Repair algorithm at first generates aDAG of the input tree and then processes each part of it individually, i.e. , it generates multiplegrammars which are combined in the end. The individual parts of the input tree are called ”re-pair packets”. The maximal size of each packet can be specified by an input parameter (default58
ML document File size (kb)
349 28 305 5 46 1 catalog-01 catalog-02
44 656 2 390 230 7 53 9 dictionary-01 dictionary-02
17 128 2 731 763 7 24 9 dblp
117 822 10 802 123 5 35 2
EnWikiNew
EnWikiQuote
EnWikiSource
13 457 1 133 534 4 20 3
EnWikiVersity
EnWikTionary
99 201 8 385 133 4 20 3
EXI-Array
EXI-factbook
EXI-Invoice
266 15 074 6 52 5
EXI-Telecomp
EXI-weblog
JST gene.chr1
JST snp.chr1
13 795 655 945 7 42 8 medline02n0328
51 751 2 866 079 6 78 6
NCBI gene.chr1
NCBI snp.chr1
63 941 3 642 224 3 15 8 sprot39.dat
111 175 10 903 567 5 48 7 treebank
19 551 2 447 726 36 251 4
Table 6:
Characteristics of the XML documents used in our tests. The values in the ”Source”-column match the source IDs inTable 7. The depth of an XML document tree specifies the length (number of edges) of the longest path from the rootof the tree to a leaf. is 20 000 edges). The author of [Kri08] points out that this packet-based behavior may have anegative impact on the compression performance of the Extended-Repair algorithm. Our owninvestigations concerning a TreeRePair version running on the DAG of the input tree instead ofon the whole tree support this point of view.In [Kri08] it is shown that Extended-Repair achieves a much better compression ratio on theXML document
NCBI snp.chr1 , when the input tree is not broken down into packets (thiscan be achieved by choosing the maximum packet size large enough). However, our experimentsshow that at the same time the memory requirements and the runtime of the Extended-Repairalgorithm rise drastically. Note that, regarding our algorithm, the DAG representation is merelyused to save memory resources and is almost completely transparent to the overlying digramreplacement process ( cf.
Sect. 4.5 on page 41).
Our experiments were done on a computer with an Intel R (cid:13) Core TM i.e. , no algorithm was able to make use of multiprocessing. TreeRePairand BPLEX were compiled with the gcc -compiler using the -O3 (compile time optimizations)and -m32 ( i.e. , we generated them as 32bit-applications) switches. We were not able to compile59 Table 7:
Sources of the XML documents from Table 6.TreeRePair BPLEX E-Repair mDAG bin. mDAGEdges (%) 2.9 3.4 4.1 12.8 18.3
Table 8:
Average values of the characteristics of the generated grammars and of the corresponding runs of the algorithms. the succ -tool of the BPLEX distribution with compile time optimizations ( i.e. , using the -O3 switch). This tool is used to apply a succinct coding to a grammar generated by the BPLEXalgorithm. However, this should not have a great influence on the runtime measured for BPLEXsince the succ -tool usually executes quite fast compared to the runtime of the actual BPLEXalgorithm. In contrast, Extended-Repair is an application written in Java TM for which we onlyhad the bytecode at hand, i.e. , we did not have access to the source code of it. We executedExtended-Repair using the Java SE Runtime Environment TM in version .During the execution of the algorithms we always measured their memory usage. We ac-complished this by constantly polling the VmRSS -value which is printed out by executing thecommand cat /proc/
Average values of the characteristics of the runs of the three algorithms when making a small size of the output file toppriority. edges input parameter. Regarding Extended-Repair, we used the supplied
ConfEdges.xml configuration file which is supposed to make Extended-Repair minimize the number of edges.The BPLEX algorithm was executed with its default input parameter values and no changes weremade to the generated grammar (besides pruning nonterminals which are referenced only onceby using the supplied gprint tool).Table 8 shows the average values of the essential characteristics of the final grammars gener-ated by the three competing algorithms. The first row shows the average compression factors interms of the number of edges in percent. The edge compression factor is computed as follows: if t ∈ T ( F ) is the binary representation of the input tree and G is the final grammar, we obtain theedge compression factor by computing |G| / | t | · . The second row shows the average number ofnonterminals of the final grammars. For the sake of completeness, the average runtimes (in sec-onds), the average memory usages (in megabytes) and the average file size compression factorsare also listed. The compression factor in terms of file size specifies the ratio between the size ofthe input file and the file size of the succinct coding of the final grammar in percent.We also added two columns to Table 8 showing the average number of edges and the averagenumber of nonterminals of the minimal DAGs of the input trees (mDAG) and the minimal DAGsof the binary representations of the input trees (bin. mDAG).As it can be seen, on average, TreeRePair generates the smallest linear SLCF tree grammars(in terms of the number of edges) compared to the other two algorithms. At the same time, itsgrammars exhibit a small number of nonterminals. It outperforms BPLEX and Extended-Repairin terms of runtime and memory usage. The speed and moderate requirements on main memoryare a result of the transparent DAG representation of the input tree and the many optimizationswe made to the source code of TreeRePair during our investigations.Figure 43.1 on page 63 gives an impression on how each of the three algorithms performs onthe individual XML documents in terms of the size of the final grammar in edges. For each file,the algorithm which generates the largest grammar is set to 100%. In Appendix A.1 on page 66there is a detailed table listing all relevant characteristics of the runs of the algorithms on the setof test XML documents. In this section, we concentrate on the sizes of the files generated by the runs of the algorithmson our set of test XML documents. In fact, we execute each algorithm in a mode in which the61ize of the resulting file is made a top priority. For TreeRePair, we achieve this by specifying theinput parameter -optimize filesize and for Extended-Repair, we get such a behavior byusing the supplied
ConfSize.xml configuration file and the -s 4 switch. The latter choosesa certain succinct coding of the Extended-Repair distribution which is supposed to generate verysmall representations of the generated grammar. Regarding BPLEX, we first apply the supplied gprint -tool using the parameters --prune and --threshold 14 . After that we use the succ -tool of the BPLEX distribution together with the parameter --type 68 to generate aHuffman coding-based succinct coding of the corresponding grammar. In [MMS08] it is statedthat this approach leads to the best compression performance of BPLEX in general (in terms offile size).In addition to the above three algorithms, we also consider the compression results producedby gzip, bzip2 and XMill 0.8 [LS00]. We include them in our comparison to make it easier toget a handle for common compression rates and runtimes. The first two algorithms are widelyused general purpose file compressors which, of course, produce a non-queryable compressedrepresentation of the input file. In contrast, XMill is a compressor specialized in compressing thestructure and, in particular, the character data of XML documents. In fact, it mainly concentrateson how to group the character data of an XML document in such a way that it can be efficientlycompressed by general purpose compressors like gzip. Since its implementation does not exhibita special ”only consider the structure of the XML document” mode, it may be unfair to directlycompare its compression results with those of TreeRePair, BPLEX or Extended-Repair. How-ever, we included its compression results, which we obtained using its default input parameters,because we were interested in its performance in this setting.Table 9 shows the average sizes of the output files generated by the six algorithms mentionedabove. For the sake of completeness, the average runtime, the average memory usage, the averagenumber of edges and the average number of nonterminals are also listed. Again, TreeRePairoutperforms BPLEX and Extended-Repair regarding all considered characteristics. Surprisingly,its queryable output files are even smaller than the non-queryable ones produced by the highlyoptimized gzip and bzip2 algorithms. However, gzip (but interestingly not bzip2) runs muchfaster than TreeRePair on our test data.Figure 43.2 gives an impression on how each of the six algorithms performs on the individualXML documents in terms of the size of the generated output file. For each file, the algorithmwhich generates the biggest output file is set to 100%. In Appendix A.2 on page 70 there is adetailed table listing all relevant characteristics of the runs of the algorithms on our set of testXML documents. Table 10 shows a comparison between the compression results of TreeRePair when using andwhen not using, respectively, the DAG representation described in Sect. 4.2 on page 33. Theleft column shows the values obtained when executing TreeRePair with its default parametersin edge optimization mode, i.e. , we are only using the -optimize edges switch since ouralgorithm uses the DAG representation by default. In contrast, the right column is a result of s t a t i s t i csc a t a l og - c a t a l og - l p d i c t i ona r y - i c t i ona r y - E n W i k i N e w E n W i k i Q uo t e E n W i k i S ou r c e E n W i k i V e r s i t y E n W i k T i ona r y EX I - A rr a y EX I - f a c t boo k EX I - I n v o i c e EX I - T e l e c o m p EX I - w eb l og J S T _gene . c h r J S T _ s np . c h r m ed li ne02n0328 NC B I _gene . c h r NC B I _ s np . c h r s p r o t . da tt r eeban k TreeRePair Bplex Repair
Fig. 43.1:
Comparison of the number of edges of the final grammars. s t a t i s t i csc a t a l og - c a t a l og - l p d i c t i ona r y - i c t i ona r y - E n W i k i N e w E n W i k i Q uo t e E n W i k i S ou r c e E n W i k i V e r s i t y E n W i k T i ona r y EX I - A rr a y EX I - f a c t boo k EX I - I n v o i c e EX I - T e l e c o m p EX I - w eb l og J S T _gene . c h r J S T _ s np . c h r m ed li ne02n0328 NC B I _gene . c h r NC B I _ s np . c h r s p r o t . da tt r eeban k TreeRePairBplex Repairbzip2 gzipXMill
Fig. 43.2:
Comparison of the sizes of the output files. ith DAG without DAGEdges (%) 2.86 2.84 Table 10:
Average values of the characteristics of the runs of TreeRePair with and without the DAG representation of the inputtree. Max. rank 0 1 2 3 4 5 6Edges (%) 55.02 3.29 2.92 2.89 2.86 2.89 2.89
Table 11:
Average values of the characteristics of the runs of TreeRePair with different maximal ranks allowed for a nonterminal. running TreeRePair with the -no dag and -optimize edges switches. Again, in AppendixA.3 on page 75, there is a detailed table listing all relevant characteristics of the runs of the twoTreeRePair configurations on each test XML document.Regarding the differences between the compression results of TreeRePair and the ones of thecompeting algorithms, it can be said that the DAG representation only has a minor impact onthe compression performance of our algorithm. However, we can state that it drastically reducesthe memory demands of TreeRePair — it slashes the memory consumption by a factor of 4.Interestingly, even without the DAG representation, TreeRePair uses only half as much mainmemory as Extended-Repair does ( cf.
Table 8). Furthermore, the DAG representation leads to afaster compression speed since it saves repetitive recalculations concerning equal subtrees.
We executed TreeRePair using the -optimize edges ( i.e. , we enabled the edge optimizationmode) and the -max rank switches. Each time, we specified a different maximal rank for anonterminal in order to get information concerning its influence on the compression performance.Table 11 shows that, regarding our set of test XML documents, a maximal rank of 4 leads to thebest compression results on average.At the same time, we can see that even when restricting the maximal rank to 1 TreeRePairperforms better than BPLEX and Extended-Repair ( cf. Table 8). The fact that large maximalranks can lead to a worse compression ratio can be explained by the trees from Sect. 3.7 onpage 24. Note that the trees from this section are basically long lists. Although this is not thecase for our test trees, their shape is nevertheless similar to a list structure. In any case, its quitedistinct from the shape of a full binary tree, where an unlimited maximal rank leads to the bestcompression ratio ( cf.
Sect. 3.6 on page 19). 64 eferences [BGK03] Peter Buneman, Martin Grohe, and Christoph Koch. Path queries on compressed XML. In
VLDB 2003: Proceedingsof the 29th international conference on very large data bases , pages 141–152. VLDB Endowment, 2003.[BHK10] Stefan B¨ottcher, Rita Hartel, and Christoph Krislin. CluX: Clustering XML sub-trees. In
ICEIS 2010: Proceedingsof the 12th International Conference on Enterprise Information Systems , 2010.[BLM08] Giorgio Busatto, Markus Lohrey, and Sebastian Maneth. Efficient memory representation of XML document trees.
Information Systems , 33(4-5):456 – 474, 2008.[BPSM +
08] Tim Bray, Jean Paoli, C. Michael Sperberg-McQueen, Eve Maler, and Franc¸ois Yergeau. Extensible markuplanguage (XML) 1.0. W3c recommendation, XML Core Working Group, World Wide Web Consortium, November2008.[CDG + +
05] Moses Charikar, Eric Lehman, April Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and AbhiShelat. The smallest grammar problem.
IEEE Trans. Inform. Theory , 51(7):2554–2576, 2005.[Deu96] P. Deutsch. DEFLATE compressed data format specification version 1.3. http://tools.ietf.org/html/rfc1951, 1996.[FGK03] Markus Frick, Martin Grohe, and Christoph Koch. Query evaluation on compressed trees (extended abstract). In
LICS ’03: Proceedings of the 18th Annual IEEE Symposium on Logic in Computer Science , pages 188–197. IEEEComputer Society Press, 2003.[Kri08] Christoph Krislin. Optimierung grammatik-basierter XML-Kompression. Diplomarbeit, Faculty for ElectricalEngineering, Computer Science and Mathematics, University of Paderborn (Germany), 2008.[LM00] N. Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression.
Proceedings of the IEEE ,88(11):1722–1732, 2000.[LM06] Markus Lohrey and Sebastian Maneth. The complexity of tree automata and XPath on grammar-compressed trees.
Theoretical Computer Science , 363(2):196 – 210, 2006.[LMSS09] Markus Lohrey, Sebastian Maneth, and Manfred Schmidt-Schauss. Parameter reduction in grammar-compressedtrees. In
Proceedings of FOSSACS 2009, number 5504 in Lecture Notes in Computer Science , pages 212–226.Springer, 2009.[LS00] H. Liefke and D. Suciu. XMill: an efficient compressor for XML data. In
Proceedings of the 2000 ACM SIGMODInternational Conference on Management of Data , page 164. ACM Press, 2000.[MLMK05] Makoto Murata, Dongwon Lee, Murali Mani, and Kohsuke Kawaguchi. Taxonomy of XML schema languagesusing formal language theory.
ACM Transactions on Internet Technology , 5(4):660–704, 2005.[MMS08] Sebastian Maneth, Nikolay Mihaylov, and Sherif Sakr. XML tree structure compression.
International Workshopon Database and Expert Systems Applications , pages 243–247, 2008.[MSV03] Tova Milo, Dan Suciu, and Victor Vianu. Typechecking for XML transformers.
Journal of Computer and SystemSciences , 66(1):66 – 97, 2003.[Nev02] Frank Neven. Automata theory for XML researchers.
SIGMOD Record , 31(3):39–46, 2002.[WLH07] Fangju Wang, Jing Li, and Hooman Homayounfar. A space efficient XML DOM parser.
Data & KnowledgeEngineering , 60(1):185 – 207, 2007. Detailed Test Results
A.1 Optimization of Total Number of EdgesAlgorithm Edges File size
TreeRePair 1.68% 0.20% 54 100ms 1BPLEX 1.80% 0.34% 168 1.813s 295E-Repair 1.69% 0.24% 37 7.518s 114bin. mDAG 8.49% - 31 - -mDAG 4.87% - 15 - - catalog-01
TreeRePair 1.69% 0.10% 400 887ms 2BPLEX 2.22% 0.22% 1251 6.548s 315E-Repair 1.63% 0.12% 291 9.975s 279bin. mDAG 3.10% - 520 - -mDAG 3.80% - 506 - - catalog-02
TreeRePair 1.11% 0.07% 965 9.409s 10BPLEX 1.38% 0.11% 3045 30s 512E-Repair 1.52% 0.11% 1499 42s 511bin. mDAG 2.22% - 805 - -mDAG 1.39% - 792 - - dblp
TreeRePair 3.89% 0.59% 25250 43s 227BPLEX 4.27% 0.73% 38712 57m 42s 1644E-Repair 5.65% 0.68% 30430 4m 34s 510bin. mDAG 19.36% - 6592 - -mDAG 11.11% - 3378 - - dictionary-01
TreeRePair 7.72% 1.54% 1676 1.010s 9BPLEX 8.43% 2.37% 3994 44s 323E-Repair 8.71% 1.83% 1248 16s 433bin. mDAG 27.99% - 2058 - -mDAG 21.07% - 448 - -66 lgorithm Edges File size dictionary-02
TreeRePair 5.92% 1.38% 9757 11s 69BPLEX 6.58% 1.95% 23209 6m 12s 587E-Repair 8.52% 1.83% 11672 1m 40s 494bin. mDAG 24.93% - 16281 - -mDAG 19.96% - 2414 - -
EnWikiNew
TreeRePair 2.29% 0.21% 667 1.585s 8BPLEX 2.40% 0.30% 1369 35s 337E-Repair 2.42% 0.24% 476 12s 347bin. mDAG 17.31% - 23 - -mDAG 8.67% - 29 - -
EnWikiQuote
TreeRePair 2.42% 0.21% 452 1.158s 7BPLEX 2.56% 0.31% 985 25s 321E-Repair 2.58% 0.26% 323 9.924s 290bin. mDAG 18.14% - 19 - -mDAG 9.09% - 25 - -
EnWikiSource
TreeRePair 1.10% 0.10% 861 4.927s 26BPLEX 1.28% 0.16% 1895 1m 9s 418E-Repair 1.82% 0.18% 1106 23s 500bin. mDAG 17.52% - 19 - -mDAG 8.77% - 24 - -
EnWikiVersity
TreeRePair 1.44% 0.13% 525 2.107s 12BPLEX 1.53% 0.18% 1043 34s 347E-Repair 1.61% 0.15% 423 12s 437bin. mDAG 17.60% - 19 - -mDAG 8.81% - 24 - -
EnWikTionary
TreeRePair 0.97% 0.11% 4535 36s 183BPLEX 1.09% 0.14% 6402 8m 58s 1287E-Repair 1.48% 0.15% 6315 1m 33s 540bin. mDAG 17.32% - 26 - -mDAG 8.66% - 30 - -67 lgorithm Edges File size
EXI-Array
TreeRePair 0.41% 0.03% 123 1.281s 14BPLEX 0.65% 0.06% 383 42s 322E-Repair 0.53% 0.05% 142 8.017s 320bin. mDAG 56.51% - 8 - -mDAG 42.20% - 13 - -
EXI-factbook
TreeRePair 2.35% 0.31% 145 271ms 2BPLEX 4.11% 0.77% 1423 5.138s 298E-Repair 2.58% 0.31% 146 11s 408bin. mDAG 9.16% - 236 - -mDAG 8.07% - 293 - -
EXI-Invoice
TreeRePair 0.68% 0.21% 14 74ms 1BPLEX 0.62% 0.30% 40 1.483s 293E-Repair 0.93% 0.24% 20 4.689s 119bin. mDAG 13.74% - 6 - -mDAG 7.12% - 15 - -
EXI-Telecomp
TreeRePair 0.07% 0.01% 21 780ms 3BPLEX 0.06% 0.02% 47 9.684s 310E-Repair 0.08% 0.02% 21 11s 452bin. mDAG 11.15% - 10 - -mDAG 5.59% - 15 - -
EXI-weblog
TreeRePair 0.06% 0.01% 13 324ms 3BPLEX 0.04% 0.01% 24 9.097s 303E-Repair 0.05% 0.02% 11 7.868s 279bin. mDAG 18.19% - 2 - -mDAG 9.10% - 2 - -
JST gene.chr1
TreeRePair 1.84% 0.10% 354 874ms 3BPLEX 2.19% 0.19% 1113 11s 315E-Repair 2.99% 0.17% 126 8.006s 233bin. mDAG 6.75% - 114 - -mDAG 4.24% - 76 - -68 lgorithm Edges File size
JST snp.chr1
TreeRePair 1.51% 0.09% 856 3.150s 8BPLEX 2.15% 0.21% 4193 31s 360E-Repair 1.54% 0.10% 634 15s 445bin. mDAG 6.20% - 282 - -mDAG 3.59% - 242 - - medline02n0328
TreeRePair 4.13% 0.35% 9064 16s 79BPLEX 5.17% 0.62% 33976 5m 52s 574E-Repair 6.73% 0.54% 13010 1m 32s 479bin. mDAG 25.84% - 20013 - -mDAG 22.80% - 3960 - -
NCBI gene.chr1
TreeRePair 1.37% 0.09% 504 1.374s 4BPLEX 2.38% 0.28% 3631 14s 327E-Repair 1.68% 0.11% 328 10s 308bin. mDAG 3.98% - 605 - -mDAG 4.45% - 436 - -
NCBI snp.chr1
TreeRePair < < < < sprot39.dat TreeRePair 2.30% 0.38% 20224 43s 178BPLEX 3.16% 0.79% 111167 14m 41s 1446E-Repair 4.27% 0.59% 33102 3m 48s 499bin. mDAG 13.18% - 31116 - -mDAG 16.07% - 10243 - - treebank
TreeRePair 20.72% 4.41% 32857 22s 164BPLEX 23.29% 6.16% 76109 21m 27s 645E-Repair 34.85% 6.03% 48358 6m 50s 526bin. mDAG 59.42% - 43586 - -mDAG 53.75% - 24746 - -69 .2 Optimization of File SizeAlgorithm Edges File size
TreeRePair 1.77% 0.20% 35 109ms 1BPLEX 2.19% 0.25% 27 2.018s 295E-Repair 1.68% 0.24% 37 4.578s 108bzip2 - 0.29% - 229ms 4gzip - 0.81% - 8ms -XMill - 0.24% - 2.728s 2 catalog-01
TreeRePair 1.76% 0.10% 279 898ms 2BPLEX 2.23% 0.14% 342 6.834s 315E-Repair 2.77% 0.19% 236 11s 349bzip2 - 0.24% - 2.701s 8gzip - 0.85% - 51ms -XMill - 0.11% - 12s 2 catalog-02
TreeRePair 1.12% 0.07% 770 10s 10BPLEX 1.27% 0.08% 948 32s 512E-Repair 1.49% 0.12% 1692 47s 521bzip2 - 0.23% - 28s 8gzip - 0.81% - 450ms -XMill - 0.09% - 1m 58s 12 dblp
TreeRePair 4.03% 0.58% 14533 43s 227BPLEX 4.52% 0.65% 11693 61m 15s 1644E-Repair 5.52% 0.68% 35125 42m 48s 516bzip2 - 0.56% - 1m 11s 8gzip - 1.30% - 1.230s -XMill - 0.53% - 11m 36s 15 dictionary-01
TreeRePair 8.08% 1.47% 930 1.117s 9BPLEX 9.67% 1.85% 1044 46s 323E-Repair 8.51% 1.81% 1428 19s 462bzip2 - 1.52% - 1.313s 7gzip - 3.07% - 39ms -XMill - 1.49% - 17s 270 lgorithm Edges File size dictionary-02
TreeRePair 6.15% 1.32% 5024 11s 69BPLEX 7.56% 1.63% 5424 6m 12s 587E-Repair 8.30% 1.81% 13698 1m 57s 475bzip2 - 1.52% - 15s 7gzip - 3.05% - 279ms -XMill - 1.49% - 2m 41s 13
EnWikiNew
TreeRePair 2.38% 0.20% 390 1.721s 8BPLEX 2.63% 0.23% 335 35s 337E-Repair 2.42% 0.24% 476 12s 369bzip2 - 0.26% - 2.999s 8gzip - 0.90% - 57ms -XMill - 0.23% - 23s 2
EnWikiQuote
TreeRePair 2.51% 0.20% 274 1.195s 7BPLEX 2.81% 0.23% 236 25s 321E-Repair 2.58% 0.26% 323 10s 268bzip2 - 0.28% - 2.013s 8gzip - 0.93% - 36ms -XMill - 0.24% - 15s 2
EnWikiSource
TreeRePair 1.14% 0.10% 515 5.025s 26BPLEX 1.40% 0.13% 535 1m 10s 418E-Repair 1.82% 0.18% 1127 23s 488bzip2 - 0.16% - 8.742s 8gzip - 0.63% - 131ms -XMill - 0.12% - 1m 4s 9
EnWikiVersity
TreeRePair 1.50% 0.12% 303 2.244s 12BPLEX 1.70% 0.15% 287 36s 347E-Repair 1.61% 0.15% 423 13s 415bzip2 - 0.19% - 3.698s 8gzip - 0.69% - 59ms -XMill - 0.15% - 28s 271 lgorithm Edges File size
EnWikTionary
TreeRePair 1.00% 0.11% 2575 37s 183BPLEX 1.15% 0.13% 2062 9m 13s 1287E-Repair 1.48% 0.15% 6314 1m 40s 526bzip2 - 0.17% - 57s 8gzip - 0.68% - 938ms -XMill - 0.13% - 7m 25s 15
EXI-Array
TreeRePair 0.44% 0.03% 75 1.393s 14BPLEX 0.77% 0.05% 124 43s 322E-Repair 0.51% 0.05% 155 7.833s 312bzip2 - 0.05% - 3.250s 8gzip - 0.37% - 67ms -XMill - 0.03% - 10s 6
EXI-factbook
TreeRePair 2.51% 0.31% 99 356ms 2BPLEX 6.44% 0.58% 170 5.333s 298E-Repair 2.59% 0.31% 151 12s 438bzip2 - 0.78% - 854ms 8gzip - 1.10% - 17ms -XMill - 0.29% - 5.248s 1
EXI-Invoice
TreeRePair 0.72% 0.21% 11 147ms 2BPLEX 0.78% 0.28% 8 1.406s 293E-Repair 0.91% 0.24% 21 4.320s 113bzip2 - 0.30% - 191ms 3gzip - 0.64% - 7ms -XMill - 0.26% - 1.256s 2
EXI-Telecomp
TreeRePair 0.08% 0.01% 12 829ms 3BPLEX 0.07% 0.02% 15 9.548s 310E-Repair 0.08% 0.02% 24 13s 450bzip2 - 0.09% - 2.363s 8gzip - 0.45% - 36ms -XMill - 0.02% - 11s 272 lgorithm Edges File size
EXI-weblog
TreeRePair 0.06% 0.01% 9 400ms 3BPLEX 0.05% 0.01% 12 9.004s 303E-Repair 0.05% 0.02% 12 7.942s 288bzip2 - 0.06% - 720ms 8gzip - 0.40% - 14ms -XMill - 0.02% - 8.342s 2
JST gene.chr1
TreeRePair 1.91% 0.10% 227 906ms 3BPLEX 2.42% 0.13% 211 11s 315E-Repair 2.99% 0.17% 128 9.947s 211bzip2 - 0.14% - 2.599s 8gzip - 0.67% - 43ms -XMill - 0.10% - 14s 2
JST snp.chr1
TreeRePair 1.58% 0.08% 537 3.213s 8BPLEX 2.45% 0.14% 569 32s 360E-Repair 1.51% 0.10% 673 15s 453bzip2 - 0.18% - 9.251s 8gzip - 0.79% - 149ms -XMill - 0.09% - 40s 8 medline02n0328
TreeRePair 4.32% 0.34% 4923 16s 79BPLEX 6.47% 0.46% 6717 5m 45s 574E-Repair 6.71% 0.54% 13243 1m 38s 477bzip2 - 0.49% - 31s 7gzip - 1.26% - 544ms -XMill - 0.34% - 2m 13s 13
NCBI gene.chr1
TreeRePair 1.43% 0.09% 354 1.442s 4BPLEX 3.00% 0.16% 464 14s 327E-Repair 1.66% 0.11% 342 10s 265bzip2 - 0.15% - 4.110s 8gzip - 0.71% - 65ms -XMill - 0.08% - 21s 873 lgorithm Edges File size
NCBI snp.chr1
TreeRePair < < < < sprot39.dat TreeRePair 2.41% 0.37% 11699 43s 178BPLEX 4.33% 0.53% 11783 13m 43s 1446E-Repair 4.25% 0.59% 33700 3m 59s 497bzip2 - 0.45% - 1m 11s 8gzip - 1.20% - 1.122s -XMill - 0.36% - 9m 52s 15 treebank
TreeRePair 21.59% 4.28% 17186 22s 164BPLEX 26.21% 5.37% 21302 21m 36s 646E-Repair 34.53% 6.01% 51470 7m 44s 514bzip2 - 5.26% - 6.407s 7gzip - 9.65% - 843ms -XMill - 4.51% - 1m 36s 1274 .3 Without Using DAG RepresentationAlgorithm Edges File size
Without DAG 1.62% 0.20% 53 121ms 4With DAG 1.68% 0.20% 54 214ms 1 catalog-01
Without DAG 1.69% 0.10% 400 1.381s 20With DAG 1.69% 0.10% 400 1.022s 3 catalog-02
Without DAG 1.11% 0.07% 967 15s 199With DAG 1.11% 0.07% 965 9.584s 10 dblp
Without DAG 3.89% 0.59% 25039 55s 1015With DAG 3.89% 0.59% 25250 44s 227 dictionary-01
Without DAG 7.63% 1.51% 1622 1.238s 25With DAG 7.72% 1.54% 1676 1.044s 9 dictionary-02
Without DAG 5.88% 1.36% 9390 12s 238With DAG 5.92% 1.38% 9757 11s 69
EnWikiNew
Without DAG 2.28% 0.21% 656 2.042s 37With DAG 2.29% 0.21% 667 1.732s 8
EnWikiQuote
Without DAG 2.41% 0.21% 458 1.320s 24With DAG 2.42% 0.21% 452 1.223s 7
EnWikiSource
Without DAG 1.09% 0.10% 863 5.652s 101With DAG 1.10% 0.10% 861 5.087s 26
EnWikiVersity
Without DAG 1.43% 0.13% 522 2.472s 45With DAG 1.44% 0.13% 525 2.229s 12
EnWikTionary
Without DAG 0.97% 0.11% 4539 42s 743With DAG 0.97% 0.11% 4535 38s 18375 lgorithm Edges File size
EXI-Array
Without DAG 0.40% 0.03% 122 1.378s 21With DAG 0.41% 0.03% 123 1.394s 14
EXI-factbook
Without DAG 2.34% 0.31% 144 331ms 6With DAG 2.35% 0.31% 145 330ms 2
EXI-Invoice
Without DAG 0.61% 0.21% 12 85ms 3With DAG 0.68% 0.21% 14 124ms 1
EXI-Telecomp
Without DAG 0.06% 0.01% 17 1.132s 17With DAG 0.07% 0.01% 21 850ms 3
EXI-weblog
Without DAG 0.05% 0.01% 10 607ms 10With DAG 0.06% 0.01% 13 400ms 3
JST gene.chr1
Without DAG 1.73% 0.09% 299 1.365s 21With DAG 1.84% 0.10% 354 910ms 3
JST snp.chr1
Without DAG 1.50% 0.09% 841 4.187s 59With DAG 1.51% 0.09% 856 3.287s 8 medline02n0328
Without DAG 4.11% 0.34% 8524 17s 235With DAG 4.13% 0.35% 9064 17s 79
NCBI gene.chr1
Without DAG 1.37% 0.09% 486 1.959s 32With DAG 1.37% 0.09% 504 1.498s 4
NCBI snp.chr1
Without DAG < < < < sprot39.dat Without DAG 2.31% 0.37% 18516 55s 936With DAG 2.30% 0.38% 20224 44s 178 treebanktreebank