[PDF] XQuery Streaming by Forest Transducers

Abstract

Streaming of XML transformations is a challenging task and only very few systems support streaming. Research approaches generally define custom fragments of XQuery and XPath that are amenable to streaming, and then design custom algorithms for each fragment. These languages have several shortcomings. Here we take a more principles approach to the problem of streaming XQuery-based transformations. We start with an elegant transducer model for which many static analysis problems are well-understood: the Macro Forest Transducer (MFT). We show that a large fragment of XQuery can be translated into MFTs --- indeed, a fragment of XQuery, that can express important features that are missing from other XQuery stream engines, such as GCX: our fragment of XQuery supports XPath predicates and let-statements. We then rely on a streaming execution engine for MFTs, one which uses a well-founded set of optimizations from functional programming, such as strictness analysis and deforestation. Our prototype achieves time and memory efficiency comparable to the fastest known engine for XQuery streaming, GCX. This is surprising because our engine relies on the OCaml built in garbage collector and does not use any specialized buffer management, while GCX's efficiency is due to clever and explicit buffer management.

Full PDF

XXQuery Streaming by Forest Transducers ∗ Shizuya Hakuta

DOCOMO Datacom, Inc.Japan

Sebastian Maneth † University of OxfordUK

Keisuke Nakano and Hideya Iwasaki

The University ofElectro-CommunicationsJapan

November 5, 2018

Abstract

Streaming of XML transformations is a challenging task and only a few existing systems support streaming.Research approaches generally deﬁne custom fragments of XQuery and XPath that are amenable to streaming,and then design custom algorithms for each fragment. These languages have several shortcomings. Here we takea more principled approach to the problem of streaming XQuery-based transformations. We start with an eleganttransducer model for which many static analysis problems are well-understood: the Macro Forest Transducer(MFT). We show that a large fragment of XQuery can be translated into MFTs — indeed, a fragment ofXQuery, that can express important features that are missing from other XQuery stream engines, such as GCX:our fragment of XQuery supports XPath predicates and let-statements. We then use an existing streamingengine for MFTs and apply a well-founded set of optimizations from functional programming such as strictnessanalysis and deforestation. Our prototype achieves time and memory eﬃciency comparable to the fastest knownengine for XQuery streaming, GCX. This is surprising because our engine relies on the OCaml built in garbagecollector and does not use any specialized buﬀer management, while GCX’s eﬃciency is due to clever and explicitbuﬀer management.

Data is often transmitted as a continuous stream; e.g., sensor readings such as weather data, or text messages suchas news feeds. Streams are also used to transmit large data that does not ﬁt into memory. Imagine now to query or transform streamed data, and, that the data is tree structured (e.g. in XML or JSON). Doing this within limitedmemory is a challenging task. Only a few systems support streaming of XML transformations. These systems workon a “best eﬀort” basis and try to use as little memory as possible, but do not give guarantees on the amount ofmemory used. XML transformations are conveniently expressed in XQuery and XSLT. An example of a best-eﬀortstreaming engine for XSLT is Kay’s Saxon [17].Research approaches generally proceed by deﬁning custom fragments of XQuery and XPath that are amenableto streaming, and then design custom algorithms for each fragment. Examples are forward XPath of Olteanu’sSpex [31] and the XQuery fragment of Koch, Scherzinger, and Schmidt’s GCX [18]. GCX is the fastest XQuerystreaming engine that we know. These languages have several shortcomings: they lack expressiveness needed forimportant transformations (for instance, they cannot express XPath predicates). Since these languages have beendesigned speciﬁcally for streamability they are diﬃcult to analyze and less amenable to more general optimizations,other than those optimizations speciﬁcally engineered for the streaming problems. For instance, their propertieswith respect to sequential composition are not well-understood, hence they cannot easily be used in conjunctionwith other transformations and ﬁlters outside of the fragment.In this paper, we take a diﬀerent and more principled approach to the problem of streaming XQuery-basedtransformations. We start with an elegant transducer model that has been studied in the literature extensively, andfor which many static analysis problems are well-understood, namely the

Macro Forest Transducer of Perst andSeidl [32]. Known properties for this model include: eﬀective exact type checking, composition closure with manyknown classes of transformations, decidable strictness analysis, decidable equivalence check for transformations of ∗ This is the full version of the paper published in the proceedings of ICDE 2014. † The present aﬃliation of this author is University of Edinburgh (UK). a r X i v : . [ c s . D B ] D ec inear size increase, and the possibility to represent outputs succinctly using grammar-based compression. We showthat a large fragment of XQuery can be translated into MFTs – indeed a fragment of XQuery, that can express theimportant features missing from the GCX language, such as XPath predicates and let-expressions. We rely on astreaming execution engine for MFTs implemented in OCaml by Nakano and Mu [30]. We use a well-founded set ofMFT-optimizations that are common in functional programming. In particular, we apply deforestation [39] whichhas been heavily studied and applied in functional programing languages like Haskell. Thus, we have moved XQuerystreaming from a “one oﬀ” problem to a well-understood problem within transducer programming. Further, wecan exploit feature of MFTs not present in our XQuery fragment. For instance, MFTs naturally support recursivefunction deﬁnitions. It is thus possible to translate a given XQuery program into an MFT, and to then change theMFT by adding recursive deﬁnitions. Alternatively, it would be possible to write a recursive MFT program whichuses small XQuery programs in the right-hand sides of its rules. Another convenient feature of MFTs is their abilityto validate the input, during transformation. This allows to check a XML Schema or Relax NG in one pass duringthe streaming transformation.Our contributions are summarized as: • We formalize a translation from a fragment of XQuery into macro forest transducers. We implemented aprototype system of our translation. • We demonstrate the eﬃciency of our system and compare it experimentally with GCX. In a nutshell, oursystem performs on par with GCX. • We present three diﬀerent optimizations: two forms of parameter removal, and the removal of stay moves.Together, they often induce a speedup of one order of magnitude. • We prove bigO complexity of composition constructions for MFTs. In the literature these constructions haveexponential time complexity. Using a particular feature of our MFTs (“stay moves”) we are able to prove aquadratic time complexity. Roughly speaking, stay moves allow to compress intermediate rule trees.Let us now explain in more detail some of the useful known properties about macro forest transducers that weuse. They can be applied to XQuery programs thanks to our translation of XQuery into MFT.(1) Streaming: MFTs traverse an input forest by applying rules based on structural recursion. Nakano andMu [30] show that any MFT-style program can be streamed thanks to the structural recursion restriction. Theirstreaming is based on the composition of an MFT and an XML parsing transducer in a way similar to that ofa macro tree transducer and a top-down tree transducer. They illustrate that the obtained transducer can benaturally implemented as a pushdown machine which directly processes an input XML stream and produces theoutput XML stream. Their streaming approach has, however, an disadvantage that it is hard for programmersto write an MFT-style program for XML transformation. Our XQuery-to-MFT translation makes up for thisshortcoming. Programmers can now write transformations in the user-friendly query language XQuery insteadof MFT-style programs, to obtain an XML stream processors. Due to the expressiveness of MFTs, our XQuerystreaming supports even complex queries which contain nested loops and multiple variable accesses.(2) Composition: MFTs themselves are not closed under composition. Several important subclasses howeverare closed under composition. For instance, their restriction to linear size increase are MSO deﬁnable and thereforeare closed under composition; this follows from Engelfriet and Maneth’s result that macro tree transducers of linearsize increase are MSO deﬁnable [6] and Maneth’s result that the composition hierarchy of macro tree transducerscollapses for linear size increase [23]. Note also that the linear size property is decidable for MFTs. If the MFTdoes not use context-parameters then it is called a “top-down forest transducer” (FTs). FTs are also not closedunder composition, however, we show that the composition of two FTs can be realized by one MFT. We give anexplicit construction for this result, and show that its time complexity is O ( | Σ || M || M | ), where Σ is the size ofthe alphabet of element/attribute and text constants used by the transducers. This is possible in the presence ofstay-moves (or by using DAG compressed right-hand sides); the classical constructions for composition of top-downtree transducers by Rounds [33] and Baker [1] are in fact exponential in the size of the ﬁrst transducer M .(3) Static Analyses: A powerful feature of MFTs is inverse type inference: regular tree languages are eﬀectivelypreserved by inverses of MFT translations. This allows to perform exact type checking [24, 27]. It also allows tocheck the parameter strictness of the states of a transducer. We use a simple version of strictness analysis in thispaper in order to reduce the number of context parameters of an MFT; this is simular to deaccumulation, see [15].Parameter reduction allows to greatly improve the eﬃciency of the MFTs that are obtained via our translation2rom XQuery. Another important static analysis (not used here) is equivalence checking: if two MFTs are of linearsize increase then their equivalence is decidable [7]. Related Work.

Streaming of XPath has been studied extensively. The fundamental article of Green etal. [16] shows how to translate ﬁlter-less descendant/child-XPath queries into ﬁnite state word automata (DFAs).The DFAs are executed top-down through the tree, using memory proportional to the depth of the XML documenttree. Several systems are based on ﬁnite automata, such as Xﬁlter, XTrie, YFilter, PreﬁxFilter, AFilter, and theXPush machine. Streaming of XPath queries that include ﬁlters is diﬃcult, because candidate nodes that dependon a ﬁlter need to be stored in memory. Recent work of Niehren et al. studies this in detail, see, e.g., [14, 13, 3].Bar-Yossef, Fontoura, and Josifovski [2] study theoretical bounds as well as practical streaming algorithms forXPath. Shalem and Bar-Yossef [38] study twig-join algorithms over XML streams.Early systems for XQuery streaming include the BEA processor [11], FluxQuery [19], and the Raindrop sys-tem [41]. Streaming of XQuery based on physical algebra operators is presented in [10]. There has been a transducer-based approach for streaming of XQuery [22], but, the transducers are more restrictive than our macro forest trans-ducer. Streaming of XSLT has been considered by Dvorakova [4] using tree transducers. The transducers are similarto tree-walking tree transducers (see e.g., [28]) as they can use XPath expressions with forward and backward axesin the right-hand sides of their rules. For a given XSLT program, they present an analysis that attempts to ﬁnd asmallest class of transducers that captures the given transformation. The classes are distinguished by the number ofpasses and the memory needed. No experimental evaluation of Dvorakova’s work is available. A mature commercialXSLT engine is SAXON by Kay [17]. Its performance is lower with respect to GCX and our engine, but, SAXON isnot comparable to these systems because it implements the full W3C standard of XSLT while GCX and our engineare proof-of-concept prototypes supporting restricted subsets of XQuery. Note that the main purpose of the newXSLT 3.0 speciﬁcation is to support streaming. For this, new primitives and modes are introduced for indicating thedesire to stream. A set of rules determines if the given program can indeed be streamed. Their memory requirementis that “not all input and output nodes are held in memory”.There are several works on streaming XML transformations based on programming language theoretic ap-proaches. In a direction similar to streaming MFTs [30], Frisch and Nakano proposed stream processing for termrewriting systems (TRS) [12], which is more powerful than MFTs because of their Turing completeness. However,it is still hard to write an XML transformation in TRS and it requires careful programming for an eﬃcient streamprocessing. Kodama, Suenaga, and Kobayashi applied type theories to obtain stream processing [20]. They employan ordered linear type system for guaranteeing the possibility of streaming. In both approaches, the transformationmust be written in their own programming languages instead of existing XML processing languages like XQueryand XSLT.

An XML document represents an ordered unranked tree. The nodes of this tree are of diﬀerent types: elementnodes, text nodes, attribute nodes, processing instructions, etc. Here we distinguish three types: element nodes,attribute nodes, and text nodes. Our techniques easily extend to other types of nodes. Element nodes have anarbitrary number of children and are written in XML as < elementname a = v . . . a n = v n > . . . where elementname is the name of the element node, a i are attribute names, v i are text values (given as stringsbetween double quotes), and “...” (possibly) contains further descendant nodes of this element node. In ourunranked tree model, the ﬁrst n children of an element node are attribute nodes labeled a , . . . , a n ; each attributenode has exactly one child which is a text node labeled v i . Text nodes have no children, i.e., they are leaves ofthe tree; they are labeled by a “text content” which is a character sequence appearing in the XML document. Forinstance, this XML snippet KnuthArt of Programming represents the unranked tree in Figure 1. It consists of a root node labeled “book” which has four children, labeled“isbn”, “price”, “author”, and “title”, respectively. Each of these children nodes has exactly one child that is a3rt of Programmingbookauthor titlepriceisbn $99 Knuth123Figure 1: An XML forest; text nodes are circled and attribute nodes are boxed. query ::= element | clauseelement ::= < elementname > { element | string | { clause } } ∗ clause ::= for $ var in ordpath return query | let $ var := query return query | ordpath | ( query { , query } + ) ordpath ::= $ var { pathstep } ∗ pathstep ::= / axis :: nodetest { [ predicate ] } ∗ axis ::= child | descendant | following-sibling nodetest ::= elementname | * | text() | node() predicate ::= predpath | empty( predpath ) | predpath =" string " | predpath !=" string " predpath ::= . { pathstep } ∗ Figure 2: Syntax of MinXQuerytext node (labeled “123”, “$99”, etc). Formally, each node has a type and a name. For us, both are part of thelabel, i.e., labels are of the form (type,name). We abstract from such pairs and assume that every node is labeledby a word in U + . The set U is a ﬁxed universal (ﬁnite) alphabet U of characters. For simplicity, we represent textnodes as special element nodes with empty child. We deﬁne XML forests as sequences of unranked trees.

Deﬁnition 1 An XML forest is a sequence t · · · t n where n ≥ and t , . . . , t n are unranked trees. An unrankedtree consists of a root node labeled by a word in U + and a (possibly empty) sequence of subtrees. The set of allXML forests is denoted by F . We often write a forest in term notation, i.e., generated by this EBNF:forest ::= ε | tree foresttree ::= label ( forest ) label ::= U + . We consider a downward navigational fragment of XQuery, called MinXQuery. XPath expressions in our queriesmay use the child, descendant, and following-sibling axes. Filters may test the existence of a path, or may comparea text node or attribute value against a constant string value. We do not allow where-clauses, “ordered by”-statements, recursive function deﬁnitions, and joins. Thus, MinXQuery expressions consist of nested for-loops andlet-statements. We do not discuss text predicates such as starts-with and contains; they are easy to support andare part of our prototype.Figure 2 shows an EBNF of MinXQuery fragment. Additional to the syntax deﬁned in that ﬁgure we imposethe following restrictions on an XQuery program: • The input document is bound to a special variable named $input . • Every XPath expression starts with a variable which has been introduced in the nearest enclosing for clause,or, if no such for clause exists, with the variable $input .4lthough the second restriction disables to deﬁne join queries, we can still write many practical queries by utilizingXPath predicates and nested loops. We do not deﬁne the semantics of XQuery programs here; see, e.g., [40]. As anexample consider the following MinXQuery program: for $v1 in $input/descendant::a returnfor $v2 in $v1/descendant::b returnlet $v3 := $v2/descendant::c returnlet $v4 := $v2/descendant::d return($v1,$v2,$v3,$v4)

Consider the evaluation of this program on the following input document:

Let us refer to the ﬁrst b -node in the document by b , and to the second one (in preorder) by b , etc. The ﬁrstsequence of subtrees that are output are those rooted at a , b , c , c , d , and d , respectively. Another sequenceof subtrees is also output, whose roots are a , b , and d . A forest transducer is a ﬁnite state machine that takes as input an XML forest and produces as output an XMLforest. Recall that each node of an XML forest is labeled by a non-empty word over U . In our transducers weabstract from U -characters forming the label of a node by ﬁxing a ﬁnite set Σ of words in U + that are of interestto the transducer. We refer to elements of Σ as “symbols”. A rule for the state of a transducer tests if the currentinput node is labeled by a given symbol σ ∈ Σ. A state also has a “default rule” which is applicable if no otherrule of that state applies; the default rule applies to any U + -labeled node. The parsing of an input forest accordingto the EBNF given in Deﬁnition 1 also provides us with the information whether the empty forest ε is reached.For instance, the parsing of the forest a ( b ()) is parsed as a ( b ( ε ) ε ) ε . Our transducers may use this information onoccurrences of ε : for each state we require that the transducer has a rule for the input ε .We deﬁne the ﬁxed set Y = { y , y , . . . } of context parameters , also called accumulating parameters , or simply parameters . A ranked set is a set together with a mapping that associates to each element of the set a non-negativenumber, called the rank of that element. For a ranked set Q we denote by Q ( k ) the subset of symbols which haverank k . Deﬁnition 2

Let Σ ⊆ U + be a ﬁnite set of symbols and let % t be a special symbol not in Σ. A macro foresttransducer M over Σ is a tuple ( Q, Σ , q , R ) where Q is a ﬁnite ranked set of states , each of rank ≥

1. The initialstate q is in Q (1) . Let q ∈ Q ( m +1) with m ≥

0. For every input symbol σ ∈ Σ the set R contains at most one (q, σ )-rule of the form q ( σ ( x ) x , y , . . . , y m ) → r ,where r is a forest over Σ ∪ Q ∪ X ∪ Y m with X = { x , x , x } , Y m = { y , . . . , y m } , and the properties that a leafhas a label in X if and only if it is the ﬁrst child of a Q -node, and parameters in Y m may only appear at leaves.Variables x , x and x bind the current node with the rest of the stream, the children of the current node, and therest of the stream, respectively. Additionally, R contains exactly one rule of each of the following two kinds: (1) a ε -rule of the form q ( ε, y , . . . , y m ) → r ,where r is as for ( q, σ )-rules, but with X = { x } ; (2) a default rule of the form q (% t ( x ) x , y , . . . , y m ) → r ,where r is as for ( q, σ )-rules but where binary nodes may be labeled % t . (cid:3) Note that our transducers are deterministic by deﬁnition, i.e., for a state at a given input node, at most onerule is applicable. Note also that they are total and deﬁne an output for any arbitrary given input forest, due tothe presence of the default rules. 5et M = ( Q, Σ , q , R ) be an MFT. A call of the from q (cid:48) ( x ) in the right-hand side of a rule of M is called “staymove”. Note that stay moves can give rise to non-terminating computation. For instance, the rule q ( ε ) → q ( x ) ifcalled on the leaf ε does not terminate. We do not further formalize termination but refer the reader to Section 5.2of [5] where this is discussed for a formalism similar to MFTs, namely for deterministic pebble macro tree transducers.We only deal with terminating MFTs; all our constructions operate on terminating MFTs, and are guaranteed toconstruct terminating MFTs. Therefore, we always mean “terminating MFT” from now on, when we speak aboutMFT.We deﬁne the semantics of the MFT M . For a given input XML forest f , M ’s output denoted [[ M ]]( f ) is deﬁnedas [[ q ]]( f ). Every state q ∈ Q ( m +1) , m ≥ q ]] : F m +1 → F deﬁned recursively for forests g , f , . . . , f m ∈ F as [[ q ]]( g , f , . . . , f m ) = [[ r ]]where r is the right-hand side of the unique rule that is applicable: (i) if g = ε then r is the right-hand side of q ’s ε -rule. (ii) If g = σ ( g ) g for forests g , g , then r is the right-hand side of the ( q, σ )-rule of M , if it exists, andotherwise is the right-hand side of the default rule of q . The tree [[ r ]] is deﬁned inductively as: [[ ε ]] = ε , [[ y j ]] = f j if j ∈ { , . . . , m } , and [[ q (cid:48) ( x i , u , . . . , u n )]] = [[ q (cid:48) ]]( g i , [[ u ]] , . . . , [[ u n ]]) if q (cid:48) ∈ Q n +1 , i ∈ { , , } , and u , . . . , u n aresubtrees in r .Let us consider this example query P person : { for $b in$input/person[./p_id/text() = "person0"]return let $r := $b/name/text()return $r } The query selects all text-node children of any name-node, that is child of a person node, and that person-node hasa p id-child with text node child of content “person0”. Here are the rules of the MFT M person of this example; itoutputs a root-node “out” where the children are the results of the above query. q (%( x ) x ) → out( q ( x )) q (person( x ) x ) → q ( x , q ( x )) q ( x ) q (% t ( x ) x ) → q ( x ) q ( x ) q (p id( x ) x , y ) → q ( x , y , q ( x , y )) q (% t ( x ) x , y ) → q ( x , y ) q (person0( x ) x , y , y ) → y q (% t ( x ) x , y , y ) → q ( x , y , y ) q ( ε, y , y ) → y q (name( x ) x ) → q ( x ) q ( x ) q (% t ( x ) x ) → q ( x ) q (% t text ( x ) x ) → % t ( ε ) q ( x ) q (% t ( x ) x ) → q ( x ) q i ( ε ) → ε for i ∈ { , , , } where the pattern % t text ( x ) x matches only text nodes, thereby x should bind ε for ‘normal’ inputs. Let us runthe MFT M person on the XML tree for this document: person0JimLi We want to compute [[ q ]](person( t t t t ε ) ε ) where t is a tree with root-node labeled p id, t , t are name-nodes,and t is a c -leaf. According to the ﬁrst rule we obtain out([[ q ]](person( f ) ε )), where f is the forest t t t t ε . Weapply the ( q , person)-rule to obtain [[ q ]]( f, [[ q ]]( f )) [[ q ]]( ε ).We remove the q -call because it produces ε . Let t = p id( f ) and f = t t t ε . According to the ( q , p id)-rule,we obtain [[ q ]]( f , [[ q ]]( f ) , [[ q ]]( f , [[ q ]]( f ))) Let f = a( ε ) person0( ε ) ε , we apply the default rule for q to obtain[[ q ]](person0( ε ) ε, [[ q ]]( f ) , [[ q ]]( f , [[ q ]]( f ))).6hen we apply the ( q , person0)-rule to obtain [[ q ]]( f ). Let t = name( f ) and t = name( f ). Since f =p id( f ) name( f ) c( ε ) name( f ) ε , we apply the q -rules to obtain [[ q ]]( f ) [[ q ]]( f ) ε . The forest f only consists ofthe “Jim” text node, which is thus output by q . Similarly, the q -call is applied to the other text in f and outputit. Thereby our ﬁnal output is out(Jim Li) where the empty forest ε is omitted. Note that in XML documents it isnot possible to have two text nodes that are direct siblings of each other. Thus, our MFT processor for this exampleoutputs JimLi .It is interesting to observe why the state q of the transducer M person uses two parameters y and y . Thisis done to simulate the existential semantics of XPath ﬁlters (the two parameters are used as two branches of aif-then-else statement). To see this, consider the input perso7Jimperson0 After some initial steps (as before) we obtain[[ q ]](perso7( ε ) ε, [[ q ]]( f (cid:48) ) , [[ q ]]( f (cid:48) , [[ q ]]( f (cid:48) ))).Since the ﬁlter is not true at this point (because “perso7” is not equal to “person0”), we need to check if furtherp id-children of the person-node satisfy the ﬁlter. We carefully prepared for this event, by supplying q ’s initial callwith a second parameter that contains a q -call to the p id-siblings. Thus, when q meets the ε -leaf of perso7 (viz.we know that the ﬁlter is false here) it selects its second parameter, thus, we obtain [[ q ]]( f (cid:48) , [[ q ]]( f (cid:48) )) and proceedcorrectly. Size of a Transducer.

We deﬁne the size | M | of the MFT M as | Σ | plus the sum of sizes of all left-handsides and right-hand sides of M ’s rules. The size of a forest is deﬁned as the number of its nodes. We now discuss the compilation of a MinXQuery program P to the MFT M P . First, let us deﬁne a shorthandnotation. For a forest f that is restricted as the default rule of an MFT, but with the additional restriction that x , x do not appear, we denote by q (% , y , . . . , y m ) → f the two rules q (% t ( x ) x , y , . . . , y m ) → fq ( ε, y , . . . , y m ) → f We deﬁne M P = ( Q, Σ , q , R ) where Σ consists of all element labels and string constants that appear in P . Forinstance, for our example program P person , Σ consists of person, p id, person0, and name. The initial state q ofthe MFT has the two rules induced by: q (%) → q (cid:48) ( x , q copy ( x ))where q (cid:48) , q copy are states in Q of ranks 2 and 1, respectively. The state q copy realizes the identity mapping onforests, via the rules q copy (% t ( x ) x ) → % t ( q copy ( x )) q copy ( x ) q copy ( ε ) → ε. Our compilation functions are deﬁned recursively on the structure of P , and return sets of rules. The compilationof any (sub)-expression of P is done in the context of a mapping ρ and a state q ∈ Q . The mapping ρ is of the form ρ = { ( v , , . . . , ( v n , n ) } , where v i are variable names appearing in the MinXQuery program. The state q ∈ Q isthe current state for which rules are deﬁned by the compilation. We deﬁne ρ = { ( $input , } and issue the call T ( P, ρ , q (cid:48) ) as initial call to the compilation function T .We deﬁne the compilation function T recursively. Let e, e (cid:48) , e , . . . , e n be arbitrary MinXQuery expressions, ρ amapping, q ∈ Q of rank m + 1, m ≥

0, and p an XPath expression (as deﬁned by the nonterminal ordpath in theEBNF of Figure 2).If e = e · · · e n then let q , . . . , q n be new states in Q of rank m + 1 and deﬁne T ( e, ρ, q ) = { r } ∪ T ( e , ρ, q ) ∪· · · ∪ T ( e n , ρ, q n ) where r is the rule q (% , y , . . . , y m ) → q ( x , y , . . . , y m ) · · · q n ( x , y , . . . , y m ).If e = < σ > e (cid:48) with σ ∈ Σ then let q (cid:48) be a new state in Q of rank m +1 and deﬁne T ( e, ρ, q ) = { r }∪T ( e (cid:48) , ρ, q (cid:48) )where r is the rule q (% , y , . . . , y m ) → σ ( q (cid:48) ( x , y , . . . , y m )).If e = σ (i.e., σ is a string constant) then deﬁne T ( e, ρ, q ) = { q (% , y , . . . , y m ) → σ ( ε ) } .7f e = $ v where $ v is a variable name, then deﬁne T ( e, ρ, q ) = { q (% , y , . . . , y m ) → y ρ ( $ v ) } .If e = for $ v in p e (cid:48) then let q (cid:48) be a new state in Q of rank m + 2 and deﬁne T ( e, ρ, q ) = T ( e (cid:48) , ρ (cid:48) , q (cid:48) ) ∪ F ( p, q, q (cid:48) )where ρ (cid:48) = ρ ∪ { ( $ v, m + 1) } and F ( p, q, q (cid:48) ) is deﬁned below.If e = let $ v := e v e (cid:48) then let q v , q (cid:48) be new states in Q of rank m + 1 and m + 2, respectively. Deﬁne T ( e, ρ, q ) = { r } ∪ T ( e v , ρ, q v ) ∪ T ( e (cid:48) , ρ (cid:48) , q (cid:48) ) where ρ (cid:48) = ρ ∪ { ( $ v, m + 1) } and r is the rule q (% , y , . . . , y m ) → q (cid:48) ( x , y , . . . , y m , q v ( x , y , . . . , y m )).If e = p with an XPath expression p , then let q (cid:48) be a new state in Q of rank m + 2 and deﬁne T ( e, ρ, q ) = { r } ∪ F ( p, q, q (cid:48) ) where r is the rule q (cid:48) (% , y , . . . , y m +1 ) → y m +1 .The rules in F ( p, q, q (cid:48) ) are deﬁned so that[[ q ]]( t s , u , . . . , u m ) =[[ q (cid:48) ]]( t s , u , . . . , u m , t ) . . . [[ q (cid:48) ]]( t n s n , u , . . . , u m , t n ) (1)where t , . . . , t n are all subtrees of t , in pre-order, that satisfy the XPath p relative to the root of t , and s , . . . , s n are the sequences of their following siblings.Let us give a deﬁnition of F ( p, q, q (cid:48) ) for an XPath p , two states q and q (cid:48) with rank m and m + 1, respectively,so that equation (1) holds. We ﬁrst show the case where p contains no predicate. We obtain a total deterministicﬁnite automaton (DFA) from the XPath p in the usual way. We only discuss child and descendant axes. Thistranslation is described by Green et al [16]. The cases for sequences of following-sibling axes is similar. Withoutloss of generality, the initial state of the DFA has no incoming transition. The set F ( p, q, q (cid:48) ) consists of rules eachof which corresponds to a transition q a −→ q of the DFA. When q is not initial and q is not ﬁnal, F ( p, q, q (cid:48) )contains a rule q ( a ( x ) x , y , . . . , y m ) → q ( x , y , . . . , y m ) q ( x , y , . . . , y m ). When q is initial and q is not ﬁnalthe set has a rule q ( a ( x ) x , y , . . . , y m ) → q ( x , y , . . . , y m ). When q is not initial and q is ﬁnal, the set has a rule q ( a ( x ) x , y , . . . , y m ) → q (cid:48) ( x , y , . . . , y m , a ( q copy ( x ))). When q is initial and q is ﬁnal, q ( a ( x ) x , y , . . . , y m ) → q (cid:48) ( x , y , . . . , y m , a ( q copy ( x ))).Next we show the case where the XPath p contains predicates. We ﬁst construct a set of rules in a way similarto the above ignoring all predicates. If a step in the XPath p has a predicate p (cid:48) , we modify rules correspondingto the transition for the step in the DFA. For example, when p = $ v //a[ p (cid:48) ]/b/c , we modify the rule for thetransition q a −→ q of the DFA using another DFA obtained from the predicate XPath p (cid:48) . Before the modiﬁcation,we introduce a state q p (cid:48) in the translated MFT so that [[ q p (cid:48) ]]( t ts, u , u ) = u if the predicate XPath p (cid:48) is true for t relative to the root of t , and [[ q p (cid:48) ]]( t ts, u , u ) = u otherwise. The set of rules for q p (cid:48) is obtained in a way similarto regular lookahead removal in macro tree transducers [9]. We use this state q p (cid:48) for the modiﬁcation of rules. Forexample, suppose that we obtain the following rules for q by ignoring predicates: q ( a ( x ) x , y , . . . , y m ) → q ( x , y , . . . , y m ) q ( x , y , . . . , y m ) q (% t ( x ) x , y , . . . , y m ) → q ( x , y , . . . , y m ) q ( x , y , . . . , y m )Then we modify the ﬁrst rule as follows. q ( a ( x ) x , y , . . . , y m ) → q p (cid:48) ( x , q ( x , y , . . . , y m ) ,q ( x , y , . . . , y m )) q ( x , y , . . . , y m )Let us summarize our translation using the example program P person . First, an MFT rule q (%) → q ( x , q copy ( x ))is generated for the initial state q , where q corresponds to an expression e = { e for } . The accumu-lating parameter of q is introduced for the $input variable, which is not used as an output hence it will beeliminated in the further optimization. For the q state, an MFT rule q (% , y ) → out( q ( x , y )) is generated by T ( e, ρ, q ) with ρ = { ( $input , } . For more MFT rules, we compute T ( e for , ρ, q ) = T ( e let , ρ (cid:48) , q ) ∪ F ( p, q , q )where e for = for $b in p return e let and ρ (cid:48) = ρ ∪ { ( $b , } . MFT rules for the q state are obtained by the furthercomputation so that the output of q is the results of the e let expression for the current node. The computation of F ( p, q , q ) generates MFT rules for the q state which collect a sequence of results of q at the path p . After thewhole translation, an MFT with 14 states is ﬁnally generated. By the parameter reduction discussed in Section 4,the MFT M person will be obtained. 8he size | P | of a MinXQuery program P is deﬁned as the number of nodes in its parse tree (according to ourEBNF in Figure 2). Theorem 1

Given a MinXQuery program P , the MFT M P is constructed in time O ( | P | ) . For every XML forest f it holds that [[ M P ]]( f ) = [[ P ]]( f ) . Our translation generates a transducer which includes many redundant parameters in general. They should beeliminated as much as possible because the number of parameters has a serious eﬀect on eﬃciency of streamingof the obtained transducers. In an extreme case, we can eliminate all parameters. This will also help us to applycomposition laws discussed in Section 4.2. In this section, we suppose that the index position of arguments (orparameters) of states is starting with zero, e.g., the parameter y in q ( x, y , y ) is called the second parameter.In our implementation, we eliminated two kinds of parameters: unused parameters and constant parameters.Additionally, we eliminated parameters by removing stay moves and unreachable states. Since the optimizationsmay interact, we apply them repeatedly. Unused parameter reduction.

An unused parameter is one that does not appear in the output, for anygiven input. For example, if we have ﬁve rules for the states q and q (cid:48) q ( σ ( x ) x , y , y ) → δ ( q (cid:48) ( x , y , y )) q (% t ( x ) x , y , y ) → % t ( q (cid:48) ( x , δ ( y ) , σ ( y ))) q ( ε, y , y ) → σ ( y ) q (cid:48) (% t ( x ) x , y , y ) → q ( x , ε, y ) q (cid:48) ( ε, y , y ) → ε ,then the parameters y of q and y of q (cid:48) are unused because they never contribute to outputs of the transducer.The second parameter y of q is obviously used for output because of the third rule. From this fact, the parameter y of q (cid:48) may also be used because it will be passed to q as the second argument in the fourth rule. A set of unusedparameters are obtained by ﬁnding all necessary parameters in the following algorithm. Let us call a bare occurrence of y i in e when y i occurs in e but not in an argument of a state call in e . The algorithm collects all necessaryparameters as a set S ⊆ U with U = { ( q, i ) | q ∈ Q, ≤ i ≤ rank ( q ) − } so that ( q, i ) ∈ S implies that the i -thparameter of state q appears in outputs. S := { ( q, i ) | y i is a bare occurrence in the right-handside of q rule. } until S is no longer updated do S := S ∪ { ( q, i ) | ( q (cid:48) , i (cid:48) ) ∈ S,e is the i (cid:48) -th argument of a q (cid:48) callin the right-hand side of q rule ,y i is a bare occurrence in e } end This procedure always terminates because of ﬁniteness of U . Obviously, U \ S is a set of unused parameters. Foreach ( q, i ) ∈ U \ S , the parameter y i can be eliminated from the left-hand side of the q rules. We also remove the i -th argument of the q -call in the right-hand sides of all rules. Constant parameter reduction.

A constant parameter is a parameter which is always instantiated by thesame constant forest. This can be found by checking whether the parameter of the state in the right-hand sides ofall rules is either the speciﬁc constant forest or the parameter of the same state in the left-hand side. For example,let f be an XML forest and consider rules q ( σ ( x ) x , y , y ) → q ( x , ε, y ) δ ( q (cid:48) ( x , y )) q (% t ( x ) x , y , y ) → q ( x , y , y ) % t ( q (cid:48) ( x , δ ( y ))) q ( ε, y , y ) → y q (cid:48) (% t ( x ) x , y ) → δ ( q ( x , ε, x ))9nd no other rule contains q in its right-hand side. The parameter y can be eliminated from the q -rules. We replaceall occurrences of y with the constant ε in the right-hand side of the q rules, i.e., the third rule in the exampleabove becomes q ( ε, y ) → ε . Stay move removal.

Removing stay moves also contributes to parameter reduction because it may removestates with parameters by inlining. For example, if we have a rule q (% , y , y ) → q (cid:48) ( x ) y , then all occurrences of q ( x i , e , e ) in the right-hand sides of the rules can be replaced by q (cid:48) ( x i ) e . Since the state q is discarded, thenumber of parameters is consequently reduced. Note that our translation only introduces stay rule of the formof the q (% , . . . ) → f which are particularly easy to inline. A general procedure for stay-move removal of similartransducers is given in Theorem 31 of [5]. Unreachable state removal.

Removing unreachable states can reduce the number of parameters for thesame reason as the stay move removal. When we construct the state-call dependency graph according to all rules,it is obvious that unreachable states from the initial state are unused. In an extreme case, the translation of agiven MinXQuery program introduces only redundant parameters, which can be reduced by the four proceduresabove. It is possible to detect whether the case happens or not from a MinXQuery program without translation.Let us classify occurrences of variables in the MinXQuery program into three: bound variables, path variables, andoutput variables. A bound variable occurs at the left-hand side of a let or for clause; a path variable occurs atthe beginning of an XPath expression; an output variable occurs at the other parts. Our translation introducesparameters for two purposes: XPath predicates and output variables. Many parameters introduced for variablebindings can be removed because most variables in MinXQuery programs occur as path variables. These parametersare removed as unused parameters in the aforementioned way. Additionally, if an output variable occurrence in theprogram is only where it is introduced by the nearest enclosing for clause, the corresponding parameter can beremoved by stay move removal. In summary, we easily obtain the following lemma from these observation.An MFT where each state is of rank 1, i.e., in which no context parameters y , . . . are used is called top-downforest transducers , abbreviated FT. Theorem 2

Let P be a MinXQuery program. When P satisﬁes (1) every XPath expression contains no predicates,and (2) every output variable occurrence is not inside of a for clause except that the corresponding bound variableoccurrence is in the for clause, there eﬀectively exists an FT equivalent to [[ P ]] . Proof.

Suppose that a MinXQuery program P satisﬁes the conditions above. It suﬃces to show that all accu-mulating parameters of the translated MFT can be removed. From the ﬁrst condition on XPath expressions, allparameters of the translated MFT are introduced in the following four cases of translation: the initial state, for clauses, let clauses, and XPath queries. For the initial state, our translation introduces a parameter for the $input variable. From the second condition, it does not occur inside any for clause, hence the parameter can be eliminatedin all states introduced for translating the inside expressions. For the other states which have the parameter, wecan remove them by stay move removal. As for the for and let clauses, the present statement can be shown ina similar way. The state introduced for an XPath query translation can be immediately eliminated by stay moveremoval. (cid:3) Composition of two XML transformations is to remove intermediate XML tree constructions like deforestation [39],which has been heavily studied in the context of functional programming. This can be a powerful optimizationfor XML processing. Koch chose for GCX a compositional fragment of XQuery. This means that two XQueryprograms (where the second reads the output of the ﬁrst) can be composed into one program. It can be shownthat our fragment of XQuery can be composed as well. What is known on the forest/tree transducer side withrespect to composition? It is easy to see that both MFT and MFT without parameters (FT) are not closed undercomposition. But, two FTs can be composed into one MFT. This can be obtained through known results; wegive a direct construction and determine its worst-case time complexity. We consider further composition results,when one of the involved transducer is a tree transducer , and state the complexity in terms of bigO-notation. Ourtransducers are slightly diﬀerent from those in the literature, plus, no complexity statements are known; thereforedwell on the theory and establish these results here. We denote f ., g for a composition of two functions and F ., G for a composition of two classes, that is, F ., G = { f ., g | f ∈ F, g ∈ G } .10 xpressive Power. An XML forest can naturally be seen as a binary tree , using the well-known ﬁrst-childnext-sibling encoding (see, e.g., [37]). In this encoding, the ﬁrst child of an unranked node becomes the left child inthe binary tree, and the next sibling in the unranked tree becomes the right child in the binary tree. If an elementnode has no ﬁrst child or no next sibling, then in the binary tree it has the empty tree ε as left (resp. right) child.A binary XML tree is a binary tree with internal nodes of rank 2 labeled by elements in U ∗ and leaves labeled ε . The set of all binary XML trees is denoted by B . For an XML forest f ∈ F we denote by fcns( f ) its ﬁrst-childnext-sibling encoded binary XML tree in B ; i.e., fcns( ε ) = ε and for forests f , f and σ ∈ U + ,fcns( σ ( f ) , f ) = σ (fcns( f ) , fcns( f )) . Given an MFT M , its binary tree translation [[ M ]] B is the function over B deﬁned as[[ M ]] B = { (fcns( f ) , fcns( g )) | ( f, g ) ∈ [[ M ]] } . We denote by mft the class of all binary tree translations realized by MFTs.We can now compare the expressive power of MFTs to other well-known classes of tree translations. A macrotree transducer ( top-down tree transducer ), for short MTT (TT), is an MFT (FT) M such that the right-hand sideof each rule is a tree in which (Σ ∪ { % t } )-labeled nodes are binary. In this case the output is always a binarytree, and therefore we deﬁne the tree translation of M as [[ M ]] B = { (fcns( f ) , g ) | ( f, g ) ∈ [[ M ]] } . The classes oftranslations are denoted mtt and tt . Macro and top-down tree transducers are conventionally deﬁned for rankedinput and output alphabets (not necessarily binary), and do not have stay moves or default rules. These inclusionshold: tt (cid:40) ft (cid:40) mtt (cid:40) mft .It was shown in [32], for transducers without stay and default rules, that every macro forest transducer can bedecomposed into a macro tree transducer, followed by an “evaluation mapping” eval . The macro tree transduceris obtained from the macro forest transducer by replacing each occurrence of concatenation in the right-hand sidesof the rules by a special binary symbol @. For instance, the MFT right-hand side q ( x ) y b ( ε, ε ) is replaced by thetree @( q ( x ) , @( y , b ( ε, ε ))). The evaluation mapping interprets @-symbols by concatenation, i.e., eval (@( t , t )) = eval ( t ) eval ( t ), and for all other labels realizes the identity. It should be clear that this result also holds in thepresence of stay and default rules. Thus, we have mft ⊆ mtt ., eval . It is not diﬃcult to show that also theconverse inclusion holds: given an MTT M and a evaluation mapping eval Σ , we can construct an MFT N such that[[ N ]] = [[ M ]] ., eval Σ (we simply remove all @-symbols from the right-hand sides of M ’s rules by interpreting themaccording to eval Σ ). The constructions do not aﬀect the presence of parameters, and thus the inclusions also holdfor forest transducers (without context-parameters). It is shown in [32] that eval Σ can be realized by a macro treetransducer. Lemma 1

The following relations hold (and one representation can be obtained from the other in linear time):(1) mft = mtt ., eval (2) ft = tt ., eval (3) eval (cid:40) mtt . We want to derive new composition results for MFTs, using existing results about tree transducers. We areinterested in complexity, and therefore must look carefully how stay rules and default rules behave under composi-tion.As it turns out, stay rules are quite useful for transducer composition: they allow to “compress” new right-handsides (using the compression power of transducer rules). Without them, composing two top-down tree transducerstakes exponential time, with them: quadratic time! This is easy to see: consider a transducer M that translatesevery a -node into 4 b -nodes: q ( a ( x )) → b ( b ( b ( b ( q ( x ))))) q ( ε ) → ε The next transducer M spawns two new copies for each b node, via a rule of the form p ( b ( x )) → c ( p ( x ) , p ( x )) p ( ε ) → ε If we follow the natural product construction of translating via M the right-hand sides of M ’s rules, then weobtain this deterministic top-down tree transducer (DT for short) rules (cid:104) q , p (cid:105) ( a ( x )) → c ( c ( c ( c ( (cid:104) q , p (cid:105) ( x ) , . . . )))) (cid:104) q , p (cid:105) ( ε ) → ε stay rules we can avoid such blow-ups.A stay transducer for the example does not have a right-hand side of exponential size for the ( (cid:104) q , p (cid:105) , a )-rule, butinstead breaks up that tree into many separate rules of the node-by-node M -translation of M ’s ( q , a )-rule: (cid:104) q , p (cid:105) ( a ( x )) → c ( (cid:104) q , p , (cid:105) ( x ) , (cid:104) q , p , (cid:105) ( x )) (cid:104) q , p , (cid:105) ( a ( x )) → c ( (cid:104) q , p , (cid:105) ( x ) , (cid:104) q , p , (cid:105) ( x )) (cid:104) q , p , (cid:105) ( a ( x )) → c ( (cid:104) q , p , (cid:105) ( x ) , (cid:104) q , p , (cid:105) ( x )) (cid:104) q , p , (cid:105) ( a ( x )) → c ( (cid:104) q , p (cid:105) ( x ) , (cid:104) q , p (cid:105) ( x ))Using stay rules we can construct in quadratic time a DT realizing the composition of two given DTs. In fact, thisalso works for two TTs, i.e., if the given transducers have stay rules and default rules. Recall that the size | M | oftransducer M is deﬁned as | Σ | plus the sum of sizes of M ’s rules. Lemma 2

Let M , M be TTs over Σ . A TT M can be constructed in time O ( | Σ || M || M | ) such that [[ M ]] =[[ M ]] ., [[ M ]] . Proof.

Let M i = ( Q i , Σ , q i , R i ). We ﬁrst add some rules to M : For every a ∈ Σ for which there is a ( p, a )-rulein R but no ( q, a )-rule in R we add the rule r a to R ; the rule r a is obtained from M ’s binary default rule forstate q by replacing every occurrence of % t (in left and right-hand side) by a . We deﬁne M = ( Q, Σ , (cid:104) q , q (cid:105) , R ).For all states q ∈ Q and p ∈ Q let (cid:104) q, p (cid:105) be a state in Q . For every rule r ∈ R , node u of the right-handside of r , and state p ∈ Q , let (cid:104) r, u, p (cid:105) be a state in Q . Let r be the rule q ( b ( x , . . . , x k )) → t with k ∈ { , , } and b ∈ Σ ∪ { % t } ∪ { ε } and let p be a state in Q . We let the rule (cid:104) q, p (cid:105) ( b ( x , . . . , x k )) → (cid:104) r, λ, p (cid:105) ( x ) be in R .Recall that λ denotes the root node of a tree. For every node u of t we let the rule (cid:104) r, u, p (cid:105) ( b ( x , . . . , x k )) → t (cid:48) be in R . If u is labeled by q (cid:48) ( x i ) for q (cid:48) ∈ Q and 0 ≤ i ≤ k , then deﬁne t (cid:48) = (cid:104) q, p (cid:105) ( x i ). Otherwise, t (cid:48) is obtainedfrom the right-hand side of the unique p -rule that is applicable to node u of t . We ﬁnally replace every p (cid:48) ( x i ) by (cid:104) r, u.i, p (cid:48) (cid:105) ( x ), where u. u . The correctness of the construction follows from the fact that for every inputforest t : [[ (cid:104) r, u, p (cid:105) ]]( t ) = [[ p ]]( s ) where s is the subtree at u of [[ q ]]( t ) and r is the unique q -rule that is applicable tothe root of t . The statement can be proved by induction on the structure of t . Let maxrhs( M ) be the size of alargest right-hand side of M ’s rules. In the ﬁrst step we add at most | Σ | -many rules to R . We thus obtain atransducer of size O ( | Σ || M | ). For each rule r of M we construct in M at most O ( | Σ || M | )-many versions of thatrule (of same size as r ). Thus M is constructed in time O ( | Σ || M ||| M | ). (cid:3) Note that the eﬀective composition closure of total deterministic top-down tree transducers (i.e., TT’s withoutstay moves and default rules) was proved in Theorem 2 of [33]; it is also shown there that non-total such transducersare not closed under composition. Baker shows how to restrict nondeterministic top-down tree transducers so thatthey can be composed into one transducer [1]. We are not aware of statements in the literature about the timecomplexity of tree transducer composition. Before we give results about composition of forest transducers, we lifttwo existing results about macro tree transducers to the presence of stay moves and default rules.

Lemma 3

Let M be an MTT and M a TT. Then MTTs M, M (cid:48) can be constructed in time O ( | Σ || M || M | ) suchthat [[ M ]] = [[ M ]] ., [[ M ]] and [[ M (cid:48) ]] = [[ M ]] ., [[ M ]] . Proof.

The construction of M (cid:48) is similar as in the proof of Lemma 2, so we omit the details. The construction of M is more complicated. Let p , . . . , p n be an ordering of the states of M . Let q be a state of M of rank m + 1, m ≥

0, and let state p i be a state of M . Then deﬁne (cid:104) q, p i (cid:105) to be a state of M of rank m + 1. For every q -rule r of M and node u in the right-hand side of r deﬁne (cid:104) r, u, p i (cid:105) to be a state of M of rank 1 + mn with n = | Q | . Theidea is as before, state (cid:104) r, u, p i (cid:105) is obtained by translating the node u of the right-hand side t of r in state p i of M .The diﬀerence now is how to translate parameters y j : we must output the p i -translation of the current parametertree in y j . For this, we provide state (cid:104) q, p (cid:105) with n -many copies of each parameter y j (one for each state p i ). Detailsare omitted due to lack of space; they can be found in the full version of the present paper [ ? ]. (cid:3) Analogous results (without complexity statements) about transducers without stay moves and default rules arestated in Corollary 4.10 and Theorem 4.12 of [9], respectively.

Composition of Forest Transducers.

We now consider the composition of two forest transducers (FTs), i.e.,MFTs without accumulating parameters. It is easy to see that FTs are not closed under composition: (1) the output12orests of any FT (seen as binary trees via the ﬁrst-child/next-sibling encoding) has height at most exponentialin the height of the input tree. (2) the composition of the following FT with itself has double exponential heightincrease. It translates a forest of n many a -nodes into a forest of 2 n many a -nodes: q ( a ( x , x )) → q ( x ) q ( x ) q ( ε ) → a. We now show that two FTs can be composed into one MFT. In fact, we show a stronger result: the compositionof an MTT and an FT can be realized by one MFT. Any FT can be turned in linear time into an equivalent MTTby turning each right-hand side into it binary tree encoding.

Theorem 3

Let M be an MTT and M an FT. An MFT M can be constructed in time O ( | Σ || M || M | ) such that [[ M ]] = [[ M ]] ., [[ M ]] . Proof.

By Lemma 1(2), M can be decomposed into a TT M (cid:48) and an eval mapping eval Σ . This takes time O ( | M | ). According to Lemma 3 we construct in time O ( | Σ || M || M (cid:48) | ) an mtt M (cid:48) with [[ M (cid:48) ]] = [[ M ]] ., [[ M (cid:48) ]]. Finally,we compose M (cid:48) and the eval Σ in time O ( | M (cid:48) | ) into the MFT M . (cid:3) Theorem 4

Let Σ be an alphabet, M a TT over Σ , and M an FT over Σ . An FT M can be constructed in time O ( | Σ || M || M | ) such that [[ M ]] = [[ M ]] ., [[ M ]] . Proof.

We decompose M into tt ., eval in time O ( | M | ) according to Lemma 1(2). We compose M with theobtain TT in time O ( | Σ || M || M | ). From the obtained mapping in tt ., eval we construct an FT again according toLemma 1(2), in linear time. (cid:3) Theorem 5

Let M be an FT and M an TT. An MTT M can be constructed in time O ( | Σ || M || M | ) such that [[ M ]] = [[ M ]] ., [[ M ]] . Proof.

We decompose M into tt ., eval in time O ( | M | ) according to Lemma 1(2). The eval-mapping can beturned into an MTT in linear time. We compose the ﬁrst TT and this MTT into one MTT, according to Lemma 3.In this case we do not need to add extra rules to the ﬁrst transducer, and therefore do not pay the Σ-factor. Thereason is that the MTT for eval has no a -rules for a ∈ Σ whatsoever: it only consists of default and ε -rules. Finally,we compose the obtained MTT with M to obtain the desired MTT in time O ( | Σ || M || M | ). (cid:3) We have implemented in OCaml both our translation from MinXQuery programs into MFTs and MFT opti-mizations. In this section, we present experimental results of our implementation by connecting with the streamprocessor generators for MFTs by Nakano and Mu [30]. All experiments are conducted on an Apple XServe with2.93 GHz 8-core Intel Xeon and 48 GB main memory. In the experiments, we compare our implementation withGCX [18, 36] and Saxon [34], both of which are stream processors for XQuery. GCX supports only a subset ofXQuery like ours, while Saxon covers full features of XQuery. The experiments shot that our MFT-based approachachieves performance on a par with GCX. We only show the numbers of comparison between ours and GCX becauseSaxon is much slower than the two. The reason comes from the fact that the Saxon’s streaming is optimized formemory, not for speed. It is implemented in Java which has much overhead in loading the Java VM and warmingup the hotspot compiler when run from the command line. A comparison with Saxon is not fair anyhow, becauseit supports the full standards, while GCX and our engine only implement a small subset of XQuery.We run over XMark documents [35, 42] with sizes ranging from 100 MB to 100 GB for benchmarking. The GCXdistribution comes with example queries for XMark which are adapted versions of the standard XMark XQuerybenchmark queries. These 20 queries are of the following three diﬀerent types:(1) queries with small output (at most 1% of the input size) realizing simple XPath selection (downward axesonly) 13 enchmark Queries{for $person in $input/site/people/person[./person_id/text()="person0"]return $person/name/text()}{for $open_auction in /site/open_auctions/open_auction return{ for $increase in $open_auction/bidder/increase return{$increase/text()} }}{for $b in $input/site/open_auctions/open_auction[./bidder[./personref/personref_person/text()="personXX"]/following-sibling::bidder/personref/personref_person/text()="personYY"]return {$b/reserve/text()}}{for $item in $input/site/regions/australia/itemreturn {$item/name/text()}{$item/description}}{for $closed_auction in $input/site/closed_auctions/closed_auction[./annotation/description/parlist/listitem/parlist/listitem/text/emph/keyword/text()] return{$closed_auction/seller/seller_person}}{for $person in $input/site/people/person[empty(./homepage/text())]return {$person/name/text()}}{$input/*}{$input/*}{$input//*//*//*//*}{ for $x in $input/* return { for $y in $x/* return {$y}{$y} } }

Figure 3: XQuery Benchmark Programs(2) queries producing huge output of quadratic size(3) queries that use data value joinsOut of the 20 queries, 16 are of type (1). From these we have picked the six most diverse queries: Q1, Q2, Q4, Q13,Q16, and Q17. Queries of type (2) might not be used in practice very often (they are not quadratic in their originalform of the XMark benchmark). Since we do not support join yet, we do not test queries of type (3). It is notdiﬃcult to add standard join procedures to our tool, and we do not expect major time diﬀerences in comparison toGCX; this is left for future work.Figure 4 shows results of the comparison between streaming MFT and GCX on total elapsed time and maximummemory consumption. We ran some queries in XMark and simple examples of XQuery with large-sized input XMLdata. Missing data in the graphs indicates a failure of execution because of out-of-memory except for Figure 4(c)in which GCX fails because of the lack of expressiveness. We also investigate performance of streaming MFTstranslated from XQuery before and after applying the optimizations discussed in Section 4, which are referred by“MFT (no-opt)” and “MFT (opt)” in the graphs, respectively. As can be seen, the optimized MFT and GCXconsume constant-sized memory, independent of the size of inputs. On the other hand, the unoptimized MFTconsumes much more space and often fails to process large-sized inputs. This is because the MFTs translatedfrom XQuery contain many redundant parameters for every variable which may not be used for a part of theoutput. Since an unoptimized MFT necessarily stores an entire input for the $input variable, it cannot beneﬁtfrom streaming-style evaluation. Therefore, the optimization phases are indispensable for our MFT-translation andthe used MFT stream processor. 14 E l ap s ed t i m e ( s e c ) M a x i m u m m e m o r y u s age ( M B ) Size of inputsMFT (no opt) timeMFT (opt) timeGCX timeMFT (no opt) memoryMFT (opt) memoryGCX memory (a) XMark Q1 E l ap s ed t i m e ( s e c ) M a x i m u m m e m o r y u s age ( M B ) Size of inputsMFT (no opt) timeMFT (opt) timeGCX timeMFT (no opt) memoryMFT (opt) memoryGCX memory (b) XMark Q2 E l ap s ed t i m e ( s e c ) M a x i m u m m e m o r y u s age ( M B ) Size of inputsMFT (no opt) timeMFT (opt) timeGCX timeMFT (no opt) memoryMFT (opt) memoryGCX memory

N/AN/A N/A N/AN/AN/A (c) XMark Q4 E l ap s ed t i m e ( s e c ) M a x i m u m m e m o r y u s age ( M B ) Size of inputsMFT (no opt) timeMFT (opt) timeGCX timeMFT (no opt) memoryMFT (opt) memoryGCX memory (d) XMark Q13 E l ap s ed t i m e ( s e c ) M a x i m u m m e m o r y u s age ( M B ) Size of inputsMFT (no opt) timeMFT (opt) timeGCX timeMFT (no opt) memoryMFT (opt) memoryGCX memory (e) XMark Q16 E l ap s ed t i m e ( s e c ) M a x i m u m m e m o r y u s age ( M B ) Size of inputsMFT (no opt) timeMFT (opt) timeGCX timeMFT (no opt) memoryMFT (opt) memoryGCX memory (f) XMark Q17 X M a r k M B X M a r k G B T r e e B a n k M B M e d li n e D B M BP r o t e i n D B M B E l ap s ed t i m e ( s e c ) M a x i m u m m e m o r y u s age ( M B ) MFT (opt) timeGCX timeMFT (opt) memoryGCX memory

N/AN/A (g) Double query X M a r k M B X M a r k G B T r e e B a n k M B M e d li n e D B M BP r o t e i n D B M B E l ap s ed t i m e ( s e c ) M a x i m u m m e m o r y u s age ( M B ) MFT (opt) timeGCX timeMFT (opt) memoryGCX memory (h) 4star query X M a r k M B X M a r k G B T r e e B a n k M B M e d li n e D B M BP r o t e i n D B M B E l ap s ed t i m e ( s e c ) M a x i m u m m e m o r y u s age ( M B ) MFT (opt) timeGCX timeMFT (opt) memoryGCX memory (i) Deepdup query

Figure 4: Benchmark results15able 1: Input XML ﬁles for benchmarksize depthXMark any 13TreeBank DB 86 MB 37Medline DB 174 MB 8Protein Sequence DB 684 MB 8

All attribute nodes are encoded as element nodes.

The memory consumptions diﬀer by a factor of about three between the optimized MFT and GCX. This isbecause of diﬀerences of their implementation languages and the employed XML parsers. Our streaming engine iswritten in OCaml, while GCX is implemented in C++. The measured OCaml memory footprint (including theExpat XML parser) was about 4.5 MB, which was the major factor of the diﬀerence. Additionally, GCX could haveadvantages on memory usage since it employs its own XML parser which does not handle attributes, namespacesand character codes. This diﬀerence also aﬀects the comparison results on elapsed time.The XQuery programs we used are listed in Figure 3. Although the // step is not shown in the syntax ofMinXQuery, it is supported in our implementation in a usual way. We modify every program for benchmark it onGCX by replacing XPath predicates with where -clauses.Let us discuss the results for each query. Figure 4(a) shows the results of running XMark Q1 which requires asimple lookahead because of XPath predicates. Our implementation is about 18% slower than GCX. Figure 4(b)shows the results of running XMark Q2 which contains nested for-loops. Since this query contains neither XPathpredicates nor let-clauses, all accumulating parameters can be removed, i.e., the optimized MFT is in FT. In thisquery, the elapsed times of our MFT-based approach are very close to GCX. Figure 4(c) shows the results of runningXMark Q4 which requires selecting according to the sibling order. It is an interesting example in the sense that thequery requires a nested XPath predicate like /site/open_auctions/open_auction[./bidder[./personref//text()="person111"]/following-sibling::bidder/personref//text()="person222"]. GCX fails to run because the following-sibling axis is not supported. Figure 4(d) shows the results of runningXMark Q13 which requires reconstruction of XML data in the result. Since this query satisﬁes the condition ofTheorem 2, all accumulating parameters can be removed, i.e., the optimized MFT is an FT. Our implementation isabout 20% slower than GCX. Our implementation is about 23% slower than GCX. Figure 4(e) shows the results ofrunning XMark Q16 which involves an XPath expression with a very long XPath predicate. In this query, we needto very deep lookahead in order to select nodes that may be harmful for stream processing. Although our MFTtranslation introduces many states with rank 3, the performance of our MFTs is still acceptable. Figure 4(f) showsthe results of running XMark Q17 which involves a negative XPath predicate [empty(./homepage/text())] . Theexperiments show results similar to others. Our implementation is about 21% slower than GCX.The last three queries test corner cases: the input doubling, the selection via the XPath query //*//*//*//* ,and deep duplication with a nested for-loop. Here our MFT-based implementation shows better results than GCX.Figure 4(g) shows the results of the input doubling query that outputs the input XML twice, which requires theentire input to be stored in memory for the second output. This example shows the our implementation enablesto run in streaming style even such an extreme case. GCX seems buggy on the input doubling query (even theidentity mapping like {$input/*} ), which fails to process when the size of an input is larger than 200MB. Figure 4(h) shows the results of running the node selection with XPath //*//*//*//* . Our implementationis 26 % faster on average and 41 % faster at the maximum than GCX. Figure 4(i) shows the results of the querywhich require a variable copying many times. Our implementation is 15 % faster on average and 48 % faster atthe maximum than GCX. We tried various kinds of XML data shown in Table 1. Treebank DB has very deep treestructures even when the size is not so large. According to the results, the depth aﬀects the performance of streamprocessing even when the total size is small. It is diﬃcult to give a general statement on when our streaming hasan advantage over GCX. We leave to future work more investigation.16

Conclusions

We present a translation of XQuery fragments to MFTs. This fragment is larger than the one supported by GCX,the fastest XQuery streaming tool we know. The main diﬀerence to GCX is that we support XPath predicatesand let-statements. Moreover, MFTs are more general and allow to easily program recursive function deﬁnitions.Many useful static analyses are known for tree transducers and can be applied to our MFTs. We applied threeof them: stay-move removal, useless parameter removal, and constant parameter removal. The optimized MFTsare often faster by one order of magnitude in comparison to the unoptimized ones. We present several eﬃcientcomposition constructions for subclasses of MFTs. These are useful so that intermediate stream results can beavoided. We believe that MFTs are a robust and appropriate intermediate compilation framework for streamingof XQuery. In the future we plan to apply further static analyses known for MFTs. We want to experiment withpartial transducers that are obtained by composing with a domain check that corresponds to a DTD or XMLSchema. We would like to minimize MFTs by using (a relaxed version of) the “earliest normal form”, similar to theone known for deterministic top-down tree transducers of Engelfriet, Maneth, and Seidl [8]. An earliest transducerproduces its output as early as possible during translation. We would like to compress the output trees producedby a transducer: Macro forest transducers can have doubly exponential size increase. This means that the sizeof an output tree is O (2 n ), where n is the size of the corresponding input tree. Their outputs can, however, berepresented using grammar-based compression in linear space with respect to the input size [25]. Thus, the outputstream is guaranteed to be of linear size. It is a challenging open question how to execute an MFT over an inputstream that is grammar-compressed. It was recently shown by Maneth, Ordonez, and Seidl [26] that top-down treetransducers can executed in constant memory over DAG-compressed tree streams. Their DAG streams containforward references to deﬁnitions that appear later in the stream. Alternatively, a tree can be shredded into smallparallel streams, as considered by Labath and Niehren [21]. Last, we want to study parallel processing of streamsby MFTs; parallel execution of MFTs has been considered by Morihata [29]. Acknowledgment

We thank Michael Benedikt and anonymous reviewers for their valuable comments. This work was partiallysupported by JSPS KAKENHI Grant Number 25730002. Sebastian Maneth was supported by the Engineer-ing and Physical Sciences Research Council project “Enforcement of Constraints on XML streams” (EPSRCEP/G004021/1).

References [1] B. S. Baker. Composition of top-down and bottom-up tree transductions.

Information and Control , 41(2):186–213, 1979.[2] Z. Bar-Yossef, M. Fontoura, and V. Josifovski. On the memory requirements of XPath evaluation over XMLstreams.

J. Comput. Syst. Sci. , 73(3):391–441, 2007.[3] D. Debarbieux, O. Gauwin, J. Niehren, T. Sebastian, and M. Zergaoui. Early nested word automata for XPathquery answering on XML streams. In

CIAA , pages 292–305, 2013.[4] J. Dvorakov´a. Automatic streaming processing of XSLT transformations based on tree transducers.

Informatica(Slovenia) , 32(4):373–382, 2008.[5] J. Engelfriet and S. Maneth. A comparison of pebble tree transducers with macro tree transducers.

Acta Inf. ,39(9):613–698, 2003.[6] J. Engelfriet and S. Maneth. Macro tree translations of linear size increase are MSO deﬁnable.

SIAM J.Comput. , 32(4):950–1006, 2003.[7] J. Engelfriet and S. Maneth. The equivalence problem for deterministic MSO tree transducers is decidable.

Inf. Process. Lett. , 100(5):206–212, 2006. 178] J. Engelfriet, S. Maneth, and H. Seidl. Deciding equivalence of top-down XML transformations in polynomialtime.

J. Comput. Syst. Sci. , 75(5):271–286, 2009.[9] J. Engelfriet and H. Vogler. Macro tree transducers.

J. Comput. Syst. Sci. , 31(1):71–146, 1985.[10] M. F. Fern´andez, P. Michiels, J. Sim´eon, and M. Stark. XQuery streaming `a la carte. In

ICDE , pages 256–265,2007.[11] D. Florescu, C. Hillery, D. Kossmann, P. Lucas, F. Riccardi, T. Westmann, M. J. Carey, and A. Sundararajan.The BEA streaming XQuery processor.

VLDB J. , 13(3):294–315, 2004.[12] A. Frisch and K. Nakano. Streaming XML transformation using term rewriting. In

PLAN-X , pages 2–13, 2007.[13] O. Gauwin and J. Niehren. Streamable fragments of forward XPath. In

CIAA , pages 3–15, 2011.[14] O. Gauwin, J. Niehren, and S. Tison. Queries on XML streams with bounded delay and concurrency.

Inf.Comput. , 209(3):409–442, 2011.[15] J. Giesl, A. K¨uhnemann, and J. Voigtl¨ander. Deaccumulation techniques for improving provability.

J. Log.Algebr. Program. , 71(2):79–113, 2007.[16] T. J. Green, A. Gupta, G. Miklau, M. Onizuka, and D. Suciu. Processing XML streams with deterministicautomata and stream indexes.

ACM Trans. Database Syst. , 29(4):752–788, 2004.[17] M. Kay. Ten reasons why Saxon XQuery is fast.

IEEE Data Eng. Bull. , 31(4):65–74, 2008.[18] C. Koch, S. Scherzinger, and M. Schmidt. The GCX system: Dynamic buﬀer minimization in streamingXQuery evaluation. In

VLDB , pages 1378–1381, 2007.[19] C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. FluXQuery: An optimizing XQuery processor forstreaming xml data. In

VLDB , pages 1309–1312, 2004.[20] K. Kodama, K. Suenaga, and N. Kobayashi. Translation of tree-processing programs into stream-processingprograms based on ordered linear type.

J. Funct. Program. , 18(3):333–371, 2008.[21] P. Labath and J. Niehren. A functional language for hyperstreaming XSLT. Unpublished manuscript availableat http://researchers.lille.inria.fr/ niehren/Papers/X-Fun/0.pdf, 2013.[22] B. Lud¨ascher, P. Mukhopadhyay, and Y. Papakonstantinou. A transducer-based xml query processor. In

VLDB , pages 227–238. Morgan Kaufmann, 2002.[23] S. Maneth. The macro tree transducer hierarchy collapses for functions of linear size increase. In

FSTTCS ,pages 326–337, 2003.[24] S. Maneth, A. Berlea, T. Perst, and H. Seidl. XML type checking with macro tree transducers. In

PODS ,pages 283–294, 2005.[25] S. Maneth and G. Busatto. Tree transducers and tree compressions. In

FoSSaCS , pages 363–377, 2004.[26] S. Maneth, A. Ordones, and H. Seidl. Constant-memory streaming of XML transformations. In preparation.,2013.[27] S. Maneth, T. Perst, and H. Seidl. Exact XML type checking in polynomial time. In

ICDT , pages 254–268,2007.[28] S. Maneth, S. Pott, and H. Seidl. Type checking of tree walking transducers. In

Modern Applications ofAutomata Theory , pages 325–372. World Scientiﬁc, 2012.[29] A. Morihata. Macro tree transformations of linear size increase achieve cost-optimal parallelism. In

APLAS ,pages 204–219, 2011. 1830] K. Nakano and S.-C. Mu. A pushdown machine for recursive XML processing. In

APLAS , pages 340–356,2006.[31] D. Olteanu. SPEX: Streamed and progressive evaluation of XPath.

IEEE Trans. Knowl. Data Eng. , 19(7):934–949, 2007.[32] T. Perst and H. Seidl. Macro forest transducers.

Inf. Process. Lett. , 89(3):141–149, 2004.[33] W. C. Rounds. Mappings and grammars on trees.

Mathematical Systems Theory , 4(3):257–287, 1970.[34] Saxon: The XSLT and XQuery processor. http://saxon.sourceforge.net/ .[35] A. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu, and R. Busse. XMark: A benchmark for XMLdata management. In

VLDB , pages 974–985, 2002.[36] M. Schmidt, S. Scherzinger, and C. Koch. Combined static and dynamic analysis for eﬀective buﬀer minimiza-tion in streaming XQuery evaluation. In

ICDE , pages 236–245, 2007.[37] T. Schwentick. Automata for XML - a survey.

J. Comput. Syst. Sci. , 73(3):289–315, 2007.[38] M. Shalem and Z. Bar-Yossef. The space complexity of processing XML twig queries over indexed documents.In

ICDE , pages 824–832, 2008.[39] P. Wadler. Deforestation: Transforming programs to eliminate trees.

Theor. Comput. Sci. , 73(2):231–248,1990.[40] P. Wadler. XQuery: A typed functional language for querying XML. In

Advanced Functional Programming ,pages 188–212, 2002.[41] M. Wei, E. A. Rundensteiner, M. Mani, and M. Li. Processing recursive XQuery over XML streams: Theraindrop approach.

Data Knowl. Eng. , 65(2):243–265, 2008.[42] XMark: an XML benchmark project.