[PDF] Scalable Continual Top-k Keyword Search in Relational Databases

Abstract

Keyword search in relational databases has been widely studied in recent years because it does not require users neither to master a certain structured query language nor to know the complex underlying database schemas. Most of existing methods focus on answering snapshot keyword queries in static databases. In practice, however, databases are updated frequently, and users may have long-term interests on specific topics. To deal with such a situation, it is necessary to build effective and efficient facility in a database system to support continual keyword queries. In this paper, we propose an efficient method for answering continual top- k keyword queries over relational databases. The proposed method is built on an existing scheme of keyword search on relational data streams, but incorporates the ranking mechanisms into the query processing methods and makes two improvements to support efficient top- k keyword search in relational databases. Compared to the existing methods, our method is more efficient both in computing the top- k results in a static database and in maintaining the top- k results when the database continually being updated. Experimental results validate the effectiveness and efficiency of the proposed method.

Full PDF

SScalable Continual Top- k Keyword Searchin Relational Databases

Yanwei Xu Department of Computer Science and Technology, Tongji University, Shanghai, China

Abstract.

Keyword search in relational databases has been widely studied inrecent years because it does not require users neither to master a certain struc-tured query language nor to know the complex underlying database schemas.Most of existing methods focus on answering snapshot keyword queries in staticdatabases. In practice, however, databases are updated frequently, and users mayhave long-term interests on speciﬁc topics. To deal with such a situation, it isnecessary to build e ﬀ ective and e ﬃ cient facility in a database system to support continual keyword queries .In this paper, we propose an e ﬃ cient method for answering continual top- k key-word queries over relational databases. The proposed method is built on an ex-isting scheme of keyword search on relational data streams, but incorporates theranking mechanisms into the query processing methods and makes two improve-ments to support e ﬃ cient top- k keyword search in relational databases. Comparedto the existing methods, our method is more e ﬃ cient both in computing the top- k results in a static database and in maintaining the top- k results when the databasecontinually being updated. Experimental results validate the e ﬀ ectiveness and ef-ﬁciency of the proposed method. Key words:

Relational databases, keyword search, continual queries, results mainte-nance.

With the proliferation of text data available in relational databases, simple ways to ex-ploring such information e ﬀ ectively are of increasing importance. Keyword search inrelational databases , with which a user speciﬁes his / her information need by a set ofkeywords, is a popular information retrieval method because the user needs to knowneither a complex query language nor the underlying database schemas. It has attractedsubstantial research e ﬀ ort in recent years, and a number of methods have been devel-oped [1,2,3,4,5,6,7,8,9,10]. Example 1.

Consider a sample publication database shown in Fig. 1. Fig. 1 (a) showsthe three relations

Papers , Authors , and

Writes . In the following, we use the initial ofeach relation name ( P , A , and W ) as its shorthand. There are two foreign key references: W → A and W → P . Fig. 1 (b) illustrates the tuple connections based on the foreign key a r X i v : . [ c s . D B ] A ug Yanwei Xu references. For the keyword query “James P2P” consisting of two keywords “James”and “P2P”, there are six tuples in the database that contain at least one of the twokeywords (underlined in Fig. 1 (a)). They can be regraded as the results of the query.However, they can be joined with other tuples according to the foreign key referencesto form more meaningful results, several of which are shown in Fig. 1 (c). The arrowsrepresent the foreign key references between the corresponding pairs of tuples. Findingsuch results which are formed by the tuples containing the keywords is the task ofkeyword search in relational databases. As described later, results are often ranked byrelevance scores evaluated by a certain ranking strategy. (cid:50)

Papers pid titlep “Leveraging Identity-Based Cryptography for Node ID Assignment in Structured (cid:58)(cid:58)(cid:58) P2P Systems.” p “ (cid:58)(cid:58)(cid:58) P2P or Not (cid:58)(cid:58)(cid:58)

P2P?: In (cid:58)(cid:58)(cid:58)

P2P 2003” p “A System for Predicting Subcellular Localization.” p “Logical Queries over Views: Decidability.” p “A conservative strategy to protect (cid:58)(cid:58)(cid:58) P2P ﬁle sharing systems from pollution attacks.” · · · · · ·

Authors aid namea “ (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) James Chen” a “Saikat Guha” a “ (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) James Bassingthwaighte” a “Sabu T.” a “ (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) James S. W. Walkerdines” · · · · · ·

Writes wid w w w w w w w w · · · aid a a a a a a a a · · · pid p p p p p p p p · · · (a) Database (Matched keywords are underlined)(b) Tuple connections (Matched tuplesare solid circles) (c) Examples of query results Fig. 1.

A sample database with a keyword query “James P2P”.Most of the existing keyword search methods assume that the databases are staticand focus on answering snapshot keyword queries. In practice, however, a database isoften updated frequently, and the result of a snapshot query becomes invalid once therelated data in the database is updated. For the database in Fig. 1, if publication datacomes continually, new publication records are inserted to the three tables. Such newrecords may be more relevant to “James” and “P2P”. Hence, after getting the initial top- k results, the user may demand the top- k results to reﬂect the latest database updates. calable Continual Top- k Keyword Search in Relational Databases 3

Such demands are common in real applications. Suppose a user want to do a top- k keyword search in a Micro-blogging database, which is being updated continually: notonly the weblogs and comments are continually being inserted or deleted by bloggers,but also the follow relationship between bloggers are being updated continually. Thus,a continual evaluation facility for keyword queries is essential in such databases.For continual keyword query evaluation, when the database is updated, two situa-tions must be considered:1. Database updates may change the existing top- k results: some top- k results may bereplaced by new ones that are related to the new tuples, and some top- k results maybe invalid due to deletions.2. Database updates may change the relevance scores of existing results because theunderlying statistics (e.g., word frequencies) are changed.In this paper, we describe a system which can e ﬃ ciently report the top- k results ofevery monitoring query while the database is being updated continually. The outline ofthe system is as follows: – When a continual query is issued, it is evaluated in a pipelined way to ﬁnd the setof results whose upper bounds of relevance scores are higher than a threshold θ bycalculating the upper bound of the future relevance score for every query result. – When the database is updated, we ﬁrst update the relevance scores of the computedresults, then ﬁnd the new results whose upper bounds of relevance scores are largerthan θ and delete the results containing the deleted tuples. – The pipelined evaluation of the keyword query is resumed if the number of com-puted results whose relevance scores are larger than θ falls below k , or is reversedif the above number is much larger than k . – At any time, the k computed results whose relevance scores are the largest and arelarger than θ are reported as the top- k results.In Section 2, some basic concepts are introduced and the problem is deﬁned. Sec-tion 3 discusses related work. Section 4 presents the details of the proposed method.Section 5 gives the experimental results. Conclusion is drawn in Section 6. In this section, we introduce some important concepts for top- k keyword querying eval-uation in relational databases. We consider a relational database schema as a directed graph G S ( V , E ), called a schemagraph, where V represents the set of relation schemas { R , R , . . . } and E represents theforeign key references between pairs of relation schemas. Given two relation schemas, Yanwei Xu R i and R j , there exists an edge in the schema graph, from R j to R i , denoted R i ← R j ,if the primary key of R i is referenced by the foreign key deﬁned on R j . For example,the schema graph of the publication database in Fig. 1 is Papers ← Write → Authors .A relation on relation schema R i is an instance of R i (a set of tuples) conforming tothe schema, denoted r ( R i ). A tuple can be inserted into a relation. Below, we use R i todenote r ( R i ) if the context is obvious. The results of keyword queries in relational databases are a set of connected trees oftuples, each of which is called a joint-tuple-tree ( JTT for short). A JTT represents howthe matched tuples , which contain the speciﬁed keywords in their text attributes, areinterconnected through foreign key references. Two adjacent tuples of a JTT, t i ∈ r ( R i )and t j ∈ r ( R j ), are interconnected if they can be joined based on a foreign key referencedeﬁned on relational schema R i and R j in G S (either R i → R j or R i ← R j ). The foreignkey references between tuples in a JTT can be denoted using arrows or notation (cid:49) . Forexample, the second JTT in Fig. 1(c) can be denoted as a ← w → p or a (cid:49) w (cid:49) p .To be a valid result of a keyword query Q , each leaf of a JTT is required to contain atleast one keyword of Q . In Fig. 1(c), tuples p , p , a and a are matched tuples to thekeyword query as they contain the keywords. Hence, the four JTTs are valid results tothe query. In contrast, p ← w → a is not a valid result because tuple a does notcontain any required keywords. The number of tuples in a JTT T is called the size of T ,denoted by size ( T ). Given a keyword query Q , the query tuple set R Qi of relation R i is deﬁned as R Qi = { t ∈ r ( R i ) | t contains some keywords of Q } . For example, the two query tuple sets inExample 1 are P Q = { p , p , p } and A Q = { a , a , a } , respectively. The free tuple setR Fi of a relation R i with respect to Q is deﬁned as the set of tuples that do not containany keywords of Q . In Example 1, P F = { p , p , . . . } , A F = { a , a , . . . } . If a relation R i does not contain text attributes (e.g., relation W in Fig. 1), R i is used to denote R Fi forany keyword query. We use R QorFi to denote a tuple set , which may be either R Qi or R Fi .Each JTT belongs to the result of a relational algebra expression, which is called a candidate network ( CN ) [4,9,11]. A CN is obtained by replacing each tuple in a JTTwith the corresponding tuple set that it belongs to. Hence, a CN corresponds to a joinexpression on tuple sets that produces JTTs as results, where each join clause R QorFi (cid:49) R QorFj corresponds to an edge (cid:104) R i , R j (cid:105) in the schema graph G S , where (cid:49) represents aequi-join between relations. For example, the CNs that correspond to two JTTs p and p ← w → a in Example 1 are P Q and P Q (cid:49) W (cid:49) A Q , respectively. In the following,we also denote P Q (cid:49) W (cid:49) A Q as P Q ← W → A Q . As the leaf nodes of JTTs must bematched tuples, the leaf nodes of CNs must be query tuple sets. Due to the existenceof m : n relationships (for example, an article may be written by multiple authors), aCN may have multiple occurrences of the same tuple set. The size of CN C , denoted as calable Continual Top- k Keyword Search in Relational Databases 5 size ( C ), is deﬁned as the number of tuple sets that it contains. Obviously, the size of aCN is the same as that of the JTTs it produces. Fig. 2 shows the CNs corresponding tothe four JTTs shown in Fig. 1 (c). A CN can be easily transformed into an equivalentSQL statement and executed by an RDBMS. Fig. 2.

Examples of Candidate NetworksWhen a continual keyword query Q = { w , w , . . . , w l } is speciﬁed, the non-emptyquery tuple set R Qi for each relation R i in the target database are ﬁrstly computed us-ing full-text indices. Then all the non-empty query tuple sets and the database schemaare used to generate the set of valid CNs, whose basic idea is to expand each partialCN by adding a R Fi or R Qi at each step ( R i is adjacent to one relation of the partial CNin G S ), beginning from the set of non-empty query tuple sets. The set of CNs shall besound / complete and duplicate-free. There are always a constraint, CN max (the maximumsize of CNs) to avoid generating complicated but less meaningful CNs. In the imple-mentation, we adopt the state-of-the-art CN generation algorithm proposed in [12]. Example 2.

In Example 1, there are two non-empty query tuple sets P Q and A Q . Usingthem and the database schema graph, if CN max =

5, the generated CNs are: CN = P Q , CN = A Q , CN = P Q ← W → A Q , CN = P Q ← W → A Q ← W → P Q , CN = P Q ← W → A F ← W → P Q , CN = A Q ← W → P Q ← W → A Q and CN = A Q ← W → P F ← W → A Q . The problem of continual top-k keyword search we study in this paper is to continuallyreport top- k JTTs based on a certain scoring function that will be described below. Weadopt the scoring method employed in [4], which is an ordinary ranking strategy in theinformation retrieval area. The following function score ( T , Q ) is used to score JTT T for query Q , which is based on the TF-IDF weighting scheme: score ( T , Q ) = (cid:80) t ∈ T tscore ( t , Q ) size ( T ) , (1)where t ∈ T is a tuple (a node) contained in T . tscore ( t , Q ) is the tuple score of t withregard to Q deﬁned as follows: tscore ( t , Q ) = (cid:88) w ∈ t ∩ Q + ln(1 + ln( t f t ,w ))(1 − s ) + s · dl t a v dl · ln (cid:32) Nd f w + (cid:33) (2) For example, we can transform CN P Q ← W → A Q as: SELECT * FROM W w, P p, A aWHERE w.pid = p.pid AND w.aid = a.aid AND p.pid in ( p , p , p ) and a.aid in ( a , a , a ). Yanwei Xu where t f t ,w is the term frequency of keyword W in tuple t , d f w is the number of tuples inrelation r ( t ) (the relation corresponds to tuple t ) that contain W . d f w is interpreted as the document frequency of W . dl t represents the size of tuple t , i.e., the number of letters in t , and is interpreted as the document length of t . N is the total number of tuples in r ( t ), a v dl is the average tuple size ( average document length ) in r ( t ), and s (0 ≤ s ≤

1) is aconstant which usually be set to 0.2.Table 1 shows the tuple scores of the six matched tuples in Example 1. We supposeall the matched tuples are shown in Fig. 1, and the numbers of tuples of the two relationsare 150 and 180, respectively. Therefore, the top-3 results are T = p ( score = . T = a ( score = .

00) and T = p ← w → a ( score = . Table 1.

Statistics and tuple scores of tuples of P Q and A Q Tuple Set P Q A Q Statistics

N d f

P2P a v dl N d f James a v dl

150 3 57.8 170 3 14.6Tuple p p p a a a dl

88 28 83 10 22 23 t f tscore

The score function in Eq. (1) has the property of tuple monotonicity , deﬁned asfollows. For any two JTTs T = t (cid:49) t (cid:49) . . . (cid:49) t l and T (cid:48) = t (cid:48) (cid:49) t (cid:48) (cid:49) . . . (cid:49) t (cid:48) l generated from the same CN C , if for any 1 ≤ i ≤ l , tscore ( t , Q ) ≤ tscore ( t (cid:48) , Q ), then wehave score ( T , Q ) ≤ score ( T (cid:48) , Q ). As shown in the following discussion, this property isrelied by the existing top- k query evaluation algorithms. Given l -keyword query Q = { w , w , · · · , w l } , the task of keyword search in a relationaldatabase is to ﬁnd structural information constructed from tuples in the database [13].There are two approaches. The schema-based approaches [1,2,4,7,9,14,15] in this areautilize the database schema to generate SQL queries which are evaluated to ﬁnd thestructures for a keyword query. They process a keyword query in two steps. They ﬁrstutilize the database schema to generate a set of relation join templates (i.e., the CNs),which can be interpreted as select-project-join views. Then, these join templates areevaluated by sending the corresponding SQL statements to the DBMS for ﬁnding thequery results. [2] proved how to generate a complete set of CNs when the CN max hasa user-given value and discussed several query processing strategies when considersthe common sub-expressions among the CNs. [1,2,14,15] all focused on ﬁnding allJTTs, whose sizes are ≤ CN max , which contain all l keywords, and there is no ranking calable Continual Top- k Keyword Search in Relational Databases 7 involved. In [4] and [9], several algorithms are proposed to get top- k JTTs. We willintroduce them in detail in Section 3.2.The graph-based methods [3,8,5,6,10,16] model and materialize the entire databaseas a directed graph where the nodes are relational tuples and the directed edges areforeign key references between tuples. Fig. 1(b) shows such a database graph of theexample database. Then for each keyword query, they ﬁnd a set of structures (eitherSteiner trees [3], distinct rooted trees [5], r -radius Steiner graphs [10], or multi-centersubgraphs [16]) from the database graph, which contain all the query keywords andare connected by the paths in database graph. Such results are found by graph traver-sals that start from the nodes that contain the keywords. For the details, please re-fer the survey papers [13,17]. The materialized data graph should be updated for anydatabase changes; hence this model is not appropriate to the databases that change fre-quently [17]. Therefore, this paper adopts the schema-based framework and can beregarded as an extension for dealing with continual keyword search. k Keyword Search in Relational Databases

DISCOVER2 [4] proposed the

Global-Pipelined ( GP ) algorithm to get the top- k resultswhich are ranked by the IR-style ranking strategy shown in Section 2.4. The aim of thealgorithm is to ﬁnd a proper order of generating JTTs in order to stop early before allthe JTTs are generated. It employs the priority preemptive , round robin protocol [18] toﬁnd results from each query tuple set preﬁx in a pipelined way, thus each CN can avoidbeing fully evaluated.For a keyword query Q , given a CN C , let the set of query tuple sets of C be { R Q , R Q , . . . , R Qm } . Tuples in each R Qi are sorted in non-increasing order of their scorescomputed by Eq. 2. Let R Qi . t j be the j -th tuple in R Qi . In each R Qi , we use R Qi . cur to denote the current tuple such that the tuples before the position of the tuple areall processed, and we use R Qi . cur ← R Qi . cur + R Qi . cur to the next posi-tion. q ( t , t , . . . , t m ) (where t i is a tuple, and t i ∈ R Qi ) denotes the parameterized querywhich checks whether the m tuples can form a valid JTT. For each tuple R Qi . t j , we use score ( C . R Qi . t j , Q ) to denote the upper bound score for all the JTTs of C that contain thetuple R Qi . t j , deﬁned as follows: score ( C . R Qi . t j , Q ) = t j . tscore + (cid:80) ≤ i (cid:48) ≤ m ∧ i (cid:48) (cid:44) i C . R Qi (cid:48) . t . tscoresize ( C ) (3)According to the tuple monotonicity property of Eq. (1) and the sorting order of tuples,among the unprocessed tuples of C . R Qi , score ( C . R Qi . cur , Q ) has the maximum value.Algorithm GP initially mark all tuples in C . R Qi (1 ≤ i ≤ m ) of each CN C asun-processed except for the top-most ones. Then in each while iteration (one round),the un-processed tuple which maximizes the score value is selected for processing.Suppose tuple C . R Qs . cur maximizes score , processing C . R Qs . cur is done by joining itwith the processed tuples in the other query tuple sets of C to ﬁnd valid JTTs: all thecombinations as ( t , t , . . . , t s − , R Qs . cur , t s + . . . , t m ) are tested, where t i is a processed Yanwei Xu tuple of C . R Qi (1 ≤ i ≤ m , i (cid:44) s ). If the k -th relevance score of the found resultsis larger than score values of all the un-processed tuples in all the CNs, it can stopand output the k found results with the largest relevance scores because no results withhigher scores can be found in the further evaluation.One drawback of the GP algorithm is that when a new tuple C . R Qi . cur is processed, ittries all the combinations of processed tuples ( t , t , . . . , t s − , t s + . . . , t m ) to test whethereach combination can be joined with C . R Qi . cur . This operation is costly due to extremelylarge number of combinations when the number of processed tuples becomes large [19].SPARK [9] proposes the Skyline-Sweeping algorithm to reduce the number of combi-nations test. SPARK uses a priority queue Q to keep the set of seen but not testedcombinations ordered by the priority deﬁned as the score of the hypothetical JTT corre-sponding to each combination. In each round, the combination in Q with the maximumpriority is tested, then all its adjacent combinations are inserted into Q but only thecombinations that have the high priorities are tested. SPARK still can not avid testinga huge number of combinations which cannot produce results, though the number ofcombinations test is highly reduced compared to DISCOVER2.This paper evaluates the CNs in a pipelined way like [4] and [9], but also em-ploys the following two optimization strategies, whose high e ﬃ ciencies are shown in[2,14,15]: (1) sharing the computational cost among CNs; and (2) adopting tuple reduc-tion. The most related projects to our paper are

S-KWS [14] and

KDynamic [20,15], whichtry to ﬁnd new results or expired results for a given keyword query over an open-ended,high-speed large relational data stream [13]. They adopt the schema-based frameworksince the database is not static. This paper deals with a di ﬀ erent problem from S-KWS and

KDynamic , though all need to respond to continual queries in a dynamic environ-ment.

S-KWS and

KDynamic focus on ﬁnding all query results. On the contrary, ourmethods maintain the top- k results, which is less sensitive to the updates of the under-lying databases because not every new or expired results change the top- k results. S-KWS maps each CN to a left-deep operator tree , where leaf operators (nodes) aretuple sets, and interior operators are joins. Then the operator trees of all the CNs arecompacted into an operator mesh by collapsing their common subtrees. Joins in theoperator mesh are evaluated in a bottom-to-top manner. A join operator has two inputsand is associated with an output bu ﬀ er which saves its results (partial JTTs). The outputbu ﬀ er of a join operator becomes input to many other join operators that share the joinoperator. A new result that is newly outputted by a join operator will be a new arrivalinput to those joins sharing it. The operator mesh has two main shortcomings [19]:(1) only the left part of the operator trees can be shared; and (2) a large number ofintermediate tuples, which are computed by many join operators in the mesh with highprocessing cost, will not be eventually output in the end.For overcoming the above shortcomings of S-KWS , KDynamic formalizes each CNas a rooted tree, whose root is deﬁned to be the node r such that the maximum path calable Continual Top- k Keyword Search in Relational Databases 9 from r to all leaf nodes of the CN is minimized; and then compresses all the rootedtrees into a L -Lattice by collapsing the common subtrees. Fig. 3(a) shows the latticeof two hypothetical CNs. Each node V in the Lattice is also associated with an outputbu ﬀ er, which contains the tuples in V that can join at least one tuple in the output bu ﬀ erof its each child node. Thus, each tuple in the output bu ﬀ er of each top-most node V , i.e.,the root of a CN, can form JTTs with tuples in the output bu ﬀ ers of its descendants. Thenew JTTs involving a new tuple are found in a two-phase approach. In the ﬁlter phase,as illustrated in Fig. 3(b), when a new tuple t new is inserted into node R , KDynamic usesselections and semi-joins to check if (1) t new can join at least a tuple in the output bu ﬀ erof each child node of R ; and (2) t new can join at least a tuple in the output bu ﬀ ers ofthe ancestors of R . The new tuples that can not pass the checks are pruned; otherwise,in the join phase (shown in Fig. 3(c)), a joining process is initiated from each tuple inthe output bu ﬀ er of each root node that can join t new , in a top-down manner, to ﬁnd theJTTs involving t new . (a) L -Lattice of two CNs (b) Filter phase (c) Join phase Fig. 3.

Query processing in

KDynamic

In this paper, we incorporate the ranking mechanisms and the pipelined evalua-tion into the query processing method of

KDynamic to support e ﬃ cient top- k keywordsearch in relational databases. k Keyword Search in Relational Databases

Database updates bring two orthogonal e ﬀ ects on the current top- k results:1. They change the values of d f w , N , and a v dl in Eq. (2) and hence change the rele-vance scores of existing results.2. New JTTs may be generated due to insertions. Existing top- k results may be expireddue to deletions.Although the second e ﬀ ect is more drastic, the ﬁrst e ﬀ ect is not negligible for long-termdatabase modiﬁcations. Thus, we can not neglect all the JTTs that are not the currenttop- k results because some of them have the potential of becoming the top- k results inthe future. This paper solves this problem by bounding the future relevance score ofeach result. We use score u to denote the upper bound of relevance score for each result. Then, the results whose score u values are not larger than relevance score of the top- k -thresults can be safely ignored.The second challenge is shortage of top- k results because they can be expired due todeletions. Since the value k is rather small compared to the huge number of all the validJTTs, the possibility of deleting a top- k result is rather small. In addition, new top- k results can also be formed by new tuples. Thus, if the insertion rate is not much smallerthan the deletion rate, the possibility of occurring of top- k results shortage would besmall. However, this possibility would be high if the deletion rate is much larger, whichcan result in frequent top- k results reﬁlling operations. It worth noting that the top- k results shortage can also be caused by the relevance score changing of results. Oursolution to this problem is to compute the top-( k + ∆ k ) ( ∆ k >

0) results instead of thenecessary k . ∆ k is a margin value. Then, we can stand up to ∆ k times of deletion of topresults when maintaining the top- k results. The setting of ∆ k is important. If ∆ k is toosmall, it may has a high possibility to reﬁll. If ∆ k is too large, the e ﬃ ciency of handlingdatabase modiﬁcations is decreased. Instead of analyzing the update behavior of theunderlying database to estimate an appropriate ∆ k value, we enlarge ∆ k on each time oftop- k results shortage until it reaches a value such that the occurring frequency of top- k results shortage falls below a threshold.On the contrary, after maintaining the top- k results for a long time, the number ofcomputed top results maybe larger than ( k + ∆ k ), especially when the insertion rateis high. In such cases, the top- k results maintaining e ﬃ ciency is decreased becausewe need to update the relevance scores for more results and join the new tuples withmore tuples than necessary. As shown in the experimental results, such extra cost isnot negligible for long-term database modiﬁcations. Therefore, we need to reverse thepipelined query evaluation if there are too many computed top results.In brief, when a continual keyword query is registered, we ﬁrst generate the setof CNs and compact them into a lattice L . Then, the initial top- k results is found byprocessing tuples in L in a pipelined way until the score u values of the un-seen JTTsare not larger than relevance score of the top-( k + ∆ k )-th result (which is denoted by L .θ ). When maintaining the top- k results, we only ﬁnd the new results that are with score u > L .θ . The pipelined evaluation of L is resumed if the number of found resultswith score u > L .θ falls below k , or is reversed if the above number is larger than( k + ∆ k ). The method of computing score u for results is introduced in Section 4.2.Section 4.3 and Section 4.4 describe our method of computing the initial top- k resultsand maintaining the top- k results, respectively. Then, two techniques which can highlyimprove the query processing e ﬃ ciency are presented in Section 4.5 and Section 4.6. Let us recall the function for computing tuple scores given in Eq. (2): tscore ( t , Q ) = (cid:88) w ∈ t ∩ Q + ln(1 + ln( t f t ,w ))1 − s + s · dl t a v dl · ln (cid:32) Nd f w + (cid:33) . calable Continual Top- k Keyword Search in Relational Databases 11

We assume that the future values of each ln (cid:16)

Nd f w + (cid:17) and a v dl both have an upper boundln u (cid:16) Nd f w + (cid:17) and a v dl u , respectively. Then, we can derive the upper bound of the futuretuple score for each tuple t as: t . tscore u = (cid:88) w ∈ t ∩ Q + ln(1 + ln( t f t ,w ))1 − s + s · dl t a v dl u · ln u (cid:32) Nd f w + (cid:33) . (4)Hence, the upper bound of the future relevance score of a JTT T is: T . score u = (cid:88) t ∈ T t . tscore u · size ( T ) . (5)Note that the function in Eq. (5) also has the tuple monotonicity property on tscore u .On query registration, each ln u (cid:16) Nd f w + (cid:17) is computed as ln (cid:16) Nd f w (1 − ∆ d f w ) + (cid:17) , and each a v dl u is computed as a v dl (1 + ∆ a v dl ), where ∆ d f w and ∆ a v dl both are set as smallvalues ( = k results, we continually monitor the changeof statistics to determine whether all the ln (cid:16) Nd f w + (cid:17) and a v dl values below their upperbounds. At each time that any ln (cid:16) Nd f w + (cid:17) or a v dl value exceeds its upper bound, the ∆ d f w or ∆ a v dl is enlarged until the frequencies of exceeding the upper bounds fall below asmall number. Example 3.

Table 2 shows the tscore u values of the six matched tuples in Example 1 bysetting ∆ d f w =

20% and ∆ a v dl = T . score u = . T . score u = .

23 and T . score u = . Table 2.

Upper bounds of tuple scores

Tuple a a a p p p tscore u k Results

Fig. 4 shows the L -lattice of the seven CNs in Example 2. We use V i to denote a nodein L . Particularly, V Qi denotes a lattice node of query tuple set, and V Qi . R Q denotesthe query tuple set of V Qi . The dual edges between two nodes, for instance, V Q and V , indicate that V is a dual child of V Q . A node V i in L can belongs to multipleCNs. We use V i . CN to denote the set of CNs that node V i belongs to. For example, V Q . CN = { CN , CN , CN , CN } . Tuples in each query tuple set V Qi . R Q are sorted innon-increasing order of tscore u . We use V Qi . cur to denote the current tuple such that thetuples before the position of the tuple are all processed, and we use V Qi . cur ← V Qi . cur + V Qi . cur to the next position. Initially, for each node V Q in L , V Qi . cur is set asthe top tuple in V Qi . R Q . In Fig. 4, V Qi . cur of the four nodes are denoted by arrows. For anode V i that is of a free tuple set R Fi , we regard all the tuples of R Fi as its processed tuplesfor all the times. We use V i . output to indicate the output bu ﬀ er of V i , which contains itsprocessed tuples that can join at least one tuple in the output bu ﬀ er of each child nodeof V i . Tuples in V i . output are also referred as the outputted tuples of V i . Fig. 4.

The constructed lattice of the seven CNs in Example 2In order to ﬁnd the top- k results in a pipelined way, we need to bound the score u values of the un-found results. For each tuple t j of V Qi . R Q , the maximal score u valuesof JTTs that t j can form is deﬁned as follows: score u (cid:16) V Qi , t j , Q (cid:17) =  , a child node of V Qi has empty output bu ﬀ er , max C ∈ V Qi . CN (cid:16) score u (cid:16) C . R Q . t j , Q (cid:17)(cid:17) , otherwise (6)where score u (cid:16) C . R Q . t j , Q (cid:17) indicates the maximal score u for all the JTTs of C that con-tain tuple t j , and is obtained by replacing tscore in Eq. (3) with tscore u . If a child of V Qi has empty output bu ﬀ er, processing any tuple at V Qi can not produce JTTs; hence score u (cid:16) V Qi , t j , Q (cid:17) = V Qi untilall its child nodes have non-empty output bu ﬀ ers. According to Eq. (6) and the tuplessorting order, among the un-processed tuples of V Qi . R Q , score u (cid:16) V Qi , V Qi . cur , Q (cid:17) has themaximum value. We use score u (cid:16) V Qi , Q (cid:17) to denote score u (cid:16) V Qi , V Qi . cur , Q (cid:17) . In Fig. 4, score u (cid:16) V Qi , Q (cid:17) values of the four V Qi nodes are shown next to the arrows. For example, score u (cid:16) V Q , Q (cid:17) = max C ∈{ CN , CN , CN , CN } (cid:16) score u (cid:16) C . A Q . a , Q (cid:17)(cid:17) = . L to ﬁnd theinitial top- k results, which is similar to the GP algorithm. Lines 1-3 are the initializationstep to sort tuples in each query tuple set and to initialize each V Qi . cur . Then in eachwhile iteration (lines 4-8), the un-processed tuple in all the V Q nodes that maximizes score u is selected to be processed. Processing the selected tuples is done by calling theprocedure Insert . Algorithm 1 stops when max V Qi ∈L score u ( V Qi , Q ) is not larger thanthe relevance score of the top-( k + ∆ k )-th found results. The procedure Insert ( V i , t ) isprovided in KDynamic , which updates the output bu ﬀ ers for V i (line 13) and all its an-cestors (lines 17-18), and ﬁnds all the JTTs containing tuple t by calling the procedure E v alPath (line 16). We will explain procedure Insert using examples later. The re-cursive procedure E v alPath ( V i , t , path ) is provided in KDynamic too, which constructsJTTs using the outputted tuples of V i ’s descendants that can join t . The stack path ,which records where the join sequence comes from, is used to reduce the join cost. Example 4.

In the ﬁrst round, tuple V Q . p is processed by calling Insert ( V Q , p ). Since V Q is the root node of CN , E v alPath is called and JTT T = p is found. Then, for thetwo father nodes of V Q , V and V , V . output is not updated because V Q . output = ∅ , V . output is updated to { w , w } because p can join w and w . And then, for the twofather nodes of V , V Q and V , V Q . output is not updated since V Q has no processed calable Continual Top- k Keyword Search in Relational Databases 13

Algorithm 1:

EvalStatic - Pipelined (lattice L , the top- k value k , ∆ k ) topk ← ∅ : the priority queue for storing found JTTs ordered by score ; Sort tuples of each V Qi . R Q in non-increasing order of tscore u ; foreach node V Qi in L do let V Qi . cur ← V Qi . R Q . t ; while max V Qi ∈L score u ( V Qi , Q ) > topk [ k + ∆ k ] . score do Suppose score u ( V Q , Q ) = max V Qi ∈L score u ( V Qi , Q ); path ← ∅ ; // A stack which records the join sequence Insert ( V Q , V Q . cur ); // Processing tuple V Q . cur at V Q V Q . cur ← V Q . cur + Output the ﬁrst k results in topk ; L .θ ← topk [ k + ∆ k ] . score ; Procedure

Insert (lattice node V i , tuple t ) if t (cid:60) V i . output and t can join at least one outputted tuple of every child of V i then Insert t into V i . output ; if t ∈ V i . output then Push ( V i , t ) to path ; if V i is a root node then topk ← topk (cid:83) E v alPath ( V , t , path ); foreach father node of V i , V i (cid:48) in L do foreach tuple t (cid:48) belongs to V i (cid:48) that can join t do Insert ( V i (cid:48) , t (cid:48) ); Pop ( V , t ) from path ; Procedure E v alPath (lattice node V i , tuple t , stack path ) T ← { t } ; // The set of found JTTs foreach child node of V i , V i (cid:48) in L do T (cid:48) ← ∅ ; // The set of JTTs that rooted at tuples of node V i (cid:48) if V i (cid:48) ∈ path then let t (cid:48) be the tuple of node V i (cid:48) that is stored in path ; T (cid:48) ← E v alPath ( V i (cid:48) , t (cid:48) , path ); else foreach tuple t (cid:48) ∈ V i (cid:48) . output that join t do T (cid:48) ← T (cid:48) (cid:83) E v alPath ( V i (cid:48) , t (cid:48) , path ); // Union the JTTs that rootedat different tuples of V i (cid:48) T ← T × T (cid:48) ; // Compute the Cartesian Product return T ;4 Yanwei Xu tuples, V . output is set as { a } because there is only one tuple a in A F that can join w and w . Since V is the root node (of CN ), E v alPath ( V , a , path ) is called but noresults are found because the only one found JTT p ← w → a ← w → p is not avalid result. After processing tuple V Q . p , score u (cid:16) V Q , Q (cid:17) = .

82 and score u (cid:16) V Q , Q (cid:17) = .

57. In the second round, tuple V Q . a is processed, which ﬁnds results T = a and T = p ← w → a . Then, V . output = { p } , V . output = { w , w } , V . output = { w } , score u (cid:16) V Q , Q (cid:17) = .

18, and score u (cid:16) V Q , Q (cid:17) = score u (cid:16) CN . A Q . a , Q (cid:17) = .

69. In thethird-ﬁfth rounds, tuples V Q . a , V Q . a and V Q . a are processed, which insert a into V Q . output and no results found. In the sixth round, tuple V Q . a is processed, whichﬁnds results a and a ← w → p ← w → a . Then, Algorithm 1 stops because therelevance score of the third result in the queue topk (suppose ∆ k =

0) is larger than allthe score u (cid:16) V Qi , Q (cid:17) values. Fig. 5 shows the snapshot of L after ﬁnding the top-3 results.Thus, θ = .

68 after the evaluation.

Fig. 5.

After ﬁnding the top-3 results (tuples in the output bu ﬀ ers are shown in bold)After the execution of Algorithm 1, score u values of all the un-found results are notlarger than L .θ . Results in the queue topk can be categorized into three kinds. The ﬁrstkind are the ( k + ∆ k ) results that are with score u ≥ L .θ , which are the initial top-( k + ∆ k )results. The second kind are with score < L .θ and score u ≥ L .θ , which are called the potential top-( k + ∆ k ) results because they have the potential to become the top-( k + ∆ k )results. The third kind are with score u ≤ L .θ . As shown in the experiment, the resultsof the last kind may have a large number. However, we can not discard them becausesome of them may become the ﬁrst two kinds when maintaining the top- k results. k Results

Algorithm 2 shows our algorithm of maintaining top- k results. A database update oper-ator is denoted by OP ( t , R t ), which represents a tuple t of relation R t is inserted (if OP is a insertion) or deleted (if OP is a deletion). Note that the database updates is modeledas deletions followed by insertions. For a new arrival OP ( t , R t ), Algorithm 2 ﬁrst checkswhether the ln (cid:16) Nd f w + (cid:17) and a v dl values of relation R t exceed their upper bounds. If someln (cid:16) Nd f w + (cid:17) (s) or a v dl exceeds their upper bounds, we enlarge the corresponding ∆ d f w (s) The methods of enlarging ∆ d f w , ∆ a v dl and ∆ k are introduced in detail in the experiments.calable Continual Top- k Keyword Search in Relational Databases 15 or ∆ a v dl (line 3), and then update the score and score u values for all the tuples in R Qt and all the results in the queue topk using the enlarged ln (cid:16) Nd f w + (cid:17) (s) or a v dl (line 4); oth-erwise, we update the relevance scores for the results in topk that are with score u ≥ L .θ (line 6). Then, we insert t into L to ﬁnd the new results if OP is an insertion (lines 7-13),or delete the expired JTTs and t from L if OP is a deletion (lines 14-17). Lines 7-17are explained in detail latter. And then, the score u ( V Qi , Q ) of some nodes may be largethan L .θ , which can be caused by three reasons: ( ) the upper bound scores of tuplesof relation R t are increased; ( ) the score u ( V Qi , Q ) of some nodes are increased from 0after inserting the new tuple into L ; and ( ) new CNs are added into L . Therefore, inlines 18-19, we process tuples using procedure Insert until all the score u ( V Qi , Q ) valuesare not larger than L .θ . Algorithm 2:

Maintain (the evaluated lattice L , the top- k value k , ∆ k ) while a new database modiﬁcation OP ( t , R t ) arrives do if Some ln (cid:16)

Nd f w + (cid:17) (or a v dl ) exceed their upper bounds after applying OP then Enlarge the corresponding ∆ d f w (or ∆ a v dl ) value(s); Update relevance scores for tuples in R Qt and results in topk ; else Update score for results in topk that are with score u ≥ L .θ ; if OP is an insertion then // Insert t into L if t is an un-matched tuple then foreach node V i in L that of R Ft do Insert ( V i , t ); else if R Qt is new then add the new CNs into L ; Insert t into R Q in descending order of tscore u ; foreach V Qi that of R Qt and has score u (cid:16) V Qi , t , Q (cid:17) > L .θ do Insert ( V Qi , t ); else if OP is a deletion then // Delete t from L Delete the results that contain t and are with score u ≥ L .θ from topk ; if t is a matched tuple then remove t from R Qt ; foreach node V i in L such that t ∈ V i . output do Delete ( V i , t ); while max V Qi ∈L score u ( V Qi , Q ) > L .θ do foreach node V Qi that is with score u ( V Qi , Q ) > L .θ do Insert ( V Qi , V Qi . cur ); if |{ T | T ∈ topk , T . score ≥ L .θ }| < k then // Resume the evaluation of L Enlarge ∆ k and then resume the execution of EvalStatic - Pipelined ; else if |{ T | T ∈ topk , T . score ≥ L .θ }| > ( k + ∆ k ) then RollBack ( L , k , ∆ k ); // Reverse the evaluation of L Report the new ﬁrst k results in topk if they are changed; Procedure

Delete ( V i , t ) Delete t from V i . output ; foreach father node of V i , V i (cid:48) in L do foreach tuple t (cid:48) in V i (cid:48) . output that can join t only do Delete ( V i (cid:48) , t (cid:48) ); // Call Delete recursively

Finally, in lines 20-23, we count the number of results that are with score u ≥ L .θ .If the number is smaller than k , ∆ k is enlarged, and then the EvalStatic - Pipelined algo-rithm (without the initialization step) is called to further evaluate L . If the number islarger than k + ∆ k , the algorithm RollBack , which is described at the end of this sub-section, is called to rollback the evaluation of L . In any case, at the end of handling the OP , we have max V Qi ∈L score u ( V Qi . t cur , Q ) ≤ topk [ k ] . score . Therefore, the k results in topk that have the largest relevance scores are the top- k results. We do not process theresults in topk that are with score u ≤ L .θ in line 6 and line 15, because they can havea large number and do not have the potential to become top- k results. However, afterthe execution of lines 4 and 21, score u of some of them may become larger than L .θ ,because their score u values may be enlarged in line 4 and the L .θ may be decreased inline 21. Therefore, all the results in topk need to be considered in lines 4 and 21. Notethat we have to ﬁrstly check whether some of them have expired due to deletions.In lines 7-13, the new tuple t is processed di ﬀ erently according to whether it con-tains the keywords. If t is an un-matched tuple, it is inserted into each node of R Ft usingthe procedure Insert (line 9). If t is a matched tuple, inserting it into L is more compli-cated. First, if t introduces a new non-empty query tuple set R Qt , we add the new CNsinvolving R Qt into the lattice. Fig. 6 illustrates the process of inserting a new CN intothe lattice shown in Fig. 5. Assuming that W — P Q is the largest common subtree of thenew CN and L , and V f is the father node of W — P Q in the new CN, then the new CN isadded by setting V as the child of V f . If V f is a free tuple set and it does not have otherchild nodes as shown in Fig. 6, Insert ( V f , t (cid:48) ) is called for each tuple t (cid:48) of V f that canjoin tuples in V . output . Further evaluation at the nodes of the new CN, if necessary,will be done in lines 18-19. Second, t is added into the query tuple set R Q (line 12), andthen for each node V Qi of R Qt , Insert ( V Qi , t ) is called when score u (cid:16) V Qi . R Q . t , Q (cid:17) > L .θ (line 13), i.e., t has the potential to form JTTs that are with score u > L .θ . Fig. 6.

Inserting a new CN into the latticeIf OP is a deletion, for each node V i in L such that t ∈ V i . output , we delete t from V i . output using the procedure Delete , which is provided by

KDynamic . Procedure

Delete ﬁrst removes t from V i . output , and then checks whether some outputted tuplesof the ancestors of V i need to be removed (lines 27-29). For instance, if the tuple a isdeleted from the lattice node V Q shown in Fig. 5, tuples w and w are deleted from V . output too because they can join a only, among tuples in V Q . output .Algorithm 3 outlines out algorithm to reverse the execution of the pipelined evalu-ation of the lattice. In the beginning, L .θ is set as the relevance score of the ( k + ∆ k )-th calable Continual Top- k Keyword Search in Relational Databases 17 result in the queue topk (line 1). Then, the processing on each processed tuple t ∈ R Qi that is of score u (cid:16) V Qi . R Q . t , Q (cid:17) ≤ L .θ is reversed (lines 4-6). We use V Qi . cur − V Qi . cur . If t ∈ R Qi . output , the results involving by t are ﬁrstlydeleted from topk , and then t is deleted from V Qi . output by calling the procedure Delete . Algorithm 3:

RollBack (a lattice L , the top- k value k , ∆ k ) L .θ ← topk [ k + ∆ k ] . score ; foreach node V Qi in L do while score u (cid:16) V Qi . cur − , Q (cid:17) ≤ L .θ do if V Qi . cur − ∈ V Qi . output then Remove the results that are of CNs in V Qi . CN and contain tuple V Qi . cur − topk ; Delete ( V Qi , V Qi . cur − // Delete from the output buffer V Qi . cur ← V Qi . cur − In Algorithm 1 and Algorithm 2, procedure

Insert and

Delete may be called by multipletimes upon multiple nodes for the the same tuple. The core of the two procedures arethe select operations (or semi-joins [15]). For example, in line 12 and line 18 of proce-dure

Insert , we need to select the tuples that can join t from the output bu ﬀ er of eachchild node of V i and the set of processed tuples of each father node of V i , respectively.Although such select operations can be done e ﬃ ciently by the DBMS using indexes,the cost of handling t is high due to the large number of database accesses. For example,in our experiments, for a new tuple t , the maximal number of database accesses can beup to several hundred.These select operations done for the same tuple t can be done e ﬃ ciently by shar-ing the computational cost among them. Assume a new tuple w is inserted into thelattice shown in Fig. 5, then procedure Insert is called by three times (

Insert ( V , w ), Insert ( V , w ) and Insert ( V , w )) and at most eight selections are done. All the eightselect operations can be expressed using following two relational algebra expressions: π aid ( σ w id = w ( W ) (cid:49) σ aid ∈A i ( A )) and π pid ( σ w = w ( W ) (cid:49) σ p ∈P j ( P )), where A i and P j represent the set of tuples in the output bu ﬀ er of a node or the set of processed tu-ples of a node. Since A i and P j can be di ﬀ erent from each other, the eight selectoperations need to be evaluated individually. However, if we rewrite the above ex-pressions as π aid ( σ aid ∈A i ( σ w id = w ( W ) (cid:49) ( A ))) and π pid ( σ p ∈P j ( σ w = w ( W ) (cid:49) ( P ))), theeight select operations would have two common sub-operations: σ w = w ( W ) (cid:49) ( A )) and σ w = w ( W ) (cid:49) ( P )). If the results of the two common sub-operations can be shared anddo selections σ aid ∈A i and σ pid ∈P j in the main memory, the eight select operations can beevaluated involving only two database accesses. Algorithm 4:

CanJoinOneOutputTuple (lattice node V i , tuple t ) Let R i be the relation corresponding to the tuple set of V i ; if the tuples of relation R i that can join t have not been stored then Query the tuples of relation R i that can join t and store them; foreach tuple t (cid:48) of the stored tuples of relation R i that can join t do if can ﬁnd t (cid:48) in V i . output then return true ; return false ; Algorithm 4 shows our procedure to check whether tuple t can join at least onetuple in the output bu ﬀ er of a lattice node V i , which is called in line 12 of procedure Insert . In line 3, all the tuples in relation R i that can join t are queried and cached inthe main memory. This set of cached joined tuples can be reused every time when theyare queried. The procedures for the select operations in line 18 of Insert and line 28of

Delete are also designed in this pattern, which are omitted due to the space limita-tion. Note that when the two procedures

Insert and

Delete are called recursively, selectoperations done in the above lines are also evaluated by these procedures. Therefore,for each tuple t , a tree of tuples, which is rooted at t and consist of all the tuples thancan join t , is created. The tree of tuples can be seen as the cached localization infor-mation of t . It is created on-the-ﬂy, i.e., along with the execution of procedures Insert and

Delete , and its depth is determined by the recursion depth of the two procedures.The maximum recursion depth of procedures

Insert and

Delete is CN max + CN max indicates the maximum size of the generated CNs. Hence, the height of this treeof tuples is bounded by CN max + p of P Q is inserted into the two nodes of P Q in the latticeshown in Fig. 5, Fig. 7 illustrates the select operations done in the procedure Insert (denoted as arrows in the left part) and the cached joined tuples of p (shown in theright part). For instance, the arrows form V Q to V selects the tuples in relation W thatcan join p . The three select operations are denoted by dashed arrows because theywould not be done if results of the two select operations, from V Q to V and from V Q to V , are empty. For the same reason, the stored tuples of relation A that can join p aredenoted using dashed rectangles. Fig. 7.

Selections done in

Insert and the cached joined tuples for a tuple p of P Q When computing the initial top- k results, the database is static; hence the cachedjoined tuples of each tuple unchange and can be reused before the database is updated. calable Continual Top- k Keyword Search in Relational Databases 19

When maintain the top- k results, although the database is continually updated, we canassume the database unchange before t is handled. However, the cached joined tuplesof t is expired after t is handled by Algorithm 2. As shown in the experimental results,caching the joined tuples can highly improve the e ﬃ ciency of computing the initialtop- k results and maintaining the top- k results. According to Eq. (3), score u values of tuples in di ﬀ erent CNs have great di ﬀ erences.For example, score u values of tuples in CN and CN are smaller than that of tuples in CN due to the large CN size. In algorithm GP, no tuples or only a small portion arejoined in the CNs whose tuples have small score values. If the CNs in Example 2 areevaluated by algorithm GP, A Q of CN and P Q of CN would have no processed tuples.However, in the lattice, a node R Qi can be shared by multiple CNs. Thus, when insertinga tuple t into R Qi , t is processed in all the CNs in R Qi . CN . As shown in Fig. 5, since V Q is shared by CN , CN , CN and CN in the lattice, tuples a and a are processed in allthese four CNs when processing them at V Q , which results in un-needed operations atnodes V and V two un-needed results a and a ← w → p ← w → a . We call theoperations at V and V and the two JTTs as un-needed because they wound not occuror be found if the CNs are evaluated separately. These un-needed operations can causefurther un-needed operations when maintaining the top- k results. For example, we haveto join a new unmatched tuple of relation P with four tuples in V . output .The essence of the above problem is that CNs have di ﬀ erent potentials in producingtop- k results, and then the same tuple set can have di ﬀ erent numbers of processed tuplesin di ﬀ erent CNs if they are evaluated separately. In order to avoid ﬁnding the un-neededresults, the optimal method is merely to share the tuple sets that have the same numberof processed tuples among CNs when they are evaluated separately. However, we cannotget these numbers without evaluating the CNs. As an alternative, we attempt to estimatethis number for the tuple sets of each CN C according to following heuristic rules: – If Max ( C ) = (cid:80) ≤ i ≤ m C . R Qi . t . tscore u size ( C ) , which indicates the maximum score u of JTTsthat C can produce, is high, tuple sets of C have more processed tuples. – If two CNs have the same

Max ( C ) values, tuple sets of the CN with larger size havemore processed tuples.Therefore, we can cluster the CNs using their Max ( C ) · ln( size ( C )) values, where ln( size ( C ))is used to normalize the e ﬀ ect of CN sizes. Then, when constructing the lattice, only thesubtrees of CNs in the same cluster can be collapsed. For example, Max ( C ) · ln( size ( C ))values of the seven CNs of Example 2 are: 5.15, 2.93, 5.39, 6.84, 5.32, 5.70 and 3.03;hence they can be clustered into two clusters: { CN , CN } and { CN , CN , CN , CN , CN } .Fig. 8 shows the lattice after ﬁnding the top-3 results if the CNs are clustered, wherethe three un-needed JTTs in Fig. 5 can be avoided. As shown in the experimental sec-tion, clustering the CNs can highly improve the e ﬃ ciency in computing the initial top- k results and handling the database updates. We cluster the CNs using the K -mean clustering algorithm [21], which needs an in-put parameter to indicate the number of expected clusters. We use Kmean to indicate theratio of this input parameter to the number of CNs. The value of

Kmean represents thetrade-o ﬀ between sharing the computation cost among CNs and considering their dif-ferent potentials in producing top- k results. When Kmean =

0, the CNs is not clustered,then the CNs share the computation cost at the maximum extent. When

Kmean = Kmean = . k results and handling the database updates. Fig. 8.

After ﬁnding the top-3 results if the CNs are clustered into two clusters

We conducted extensive experiments to test the e ﬃ ciency of our methods. We usethe DBLP dataset . Note that DBLP is not continuously growing and is updated on amonthly basis. The reason we use DBLP to simulate a continuously growing relationaldataset is because there is no real growing relational datasets in public, and many stud-ies [4,9] on top- k keyword queries over relational databases use DBLP. The downloadedXML ﬁle is decomposed into relations according to the schema shown in Fig. 9. Thetwo arrows from PaperCite to Papers denote the foreign-key-references from paperID to paperID and citedPaperID to paperID , respectively. The DBMS used is MySQL(v5.1.44) with the default “Dedicated MySQL Server Machine” conﬁguration. All therelations use the MyISAM storage engine. Indexes are built for all primary key andforeign key attributes, and full-text indexes are built for all text attributes. All the algo-rithms are implemented in C ++ . We conducted all the experiments on a 2.53 GHz CPUand 4 GB memory PC running Windows 7. We use the following ﬁve parameters in the experiments: ( ) k : the top- k value; ( ) l :the number of keywords in a query; ( ) IDF : the ratio of the number of matched tuplesto the number of total tuples, i.e., d f w N ; ( ) CN max : the maximum size of the generatedCNs; and ( ) Kmean : the ratio of the number of clusters of CNs to the number of CNs. http: // dblp.mpi-inf.mpg.de / dblp-mirror / index.php / calable Continual Top- k Keyword Search in Relational Databases 21

Fig. 9.

The DBLP schema (PK stands for primary key, FK for foreign key)The parameters with their default values (bold) are shown in Table. 3. The keywordsselected are listed in Table. 4 with their

IDF values, where the keywords in bold fontsare keywords popular in author names. Ten queries are constructed for every

IDF value,each of which contains three selected keywords. For each l value, ten queries are con-structed by selecting l keywords from the row of IDF = .

013 in Table. 4. To avoidgenerating a small number of CNs for each query, one author name keyword of each

IDF value always be selected for each query.When k grows, the cost of computing the initial top- k results increases since weneed to compute more results, and the cost of maintaining the top- k results also in-creases since there are more tuples in the output bu ﬀ ers of the lattice nodes. The pa-rameter CN max has a great impact on keyword query processing because the numberof generated CNs increases exponentially while CN max increases. And the number ofmatched tuples increases as IDF and l increase. Hence, the ﬁrst four parameters k , l , IDF and CN max have e ﬀ ects on the scalability of our method. Table 3.

Parameters

Name Values k , 150, 200 l , 4, 5 IDF ,0.03 CN max

4, 5, , 7 Kmean

0, 0.20,0.40, ,0.80,1

Table 4.

Keywords and their

IDF values

Keywords

IDF

ATM, embedded, navigation, privacy, scalable, Spatial, XML,

Charles , Eric

James , Zhang

John , Wang

David , Michael k Results Computation

In this experiment, we want to study the e ﬀ ects of the ﬁve parameters on computing theinitial top- k results. We retrieve the data in the XML ﬁle sequentially until number oftuples in the relations reach the numbers shown in Table. 5. Then we run the algorithm EvalStatic - Pipelined on di ﬀ erent values of each parameter while keeping the other fourparameters in their default values. We use two measures to evaluate the e ﬀ ects of theparameters. The ﬁrst is R , the number of found results in the queue topk . The secondmeasure is T , the time cost of running the algorithm. Ten top- k queries are selected for each combinations of parameters, and the average values of the metrics of them arereported in the following. In this experiment, ∆ d f w ( = ∆ a v dl ( = ∆ k ( = k results.The main results of this experiment are given in Fig. 10. Note that the units for the y -axis are di ﬀ erent for the three measures. Fig. 10(a), (b) and (c) show that the twomeasures all increases as k , id f and CN max grow. However, they do not show rapidincrease in Fig. 10(a), (b) and (c), which imply the good scalability of our method. Onthe contrary, we can ﬁnd rapid increase while CN max grows from the time cost of themethod of [9] in ﬁnding the top- k results, which is shown in Fig. 10(c) and are denotedby T [ S PARK ]. Fig. 10(c) presents that, compared to the existing method, algorithm

EvalStatic - Pipelined is very e ﬃ cient in ﬁnding the top- k results. The reason is thatevaluating the CNs using the lattice can achieve full reduction because all the tuples inthe output bu ﬀ er of the root nodes can form JTTs and can save the computation cost bysharing the common sub-expressions [15]. Fig. 10(d) shows that the e ﬀ ect of l seemsmore complicated: all the two measures may decrease when l increases. As shown inFig. 10(d), R and T even both achieve the minimum values when l =

5. This is becausethe probability that the keywords to co-appear in a tuple and the matched tuples can joinis high when the number of keywords is large. Therefore, there are more JTTs that havehigh relevance scores, which results in larger θ and small values of the two measures. Table 5.

Tuple numbers of relations

Papers PaperCite Write Authors Proceedings ProcEditors ProcEditor157,300 9,155 400,706 190,615 2,886 1,936 1,411

Fig. 10(e) presents the changing of the two measures when

Kmean varies. Since theresults of the K -means clustering may be a ﬀ ected by the starting condition [21], foreach Kmean value, we run Algorithm 1 for 5 times on di ﬀ erent starting condition foreach keyword query and report the average experimental results. Note that the algorithm EvalStatic in KDynamic corresponding to

Kmean = KDynamic . From Fig. 10(e), we can ﬁnd that clustering the CNs can highly improvethe e ﬃ ciency of computing the top- k results and the time cost decreases as Kmean in-creases. However, when

Kmean =

1, which indicates that all the CNs are evaluatedseparately, the time cost grows to a higher value than that when

Kmean is 0.6 or 0.8.Therefore, it is important to select a proper

Kmean value. The minimum T in this ex-periment is achieved on Kmean = .

6; hence the default value of

Kmean is 0.6 in ourexperiments. As can be seen in the next section,

Kmean = . k results withthat of KDynamic , while varying CN max . The time cost of KDynamic is denoted by“!Cache” because it does not cache the joined tuples for each tuple. We can ﬁnd thatcaching the joined tuples for each tuple highly improves the e ﬃ ciency of computingthe top- k results. More important, the improvement increases as CN max grows. Thisis because when CN max grows, the times of calling the procedure Insert on each tuple calable Continual Top- k Keyword Search in Relational Databases 23(a) Varying k (b) Varying IDF (c) Varying CN max (d) Varying l (e) Varying Kmean (f) E ﬀ ect of storing joined tuples Fig. 10.

Experimental results of calculating the initial top- k results increases fast since the number of lattice nodes increases exponentially; hence the savedcost due to storing the joined tuples of each tuple grows as CN max grows.From the curves of R in Fig. 10, we can ﬁnd that R values are large in all thesettings: about several thousand. Recall that topk contains three kinds of results. Thenumber of the ﬁrst kind of results is k + ∆ k , which is small compared to the R values.Since ∆ d f w ( = ∆ a v dl and ∆ k all have very small values, the number of potentialtop-( k + ∆ k ) results in topk is very small ( < score u < L .θ , is in the majority and has a lager number. k Result Maintenance

In this experiment, we want to study the e ﬃ ciency of Algorithm 2 in maintaining top- k results. We use the same keyword queries as Exp-1. After calculating the initial top- k results for them, we sequentially insert additional tuples into the database by retrievingdata from the DBLP XML ﬁle. At the same time, we delete randomly selected tuplesfrom the database. Algorithm 2 is used to maintain the top- k results for the queries whilethe database being updated. The database update records are read from the database logﬁle; hence the database updating rate has no directly impact on the e ﬃ ciency of top- k results maintenance because the database is updated by another process.We ﬁrst add 713,084 new tuples into the database and delete 250,000 tuples from thedatabase. The new data is roughly 90 percent of the data used in Exp-1. The compositionof the additional tuples is shown in Table. 6. Fig. 11(a) and (b) show the change ofthe average execution times of Algorithm 2 in handling the above database updateswhen varying the ﬁve parameters , which presents the e ﬃ ciency of Algorithm 2. Notethat the units for the x -axis are di ﬀ erent for the ﬁve measures, whose minimum andmaximum values are labeled in Fig. 11(a) and (b), and their other values can be foundin Table. 3. We can ﬁnd that the time cost of handling database updates for the defaultqueries is smaller than 1.5ms. Comparing Fig. 11(a) and (b) with the curves of measure T in Fig. 10 (especially the curves in Fig. 10(d) and Fig. 10(e)), we can ﬁnd that thetime cost to handle database updates and the time cost to compute the initial top- k results have the same changing trends. This is because there are more outputted tuplesin the lattice when more time is needed to compute the initial top- k results; hence moretime is required to do the selections in procedures Insert and

Delete and the recursivedepthes of them are more larger. Fig. 11(c) compares the time cost of our method inhandling database updates with that of

KDynamic , while varying CN max . The time costof KDynamic is also denoted as “!Cache”. We can ﬁnd that caching the joined tuples foreach tuple can also improve the e ﬃ ciency of handling database updates, and the largerthe CN max , the higher the improvement of the e ﬃ ciency is. Table 6.

Composition of the additional tuples

Papers PaperCite Write Authors Proceedings ProcEditors ProcEditor156,965 20,010 411,109 111,094 3,033 3,886 6,987 Since it is hard to read in one ﬁgure, we split the data of the ﬁve parameters into two ﬁgures.calable Continual Top- k Keyword Search in Relational Databases 25(a) Time for handling database updateswhile varying

Kmean and CN max (b) Time for handling database updateswhile varying l , k and IDF (c) E ﬀ ect of storing joined tuples inhandling database updates (d) Changes of the times of enlarging ∆ d f w (e) Changes of the times of callingprocedure RollBack (f) Changes of the times of enlarging ∆ k Fig. 11. E ﬃ ciency of top- k result maintenance Secondly, we only insert the 713,084 additional tuples into the database while main-taining top-100 results for the default ten keyword queries. We adopt two di ﬀ erent grow-ing rates of ∆ d f w : ∆ d f w + =

2% and ∆ d f w + = (cid:16) Nd f w + (cid:17) exceed its upper bound, the corresponding ∆ d f w value is increased by 2% and 5%,respectively. After inserting each 100,000 additional tuples, we record the average fre-quency of enlarging ∆ d f w and calling the procedure RollBack for the ten queries, whosechanges are shown in Fig. 11(d) and (e), respectively, whose x -axis (with unit of 10 )indicate the number of additional tuples. Note that we do not report the frequency ofenlarging a v dl because it is very small in the experiment ( < ∆ d f w is larger when the growing rate of ∆ d f w is lower, after inserting 300,000 additional tuples, the times of enlarging ∆ d f w , i.e., thetimes of exceeding the upper bound of ln (cid:16) Nd f w + (cid:17) , falls below 5 for both the two growingrates of ∆ d f w . After inserting 300,000 additional tuples, the maximum ∆ d f w value of allthe relations is 15; hence it is reasonable to set 15 as the maximum value for ∆ d f w . Thereis only one curve in Fig. 11(e) because the growing rate of ∆ d f w has no great impact onthe times of calling the procedure RollBack , which is mainly a ﬀ ected by the frequencyof ﬁnding new results that are with score u > L .θ . Note that L .θ is increased after eachtime of calling the procedure RollBack . Therefore, the times of calling the procedure

RollBack is decreasing since it is more and more harder to ﬁnd new results that are with score u > L .θ . In order to study the impact of reversing the pipelined evaluation on thee ﬃ ciency of handling database updates, we also redo the experiment without callingthe procedure RollBack . Then, the average time cost of handling database updates isincreased by 45 . ﬀ erent ∆ k growing rates are adopted: ∆ k + = ∆ k + =

5, which mean that when the number of results that are with score u > L .θ fallsbelow k , the corresponding ∆ k value is increased by 2 and 5, respectively. We recordthe average times of enlarging ∆ k of the ten queries after deleting each 100,000 tuples,whose changes are shown in Fig. 11(f). Fig. 11(f) shows that the frequency of shortageof top- k results falls below a very small number after deleting 200,000 tuples, i.e., after ∆ k being enlarged to about 20. As indicated by the curve of k in Fig. 11(b), a large ∆ k value can highly decrease the e ﬃ ciency of handling database updates. Therefore, it isreasonable to set the maximum value of ∆ d f w as 20%. In this paper, we have studied the problem of ﬁnding the top- k results in relationaldatabases for a continual keyword query. We proposed an approach that ﬁnds the an-swers whose upper bounds of future relevance scores are larger than a threshold. Weadopt an existing scheme of ﬁnding all the results in a relational database stream, butincorporate the ranking mechanisms in the query processing methods and make twoimprovements that can facilitate e ﬃ cient top- k keyword search in relational databases.The proposed method can e ﬃ ciently maintain top- k results of a keyword query without calable Continual Top- k Keyword Search in Relational Databases 27 re-evaluation. Therefore, it can be used to solve the problem of answering continualkeyword search in databases that are updated frequently.

Acknowledgments

This research was partly supported by the National Natural Science Foundation ofChina (NSFC) under grant No. 60873040, 863 Program under grant No. 2009AA01Z135.Jihong Guan was also supported by the “Shu Guang” Program of Shanghai MunicipalEducation Commission and Shanghai Education Development Foundation.

References

1. S. Agrawal, S. Chaudhuri, and G. Das, “DBXplorer: Enabling keyword search over relationaldatabases,” ACM SIGMOD, p.627, 2002.2. V. Hristidis and Y. Papakonstantinou, “DISCOVER: Keyword search in relational databases,”VLDB, pp.670–681, 2002.3. B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe, and Parag, “BANKS: Browsingand keyword searching in relational databases,” VLDB, pp.1083–1086, 2002.4. V. Hristidis, L. Gravano, and Y. Papakonstantinou, “E ﬃ cient IR-style keyword search overrelational databases,” VLDB, pp.850–861, 2003.5. V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar, “Bidi-rectional expansion for keyword search on graph databases,” VLDB, pp.505–516, 2005.6. H. He, H. Wang, J. Yang, and P.S. Yu, “Blinks: ranked keyword searches on graphs,” ACMSIGMOD, New York, NY, USA, pp.305–316, ACM, 2007.7. F. Liu, C. Yu, W. Meng, and A. Chowdhury, “E ﬀ ective keyword search in relationaldatabases,” ACM SIGMOD, pp.563–574, 2006.8. G. Li, X. Zhou, J. Feng, and J. Wang, “Progressive keyword search in relational databases,”ICDE, pp.1183–1186, 2009.9. Y. Luo, X. Lin, W. Wang, and X. Zhou, “SPARK: Top-k keyword query in relationaldatabases,” ACM SIGMOD, pp.115–126, 2007.10. G. Li, B.C. Ooi, J. Feng, J. Wang, and L. Zhou, “EASE: An e ﬀ ective 3-in-1 keyword searchmethod for unstructured, semi-structured and structured data,” ACM SIGMOD, pp.903–914,2008.11. Y. Xu, Y. Ishikawa, and J. Guan, “E ﬀ ective top- k keyword search in relational databasesconsidering query semantics,” APWeb //