Scalable Continual Top-k Keyword Search in Relational Databases
SScalable Continual Top- k Keyword Searchin Relational Databases
Yanwei Xu Department of Computer Science and Technology, Tongji University, Shanghai, China
Abstract.
Keyword search in relational databases has been widely studied inrecent years because it does not require users neither to master a certain struc-tured query language nor to know the complex underlying database schemas.Most of existing methods focus on answering snapshot keyword queries in staticdatabases. In practice, however, databases are updated frequently, and users mayhave long-term interests on specific topics. To deal with such a situation, it isnecessary to build e ff ective and e ffi cient facility in a database system to support continual keyword queries .In this paper, we propose an e ffi cient method for answering continual top- k key-word queries over relational databases. The proposed method is built on an ex-isting scheme of keyword search on relational data streams, but incorporates theranking mechanisms into the query processing methods and makes two improve-ments to support e ffi cient top- k keyword search in relational databases. Comparedto the existing methods, our method is more e ffi cient both in computing the top- k results in a static database and in maintaining the top- k results when the databasecontinually being updated. Experimental results validate the e ff ectiveness and ef-ficiency of the proposed method. Key words:
Relational databases, keyword search, continual queries, results mainte-nance.
With the proliferation of text data available in relational databases, simple ways to ex-ploring such information e ff ectively are of increasing importance. Keyword search inrelational databases , with which a user specifies his / her information need by a set ofkeywords, is a popular information retrieval method because the user needs to knowneither a complex query language nor the underlying database schemas. It has attractedsubstantial research e ff ort in recent years, and a number of methods have been devel-oped [1,2,3,4,5,6,7,8,9,10]. Example 1.
Consider a sample publication database shown in Fig. 1. Fig. 1 (a) showsthe three relations
Papers , Authors , and
Writes . In the following, we use the initial ofeach relation name ( P , A , and W ) as its shorthand. There are two foreign key references: W → A and W → P . Fig. 1 (b) illustrates the tuple connections based on the foreign key a r X i v : . [ c s . D B ] A ug Yanwei Xu references. For the keyword query “James P2P” consisting of two keywords “James”and “P2P”, there are six tuples in the database that contain at least one of the twokeywords (underlined in Fig. 1 (a)). They can be regraded as the results of the query.However, they can be joined with other tuples according to the foreign key referencesto form more meaningful results, several of which are shown in Fig. 1 (c). The arrowsrepresent the foreign key references between the corresponding pairs of tuples. Findingsuch results which are formed by the tuples containing the keywords is the task ofkeyword search in relational databases. As described later, results are often ranked byrelevance scores evaluated by a certain ranking strategy. (cid:50)
Papers pid titlep “Leveraging Identity-Based Cryptography for Node ID Assignment in Structured (cid:58)(cid:58)(cid:58) P2P Systems.” p “ (cid:58)(cid:58)(cid:58) P2P or Not (cid:58)(cid:58)(cid:58)
P2P?: In (cid:58)(cid:58)(cid:58)
P2P 2003” p “A System for Predicting Subcellular Localization.” p “Logical Queries over Views: Decidability.” p “A conservative strategy to protect (cid:58)(cid:58)(cid:58) P2P file sharing systems from pollution attacks.” · · · · · ·
Authors aid namea “ (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) James Chen” a “Saikat Guha” a “ (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) James Bassingthwaighte” a “Sabu T.” a “ (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) James S. W. Walkerdines” · · · · · ·
Writes wid w w w w w w w w · · · aid a a a a a a a a · · · pid p p p p p p p p · · · (a) Database (Matched keywords are underlined)(b) Tuple connections (Matched tuplesare solid circles) (c) Examples of query results Fig. 1.
A sample database with a keyword query “James P2P”.Most of the existing keyword search methods assume that the databases are staticand focus on answering snapshot keyword queries. In practice, however, a database isoften updated frequently, and the result of a snapshot query becomes invalid once therelated data in the database is updated. For the database in Fig. 1, if publication datacomes continually, new publication records are inserted to the three tables. Such newrecords may be more relevant to “James” and “P2P”. Hence, after getting the initial top- k results, the user may demand the top- k results to reflect the latest database updates. calable Continual Top- k Keyword Search in Relational Databases 3
Such demands are common in real applications. Suppose a user want to do a top- k keyword search in a Micro-blogging database, which is being updated continually: notonly the weblogs and comments are continually being inserted or deleted by bloggers,but also the follow relationship between bloggers are being updated continually. Thus,a continual evaluation facility for keyword queries is essential in such databases.For continual keyword query evaluation, when the database is updated, two situa-tions must be considered:1. Database updates may change the existing top- k results: some top- k results may bereplaced by new ones that are related to the new tuples, and some top- k results maybe invalid due to deletions.2. Database updates may change the relevance scores of existing results because theunderlying statistics (e.g., word frequencies) are changed.In this paper, we describe a system which can e ffi ciently report the top- k results ofevery monitoring query while the database is being updated continually. The outline ofthe system is as follows: – When a continual query is issued, it is evaluated in a pipelined way to find the setof results whose upper bounds of relevance scores are higher than a threshold θ bycalculating the upper bound of the future relevance score for every query result. – When the database is updated, we first update the relevance scores of the computedresults, then find the new results whose upper bounds of relevance scores are largerthan θ and delete the results containing the deleted tuples. – The pipelined evaluation of the keyword query is resumed if the number of com-puted results whose relevance scores are larger than θ falls below k , or is reversedif the above number is much larger than k . – At any time, the k computed results whose relevance scores are the largest and arelarger than θ are reported as the top- k results.In Section 2, some basic concepts are introduced and the problem is defined. Sec-tion 3 discusses related work. Section 4 presents the details of the proposed method.Section 5 gives the experimental results. Conclusion is drawn in Section 6. In this section, we introduce some important concepts for top- k keyword querying eval-uation in relational databases. We consider a relational database schema as a directed graph G S ( V , E ), called a schemagraph, where V represents the set of relation schemas { R , R , . . . } and E represents theforeign key references between pairs of relation schemas. Given two relation schemas, Yanwei Xu R i and R j , there exists an edge in the schema graph, from R j to R i , denoted R i ← R j ,if the primary key of R i is referenced by the foreign key defined on R j . For example,the schema graph of the publication database in Fig. 1 is Papers ← Write → Authors .A relation on relation schema R i is an instance of R i (a set of tuples) conforming tothe schema, denoted r ( R i ). A tuple can be inserted into a relation. Below, we use R i todenote r ( R i ) if the context is obvious. The results of keyword queries in relational databases are a set of connected trees oftuples, each of which is called a joint-tuple-tree ( JTT for short). A JTT represents howthe matched tuples , which contain the specified keywords in their text attributes, areinterconnected through foreign key references. Two adjacent tuples of a JTT, t i ∈ r ( R i )and t j ∈ r ( R j ), are interconnected if they can be joined based on a foreign key referencedefined on relational schema R i and R j in G S (either R i → R j or R i ← R j ). The foreignkey references between tuples in a JTT can be denoted using arrows or notation (cid:49) . Forexample, the second JTT in Fig. 1(c) can be denoted as a ← w → p or a (cid:49) w (cid:49) p .To be a valid result of a keyword query Q , each leaf of a JTT is required to contain atleast one keyword of Q . In Fig. 1(c), tuples p , p , a and a are matched tuples to thekeyword query as they contain the keywords. Hence, the four JTTs are valid results tothe query. In contrast, p ← w → a is not a valid result because tuple a does notcontain any required keywords. The number of tuples in a JTT T is called the size of T ,denoted by size ( T ). Given a keyword query Q , the query tuple set R Qi of relation R i is defined as R Qi = { t ∈ r ( R i ) | t contains some keywords of Q } . For example, the two query tuple sets inExample 1 are P Q = { p , p , p } and A Q = { a , a , a } , respectively. The free tuple setR Fi of a relation R i with respect to Q is defined as the set of tuples that do not containany keywords of Q . In Example 1, P F = { p , p , . . . } , A F = { a , a , . . . } . If a relation R i does not contain text attributes (e.g., relation W in Fig. 1), R i is used to denote R Fi forany keyword query. We use R QorFi to denote a tuple set , which may be either R Qi or R Fi .Each JTT belongs to the result of a relational algebra expression, which is called a candidate network ( CN ) [4,9,11]. A CN is obtained by replacing each tuple in a JTTwith the corresponding tuple set that it belongs to. Hence, a CN corresponds to a joinexpression on tuple sets that produces JTTs as results, where each join clause R QorFi (cid:49) R QorFj corresponds to an edge (cid:104) R i , R j (cid:105) in the schema graph G S , where (cid:49) represents aequi-join between relations. For example, the CNs that correspond to two JTTs p and p ← w → a in Example 1 are P Q and P Q (cid:49) W (cid:49) A Q , respectively. In the following,we also denote P Q (cid:49) W (cid:49) A Q as P Q ← W → A Q . As the leaf nodes of JTTs must bematched tuples, the leaf nodes of CNs must be query tuple sets. Due to the existenceof m : n relationships (for example, an article may be written by multiple authors), aCN may have multiple occurrences of the same tuple set. The size of CN C , denoted as calable Continual Top- k Keyword Search in Relational Databases 5 size ( C ), is defined as the number of tuple sets that it contains. Obviously, the size of aCN is the same as that of the JTTs it produces. Fig. 2 shows the CNs corresponding tothe four JTTs shown in Fig. 1 (c). A CN can be easily transformed into an equivalentSQL statement and executed by an RDBMS. Fig. 2.
Examples of Candidate NetworksWhen a continual keyword query Q = { w , w , . . . , w l } is specified, the non-emptyquery tuple set R Qi for each relation R i in the target database are firstly computed us-ing full-text indices. Then all the non-empty query tuple sets and the database schemaare used to generate the set of valid CNs, whose basic idea is to expand each partialCN by adding a R Fi or R Qi at each step ( R i is adjacent to one relation of the partial CNin G S ), beginning from the set of non-empty query tuple sets. The set of CNs shall besound / complete and duplicate-free. There are always a constraint, CN max (the maximumsize of CNs) to avoid generating complicated but less meaningful CNs. In the imple-mentation, we adopt the state-of-the-art CN generation algorithm proposed in [12]. Example 2.
In Example 1, there are two non-empty query tuple sets P Q and A Q . Usingthem and the database schema graph, if CN max =
5, the generated CNs are: CN = P Q , CN = A Q , CN = P Q ← W → A Q , CN = P Q ← W → A Q ← W → P Q , CN = P Q ← W → A F ← W → P Q , CN = A Q ← W → P Q ← W → A Q and CN = A Q ← W → P F ← W → A Q . The problem of continual top-k keyword search we study in this paper is to continuallyreport top- k JTTs based on a certain scoring function that will be described below. Weadopt the scoring method employed in [4], which is an ordinary ranking strategy in theinformation retrieval area. The following function score ( T , Q ) is used to score JTT T for query Q , which is based on the TF-IDF weighting scheme: score ( T , Q ) = (cid:80) t ∈ T tscore ( t , Q ) size ( T ) , (1)where t ∈ T is a tuple (a node) contained in T . tscore ( t , Q ) is the tuple score of t withregard to Q defined as follows: tscore ( t , Q ) = (cid:88) w ∈ t ∩ Q + ln(1 + ln( t f t ,w ))(1 − s ) + s · dl t a v dl · ln (cid:32) Nd f w + (cid:33) (2) For example, we can transform CN P Q ← W → A Q as: SELECT * FROM W w, P p, A aWHERE w.pid = p.pid AND w.aid = a.aid AND p.pid in ( p , p , p ) and a.aid in ( a , a , a ). Yanwei Xu where t f t ,w is the term frequency of keyword W in tuple t , d f w is the number of tuples inrelation r ( t ) (the relation corresponds to tuple t ) that contain W . d f w is interpreted as the document frequency of W . dl t represents the size of tuple t , i.e., the number of letters in t , and is interpreted as the document length of t . N is the total number of tuples in r ( t ), a v dl is the average tuple size ( average document length ) in r ( t ), and s (0 ≤ s ≤
1) is aconstant which usually be set to 0.2.Table 1 shows the tuple scores of the six matched tuples in Example 1. We supposeall the matched tuples are shown in Fig. 1, and the numbers of tuples of the two relationsare 150 and 180, respectively. Therefore, the top-3 results are T = p ( score = . T = a ( score = .
00) and T = p ← w → a ( score = . Table 1.
Statistics and tuple scores of tuples of P Q and A Q Tuple Set P Q A Q Statistics
N d f
P2P a v dl N d f James a v dl
150 3 57.8 170 3 14.6Tuple p p p a a a dl
88 28 83 10 22 23 t f tscore
The score function in Eq. (1) has the property of tuple monotonicity , defined asfollows. For any two JTTs T = t (cid:49) t (cid:49) . . . (cid:49) t l and T (cid:48) = t (cid:48) (cid:49) t (cid:48) (cid:49) . . . (cid:49) t (cid:48) l generated from the same CN C , if for any 1 ≤ i ≤ l , tscore ( t , Q ) ≤ tscore ( t (cid:48) , Q ), then wehave score ( T , Q ) ≤ score ( T (cid:48) , Q ). As shown in the following discussion, this property isrelied by the existing top- k query evaluation algorithms. Given l -keyword query Q = { w , w , · · · , w l } , the task of keyword search in a relationaldatabase is to find structural information constructed from tuples in the database [13].There are two approaches. The schema-based approaches [1,2,4,7,9,14,15] in this areautilize the database schema to generate SQL queries which are evaluated to find thestructures for a keyword query. They process a keyword query in two steps. They firstutilize the database schema to generate a set of relation join templates (i.e., the CNs),which can be interpreted as select-project-join views. Then, these join templates areevaluated by sending the corresponding SQL statements to the DBMS for finding thequery results. [2] proved how to generate a complete set of CNs when the CN max hasa user-given value and discussed several query processing strategies when considersthe common sub-expressions among the CNs. [1,2,14,15] all focused on finding allJTTs, whose sizes are ≤ CN max , which contain all l keywords, and there is no ranking calable Continual Top- k Keyword Search in Relational Databases 7 involved. In [4] and [9], several algorithms are proposed to get top- k JTTs. We willintroduce them in detail in Section 3.2.The graph-based methods [3,8,5,6,10,16] model and materialize the entire databaseas a directed graph where the nodes are relational tuples and the directed edges areforeign key references between tuples. Fig. 1(b) shows such a database graph of theexample database. Then for each keyword query, they find a set of structures (eitherSteiner trees [3], distinct rooted trees [5], r -radius Steiner graphs [10], or multi-centersubgraphs [16]) from the database graph, which contain all the query keywords andare connected by the paths in database graph. Such results are found by graph traver-sals that start from the nodes that contain the keywords. For the details, please re-fer the survey papers [13,17]. The materialized data graph should be updated for anydatabase changes; hence this model is not appropriate to the databases that change fre-quently [17]. Therefore, this paper adopts the schema-based framework and can beregarded as an extension for dealing with continual keyword search. k Keyword Search in Relational Databases
DISCOVER2 [4] proposed the
Global-Pipelined ( GP ) algorithm to get the top- k resultswhich are ranked by the IR-style ranking strategy shown in Section 2.4. The aim of thealgorithm is to find a proper order of generating JTTs in order to stop early before allthe JTTs are generated. It employs the priority preemptive , round robin protocol [18] tofind results from each query tuple set prefix in a pipelined way, thus each CN can avoidbeing fully evaluated.For a keyword query Q , given a CN C , let the set of query tuple sets of C be { R Q , R Q , . . . , R Qm } . Tuples in each R Qi are sorted in non-increasing order of their scorescomputed by Eq. 2. Let R Qi . t j be the j -th tuple in R Qi . In each R Qi , we use R Qi . cur to denote the current tuple such that the tuples before the position of the tuple areall processed, and we use R Qi . cur ← R Qi . cur + R Qi . cur to the next posi-tion. q ( t , t , . . . , t m ) (where t i is a tuple, and t i ∈ R Qi ) denotes the parameterized querywhich checks whether the m tuples can form a valid JTT. For each tuple R Qi . t j , we use score ( C . R Qi . t j , Q ) to denote the upper bound score for all the JTTs of C that contain thetuple R Qi . t j , defined as follows: score ( C . R Qi . t j , Q ) = t j . tscore + (cid:80) ≤ i (cid:48) ≤ m ∧ i (cid:48) (cid:44) i C . R Qi (cid:48) . t . tscoresize ( C ) (3)According to the tuple monotonicity property of Eq. (1) and the sorting order of tuples,among the unprocessed tuples of C . R Qi , score ( C . R Qi . cur , Q ) has the maximum value.Algorithm GP initially mark all tuples in C . R Qi (1 ≤ i ≤ m ) of each CN C asun-processed except for the top-most ones. Then in each while iteration (one round),the un-processed tuple which maximizes the score value is selected for processing.Suppose tuple C . R Qs . cur maximizes score , processing C . R Qs . cur is done by joining itwith the processed tuples in the other query tuple sets of C to find valid JTTs: all thecombinations as ( t , t , . . . , t s − , R Qs . cur , t s + . . . , t m ) are tested, where t i is a processed Yanwei Xu tuple of C . R Qi (1 ≤ i ≤ m , i (cid:44) s ). If the k -th relevance score of the found resultsis larger than score values of all the un-processed tuples in all the CNs, it can stopand output the k found results with the largest relevance scores because no results withhigher scores can be found in the further evaluation.One drawback of the GP algorithm is that when a new tuple C . R Qi . cur is processed, ittries all the combinations of processed tuples ( t , t , . . . , t s − , t s + . . . , t m ) to test whethereach combination can be joined with C . R Qi . cur . This operation is costly due to extremelylarge number of combinations when the number of processed tuples becomes large [19].SPARK [9] proposes the Skyline-Sweeping algorithm to reduce the number of combi-nations test. SPARK uses a priority queue Q to keep the set of seen but not testedcombinations ordered by the priority defined as the score of the hypothetical JTT corre-sponding to each combination. In each round, the combination in Q with the maximumpriority is tested, then all its adjacent combinations are inserted into Q but only thecombinations that have the high priorities are tested. SPARK still can not avid testinga huge number of combinations which cannot produce results, though the number ofcombinations test is highly reduced compared to DISCOVER2.This paper evaluates the CNs in a pipelined way like [4] and [9], but also em-ploys the following two optimization strategies, whose high e ffi ciencies are shown in[2,14,15]: (1) sharing the computational cost among CNs; and (2) adopting tuple reduc-tion. The most related projects to our paper are
S-KWS [14] and
KDynamic [20,15], whichtry to find new results or expired results for a given keyword query over an open-ended,high-speed large relational data stream [13]. They adopt the schema-based frameworksince the database is not static. This paper deals with a di ff erent problem from S-KWS and
KDynamic , though all need to respond to continual queries in a dynamic environ-ment.
S-KWS and
KDynamic focus on finding all query results. On the contrary, ourmethods maintain the top- k results, which is less sensitive to the updates of the under-lying databases because not every new or expired results change the top- k results. S-KWS maps each CN to a left-deep operator tree , where leaf operators (nodes) aretuple sets, and interior operators are joins. Then the operator trees of all the CNs arecompacted into an operator mesh by collapsing their common subtrees. Joins in theoperator mesh are evaluated in a bottom-to-top manner. A join operator has two inputsand is associated with an output bu ff er which saves its results (partial JTTs). The outputbu ff er of a join operator becomes input to many other join operators that share the joinoperator. A new result that is newly outputted by a join operator will be a new arrivalinput to those joins sharing it. The operator mesh has two main shortcomings [19]:(1) only the left part of the operator trees can be shared; and (2) a large number ofintermediate tuples, which are computed by many join operators in the mesh with highprocessing cost, will not be eventually output in the end.For overcoming the above shortcomings of S-KWS , KDynamic formalizes each CNas a rooted tree, whose root is defined to be the node r such that the maximum path calable Continual Top- k Keyword Search in Relational Databases 9 from r to all leaf nodes of the CN is minimized; and then compresses all the rootedtrees into a L -Lattice by collapsing the common subtrees. Fig. 3(a) shows the latticeof two hypothetical CNs. Each node V in the Lattice is also associated with an outputbu ff er, which contains the tuples in V that can join at least one tuple in the output bu ff erof its each child node. Thus, each tuple in the output bu ff er of each top-most node V , i.e.,the root of a CN, can form JTTs with tuples in the output bu ff ers of its descendants. Thenew JTTs involving a new tuple are found in a two-phase approach. In the filter phase,as illustrated in Fig. 3(b), when a new tuple t new is inserted into node R , KDynamic usesselections and semi-joins to check if (1) t new can join at least a tuple in the output bu ff erof each child node of R ; and (2) t new can join at least a tuple in the output bu ff ers ofthe ancestors of R . The new tuples that can not pass the checks are pruned; otherwise,in the join phase (shown in Fig. 3(c)), a joining process is initiated from each tuple inthe output bu ff er of each root node that can join t new , in a top-down manner, to find theJTTs involving t new . (a) L -Lattice of two CNs (b) Filter phase (c) Join phase Fig. 3.
Query processing in
KDynamic
In this paper, we incorporate the ranking mechanisms and the pipelined evalua-tion into the query processing method of
KDynamic to support e ffi cient top- k keywordsearch in relational databases. k Keyword Search in Relational Databases
Database updates bring two orthogonal e ff ects on the current top- k results:1. They change the values of d f w , N , and a v dl in Eq. (2) and hence change the rele-vance scores of existing results.2. New JTTs may be generated due to insertions. Existing top- k results may be expireddue to deletions.Although the second e ff ect is more drastic, the first e ff ect is not negligible for long-termdatabase modifications. Thus, we can not neglect all the JTTs that are not the currenttop- k results because some of them have the potential of becoming the top- k results inthe future. This paper solves this problem by bounding the future relevance score ofeach result. We use score u to denote the upper bound of relevance score for each result. Then, the results whose score u values are not larger than relevance score of the top- k -thresults can be safely ignored.The second challenge is shortage of top- k results because they can be expired due todeletions. Since the value k is rather small compared to the huge number of all the validJTTs, the possibility of deleting a top- k result is rather small. In addition, new top- k results can also be formed by new tuples. Thus, if the insertion rate is not much smallerthan the deletion rate, the possibility of occurring of top- k results shortage would besmall. However, this possibility would be high if the deletion rate is much larger, whichcan result in frequent top- k results refilling operations. It worth noting that the top- k results shortage can also be caused by the relevance score changing of results. Oursolution to this problem is to compute the top-( k + ∆ k ) ( ∆ k >
0) results instead of thenecessary k . ∆ k is a margin value. Then, we can stand up to ∆ k times of deletion of topresults when maintaining the top- k results. The setting of ∆ k is important. If ∆ k is toosmall, it may has a high possibility to refill. If ∆ k is too large, the e ffi ciency of handlingdatabase modifications is decreased. Instead of analyzing the update behavior of theunderlying database to estimate an appropriate ∆ k value, we enlarge ∆ k on each time oftop- k results shortage until it reaches a value such that the occurring frequency of top- k results shortage falls below a threshold.On the contrary, after maintaining the top- k results for a long time, the number ofcomputed top results maybe larger than ( k + ∆ k ), especially when the insertion rateis high. In such cases, the top- k results maintaining e ffi ciency is decreased becausewe need to update the relevance scores for more results and join the new tuples withmore tuples than necessary. As shown in the experimental results, such extra cost isnot negligible for long-term database modifications. Therefore, we need to reverse thepipelined query evaluation if there are too many computed top results.In brief, when a continual keyword query is registered, we first generate the setof CNs and compact them into a lattice L . Then, the initial top- k results is found byprocessing tuples in L in a pipelined way until the score u values of the un-seen JTTsare not larger than relevance score of the top-( k + ∆ k )-th result (which is denoted by L .θ ). When maintaining the top- k results, we only find the new results that are with score u > L .θ . The pipelined evaluation of L is resumed if the number of found resultswith score u > L .θ falls below k , or is reversed if the above number is larger than( k + ∆ k ). The method of computing score u for results is introduced in Section 4.2.Section 4.3 and Section 4.4 describe our method of computing the initial top- k resultsand maintaining the top- k results, respectively. Then, two techniques which can highlyimprove the query processing e ffi ciency are presented in Section 4.5 and Section 4.6. Let us recall the function for computing tuple scores given in Eq. (2): tscore ( t , Q ) = (cid:88) w ∈ t ∩ Q + ln(1 + ln( t f t ,w ))1 − s + s · dl t a v dl · ln (cid:32) Nd f w + (cid:33) . calable Continual Top- k Keyword Search in Relational Databases 11
We assume that the future values of each ln (cid:16)
Nd f w + (cid:17) and a v dl both have an upper boundln u (cid:16) Nd f w + (cid:17) and a v dl u , respectively. Then, we can derive the upper bound of the futuretuple score for each tuple t as: t . tscore u = (cid:88) w ∈ t ∩ Q + ln(1 + ln( t f t ,w ))1 − s + s · dl t a v dl u · ln u (cid:32) Nd f w + (cid:33) . (4)Hence, the upper bound of the future relevance score of a JTT T is: T . score u = (cid:88) t ∈ T t . tscore u · size ( T ) . (5)Note that the function in Eq. (5) also has the tuple monotonicity property on tscore u .On query registration, each ln u (cid:16) Nd f w + (cid:17) is computed as ln (cid:16) Nd f w (1 − ∆ d f w ) + (cid:17) , and each a v dl u is computed as a v dl (1 + ∆ a v dl ), where ∆ d f w and ∆ a v dl both are set as smallvalues ( = k results, we continually monitor the changeof statistics to determine whether all the ln (cid:16) Nd f w + (cid:17) and a v dl values below their upperbounds. At each time that any ln (cid:16) Nd f w + (cid:17) or a v dl value exceeds its upper bound, the ∆ d f w or ∆ a v dl is enlarged until the frequencies of exceeding the upper bounds fall below asmall number. Example 3.
Table 2 shows the tscore u values of the six matched tuples in Example 1 bysetting ∆ d f w =
20% and ∆ a v dl = T . score u = . T . score u = .
23 and T . score u = . Table 2.
Upper bounds of tuple scores
Tuple a a a p p p tscore u k Results
Fig. 4 shows the L -lattice of the seven CNs in Example 2. We use V i to denote a nodein L . Particularly, V Qi denotes a lattice node of query tuple set, and V Qi . R Q denotesthe query tuple set of V Qi . The dual edges between two nodes, for instance, V Q and V , indicate that V is a dual child of V Q . A node V i in L can belongs to multipleCNs. We use V i . CN to denote the set of CNs that node V i belongs to. For example, V Q . CN = { CN , CN , CN , CN } . Tuples in each query tuple set V Qi . R Q are sorted innon-increasing order of tscore u . We use V Qi . cur to denote the current tuple such that thetuples before the position of the tuple are all processed, and we use V Qi . cur ← V Qi . cur + V Qi . cur to the next position. Initially, for each node V Q in L , V Qi . cur is set asthe top tuple in V Qi . R Q . In Fig. 4, V Qi . cur of the four nodes are denoted by arrows. For anode V i that is of a free tuple set R Fi , we regard all the tuples of R Fi as its processed tuplesfor all the times. We use V i . output to indicate the output bu ff er of V i , which contains itsprocessed tuples that can join at least one tuple in the output bu ff er of each child nodeof V i . Tuples in V i . output are also referred as the outputted tuples of V i . Fig. 4.
The constructed lattice of the seven CNs in Example 2In order to find the top- k results in a pipelined way, we need to bound the score u values of the un-found results. For each tuple t j of V Qi . R Q , the maximal score u valuesof JTTs that t j can form is defined as follows: score u (cid:16) V Qi , t j , Q (cid:17) = , a child node of V Qi has empty output bu ff er , max C ∈ V Qi . CN (cid:16) score u (cid:16) C . R Q . t j , Q (cid:17)(cid:17) , otherwise (6)where score u (cid:16) C . R Q . t j , Q (cid:17) indicates the maximal score u for all the JTTs of C that con-tain tuple t j , and is obtained by replacing tscore in Eq. (3) with tscore u . If a child of V Qi has empty output bu ff er, processing any tuple at V Qi can not produce JTTs; hence score u (cid:16) V Qi , t j , Q (cid:17) = V Qi untilall its child nodes have non-empty output bu ff ers. According to Eq. (6) and the tuplessorting order, among the un-processed tuples of V Qi . R Q , score u (cid:16) V Qi , V Qi . cur , Q (cid:17) has themaximum value. We use score u (cid:16) V Qi , Q (cid:17) to denote score u (cid:16) V Qi , V Qi . cur , Q (cid:17) . In Fig. 4, score u (cid:16) V Qi , Q (cid:17) values of the four V Qi nodes are shown next to the arrows. For example, score u (cid:16) V Q , Q (cid:17) = max C ∈{ CN , CN , CN , CN } (cid:16) score u (cid:16) C . A Q . a , Q (cid:17)(cid:17) = . L to find theinitial top- k results, which is similar to the GP algorithm. Lines 1-3 are the initializationstep to sort tuples in each query tuple set and to initialize each V Qi . cur . Then in eachwhile iteration (lines 4-8), the un-processed tuple in all the V Q nodes that maximizes score u is selected to be processed. Processing the selected tuples is done by calling theprocedure Insert . Algorithm 1 stops when max V Qi ∈L score u ( V Qi , Q ) is not larger thanthe relevance score of the top-( k + ∆ k )-th found results. The procedure Insert ( V i , t ) isprovided in KDynamic , which updates the output bu ff ers for V i (line 13) and all its an-cestors (lines 17-18), and finds all the JTTs containing tuple t by calling the procedure E v alPath (line 16). We will explain procedure Insert using examples later. The re-cursive procedure E v alPath ( V i , t , path ) is provided in KDynamic too, which constructsJTTs using the outputted tuples of V i ’s descendants that can join t . The stack path ,which records where the join sequence comes from, is used to reduce the join cost. Example 4.
In the first round, tuple V Q . p is processed by calling Insert ( V Q , p ). Since V Q is the root node of CN , E v alPath is called and JTT T = p is found. Then, for thetwo father nodes of V Q , V and V , V . output is not updated because V Q . output = ∅ , V . output is updated to { w , w } because p can join w and w . And then, for the twofather nodes of V , V Q and V , V Q . output is not updated since V Q has no processed calable Continual Top- k Keyword Search in Relational Databases 13
Algorithm 1:
EvalStatic - Pipelined (lattice L , the top- k value k , ∆ k ) topk ← ∅ : the priority queue for storing found JTTs ordered by score ; Sort tuples of each V Qi . R Q in non-increasing order of tscore u ; foreach node V Qi in L do let V Qi . cur ← V Qi . R Q . t ; while max V Qi ∈L score u ( V Qi , Q ) > topk [ k + ∆ k ] . score do Suppose score u ( V Q , Q ) = max V Qi ∈L score u ( V Qi , Q ); path ← ∅ ; // A stack which records the join sequence Insert ( V Q , V Q . cur ); // Processing tuple V Q . cur at V Q V Q . cur ← V Q . cur + Output the first k results in topk ; L .θ ← topk [ k + ∆ k ] . score ; Procedure
Insert (lattice node V i , tuple t ) if t (cid:60) V i . output and t can join at least one outputted tuple of every child of V i then Insert t into V i . output ; if t ∈ V i . output then Push ( V i , t ) to path ; if V i is a root node then topk ← topk (cid:83) E v alPath ( V , t , path ); foreach father node of V i , V i (cid:48) in L do foreach tuple t (cid:48) belongs to V i (cid:48) that can join t do Insert ( V i (cid:48) , t (cid:48) ); Pop ( V , t ) from path ; Procedure E v alPath (lattice node V i , tuple t , stack path ) T ← { t } ; // The set of found JTTs foreach child node of V i , V i (cid:48) in L do T (cid:48) ← ∅ ; // The set of JTTs that rooted at tuples of node V i (cid:48) if V i (cid:48) ∈ path then let t (cid:48) be the tuple of node V i (cid:48) that is stored in path ; T (cid:48) ← E v alPath ( V i (cid:48) , t (cid:48) , path ); else foreach tuple t (cid:48) ∈ V i (cid:48) . output that join t do T (cid:48) ← T (cid:48) (cid:83) E v alPath ( V i (cid:48) , t (cid:48) , path ); // Union the JTTs that rootedat different tuples of V i (cid:48) T ← T × T (cid:48) ; // Compute the Cartesian Product return T ;4 Yanwei Xu tuples, V . output is set as { a } because there is only one tuple a in A F that can join w and w . Since V is the root node (of CN ), E v alPath ( V , a , path ) is called but noresults are found because the only one found JTT p ← w → a ← w → p is not avalid result. After processing tuple V Q . p , score u (cid:16) V Q , Q (cid:17) = .
82 and score u (cid:16) V Q , Q (cid:17) = .
57. In the second round, tuple V Q . a is processed, which finds results T = a and T = p ← w → a . Then, V . output = { p } , V . output = { w , w } , V . output = { w } , score u (cid:16) V Q , Q (cid:17) = .
18, and score u (cid:16) V Q , Q (cid:17) = score u (cid:16) CN . A Q . a , Q (cid:17) = .
69. In thethird-fifth rounds, tuples V Q . a , V Q . a and V Q . a are processed, which insert a into V Q . output and no results found. In the sixth round, tuple V Q . a is processed, whichfinds results a and a ← w → p ← w → a . Then, Algorithm 1 stops because therelevance score of the third result in the queue topk (suppose ∆ k =
0) is larger than allthe score u (cid:16) V Qi , Q (cid:17) values. Fig. 5 shows the snapshot of L after finding the top-3 results.Thus, θ = .
68 after the evaluation.
Fig. 5.
After finding the top-3 results (tuples in the output bu ff ers are shown in bold)After the execution of Algorithm 1, score u values of all the un-found results are notlarger than L .θ . Results in the queue topk can be categorized into three kinds. The firstkind are the ( k + ∆ k ) results that are with score u ≥ L .θ , which are the initial top-( k + ∆ k )results. The second kind are with score < L .θ and score u ≥ L .θ , which are called the potential top-( k + ∆ k ) results because they have the potential to become the top-( k + ∆ k )results. The third kind are with score u ≤ L .θ . As shown in the experiment, the resultsof the last kind may have a large number. However, we can not discard them becausesome of them may become the first two kinds when maintaining the top- k results. k Results
Algorithm 2 shows our algorithm of maintaining top- k results. A database update oper-ator is denoted by OP ( t , R t ), which represents a tuple t of relation R t is inserted (if OP is a insertion) or deleted (if OP is a deletion). Note that the database updates is modeledas deletions followed by insertions. For a new arrival OP ( t , R t ), Algorithm 2 first checkswhether the ln (cid:16) Nd f w + (cid:17) and a v dl values of relation R t exceed their upper bounds. If someln (cid:16) Nd f w + (cid:17) (s) or a v dl exceeds their upper bounds, we enlarge the corresponding ∆ d f w (s) The methods of enlarging ∆ d f w , ∆ a v dl and ∆ k are introduced in detail in the experiments.calable Continual Top- k Keyword Search in Relational Databases 15 or ∆ a v dl (line 3), and then update the score and score u values for all the tuples in R Qt and all the results in the queue topk using the enlarged ln (cid:16) Nd f w + (cid:17) (s) or a v dl (line 4); oth-erwise, we update the relevance scores for the results in topk that are with score u ≥ L .θ (line 6). Then, we insert t into L to find the new results if OP is an insertion (lines 7-13),or delete the expired JTTs and t from L if OP is a deletion (lines 14-17). Lines 7-17are explained in detail latter. And then, the score u ( V Qi , Q ) of some nodes may be largethan L .θ , which can be caused by three reasons: ( ) the upper bound scores of tuplesof relation R t are increased; ( ) the score u ( V Qi , Q ) of some nodes are increased from 0after inserting the new tuple into L ; and ( ) new CNs are added into L . Therefore, inlines 18-19, we process tuples using procedure Insert until all the score u ( V Qi , Q ) valuesare not larger than L .θ . Algorithm 2:
Maintain (the evaluated lattice L , the top- k value k , ∆ k ) while a new database modification OP ( t , R t ) arrives do if Some ln (cid:16)
Nd f w + (cid:17) (or a v dl ) exceed their upper bounds after applying OP then Enlarge the corresponding ∆ d f w (or ∆ a v dl ) value(s); Update relevance scores for tuples in R Qt and results in topk ; else Update score for results in topk that are with score u ≥ L .θ ; if OP is an insertion then // Insert t into L if t is an un-matched tuple then foreach node V i in L that of R Ft do Insert ( V i , t ); else if R Qt is new then add the new CNs into L ; Insert t into R Q in descending order of tscore u ; foreach V Qi that of R Qt and has score u (cid:16) V Qi , t , Q (cid:17) > L .θ do Insert ( V Qi , t ); else if OP is a deletion then // Delete t from L Delete the results that contain t and are with score u ≥ L .θ from topk ; if t is a matched tuple then remove t from R Qt ; foreach node V i in L such that t ∈ V i . output do Delete ( V i , t ); while max V Qi ∈L score u ( V Qi , Q ) > L .θ do foreach node V Qi that is with score u ( V Qi , Q ) > L .θ do Insert ( V Qi , V Qi . cur ); if |{ T | T ∈ topk , T . score ≥ L .θ }| < k then // Resume the evaluation of L Enlarge ∆ k and then resume the execution of EvalStatic - Pipelined ; else if |{ T | T ∈ topk , T . score ≥ L .θ }| > ( k + ∆ k ) then RollBack ( L , k , ∆ k ); // Reverse the evaluation of L Report the new first k results in topk if they are changed; Procedure
Delete ( V i , t ) Delete t from V i . output ; foreach father node of V i , V i (cid:48) in L do foreach tuple t (cid:48) in V i (cid:48) . output that can join t only do Delete ( V i (cid:48) , t (cid:48) ); // Call Delete recursively
Finally, in lines 20-23, we count the number of results that are with score u ≥ L .θ .If the number is smaller than k , ∆ k is enlarged, and then the EvalStatic - Pipelined algo-rithm (without the initialization step) is called to further evaluate L . If the number islarger than k + ∆ k , the algorithm RollBack , which is described at the end of this sub-section, is called to rollback the evaluation of L . In any case, at the end of handling the OP , we have max V Qi ∈L score u ( V Qi . t cur , Q ) ≤ topk [ k ] . score . Therefore, the k results in topk that have the largest relevance scores are the top- k results. We do not process theresults in topk that are with score u ≤ L .θ in line 6 and line 15, because they can havea large number and do not have the potential to become top- k results. However, afterthe execution of lines 4 and 21, score u of some of them may become larger than L .θ ,because their score u values may be enlarged in line 4 and the L .θ may be decreased inline 21. Therefore, all the results in topk need to be considered in lines 4 and 21. Notethat we have to firstly check whether some of them have expired due to deletions.In lines 7-13, the new tuple t is processed di ff erently according to whether it con-tains the keywords. If t is an un-matched tuple, it is inserted into each node of R Ft usingthe procedure Insert (line 9). If t is a matched tuple, inserting it into L is more compli-cated. First, if t introduces a new non-empty query tuple set R Qt , we add the new CNsinvolving R Qt into the lattice. Fig. 6 illustrates the process of inserting a new CN intothe lattice shown in Fig. 5. Assuming that W — P Q is the largest common subtree of thenew CN and L , and V f is the father node of W — P Q in the new CN, then the new CN isadded by setting V as the child of V f . If V f is a free tuple set and it does not have otherchild nodes as shown in Fig. 6, Insert ( V f , t (cid:48) ) is called for each tuple t (cid:48) of V f that canjoin tuples in V . output . Further evaluation at the nodes of the new CN, if necessary,will be done in lines 18-19. Second, t is added into the query tuple set R Q (line 12), andthen for each node V Qi of R Qt , Insert ( V Qi , t ) is called when score u (cid:16) V Qi . R Q . t , Q (cid:17) > L .θ (line 13), i.e., t has the potential to form JTTs that are with score u > L .θ . Fig. 6.
Inserting a new CN into the latticeIf OP is a deletion, for each node V i in L such that t ∈ V i . output , we delete t from V i . output using the procedure Delete , which is provided by
KDynamic . Procedure
Delete first removes t from V i . output , and then checks whether some outputted tuplesof the ancestors of V i need to be removed (lines 27-29). For instance, if the tuple a isdeleted from the lattice node V Q shown in Fig. 5, tuples w and w are deleted from V . output too because they can join a only, among tuples in V Q . output .Algorithm 3 outlines out algorithm to reverse the execution of the pipelined evalu-ation of the lattice. In the beginning, L .θ is set as the relevance score of the ( k + ∆ k )-th calable Continual Top- k Keyword Search in Relational Databases 17 result in the queue topk (line 1). Then, the processing on each processed tuple t ∈ R Qi that is of score u (cid:16) V Qi . R Q . t , Q (cid:17) ≤ L .θ is reversed (lines 4-6). We use V Qi . cur − V Qi . cur . If t ∈ R Qi . output , the results involving by t are firstlydeleted from topk , and then t is deleted from V Qi . output by calling the procedure Delete . Algorithm 3:
RollBack (a lattice L , the top- k value k , ∆ k ) L .θ ← topk [ k + ∆ k ] . score ; foreach node V Qi in L do while score u (cid:16) V Qi . cur − , Q (cid:17) ≤ L .θ do if V Qi . cur − ∈ V Qi . output then Remove the results that are of CNs in V Qi . CN and contain tuple V Qi . cur − topk ; Delete ( V Qi , V Qi . cur − // Delete from the output buffer V Qi . cur ← V Qi . cur − In Algorithm 1 and Algorithm 2, procedure
Insert and
Delete may be called by multipletimes upon multiple nodes for the the same tuple. The core of the two procedures arethe select operations (or semi-joins [15]). For example, in line 12 and line 18 of proce-dure
Insert , we need to select the tuples that can join t from the output bu ff er of eachchild node of V i and the set of processed tuples of each father node of V i , respectively.Although such select operations can be done e ffi ciently by the DBMS using indexes,the cost of handling t is high due to the large number of database accesses. For example,in our experiments, for a new tuple t , the maximal number of database accesses can beup to several hundred.These select operations done for the same tuple t can be done e ffi ciently by shar-ing the computational cost among them. Assume a new tuple w is inserted into thelattice shown in Fig. 5, then procedure Insert is called by three times (
Insert ( V , w ), Insert ( V , w ) and Insert ( V , w )) and at most eight selections are done. All the eightselect operations can be expressed using following two relational algebra expressions: π aid ( σ w id = w ( W ) (cid:49) σ aid ∈A i ( A )) and π pid ( σ w = w ( W ) (cid:49) σ p ∈P j ( P )), where A i and P j represent the set of tuples in the output bu ff er of a node or the set of processed tu-ples of a node. Since A i and P j can be di ff erent from each other, the eight selectoperations need to be evaluated individually. However, if we rewrite the above ex-pressions as π aid ( σ aid ∈A i ( σ w id = w ( W ) (cid:49) ( A ))) and π pid ( σ p ∈P j ( σ w = w ( W ) (cid:49) ( P ))), theeight select operations would have two common sub-operations: σ w = w ( W ) (cid:49) ( A )) and σ w = w ( W ) (cid:49) ( P )). If the results of the two common sub-operations can be shared anddo selections σ aid ∈A i and σ pid ∈P j in the main memory, the eight select operations can beevaluated involving only two database accesses. Algorithm 4:
CanJoinOneOutputTuple (lattice node V i , tuple t ) Let R i be the relation corresponding to the tuple set of V i ; if the tuples of relation R i that can join t have not been stored then Query the tuples of relation R i that can join t and store them; foreach tuple t (cid:48) of the stored tuples of relation R i that can join t do if can find t (cid:48) in V i . output then return true ; return false ; Algorithm 4 shows our procedure to check whether tuple t can join at least onetuple in the output bu ff er of a lattice node V i , which is called in line 12 of procedure Insert . In line 3, all the tuples in relation R i that can join t are queried and cached inthe main memory. This set of cached joined tuples can be reused every time when theyare queried. The procedures for the select operations in line 18 of Insert and line 28of
Delete are also designed in this pattern, which are omitted due to the space limita-tion. Note that when the two procedures
Insert and
Delete are called recursively, selectoperations done in the above lines are also evaluated by these procedures. Therefore,for each tuple t , a tree of tuples, which is rooted at t and consist of all the tuples thancan join t , is created. The tree of tuples can be seen as the cached localization infor-mation of t . It is created on-the-fly, i.e., along with the execution of procedures Insert and
Delete , and its depth is determined by the recursion depth of the two procedures.The maximum recursion depth of procedures
Insert and
Delete is CN max + CN max indicates the maximum size of the generated CNs. Hence, the height of this treeof tuples is bounded by CN max + p of P Q is inserted into the two nodes of P Q in the latticeshown in Fig. 5, Fig. 7 illustrates the select operations done in the procedure Insert (denoted as arrows in the left part) and the cached joined tuples of p (shown in theright part). For instance, the arrows form V Q to V selects the tuples in relation W thatcan join p . The three select operations are denoted by dashed arrows because theywould not be done if results of the two select operations, from V Q to V and from V Q to V , are empty. For the same reason, the stored tuples of relation A that can join p aredenoted using dashed rectangles. Fig. 7.
Selections done in
Insert and the cached joined tuples for a tuple p of P Q When computing the initial top- k results, the database is static; hence the cachedjoined tuples of each tuple unchange and can be reused before the database is updated. calable Continual Top- k Keyword Search in Relational Databases 19
When maintain the top- k results, although the database is continually updated, we canassume the database unchange before t is handled. However, the cached joined tuplesof t is expired after t is handled by Algorithm 2. As shown in the experimental results,caching the joined tuples can highly improve the e ffi ciency of computing the initialtop- k results and maintaining the top- k results. According to Eq. (3), score u values of tuples in di ff erent CNs have great di ff erences.For example, score u values of tuples in CN and CN are smaller than that of tuples in CN due to the large CN size. In algorithm GP, no tuples or only a small portion arejoined in the CNs whose tuples have small score values. If the CNs in Example 2 areevaluated by algorithm GP, A Q of CN and P Q of CN would have no processed tuples.However, in the lattice, a node R Qi can be shared by multiple CNs. Thus, when insertinga tuple t into R Qi , t is processed in all the CNs in R Qi . CN . As shown in Fig. 5, since V Q is shared by CN , CN , CN and CN in the lattice, tuples a and a are processed in allthese four CNs when processing them at V Q , which results in un-needed operations atnodes V and V two un-needed results a and a ← w → p ← w → a . We call theoperations at V and V and the two JTTs as un-needed because they wound not occuror be found if the CNs are evaluated separately. These un-needed operations can causefurther un-needed operations when maintaining the top- k results. For example, we haveto join a new unmatched tuple of relation P with four tuples in V . output .The essence of the above problem is that CNs have di ff erent potentials in producingtop- k results, and then the same tuple set can have di ff erent numbers of processed tuplesin di ff erent CNs if they are evaluated separately. In order to avoid finding the un-neededresults, the optimal method is merely to share the tuple sets that have the same numberof processed tuples among CNs when they are evaluated separately. However, we cannotget these numbers without evaluating the CNs. As an alternative, we attempt to estimatethis number for the tuple sets of each CN C according to following heuristic rules: – If Max ( C ) = (cid:80) ≤ i ≤ m C . R Qi . t . tscore u size ( C ) , which indicates the maximum score u of JTTsthat C can produce, is high, tuple sets of C have more processed tuples. – If two CNs have the same
Max ( C ) values, tuple sets of the CN with larger size havemore processed tuples.Therefore, we can cluster the CNs using their Max ( C ) · ln( size ( C )) values, where ln( size ( C ))is used to normalize the e ff ect of CN sizes. Then, when constructing the lattice, only thesubtrees of CNs in the same cluster can be collapsed. For example, Max ( C ) · ln( size ( C ))values of the seven CNs of Example 2 are: 5.15, 2.93, 5.39, 6.84, 5.32, 5.70 and 3.03;hence they can be clustered into two clusters: { CN , CN } and { CN , CN , CN , CN , CN } .Fig. 8 shows the lattice after finding the top-3 results if the CNs are clustered, wherethe three un-needed JTTs in Fig. 5 can be avoided. As shown in the experimental sec-tion, clustering the CNs can highly improve the e ffi ciency in computing the initial top- k results and handling the database updates. We cluster the CNs using the K -mean clustering algorithm [21], which needs an in-put parameter to indicate the number of expected clusters. We use Kmean to indicate theratio of this input parameter to the number of CNs. The value of
Kmean represents thetrade-o ff between sharing the computation cost among CNs and considering their dif-ferent potentials in producing top- k results. When Kmean =
0, the CNs is not clustered,then the CNs share the computation cost at the maximum extent. When
Kmean = Kmean = . k results and handling the database updates. Fig. 8.
After finding the top-3 results if the CNs are clustered into two clusters
We conducted extensive experiments to test the e ffi ciency of our methods. We usethe DBLP dataset . Note that DBLP is not continuously growing and is updated on amonthly basis. The reason we use DBLP to simulate a continuously growing relationaldataset is because there is no real growing relational datasets in public, and many stud-ies [4,9] on top- k keyword queries over relational databases use DBLP. The downloadedXML file is decomposed into relations according to the schema shown in Fig. 9. Thetwo arrows from PaperCite to Papers denote the foreign-key-references from paperID to paperID and citedPaperID to paperID , respectively. The DBMS used is MySQL(v5.1.44) with the default “Dedicated MySQL Server Machine” configuration. All therelations use the MyISAM storage engine. Indexes are built for all primary key andforeign key attributes, and full-text indexes are built for all text attributes. All the algo-rithms are implemented in C ++ . We conducted all the experiments on a 2.53 GHz CPUand 4 GB memory PC running Windows 7. We use the following five parameters in the experiments: ( ) k : the top- k value; ( ) l :the number of keywords in a query; ( ) IDF : the ratio of the number of matched tuplesto the number of total tuples, i.e., d f w N ; ( ) CN max : the maximum size of the generatedCNs; and ( ) Kmean : the ratio of the number of clusters of CNs to the number of CNs. http: // dblp.mpi-inf.mpg.de / dblp-mirror / index.php / calable Continual Top- k Keyword Search in Relational Databases 21
Fig. 9.
The DBLP schema (PK stands for primary key, FK for foreign key)The parameters with their default values (bold) are shown in Table. 3. The keywordsselected are listed in Table. 4 with their
IDF values, where the keywords in bold fontsare keywords popular in author names. Ten queries are constructed for every
IDF value,each of which contains three selected keywords. For each l value, ten queries are con-structed by selecting l keywords from the row of IDF = .
013 in Table. 4. To avoidgenerating a small number of CNs for each query, one author name keyword of each
IDF value always be selected for each query.When k grows, the cost of computing the initial top- k results increases since weneed to compute more results, and the cost of maintaining the top- k results also in-creases since there are more tuples in the output bu ff ers of the lattice nodes. The pa-rameter CN max has a great impact on keyword query processing because the numberof generated CNs increases exponentially while CN max increases. And the number ofmatched tuples increases as IDF and l increase. Hence, the first four parameters k , l , IDF and CN max have e ff ects on the scalability of our method. Table 3.
Parameters
Name Values k , 150, 200 l , 4, 5 IDF ,0.03 CN max
4, 5, , 7 Kmean
0, 0.20,0.40, ,0.80,1
Table 4.
Keywords and their
IDF values
Keywords
IDF
ATM, embedded, navigation, privacy, scalable, Spatial, XML,
Charles , Eric
James , Zhang
John , Wang
David , Michael k Results Computation
In this experiment, we want to study the e ff ects of the five parameters on computing theinitial top- k results. We retrieve the data in the XML file sequentially until number oftuples in the relations reach the numbers shown in Table. 5. Then we run the algorithm EvalStatic - Pipelined on di ff erent values of each parameter while keeping the other fourparameters in their default values. We use two measures to evaluate the e ff ects of theparameters. The first is R , the number of found results in the queue topk . The secondmeasure is T , the time cost of running the algorithm. Ten top- k queries are selected for each combinations of parameters, and the average values of the metrics of them arereported in the following. In this experiment, ∆ d f w ( = ∆ a v dl ( = ∆ k ( = k results.The main results of this experiment are given in Fig. 10. Note that the units for the y -axis are di ff erent for the three measures. Fig. 10(a), (b) and (c) show that the twomeasures all increases as k , id f and CN max grow. However, they do not show rapidincrease in Fig. 10(a), (b) and (c), which imply the good scalability of our method. Onthe contrary, we can find rapid increase while CN max grows from the time cost of themethod of [9] in finding the top- k results, which is shown in Fig. 10(c) and are denotedby T [ S PARK ]. Fig. 10(c) presents that, compared to the existing method, algorithm
EvalStatic - Pipelined is very e ffi cient in finding the top- k results. The reason is thatevaluating the CNs using the lattice can achieve full reduction because all the tuples inthe output bu ff er of the root nodes can form JTTs and can save the computation cost bysharing the common sub-expressions [15]. Fig. 10(d) shows that the e ff ect of l seemsmore complicated: all the two measures may decrease when l increases. As shown inFig. 10(d), R and T even both achieve the minimum values when l =
5. This is becausethe probability that the keywords to co-appear in a tuple and the matched tuples can joinis high when the number of keywords is large. Therefore, there are more JTTs that havehigh relevance scores, which results in larger θ and small values of the two measures. Table 5.
Tuple numbers of relations
Papers PaperCite Write Authors Proceedings ProcEditors ProcEditor157,300 9,155 400,706 190,615 2,886 1,936 1,411
Fig. 10(e) presents the changing of the two measures when
Kmean varies. Since theresults of the K -means clustering may be a ff ected by the starting condition [21], foreach Kmean value, we run Algorithm 1 for 5 times on di ff erent starting condition foreach keyword query and report the average experimental results. Note that the algorithm EvalStatic in KDynamic corresponding to
Kmean = KDynamic . From Fig. 10(e), we can find that clustering the CNs can highly improvethe e ffi ciency of computing the top- k results and the time cost decreases as Kmean in-creases. However, when
Kmean =
1, which indicates that all the CNs are evaluatedseparately, the time cost grows to a higher value than that when
Kmean is 0.6 or 0.8.Therefore, it is important to select a proper
Kmean value. The minimum T in this ex-periment is achieved on Kmean = .
6; hence the default value of
Kmean is 0.6 in ourexperiments. As can be seen in the next section,
Kmean = . k results withthat of KDynamic , while varying CN max . The time cost of KDynamic is denoted by“!Cache” because it does not cache the joined tuples for each tuple. We can find thatcaching the joined tuples for each tuple highly improves the e ffi ciency of computingthe top- k results. More important, the improvement increases as CN max grows. Thisis because when CN max grows, the times of calling the procedure Insert on each tuple calable Continual Top- k Keyword Search in Relational Databases 23(a) Varying k (b) Varying IDF (c) Varying CN max (d) Varying l (e) Varying Kmean (f) E ff ect of storing joined tuples Fig. 10.
Experimental results of calculating the initial top- k results increases fast since the number of lattice nodes increases exponentially; hence the savedcost due to storing the joined tuples of each tuple grows as CN max grows.From the curves of R in Fig. 10, we can find that R values are large in all thesettings: about several thousand. Recall that topk contains three kinds of results. Thenumber of the first kind of results is k + ∆ k , which is small compared to the R values.Since ∆ d f w ( = ∆ a v dl and ∆ k all have very small values, the number of potentialtop-( k + ∆ k ) results in topk is very small ( < score u < L .θ , is in the majority and has a lager number. k Result Maintenance
In this experiment, we want to study the e ffi ciency of Algorithm 2 in maintaining top- k results. We use the same keyword queries as Exp-1. After calculating the initial top- k results for them, we sequentially insert additional tuples into the database by retrievingdata from the DBLP XML file. At the same time, we delete randomly selected tuplesfrom the database. Algorithm 2 is used to maintain the top- k results for the queries whilethe database being updated. The database update records are read from the database logfile; hence the database updating rate has no directly impact on the e ffi ciency of top- k results maintenance because the database is updated by another process.We first add 713,084 new tuples into the database and delete 250,000 tuples from thedatabase. The new data is roughly 90 percent of the data used in Exp-1. The compositionof the additional tuples is shown in Table. 6. Fig. 11(a) and (b) show the change ofthe average execution times of Algorithm 2 in handling the above database updateswhen varying the five parameters , which presents the e ffi ciency of Algorithm 2. Notethat the units for the x -axis are di ff erent for the five measures, whose minimum andmaximum values are labeled in Fig. 11(a) and (b), and their other values can be foundin Table. 3. We can find that the time cost of handling database updates for the defaultqueries is smaller than 1.5ms. Comparing Fig. 11(a) and (b) with the curves of measure T in Fig. 10 (especially the curves in Fig. 10(d) and Fig. 10(e)), we can find that thetime cost to handle database updates and the time cost to compute the initial top- k results have the same changing trends. This is because there are more outputted tuplesin the lattice when more time is needed to compute the initial top- k results; hence moretime is required to do the selections in procedures Insert and
Delete and the recursivedepthes of them are more larger. Fig. 11(c) compares the time cost of our method inhandling database updates with that of
KDynamic , while varying CN max . The time costof KDynamic is also denoted as “!Cache”. We can find that caching the joined tuples foreach tuple can also improve the e ffi ciency of handling database updates, and the largerthe CN max , the higher the improvement of the e ffi ciency is. Table 6.
Composition of the additional tuples
Papers PaperCite Write Authors Proceedings ProcEditors ProcEditor156,965 20,010 411,109 111,094 3,033 3,886 6,987 Since it is hard to read in one figure, we split the data of the five parameters into two figures.calable Continual Top- k Keyword Search in Relational Databases 25(a) Time for handling database updateswhile varying
Kmean and CN max (b) Time for handling database updateswhile varying l , k and IDF (c) E ff ect of storing joined tuples inhandling database updates (d) Changes of the times of enlarging ∆ d f w (e) Changes of the times of callingprocedure RollBack (f) Changes of the times of enlarging ∆ k Fig. 11. E ffi ciency of top- k result maintenance Secondly, we only insert the 713,084 additional tuples into the database while main-taining top-100 results for the default ten keyword queries. We adopt two di ff erent grow-ing rates of ∆ d f w : ∆ d f w + =
2% and ∆ d f w + = (cid:16) Nd f w + (cid:17) exceed its upper bound, the corresponding ∆ d f w value is increased by 2% and 5%,respectively. After inserting each 100,000 additional tuples, we record the average fre-quency of enlarging ∆ d f w and calling the procedure RollBack for the ten queries, whosechanges are shown in Fig. 11(d) and (e), respectively, whose x -axis (with unit of 10 )indicate the number of additional tuples. Note that we do not report the frequency ofenlarging a v dl because it is very small in the experiment ( < ∆ d f w is larger when the growing rate of ∆ d f w is lower, after inserting 300,000 additional tuples, the times of enlarging ∆ d f w , i.e., thetimes of exceeding the upper bound of ln (cid:16) Nd f w + (cid:17) , falls below 5 for both the two growingrates of ∆ d f w . After inserting 300,000 additional tuples, the maximum ∆ d f w value of allthe relations is 15; hence it is reasonable to set 15 as the maximum value for ∆ d f w . Thereis only one curve in Fig. 11(e) because the growing rate of ∆ d f w has no great impact onthe times of calling the procedure RollBack , which is mainly a ff ected by the frequencyof finding new results that are with score u > L .θ . Note that L .θ is increased after eachtime of calling the procedure RollBack . Therefore, the times of calling the procedure
RollBack is decreasing since it is more and more harder to find new results that are with score u > L .θ . In order to study the impact of reversing the pipelined evaluation on thee ffi ciency of handling database updates, we also redo the experiment without callingthe procedure RollBack . Then, the average time cost of handling database updates isincreased by 45 . ff erent ∆ k growing rates are adopted: ∆ k + = ∆ k + =
5, which mean that when the number of results that are with score u > L .θ fallsbelow k , the corresponding ∆ k value is increased by 2 and 5, respectively. We recordthe average times of enlarging ∆ k of the ten queries after deleting each 100,000 tuples,whose changes are shown in Fig. 11(f). Fig. 11(f) shows that the frequency of shortageof top- k results falls below a very small number after deleting 200,000 tuples, i.e., after ∆ k being enlarged to about 20. As indicated by the curve of k in Fig. 11(b), a large ∆ k value can highly decrease the e ffi ciency of handling database updates. Therefore, it isreasonable to set the maximum value of ∆ d f w as 20%. In this paper, we have studied the problem of finding the top- k results in relationaldatabases for a continual keyword query. We proposed an approach that finds the an-swers whose upper bounds of future relevance scores are larger than a threshold. Weadopt an existing scheme of finding all the results in a relational database stream, butincorporate the ranking mechanisms in the query processing methods and make twoimprovements that can facilitate e ffi cient top- k keyword search in relational databases.The proposed method can e ffi ciently maintain top- k results of a keyword query without calable Continual Top- k Keyword Search in Relational Databases 27 re-evaluation. Therefore, it can be used to solve the problem of answering continualkeyword search in databases that are updated frequently.
Acknowledgments
This research was partly supported by the National Natural Science Foundation ofChina (NSFC) under grant No. 60873040, 863 Program under grant No. 2009AA01Z135.Jihong Guan was also supported by the “Shu Guang” Program of Shanghai MunicipalEducation Commission and Shanghai Education Development Foundation.
References
1. S. Agrawal, S. Chaudhuri, and G. Das, “DBXplorer: Enabling keyword search over relationaldatabases,” ACM SIGMOD, p.627, 2002.2. V. Hristidis and Y. Papakonstantinou, “DISCOVER: Keyword search in relational databases,”VLDB, pp.670–681, 2002.3. B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe, and Parag, “BANKS: Browsingand keyword searching in relational databases,” VLDB, pp.1083–1086, 2002.4. V. Hristidis, L. Gravano, and Y. Papakonstantinou, “E ffi cient IR-style keyword search overrelational databases,” VLDB, pp.850–861, 2003.5. V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar, “Bidi-rectional expansion for keyword search on graph databases,” VLDB, pp.505–516, 2005.6. H. He, H. Wang, J. Yang, and P.S. Yu, “Blinks: ranked keyword searches on graphs,” ACMSIGMOD, New York, NY, USA, pp.305–316, ACM, 2007.7. F. Liu, C. Yu, W. Meng, and A. Chowdhury, “E ff ective keyword search in relationaldatabases,” ACM SIGMOD, pp.563–574, 2006.8. G. Li, X. Zhou, J. Feng, and J. Wang, “Progressive keyword search in relational databases,”ICDE, pp.1183–1186, 2009.9. Y. Luo, X. Lin, W. Wang, and X. Zhou, “SPARK: Top-k keyword query in relationaldatabases,” ACM SIGMOD, pp.115–126, 2007.10. G. Li, B.C. Ooi, J. Feng, J. Wang, and L. Zhou, “EASE: An e ff ective 3-in-1 keyword searchmethod for unstructured, semi-structured and structured data,” ACM SIGMOD, pp.903–914,2008.11. Y. Xu, Y. Ishikawa, and J. Guan, “E ff ective top- k keyword search in relational databasesconsidering query semantics,” APWeb //