[PDF] Graph-based keyword search in heterogeneous data sources

Abstract

Data journalism is the field of investigative journalism which focuses on digital data by treating them as first-class citizens. Following the trends in human activity, which leaves strong digital traces, data journalism becomes increasingly important. However, as the number and the diversity of data sources increase, heterogeneous data models with different structure, or even no structure at all, need to be considered in query answering. Inspired by our collaboration with Le Monde, a leading French newspaper, we designed a novel query algorithm for exploiting such heterogeneous corpora through keyword search. We model our underlying data as graphs and, given a set of search terms, our algorithm nds links between them within and across the heterogeneous datasets included in the graph. We draw inspiration from prior work on keyword search in structured and unstructured data, which we extend with the data heterogeneity dimension, which makes the keyword search problem computationally harder. We implement our algorithm and we evaluate its performance using synthetic and real-world datasets.

Full PDF

GGraph-based keyword search in heterogeneous data sources

Angelos Christos Anadiotis , Mhd Yamen Haddad , Ioana Manolescu Ecole Polytechnique and Institut Polytechnique de Paris, Inria and Institut Polytechnique de Paris, { name.surname } @ polytechnique.edu, inria.fr ABSTRACT

Data journalism is the field of investigative journalism which fo-cuses on digital data by treating them as first-class citizens. Follow-ing the trends in human activity, which leaves strong digital traces,data journalism becomes increasingly important. However, as thenumber and the diversity of data sources increase, heterogeneousdata models with different structure, or even no structure at all,need to be considered in query answering.Inspired by our collaboration with Le Monde, a leading Frenchnewspaper, we designed a novel query algorithm for exploitingsuch heterogeneous corpora through keyword search. We modelour underlying data as graphs and, given a set of search terms,our algorithm finds links between them within and across theheterogeneous datasets included in the graph. We draw inspirationfrom prior work on keyword search in structured and unstructureddata, which we extend with the data heterogeneity dimension,which makes the keyword search problem computationally harder.We implement our algorithm and we evaluate its performance usingsynthetic and real-world datasets.

Data analysis is increasingly important for several organizationstoday, as it creates value by drawing meaningful insights from thedata. As we are moving towards large data lakes installations wherehuge amounts of data are stored, the opportunities for importantdiscoveries are growing; unfortunately, on par with the uselessinformation. Moreover, the data to be processed is often storedin different formats, ranging from fully and semi-structured, tocompletely unstructured, like free text. Accordingly, the challengesin processing all this data that is available today, reside in bothexpressing and answering queries.Research in heterogeneous data processing has proposed severalapproaches in addressing the above challenges. On the one side,massively parallel processing systems like Spark [33], Hive [28]and Pig [26] provide connectors for heterogeneous data sourcesand allow the execution of data analysis tasks on top of them, us-ing either a platform-specific API or a query language like SQL.Polystore-based approaches [3, 9, 12] focus more on the data modeland the query planning and optimization on top of heterogeneousdata stores. Finally, the so-called just-in-time (JIT) data virtualiza-tion approach generates the query engine at runtime based on thedata format [20, 21]. All these works consider that users, typicallydata scientists, already know what they are looking for, and theyexpress it either using a powerful query language or a rich API.However, today the data analysis paradigm has shifted and acentral point is to find parts of the data which feature interestingpatterns. The patterns may not be known at query time; instead,users may have to discover them through a process of trial and er-ror. A popular query paradigm in such a context is keyword search . A staple of Information Retrieval in data with little or no struc-ture, keyword search has been applied also on relational, XMLor graph data, when users are unsure of the structure and wouldlike the system to identify possible connections. In this work, wemodel a set of heterogeneous data sources as a graph, and focus on answering queries asking for connections among the nodesof the graph which are of interest to the users . This work isinspired from our collaboration with Les Décodeurs, Le Monde’sfact-checking team , within the ContentCheck collaborative re-search project . Our study is novel with respect to the state of theart (Section 6) as we are the first to consider that an answer mayspan over multiple datasets of different data models, with very dif-ferent or even absent internal structure, e.g., text data. For instance,a national company registry is typically relational, contracts orpolitical speeches are text, social media content typically comes asJSON documents, and open data is often encoded in RDF graphs. Integrated graph preserving all original nodes

In the data jour-nalism context mentioned above, it is important to be able to showwhere each piece of information in an answer came from , and how theconnections were created . This is a form of provenance, and can alsobe seen as result explanation. Therefore, the queried graph needsto preserve the identity of each node from the original sources. Atthe same time, to enable interesting connections, we: ( i ) extractseveral kinds of meaningful entities from all the data sources of allkinds; ( ii ) interconnect data sources that comprise the same entity,or very similar ones, through so-called sameAs . Both extraction andsimilarity produce results with some confidence , a value between 0and 1, thus, some edges in our graph have can be seen as uncertain(but quite likely). No help from a score function

An important dimension of key-word search problems is scoring , i.e., how do we evaluate the inter-estingness of a given connection (or query result). This is importantfor two reasons. First, in many scenarios, the number of results isextremely large, users can only look at a small number of results,say k . Second, some answer score measures have properties thatcan help limit the search, by allowing to determine that some ofthe answers not explored yet would not make it into the top k . Un-fortunately, while desirable from an algorithmic perspective (sincethey simplify the problem), such assumptions on the cost modelare not always realistic from a user perspective, as we learned byexchanging with journalists; we detail this in Section 3. Bidirectional search

All edges in our graph are directed , e.g., fromthe subject to the object in an RDF graph, from the parent to thechild in a hierarchical document etc., and, in keeping with ourgoal of integral source preservation, we store the edge directionin the graph. However, we allow answer trees to traverse edgesin any direction, since heterogeneous data sources may model the https://team.inria.fr/cedar/contentcheck/ a r X i v : . [ c s . D B ] S e p DA 2020, October 2020, Paris, France A. C. Anadiotis, M. Y. Haddad and I. Manolescu same information either, say, of the form Alice wrote −−−−−→

Paper orPaper −−−−−−−−→ Alice; since users are unfamiliar with the data,they should not be penalized for not having “guessed” correctlythe edge directions. This is in contrast with many prior works (seeSection 6) which define answers as a tree where from the root, anode matching each keyword is reached by traversing edges in theiroriginal direction only. For instance, assume the graph comprises a −−−−−→ p and a −−−−−→ p . With a restricted notion of answers,the query { a a } has no answer; in contrast, in our approach, theanswer connecting them through p is easily found. Bidirectionalsearch gives a functional advantage, but makes the search morechallenging: in a graph of | E | edges, the search space is multipliedby 2 | E | .The contributions made in this work are as follows: • We formalize the problem of bidirectional keyword searchon graphs as described above, built from a combination ofdata sources. • With respect to scoring, we introduce a general score func-tion that can be extended and customized to reflect allinteresting properties of a given answer. We show thatthis generality, together with the possibility of confidencelower than 1 . • We propose a complete (if exhaustive) algorithm for solvingthe keyword search problem in this context, as well as someoriginal pruning criteria arising specifically in the contextof our graphs. Given the usually huge search space size, apractical use of this algorithm is to run it until a time-outand retain the best answers found. • We have implemented our algorithm and present a set ofexperiments validating its practical interest.A previous version of our system had been demonstrated in [5].Since then, we have completely re-engineered the graph construc-tion (this is described in the companion paper [4]), deepened ouranalysis of the query problem, and proposed a new algorithm, de-scribed in the present work; this also differs from (and improvesover) our previous technical report [7].

In this section, we formalize our keyword search problem overa graph that we build by integrating data from various datasets,organized in different data models.

We consider a set M of data models : relational (including SQLdatabases, CSV files etc.), RDF, JSON, HTML, XML, and text. Adataset D is an instance of one of these data models .From a set D = { D , D , . . . , D n } of datasets, we create an in-tegrated graph G = ( N , E ) , where N is the set of nodes and E theset of edges. For instance, consider the dataset collection shown inFigure 1. Starting from the top left, in clockwise order, it shows: a Our graph can also integrate other kinds of files, in particular PDF documents andspreadsheet files, by converting them to one or several instances of the above datamodels; as this is orthogonal wrt this paper, we delegate those details to [4]. table with assets of public officials, a JSON listing of France electedofficials, an article from the newspaper Libération with entitieshighlighted, and a subset of the DBPedia RDF knowledge base.Figure 2 shows the graph produced from the datasets in Figure 1.There are several observations to be made on this graph:( i ) The graph comprises four dataset nodes (the ones filled withyellow), one for each data source.( ii ) All the internal structure present in the input datasets is pre-served in the graph: each RDF node became a node in the integratedgraph, and each triple became an edge. A node is created for eachmap, array, and value in the JSON document. A node is createdfrom each tuple, and from each attribute in the relational databases.Finally, a single node is created from the whole text document,which has no internal structure. When a text consists of more thanone phrase, we segment it as a sequence of phrases , each of which isa node (child of the dataset node) to avoid overly large nodes thatare hard to interpret by users.( iii ) Entity nodes (rounded-corners blue boxes) are extractedusing Information Extraction (IE) techniques. Thus, in the example,nodes labeled “P. Balkany”, “I. Balkany” are recognized as People,“Levallois-Perret” and “Centrafrique” are recognized as Locations,while “Areva” is an Organization. An extracted entity is added tothe graph as a child of the node (leaf in an XML, HTML, JSON ortext document; attribute value from a relational dataset; or RDFliteral) from which it has been extracted.( iv ) Equivalence edges (solid red edges in Figure 1) connect nodesfound in different datasets which are considered to refer to the samereal-world entity. For instance, the three occurrences of “P. Balkany”are pairwise connected by edges with a confidence of 1 .

0. The confi-dence of the edges derived directly from the datasets, as explainedabove, is 1 .

0; we do not show it in the figure to avoid clutter. Wesay nodes connected by equivalence edges are equivalent .( v ) Similarity edges (dotted, curved red edge between “CentralAfrican Republic” and “Centrafrique” in Figure 1) connect nodeswhich are considered strongly similar but not equivalent. In ourexample, the two nodes have a similarity of 0 .

85, which is attachedto the edge as confidence.For efficiency, when k nodes are equivalent, we do not considerall the k ( k − ) edges; instead, one of the nodes (the first to be addedto the graph - any other choice could be made) is designated the representative of all of them, and we store associated with eachnode, the ID of its representative.The purpose of the equivalence and similarity edges is to inter-connect nodes within and across the datasets ; entity extractionprepares the ground for the same, since it creates nodes that mayco-occur across data sources, e.g., entities mentioned in separatetexts, such as “P. Balkany” in the figure. This increases the valueand usefulness of the graph, since it allows to find connectionswhich cannot be established based on any dataset taken separately.For instance, consider the question: «What connections exist be-tween “I. Balkany”, “Africa”, and “real estate”?» This can be askedas a three-keyword query {“I. Balkany”, “Africa", “Estate”}, forwhich an answer (a tree composed of graph edges) is shown asa light green highlight in Figure 1; the three nodes matching therespective keywords are shown in bold. This answer interconnectsall four data sources. raph-based keyword search in heterogeneous data sources BDA 2020, October 2020, Paris, France

Figure 1: Sample dataset collection D .Figure 2: Integrated graph corresponding to the datasets of Figure 1. We formalize this keyword search query problem below.

Given our graph G = ( N , E ) , we denote by L the set of all thelabels of G nodes, plus the special constant ϵ denoting the emptylabel. We denote by λ (·) a function assigning to each node and edgea label, which may be empty. As illustrated in Figure 2, internalnodes, which correspond, e.g., to a relational tuple, or to a JSONmap or array, have an empty label.Let W be the set of keywords , obtained by stemming the labelset L ; a search query is a set of keywords Q = { w , ..., w m } , where w i ∈ W . We define an answer tree (AT, in short) as a set t of G edges which ( i ) together, form a tree (each node is reachable fromany other through exactly one path), ( ii ) for each w i , contain at leastone node whose label matches w i . Here, the edges are consideredundirected , that is: n a −→ n b ←− n c −→ n is a sample AT, suchthat for all w i ∈ Q , there is a node n i ∈ t such that w i ∈ λ ( n i ) .We treat the edges of G as undirected when defining the AT inorder to allow more query results, on a graph built out of hetero-geneous content whose structure is not well-known to users. Forinstance, consider a query consisting of the keywords k , k suchthat k ∈ λ ( n ) and k ∈ λ ( n ) on the four-nodes sample AT intro-duced above. If our ATs were restricted to the original direction DA 2020, October 2020, Paris, France A. C. Anadiotis, M. Y. Haddad and I. Manolescu of G edges, the query would have no answer; ignoring the edgedirections, it has one. One could easily extend the definition andthe whole discussion in order to allow matches to also occur onedges (just enlarge L to also include the stemmed edge labels).Further, we are interested in minimal answer trees, that is:(1) Removing an edge from the tree should make it lack oneor more of the query keywords w i .(2) If a query keyword w i matches the label of more than onenodes in the answer tree, then all these matching nodesmust be equivalent.Condition (2) is specific to the graph we consider, originating in several data sources connected by equivalence or similarity edges . Inclassical graph keyword search problems, each query keyword ismatched exactly once in an answer (otherwise, the tree is considerednon-minimal). In contrast, our answer trees may need to traverseequivalence edges , and if w i is matched by one node connected bysuch an edge, it is also matched by the other. For instance, considerthe three-keyword query “Gyucy Balkany Levallois” in Figure 2: thekeyword Balkany is matched by the two nodes labeled “P. Balkany”which are part of the answer.As a counter-example to condition (2), consider the query “BalkanyCentrafrique” in Figure 2, assuming the keyword Centrafrique isalso matched in the label “Central African Republic” . Considerthe tree that connects a “P. Balkany” node with “Centrafrique”,and also traverses the edge between “Centrafrique” and “CentralAfrican Republic”: this tree is not minimal, thus it is not an answer.The intuition for rejecting it is that “Centrafrique” and “CentralAfrican Republic” may or may not be the same thing (we have asimilarity, not an equivalence edge), therefore the query keyword“Centrafrique” is matched by two potentially different things in thisanswer, making it hard to interpret.A direct consequence of minimality is that in an answer, eachand every leaf matches a query keyword .Several minimal answer trees may exist in G for a given query.We consider available a scoring function which assigns a highervalue to more interesting answer trees (see Section 3). Thus, ourproblem can be stated as follows:Problem statement Given the graph G built out of thedatasets D and a query Q , return the k highest-score minimalanswer trees.An AT may potentially span over the whole graph, (also) becauseit can traverse G edges in any direction; this makes the problemchallenging. Discussion: degraded answers.

In some cases, a query may haveno answer (as defined above) on a given graph, yet if one is willingto drop the second condition concerning nodes matching the samequery keyword, an answer tree could be found. For instance, con-sider a graph of the form a l −→ b m −−→ b n −→ c , such that b is notequivalent to b , and the query { a , b , c } , such that the keyword a matches the node a , b matches b and b and c matches c . Givenour definition of answers above, this query has no answer, because b matches the two nodes b and b . This may be the case using a more advanced indexing system that includes somenatural language understanding, term dictionaries etc.

If we removed condition (2), we could accept such an answer,which we call degraded , since it is harder to interpret for users(lacking one clearly identified node for each keyword). One couldthen generalize our problem statement into: ( i ) solve the problemstated above, and ( ii ) only if there are no answers, find the top- k degraded answers (if they exist). We do not pursue degradedanswer search further in this paper, and focus instead on findingthose defined above. The problem that we study is related to the (Group) Steiner TreeProblem, which we recall below.Given a graph G with weights (costs) on edges, and a set of m nodes n , . . . , n m , the Steiner Tree Problem (STP) [14] consistsof finding the smallest-cost tree in G that connects all the nodestogether. We could answer our queries by solving one STP problemfor each combination of nodes matching the keywords w , . . . , w m .However, there are several obstacles left: ( (cid:5) ) STP is a known NP-hardproblem in the size of G , denoted | G | ; ( (cid:66) ) as we consider that eachedge can be taken in the direct or reverse direction, this amounts to“doubling” every edge in G . Thus, our search space is 2 | G | largerthan the one of the STP, or that considered in similar works ,discussed in Section 6. This is daunting even for small graphs of afew hundred edges; ( (cid:67) ) we need the k smallest-cost trees, not justone; ( ◦ ) each keyword may match several nodes, not just one.The closely related Group STP (GSTP, in short) [14] is: given m sets of nodes from G , find the minimum-cost subtree connecting onenode from each of these subtrees. GSTP does not raise the problem( ◦ ), but still has all the others.In conclusion, the complexity of the problem we consider isextremely high. Therefore, solving it fully is unfeasible for largeand/or high-connectivity graphs. Instead, our approach is: • Attempt to find all answers from the smallest (fewest edges) to the largest . Enumerating small trees first is both a prac-tical decision (we use them to build larger ones) and fitsthe intuition that we shouldn’t miss small answers that ahuman could have found manually. However, as we willexplain, we still “opportunistically” build some trees beforeexhausting the enumeration of smaller ones, whenever thisis likely to lead faster to answers. The strategy for choos-ing to move towards bigger instead of smaller tress leavesrooms for optimizations on the search order. • Stop at a given time-out or when m answers have been found ,for some m ≥ k ; • Return the k top-scoring answers found. We now discuss how to evaluate the quality of an answer. Sec-tion 3.1 introduces the general notion of score on which we baseour approach. Section 3.2 describes one particular metric we attachto edges in order to instantiate this score, finally Section 3.3 detailsthe actual score function we used.

We have configured our problem setting to allow any scoring func-tion , which enables the use of different scoring schemes fitting the raph-based keyword search in heterogeneous data sources BDA 2020, October 2020, Paris, France requirements of different users. As a consequence, this approachallows us to study the interaction of the scoring function withdifferent properties of the graph. For instance, we are currentlyinvestigating the possibility to learn what makes an answer inter-esting for a user, so that we may return customized answers to eachuser.Given an answer tree t to a query Q , we consider a score functionconsisting of (at least) the following two components: • The matching score ms ( t ) , which reflects the quality of theanswer tree, that is, how well its leaves match the queryterms. • The connection score cs ( t ) , which reflects the quality of thetree connecting the edges. Any formula can be used here,considering the number of edges, the confidence or anyother property attached to edges, or a query-independentproperty of the nodes, such as their PageRank or between-ness centrality score etc.The score of t for Q , denoted s ( t ) , is computed as a combina-tion of the two independent components ms ( t ) and cs ( t ) . Popularcombinations functions (a weighted sums, or product etc.) are mo-notonous in both components, however, our framework does notrequire it. Finally, both ms ( t ) and cs ( t ) can be tuned based on agiven user’s preferences, to personalize the score, or make themevolve in time through user feedback etc. We now describe a metric on edges, which we used (through theconnection score cs ( t ) ) to favor edges that are “rare” for both nodesthey connect. This metric was inspired by our experiments withreal-world data sources, and it helped return interesting answertrees in our experience.For a given node n and label l , let N l → n be the number of l -labelededges entering n , and N ln → the number of l -labeled edges exiting n .The specificity of an edge e = n l −→ n is defined as: s ( e ) = /( N ln → + N l → n ) . s ( e ) is 1 . /( . + . ) = .

0. Incontrast, there are 54 countries in Africa (we show only two), andeach country is in exactly one continent; thus, the specificity of the dbo:partOf edges in the DBPedia fragment, going from the nodenamed Morocco (or the one named Central African Republic) tothe node named Africa is 2 /( + ) (cid:39) . Specificity computation.

When registering the first dataset D , computing the specificity of its edges is trivial. However, whenregistering subsequent datasets D , D etc., if some node, say n ∈ D is found to be equivalent to a node n ∈ D , all the D edgesadjacent to n and the D edges adjacent to n should be reflectedin the specificity of each of these edges. Thus, in particular, thespecificity of D edges needs to be recomputed when a node in asource added after D is equivalent to one of its nodes. n es x n l . . . l . . . l . . . l . . . b . . . c . . . d . . . l Figure 3: Illustration for specificity (re)computation. Thespecificity of the edge x l −→ n , s ( e ) is initially computed outof the blue edges; when n joins the equivalence set es , it isrecomputed to also reflect the violet edges. A naïve approach would be: when the edges of D are traversed(when we add this dataset to the graph), re-traverse the edges of n in D in order to (re)compute their specificity. However, that wouldbe quite inefficient.Instead, below, we describe an efficient incremental algorithm tocompute specificity. We introduce two notations. For any edge e ,we denote N e →• , respectively N e ◦→ , the two numbers out of whichthe specificity of e has been most recently computed . Specifically, N e →• counts l -labeled edges incoming to the target of e , while N e ◦→ counts l -labeled edges outgoing the source of e . In Figure 3, if e isthe edge x l −→ n , then N e →• = N e ◦→ =

1, thus s ( e ) = / = . n ∈ D be a node, es be the set of all nodes equivalent to n , and n ∈ D be a node in a dataset we currently register, andwhich has just been found to be equivalent to n , also.Further, let l be a label of an edge incoming or outgoing (any)node from es , and/or n . We denote by N l → es the sum (cid:205) n ∈ es ( N l → n ) and similarly by N les → the sum (cid:205) n ∈ es ( N ln → ) ; they are the num-bers of l -labeled outgoing (resp., incoming) l -labeled edges of anynode in es . When n joins the equivalence set es of n (see Fig-ure 3):(1) If N l → es (cid:44) N l → n (cid:44)

0, the specificity of every l -labeled edge e incoming either a node in es or the node n must be recomputed.Let e be such an incoming edge labeled l . When n isadded to the set es , the specificity of e becomes 2 /(( N e →• + N l → n ) + N e ◦→ ) , to reflect that n brings more incoming l -labeled edges. This amounts to 2 /( + + ) = .

33 in Fig-ure 3: the violet edges have joined the blue ones. Followingthis adjustment, the numbers out of which e ’s specificityhas been most recently computed are modified as follows: N e →• becomes N e →• + N l → n , thus 3 + = N e ◦→ remains unchanged.(2) If N l → es = N l → n (cid:44)

0, the specificity of every l -labeled edge e incoming n does not change when n joinsthe equivalence set es .(3) If N l → es (cid:44) N l → n =

0, the newly added node n does not change the edges adjacent to the nodes of es , northeir specificity values.The last two cases, when N les → (cid:44) N ln → (cid:44)

0, respectively, N les → = N ln → (cid:44)

0, are handled in a similar manner.The above method only needs, for a given node n newly addedto the graph, and label l , the number of edges adjacent to n in itsdataset, and the number of l edges adjacent to a node equivalent This can be either during the first specificity computation of e , or during a recompu-tation, as discussed below. DA 2020, October 2020, Paris, France A. C. Anadiotis, M. Y. Haddad and I. Manolescu to n . Unlike the naïve specificity computation method, it does notneed to actually traverse these edges previously registered edges,making it more efficient.Concretely, for each edge e ∈ E , we store three attributes: N e →• , N e ◦→ and s , the last-computed specificity, and we update N e →• , N e ◦→ as explained above. In our experiments, we used the following score function.For an answer t to the query Q , we compute the matching score ms ( t ) as the average , over all query keywords w i , of the similaritybetween the t node matching w i and the keyword w i itself; we usedthe edit distance.We compute the connection score cs ( t ) based on edge confidence,on one hand, and edge specificity on the other. We multiply theconfidence values, since we consider that uncertainty (confidence <

1) multiplies; and we also multiply the specificities of all edges in t , to discourage many low-specificity edges. Specifically, our scoreis computed as: score ( t , Q ) = α · ms ( t , Q ) + β · (cid:206) e ∈ E c ( e ) + ( − α − β ) · (cid:206) e ∈ E s ( e ) where α , β are parameters of the system such that 0 ≤ α , β < α + β ≤ Before we describe the search algorithm, we make a few moreremarks on the connection between the score function and thesearch algorithm.We start by considering the classical Steiner Tree and GroupSteiner Tree Problems (Section 2.3). These assume that the tree costis monotonous , that is: for any query Q and all trees T , ˆ T where T is a subtree of ˆ T it follows that the cost of T is higher (in ourterminology, its score is lower) than the cost of ˆ T . This is naturallysatisfied if the cost is the addition of edge weights. In contrast, thescore, in its general form (Section 3.1), and in particular our concreteone (Section 3.3), is not monotonous , as illustrated in Figure 4,where on each edge, c is the confidence and s is the specificity.Denoting T the four-edge tree rooted in n , the connection score cs ( T ) = β + ( − α − β )·( . ) , while cs ( T (cid:48) ) = β · . + ( − α − β )( . ) · . α = β = , then cs ( T ) = ( + ( . ) ) (cid:39) .

35 while cs ( T (cid:48) ) = · ( . + ( . ) ) (cid:39) .

17, which is clearly smaller. Assuming T (cid:48) has the same matching score as T , the global score of T (cid:48) is smallerthan that of T , contradicting the monotonicity assumption.Another property sometimes assumed by score functions is thecalled optimal substructure , that is: the best solution for a prob-lem of size p is part of the best solution for a problem of size p + p , for some problem size p . When this holds,the problem can be efficiently solved in a dynamic programmingfashion. However, STP does not enjoy this property: the smallest-cost tree connecting two nodes n , n is not necessarily part ofthe smallest-cost tree that connects n , n , n (and the same holdsfor GSTP). Some existing algorithms also assume a variant of theoptimal substructure property (see Section 6). In contrast, our scorefunction (both in its general and its concrete form) does not ensuresuch favorable properties. This is why the search algorithm we describe next has to find as many answers as possible, as quicklyas possible. We now present our approach for computing query answers, basedon the integrated graph.

Our first algorithm uses some concepts from the prior literature [10,18] while exploring many more trees. Specifically, it starts from thesets of nodes N , . . . , N m where the nodes in N i all match the querykeyword w i ; each node n i , j ∈ N i forms a one-node partial tree. Forinstance, in Figure 2, one-node trees are built from the nodes withboldface text, labeled “Africa”, “Real Estate” and “I. Balkany”. Twotransformations can be applied to form increasingly larger trees,working toward query answers: • Grow( t , e ), where t is a tree, e is an edge adjacent to the root of t , and e does not close a loop with a node in t , createsa new tree t (cid:48) having all the edges of t plus e ; the root ofthe new tree is the other end of the edge e . For instance,starting from the node labeled “Africa”, a Grow can addthe edge labeled dbo:name . • Merge( t , t ), where t , t are trees with the same root,whose other nodes are disjoint, and matching disjoint setsof keywords, creates a tree t (cid:48)(cid:48) with the same root and withall edges from t and t . Intuitively, Grow moves awayfrom the keywords, to explore the graph; Merge fuses twotrees into one that matches more keywords than both t and t .In a single-dataset context, Grow and Merge have the followingproperties. ( дm ) Grow alone is complete (guaranteed to find allanswers) for k = , k , Grow and Merge togetherare complete. ( дm ) Using Merge steps helps find answers fasterthan using just Grow [18]: partial trees, each starting from a leafthat matches a keyword, are merged into an answer as soon as theyhave reached the same root. ( дm ) An answer can be found through multiple combinations of Grow and Merge . For instance, con-sider a linear graph n → n → . . . n p and the two-keyword query { a , a p } where a i matches the label of n i . The answer is obviouslythe full graph. It can be found: starting from n and applying p − n p and applying p − p − n , . . . , n p − . These are allthe same according to our definition of an answer (Section 2.2),which does not distinguish a root in an answer tree; this followsusers’ need to know how things are connected, and for which thetree root is irrelevant. The changes we brought for our harder problem (bidirectional edgesand multiple interconnected datasets) are as follows.

1. Bidirectional growth.

We allow Grow to traverse an edge bothgoing from the source to the target, and going from the target to thesource. For instance, the type edge from “Real Estate” to < tuple1 > is raph-based keyword search in heterogeneous data sources BDA 2020, October 2020, Paris, France n n n n n s = . , c = s = . , c = s = . , c = s = . , c = n n s = . , c = . s = , c = Figure 4: Example (non-monotonicity of the tree score). T is the four-edges tree rooted in n . traversed target-to-source, whereas the location edge from < tuple1 > to “Real Estate” is traversed source-to-target.

2. Many-dataset answers.

As defined in a single-dataset scenario,Grow and Merge do not allow to connect multiple datasets. Tomake that possible, we need to enable one, another, or both to alsotraverse similarity and equivalence edges (shown in solid or dottedred lines in Figure 2. We decide to simply extend Grow to allow itto traverse not just data edges, but also similarity edges betweennodes of the same or different datasets. We handle equivalence edgesas follows:

The simplest idea isto allow Grow to also add an equivalence edge to the root of a tree.However, this can be very inefficient. Consider three equivalentnodes m , m (cid:48) and m (cid:48)(cid:48) , e.g., the three “P. Balkany” nodes in Figure 2:a Grow step could add one equivalence edge, the next Grow couldadd another on top of it etc. More generally, for a group of p equiva-lent nodes, from a tree rooted in one of these nodes, 2 p trees wouldbe created just by Grow. In our French journalistic datasets, someentities, e.g. “France”, are very frequent, leading to high p ; explor-ing 2 p subtrees every time we reach a “France” node is extremelyexpensive . To avoid this, we devise a third algo-rithmic step, called

Grow-to-representative (Grow2Rep), as follows.Let t be a partial tree developed during the search, rooted in a node n , such that the representative of n (recall Section 2.1) is a node n rep (cid:44) n . Grow2Rep creates a new tree by adding to t the edge n ≡ −→ n rep ; this new tree is rooted in n rep . If n is part of a groupof p equivalent nodes, only one Grow2Rep step is possible from t ,to the unique representative of n ; Grow2Rep does not apply againon Grow2Rep( t ), because the root of this tree is n rep , which is itsown representative.Together, Grow, Grow2Rep and Merge enable finding answersthat span multiple data sources, as follows: • Grow allows exploring data edges within a dataset, andsimilarity edges within or across datasets; • Grow2Rep goes from a node to its representative whenthey differ; the representative may be in a different dataset; • Merge merges trees with a same root: when that root isthe representative of a group of p equivalent nodes, this al-lows connecting partial trees, including Grow2Rep results,containing nodes from different datasets. Thus, Merge canbuild trees spanning multiple datasets. Note that similarity edges do not raise the same problem, because in our graph weonly have such edges if the similarity between two nodes is above a certain threshold τ . Thus, if a node n is at least τ -similar to n , and n is at least τ -similar to n , n may be at least τ -similar to n , or not. This leads to much smaller groups of similarnodes, than the groups of equivalent nodes we encountered. One potential performance problem remains. Consider again p equivalent nodes n , . . . , n p ; assume without loss of generality thattheir representative is n . Assume that during the search, a tree t i is created rooted in each of these p nodes. Grow2Rep applies toall but the first of these trees, creating the trees t (cid:48) , t (cid:48) , . . . , t (cid:48) p , allrooted in n . Now, Merge can merge any pair of them, and can thenrepeatedly apply to merge three, then four such trees etc., as theyall have the same root n . The exponential explosion of Grow trees,avoided by introducing Grow2Rep, is still present due to Merge!We solve this problem as follows. Observe that in an answer, a path of two or more equivalence edges of the form n ≡ −→ n ≡ −→ n such that a node internal to the path , e.g. n , has no other adjacentedge , even if allowed by our definition, is redundant . Intuitively, sucha node brings nothing to the answer, since its neighbors, e.g., n and n , could have been connected directly by a single equivalence edge,thanks to the transitivity of equivalence. We call non-redundant ananswer that does not feature any such path, and decide to searchfor non-redundant answers only.The following properties hold on non-redundant answers:Property 1. There exists a graph G and a k -keyword query Q such that a non-redundant answer contains k − adjacent equivalenceedges (edges that, together, form a single connected subtree). Figure 5: Sample answer trees for algorithm discussion.

We prove this by exhibiting such an instance. Let G be a graphof 2 k nodes shown in Figure 5 (a), such that all the x i are equiv-alent, and consider the k -keyword query Q = { a , . . . , a k } (eachkeyword matches exactly the respective a i node). An answer needsto traverse all the k edges from a i to x i , and then connect the nodes x i , . . . , x k ; we need k − Let t be a non-redundant answer to a query Q of k keywords. A group of adjacent equivalence edges contained in t hasat most k − edges. We prove this by induction over k . For k =

1, each answer has 1node and 0 edge (trivial case).Now, consider this true for k and let us prove it for k +

1. Assumeby contradiction that a non-redundant answer t Q to a query Q of k + k + Q (cid:48) be the query having only the first k keywords of Q , and t (cid:48) be asubtree of t that is a non-redundant answer to Q (cid:48) : DA 2020, October 2020, Paris, France A. C. Anadiotis, M. Y. Haddad and I. Manolescu • t (cid:48) exists, because t connects all Q keywords, thus also the Q (cid:48) keywords; • t (cid:48) is non-redundant, because its edges are also in the (non-redundant) t .By the induction hypothesis, t (cid:48) has at most k − two adjacent equivalent edges in t \ t (cid:48) .(1) If these edges, together, lead to two distinct leaves of t , then t has two leaves not in t (cid:48) . This is not possible, because bydefinition of an answer, t has k + t (cid:48) has k leaves.(2) It follows, then, that the two edges lead to a single leaf of t ,therefore the edges form a redundant path. This contradictsthe non-redundancy of t , and concludes our proof.Property 2 gives us an important way to control the exponentialdevelopment of trees due to p equivalent nodes. Grow, Grow2Repand Merge, together, can generate trees with up to k (instead of k − k equivalentnodes (see Figure 5(b), assuming x is the representative of all theequivalent x i s, and the query { a , . . . , a k } ). The resulting answermay be redundant, if the representative has no other adjacent edgesin the answer other than equivalence edges. In such cases, in a post-processing step , we remove from the answer the representativeand its equivalence edges, then reconnect the respective equivalentnodes using k − We now have the basic exploration steps we need: Grow, Grow2Repand Merge. In this section, we explain how we use them in ourintegrated keyword search algorithm.We decide to apply in sequence: one Grow or Grow2Rep (seebelow), leading to a new tree t , immediately followed by all theMerge operations possible on t . Thus, we call our algorithm Growand Aggressive Merge (GAM, in short). We merge aggressivelyin order to detect as quickly as possible when some of our trees,merged at the root, form an answer.Given that every node of a currently explored answer tree can beconnected with several edges, we need to decide which Grow (orGrow2Rep) to apply at a certain point. For that, we use a priorityqueue U in which we add (tree, edge) entries: for Grow, with thenotation above, we add the ( t , e ) pair, while for Grow2Rep, we add t together with the equivalence edge leading to the representativeof t ’s root. In both cases, when a ( t , e ) pair is extracted from U ,we just extend t with the edge e (adjacent to its root), leading to anew tree t G , whose root is the other end of the edge e . Then weaggressively merge t G with all compatible trees explored so far,finally we read from the graph the (data, similarity or equivalence)edges adjacent to t G ’s root and add to U more (tree, edge) pairs tobe considered further during the search. The algorithm then picksthe highest-priority pair in U and reiterates; it stops when U isempty, at a timeout, or when a maximum number of answers arefound (whichever comes first). The last parameter impacting the exploration order is the priorityused in U : at any point, U gives the highest-priority ( t , e ) pair, whichdetermines the operations performed next.(1) Trees matching many query keywords are preferable, to gotoward complete query answers;(2) At the same number of matched keywords, smaller trees are preferable in order not to miss small answers;(3) Finally, among ( t , e ) , ( t , e ) with the same number ofnodes and matched keywords, we prefer the pair with the higher specificity edge . Algorithm details

Beyond the priority queue U described above,the algorithm also uses a memory of all the trees explored , called E .It also organizes all the (non-answer) trees into a map K in whichthey can be accessed by the subset of query keywords that theymatch. The algorithm is shown in pseudocode in Figure 6, followingthe notations introduced in the above discussion.While not shown in Figure 6 to avoid clutter, the algorithm onlydevelops minimal trees (thus, it only finds minimal answers). This isguaranteed: • When creating Grow and Grow2Rep opportunities (steps3 and 4d): we check not only that the newly added doesnot close a cycle, but also that the matches present in thenew tree satisfy our minimality condition (Section 2.2). • Similarly, when selecting potential Mergecandidates (step4(c)iiiA).

We implemented our approach in the

ConnectionLens proto-type, available online at https://gitlab.inria.fr/cedar/connectionlens,which we used to experimentally evaluate the performance of ouralgorithms. This section presents the results that we obtained byusing synthetic graphs, which are similar to the real-world datasetsthat we have obtained. First, we describe the hardware and soft-ware setup that we used to run our experiments, and then we giveour findings for various combinations of amount of keywords andgraph sizes.

We conducted our experiments on a server equipped with 2x10-coreIntel Xeon E5-2640 CPUs clocked at 2.40GHz, and 128GB DRAM.The graph is constructed following the approach described in [4]and we used Postgres 9.6.5 to store and query the graph for nodes,edges and labels. The search algorithms are implemented in a Javaapplication which communicates with the database over JDBC,whereas it also maintains an in-memory cache. Every time that thesearch algorithm needs information about a node, it first looks intothe cache, and if the requested information is not there, it is directlyretrieved from the database and then stored in the cache. To avoidany effects of the cache replacement algorithm, in our experimentswe set the cache to be large enough to include all the informationthat has been retrieved from the database.

Synthetic datasets

For controlled experiments, we generated dif-ferent types of (RDF) graphs. The first type is a line graph , which thesimplest model that we can use. In the line graph, every node is con-nected with two others, having one edge for each node, except twonodes which are connected with only one. By using the line graph, raph-based keyword search in heterogeneous data sources BDA 2020, October 2020, Paris, France

Procedure process (tree t ) • if t is not already in E • then – add t to E – if t has matches for all the query keywords – then post-process t if needed; output the result as an answer • else insert t into K Algorithm

GAMSearch (query Q = { w , w , . . . , w k } )(1) For each w i , 1 ≤ i ≤ k • For each node n ji matching w i , let t ji be the 1-node tree consisting of n ji ; process( t ji )(2) Initial merge ∗ : try to merge every pair of trees from E , and process any resulting answer tree.(3) Initialize U (empty so far):(a) Create Grow opportunities: Insert into U the pair ( t , e ) , for each t ∈ E and e a data or similarity edge adjacent to t ’s root.(b) Create Grow2Rep opportunities: Insert into U the pair ( t , n → n rep ) for each t ∈ E whose root is n , such that the representativeof n is n rep (cid:44) n .(4) While ( U is not empty)(a) Pop out of U the highest-priority pair ( t , e ) .(b) Apply the corresponding Grow or Grow2Rep, resulting in a new tree t (cid:48)(cid:48) ; process( t (cid:48)(cid:48) ).(c) If t (cid:48)(cid:48) was not already in E , agressively Merge:(i) Let NT be a set of new trees obtained from the Merge (initially ∅ ).(ii) Let p be the keyword set of t (cid:48)(cid:48) (iii) For each keyword subset p that is a key within K , and such that p ∩ p = ∅ (A) For each tree t i that corresponds to p , try to merge t (cid:48)(cid:48) with t i . Process any possible result; if it is new (not in E previously), add it to NT .(d) Re-plenish U (add more entries in it). This is performed as in step 3 but based on NT (not on E ). Figure 6: Outline of GAM algorithm we clearly show the performance of Grow and Merge operationswith respect to the size of the graph. The second type is a chaingraph , which is the same as the line graph, but instead of one edgeconnecting every pair of nodes, we have two. We use this type toshow the performance of the algorithm as we double the amount ofedges of the line graph and we give more options to the Grow andthe Merge algorithms. The third type is the star graph , where wehave several line graphs connected through a strongly connectedcluster of nodes with a representative. We use this type to showthe performance of Grow2Rep, by placing the query keywordson different line graphs. The fourth type is a random graph basedon the

Barabasi-Albert (BA, in short) model [1], which generatesscale-free networks with only a few nodes (referred to as hubs)of the graph having much higher degree than the rest. The graphin this model is created in a two-staged process. During the firststage, a network of some nodes is created. Then, during the secondstage, new nodes are inserted in the graph and they are connectedto nodes created during the first stage. At the second stage, we cancontrol how many connections every node will have to the onescreated at the first stage. By setting that every node created at thesecond stage to be connected with exactly one node created at thefirst stage, we have observed that we can construct graphs whichare similar to the real-world ones, and therefore we tune our modelaccordingly.

Real-world dataset

The real-world dataset that we used isbased on data that we have obtained from journalists with whom 0 100 200 300 400 500 600 700 800 900 1 , , , , , , , ,

000 Number of nodes in the graph E x e c u t i o n t i m e ( m s ) Figure 7: Line graph execution time we collaborate. Our dataset combines information on French pol-itics, which we obtained by crawling the web pages of a Frenchnewspaper, as we explain in the corresponding Section.For the microbenchmarks, we report the time needed for oursystem to return the first answer, as well as the time for all answers.For the macrobenchmarks, we only report the time to return allthe answers, as we do not have full control over the graphs and,hence, it is hard to draw meaningful conclusions and explain themrelying on the whole graphs. Finally, we set an upper bound to theoverall execution time at 120 seconds, which is applied to all theexperiments that we performed.

DA 2020, October 2020, Paris, France A. C. Anadiotis, M. Y. Haddad and I. Manolescu

Figure 7 shows the execution time of our algorithm when executinga query with two keywords on a line graph, as we vary the numberof nodes of the graph. We place the keywords on the two “ends”of the graph to show the impact of the distance on the executiontime. The performance of our algorithm is naturally affected by thesize of the graph, as it generates 2 ∗ N answer trees, where N is thenumber of nodes. Given that this is a line graph, there is only oneanswer, which is the whole graph, and, therefore, the time to findthe first answer is also the overall execution time.Figure 8 shows the execution of our algorithm on a chain graph.Specifically, Figure 8a shows the time elapsed until the first answeris found, whereas Figure 8b shows the overall execution time. Theexecution times reported in Figure 8a are almost the same, as thesize of the graph increases slowly. On the other hand, the overallexecution times increase at a much higher (exponential) rate, asshown in Figure 8b, where the y axis has a logarithmic scale. Thereason is that every pair of nodes is connected with two edges,which increases the amount of answers exponentially with theamount of nodes in the graph.Similar to the chain graph, in Figure 9 we report the executiontime until our algorithm finds the first and all answers (left andright hand side, respectively). Given that we use keywords whichare placed in two different lines connected through the center ofthe graph, the algorithm has to use Grow2Rep, whereas in theprevious cases it only had to use Grow and Merge. The numberof branches, depicted on the x axis of Figure 9, corresponds to thenumber of line graphs connected in the star. Each line graph has10 nodes and we place the query keywords at the extremities oftwo different line graphs. Given that our algorithm will have tocheck all possible answers, it follows that the number of merges isexponential to the number of branches, that is O( K ) , where K isthe number of branches. This behaviour is clearly shown in bothparts of Figure 9, where on the y axis (in logarithmic scale) we showthe times to find the first, and, respectively, all answers. Above 12branches, the timeout of 120 seconds that we have set is hit and,thus, search is terminated, as shown in Figure 9b.Figure 10 depicts the performance of our algorithm when con-sidering the Barabasi-Albert graph model. In this experiment, wekeep the graph with 2000 nodes fixed and we vary the position oftwo keywords, by choosing nodes which have a distance, as givenin the x axis; note the logarithmic y axis. Due to the fact that thegraph is randomly generated within the BA model, we note someirregularity in the time to the first solution, which however grows ata moderate pace as the distance between the keyword node grows.The overall relation between the time to the first solution and thetotal time confirms that the search space is very large but that mostof the exploration is not needed, since the first solution is foundquite fast. This Section includes the results that we obtained by running ouralgorithm on real-world data. Our dataset is a corpus of 462 HTMLarticles (about 6MB) crawled from the French online newspaperMediapart with the search keywords “gilets jaunes" (yellow vests,a protest movement in France over the last year). We built a graph using these articles which consists of 90626 edges and 65868 nodes,out of which 1525 correpond to people, 1240 to locations and 1050to organizations. We query the graph using queries of one, two andthree different keywords.We report our findings in Table 1. The results are not given as abasis comparison, rather than as a proof of concept. Nevertheless,there are several interesting observations to be made. First, theamount of answers for every query is generally larger than 1. Weallow several results, as the end users (in our case, the investigativejournalists) need to see different connections to reach to potentiallyinteresting conclusions. Second, there are queries where severalanswers are found, and the execution is interrupted due to thethreshold. We allow the user to set the threshold, based on theresults returned every time. Third, the answers returned to the userare significantly less than the answer trees discovered, showing theimpact of minimality as a requirement for returning an answer.

Keyword search (KS, in short) is the method of choice for searchingin unstructured (typically text) data, and it is also the best searchmethod for novice users, as witnessed by the enormous successof keyword-based search engines. As databases grew larger andmore complex, KS has been proposed as a method for searching also in structured data [31], when users are not perfectly famil-iar with the data, or to get answers enabled by different tupleconnection networks. For relational data, in [19] and subsequentworks, tuples are represented as nodes, and two tuples are inter-connected only through primary key-foreign key pairs. The graphsthat result are thus quite uniform, e.g., they consist of “Companynodes”, “Employee nodes” etc. The same model was consideredin [8, 27, 29, 30, 32]; [27] also establishes links based on similarity(or equality) of constants appearing in different relational attributes.As explained in Section 2.3, our problem is (much) harder sinceour trees can traverse edges in both directions, and paths can be(much) longer than those based on PK-FK alone. [30] proposes toincorporate user feedback through active learning to improve thequality of answers in a relational data integration setting. We areworking to devise such a learning-to-rank approach for our graphs,also.KS has also been studied in

XML documents [17, 25]. Here,an answer is defined as a subtree of the original document, whoseleaves match the query keywords. This problem is much easier thanthe one we face, since: ( i ) an XML document is a tree, guaranteeingjust one connection between any two nodes; in contrast, therecan be any number of such connections in our graphs; ( ii ) themaximum size of an answer to a k -keywords query is k · h where h ,the height of an XML tree, is almost always quite small, e.g., 20 isconsidered “quite high”; in contrast, with our bi-directional search,the bound is k · D where D is the diameter of our graph - whichcan be enormously larger.Our Grow and Merge steps are borrowed from [10, 18], whichaddress KS for graphs , assuming optimal-substructure which doesnot hold for us, and single-direction edge traversal. For RDF graphs [13,22] traverse edges in their direction only; moreover, [22] also makestrong assumptions on the graph, e.g., that all non-leaf nodes havetypes, and that there are a small number of types (regular graph). raph-based keyword search in heterogeneous data sources BDA 2020, October 2020, Paris, France E x e c u t i o n t i m e ( m s ) (a) Time to find the first answer Number of nodes in the graph E x e c u t i o n t i m e ( m s ) (b) Time to find all answers Figure 8: Chain graph execution time Number of branches in the graph E x e c u t i o n t i m e ( m s ) (a) Time to find the first answer Number of branches in the graph E x e c u t i o n t i m e ( m s ) (b) Time to find all answers Figure 9: Star graph execution time

Query keyword(s) Answers Answer trees Time to 1st (ms) Total time (ms)Macron 118 0 179 390Trump 10 0 26 36Melenchon 8 0 31 39Christophe, Dettinger 1105 319611 136 123932Etienne, Chouard, Rodrigues 1 194 144 146Thierry–Paul, Valette, Drouet 0 300813 N/A 120001Melenchon, Aubry 9 284 38 929Castaner, flashball 17 1724 61 545Drouet, Levavasseur 18 518 145 309Dupont–Aignan, Chalencon 21 1850 53 393Estrosi, Castaner 16 2203 205 529Alexis, Corbiere, Ruffin 11 3782 57 1022Macron, Nunez 13 4107 1511 1561Hamon, Drouet 5 421 71 145Drouet, Ludosky 27 486 43 145Salvini, Ludosky 17 1156 111 375Salvini, Chouard 16 3205 76 710Corbiere, Drouet 13 2341 129 673Cauchy, Drouet 22 516 96 260Benalla, Nunez 15 1027 199 347

Table 1: Results with real-world dataset

In [6], the authors investigate a different kind of answers to key-word search, the so-called r -clique graphs, which they solve withthe help of specific indexes. Keyword search across heterogeneous datasets has been pre-viously studied in [11, 23]. However, in these works, each answercomes from a single dataset , that is, they never consider answers DA 2020, October 2020, Paris, France A. C. Anadiotis, M. Y. Haddad and I. Manolescu Distance between keyword nodes E x e c u t i o n t i m e ( m s ) First solutionAll solutions

Figure 10: Barabasi-Albert graph execution time spanning over and combining multiple datasets, such as the oneshown in Figure 2.In the literature, (G)STP has been addressed under various sim-plifications that do not hold in our context. For instance: the qual-ity of a solution exponentially decreases with the tree size, thussearch can stop when all trees are under a certain threshold [2];edges are considered in a single direction [13, 22, 32]; the costfunction has the suboptimal-structure property [10, 24] etc. Theseassumptions reduce the computational cost; in contrast, to leaveour options open as to the best score function, we worked to builda feasible solution for the general problem we study. Some workshave focused on finding bounded (G)STP approximations , i.e.,(G)STP trees solutions whose cost is at most f times higher thanthe optimal cost, e.g., [15, 16]. Beyond the differences between ourproblem and (G)STP, due notably to the fact that our score is muchmore general (Section 3), non-expert users find it hard to set f .Beyond the differences we mentioned above, most of whichconcern our bidirectional search, and the lack of favorable costhypothesis, our work is the first to study querying of graphs orig-inating from integrating several data sources , while at the sametime preserving the identity of each node from the original document ;this is a requirement for integrating, and simultaneously preserv-ing, datasets of journalistic interest. In a companion paper [4] wepresent our latest algorithms for creating such graphs, relying alsoon information extraction, data matching, and named entity disam-biguation; earlier versions were outlined in [5, 7]. Acknowledgements

The authors would like to thank: Helena Gal-hardas and Julien Leblay who contributed to previous versions onthis work [5, 7] and Tayeb Merabti for his support in the devel-opment and maintenance of the ConnectionLens system [4]. Thiswork was partially supported by the H2020 research program undergrant agreement nr. 800192, and by the ANR AI Chair SourcesSay.

REFERENCES [1] Albert-László Barabási and Réka Albert. 1999. Emergence of Scaling in RandomNetworks.

Science

DOI: http://dx.doi.org/10.1126/science.286.5439.509[2] Raphaël Bonaque, Bogdan Cautis, François Goasdoué, and Ioana Manolescu.2016. Social, Structured and Semantic Search. In

EDBT .[3] Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, Ioana Ileana, and IoanaManolescu. 2015. Invisible Glue: Scalable Self-Tunning Multi-Stores. In

CIDR .[4] Oana Bˇalˇalˇau, Catarina Conceia¸o, Helena Galhardas, Ioana Manolescu, TayebMerabti, Jingmao You, and Youssr Youssef. 2020. Graph integration of structured,semistructured and unstructured data for data journalism. BDA. (2020). https://hal.inria.fr/hal-02904797 [5] Camille Chanial, Rédouane Dziri, Helena Galhardas, Julien Leblay,Minh Huong Le Nguyen, and Ioana Manolescu. 2018. ConnectionLens: FindingConnections Across Heterogeneous Data Sources (demonstration).

VLDB (2018).[6] Yu-Rong Cheng, Ye Yuan, Jia-Yu Li, Lei Chen, and Guo-Ren Wang. 2016. KeywordQuery over Error-Tolerant Knowledge Bases.

Journal of Computer Science andTechnology

31 (2016). Issue 4.[7] Felipe Cordeiro, Helena Galhardas, Julien Leblay, Ioana Manolescu, and TayebMerabti. 2020. Keyword Search in Heterogeneous Data Sources. (2020). https://hal.inria.fr/hal-02559688 Technical report.[8] Pericles de Oliveira, Altigran Soares da Silva, and Edleno Silva de Moura. 2015.Ranking Candidate Networks of relations to improve keyword search over rela-tional databases. In

IEEE .[9] David J. DeWitt, Alan Halverson, Rimma Nehme, Srinath Shankar, Josep Aguilar-Saborit, Artin Avanes, Miro Flasza, and Jim Gramling. 2013. Split Query Process-ing in Polybase. In

SIGMOD . DOI: http://dx.doi.org/10.1145/2463676.2463709[10] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. 2007. Finding top- k min-cost connected trees in databases. In ICDE .[11] Xin Dong and Alon Halevy. 2007. Indexing Dataspaces. In

SIGMOD .[12] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magda Balazinska, BillHowe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik.2015. The BigDAWG Polystore System.

SIGMOD Rec.

44, 2 (2015).

DOI: http://dx.doi.org/10.1145/2814710.2814713[13] Shady Elbassuoni and Roi Blanco. 2011. Keyword Search over RDF Graphs. In

CIKM .[14] Michael R. Garey and David S. Johnson. 1990.

Computers and Intractability: AGuide to the Theory of NP-Completeness . W. H. Freeman & Co. New York.[15] N. Garg, G. Konjevod, and R. Ravi. 1998. A polylogarithmic approximationalgorithm for the group Steiner tree problem. In

SIAM .[16] Andrey Gubichev and Thomas Neumann. 2012. Fast approximation of Steinertrees in large graphs. In

CIKM .[17] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. 2003.XRANK: Ranked keyword search over XML documents. In

Proceedings of the2003 ACM SIGMOD international conference on Management of data . 16–27.[18] Hao He, Haixun Wang, Jun Yang, and Philip S. Yu. 2007. BLINKS: ranked keywordsearches on graphs. In

SIGMOD .[19] Vagelis Hristidis and Yannis Papakonstantinou. 2002. DISCOVER: KeywordSearch in Relational Databases. In

VLDB .[20] Manos Karpathiotakis, Ioannis Alagiannis, and Anastasia Ailamaki. 2016. FastQueries over Heterogeneous Data through Engine Customization.

PVLDB

9, 12(2016).

DOI: http://dx.doi.org/10.14778/2994509.2994516[21] Manos Karpathiotakis, Ioannis Alagiannis, Thomas Heinis, Miguel Branco, andAnastasia Ailamaki. 2015. Just-In-Time Data Virtualization: Lightweight DataManagement with ViDa. In

CIDR .[22] Wangchao Le, Feifei Li, Anastasios Kementsietsidis, and Songyun Duan. 2014.Scalable Keyword Search on Large RDF Data.

IEEE Trans. Knowl. Data Eng.

SIGMOD .[24] Rong-Hua Li, Lu Qin, Jeffrey Xu Yu, and Rui Mao. 2016. Efficient and ProgressiveGroup Steiner Tree Search. In

SIGMOD , Fatma Özcan, Georgia Koutrika, andSam Madden (Eds.).[25] Ziyang Liu and Yi Chen. 2007. Identifying meaningful return information forXML keyword search. In

Proceedings of the 2007 ACM SIGMOD internationalconference on Management of data . 329–340.[26] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and AndrewTomkins. 2008. Pig Latin: A Not-so-Foreign Language for Data Processing. In

SIGMOD (SIGMOD ’08) . DOI: http://dx.doi.org/10.1145/1376616.1376726[27] Mayssam Sayyadian, Hieu LeKhac, AnHai Doan, and Luis Gravano. 2007. Effi-cient Keyword Search Across Heterogeneous Relational Databases. In

ICDE .[28] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: AWarehousing Solution over a Map-Reduce Framework.

PVLDB

2, 2 (2009).[29] Quang Hieu Vu, Beng Chin Ooi, Dimitris Papadias, and Anthony K. H. Tung.2008. A graph method for keyword-based selection of the top-K databases. In

SIGMOD .[30] Zhepeng Yan, Nan Zheng, Zachary G. Ives, Partha Pratim Talukdar, and CongYu. 2015. Active learning in keyword search-based data integration.

VLDB J.

Keyword Search in Databases . DOI: http://dx.doi.org/10.2200/S00231ED1V01Y200912DTM001[32] Jeffrey Xu Yu, Lu Qin, and Lijun Chang. 2010. Keyword Search in RelationalDatabases: A Survey.

IEEE Data Eng. Bull.

33, 1 (2010).[33] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust,Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J.Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016.Apache Spark: A Unified Engine for Big Data Processing.