[PDF] Empowering Investigative Journalism with Graph-based Heterogeneous Data Management

Abstract

Investigative Journalism (IJ, in short) is staple of modern, democratic societies. IJ often necessitates working with large, dynamic sets of heterogeneous, schema-less data sources, which can be structured, semi-structured, or textual, limiting the applicability of classical data integration approaches. In prior work, we have developed ConnectionLens, a system capable of integrating such sources into a single heterogeneous graph, leveraging Information Extraction (IE) techniques; users can then query the graph by means of keywords, and explore query results and their neighborhood using an interactive GUI. Our keyword search problem is complicated by the graph heterogeneity, and by the lack of a result score function that would allow to prune some of the search space. In this work, we describe an actual IJ application studying conflicts of interest in the biomedical domain, and we show how ConnectionLens supports it. Then, we present novel techniques addressing the scalability challenges raised by this application: one allows to reduce the significant IE costs while building the graph, while the other is a novel, parallel, in-memory keyword search engine, which achieves orders of magnitude speed-up over our previous engine. Our experimental study on the real-world IJ application data confirms the benefits of our contributions.

Full PDF

EEmpowering Investigative Journalism with Graph-basedHeterogeneous Data Management

Angelos-Christos Anadiotis

École Polytechnique, IPP & [email protected]

Oana Balalau

Inria & [email protected]

Théo Bouganim

Inria & [email protected]

Francesco Chimienti

Inria & [email protected]

Helena Galhardas

INESC-ID & IST, Univ. [email protected]

Mhd Yamen Haddad

Inria & [email protected]

Stéphane Horel

Le [email protected]

Ioana Manolescu

Inria & [email protected]

Youssr Youssef

Inria & [email protected]

ABSTRACT

Investigative Journalism (IJ, in short) is staple of modern, demo-cratic societies. IJ often necessitates working with large, dynamicsets of heterogeneous, schema-less data sources , which can be struc-tured, semi-structured, or textual , limiting the applicability of classi-cal data integration approaches. In prior work, we have developedConnectionLens, a system capable of integrating such sources intoa single heterogeneous graph, leveraging Information Extraction(IE) techniques; users can then query the graph by means of key-words, and explore query results and their neighborhood using aninteractive GUI. Our keyword search problem is complicated bythe graph heterogeneity, and by the lack of a result score functionthat would allow to prune some of the search space.In this work, we describe an actual IJ application studying con-flicts of interest in the biomedical domain, and we show how Con-nectionLens supports it. Then, we present novel techniques address-ing the scalability challenges raised by this application: one allowsto reduce the significant IE costs while building the graph, whilethe other is a novel, parallel, in-memory keyword search engine,which achieves orders of magnitude speed-up over our previousengine. Our experimental study on the real-world IJ applicationdata confirms the benefits of our contributions.

Journalism and the press are a critical ingredient of any modernsociety. Like many other industries, such as trade, or entertainment,journalism has benefitted from the explosion of Web technologies,which enabled instant sharing of their content with the audience.However, unlike trade, where databases and data warehouses hadtaken over daily operations decades before the Web age, manynewsrooms discovered the Web and social media, long before buildingstrong information systems where journalists could store their infor-mation and/or ingest data of interest for them.

As a matter of fact,journalists’ desire to protect their confidential information may alsohave played a role in delaying the adoption of data managementinfrastructures in newsrooms.At the same time, highly appreciated journalism work often re-quires acquiring, curating, and exploiting large amounts of digitaldata . Among the authors, S. Horel co-authored the “Monsanto Pa-pers” series which obtained the European Press Prize Investigative Reporting Award in 2018 [3]; a similar project is the “Panama Pa-pers” (later known as “Offshore Leaks”) series of the InternationalConsortium of Investigative Journalists [2]. In such works, jour-nalists are forced to work with heterogeneous data, potentially in different data models ( structured such as relations, semistructured such as JSON or XML documents, or graphs, including but notlimited to RDF, as well as unstructured text ). We, the authors, arecurrently collaborating on such an Investigative Journalism (IJ,in short) application, focused on the study of situations poten-tially leading to conflicts of interest (CoIs, in short) betweenbiomedical experts and various organizations: corporations, indus-try associations, lobbying organizations or front groups. Informa-tion of interest in this setting comes from: scientific publications (inPDF) where authors declare e.g., “Dr. X. Y. has received consultingfees from ABC”; semi-structured metadata (typically XML, used forinstance in PubMed), where authors may also specify such connec-tions; a medical association, say, French cardiology, may build itsown disclosure database which may be relational, while a companymay disclose its ties to specialists in a spreadsheet.This paper builds upon our recent work [6], where we haveidentified a set of requirements ( R ) and the constraints ( C ) thatneed to be addressed to efficiently support IJ applications. We recallthem here for clarity and completeness: R1. Integral source preservation and provenance: in journal-istic work, it is crucial to be able to trace each information itemback to the data source from which it came. This enables adequatelysourcing information, an important tenet of quality journalism.

R2. Little to no effort required from users: journalists oftenlack time and resources to set up IT tools or data processing pipelines.Even when they are able to use a tool supporting one or two datamodels (e.g., most relational databases provide some support forJSON data), handling other data models remains challenging. Thus,the data analysis pipeline needs to be as automatic as possible.

C1. Little-known entities: interesting journalistic datasets fea-ture some extremely well-known entities (e.g., world leaders in thepharmaceutical industry) next to others of much smaller notoriety(e.g., an expert consulted by EU institutions, or a little-known trade According to the 2011 French transparency law, “A conflict of interest is any situationwhere a public interest may interfere with a public or private interest, in such a waythat the public interest may be, or appear to be, unduly influenced.” a r X i v : . [ c s . D B ] F e b ngelos-Christos Anadiotis, Oana Balalau, Théo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stéphane Horel, Ioana Manolescu, and Youssr Youssef association). From a journalistic perspective, such lesser-knownentities may play a crucial role in making interesting connectionsamong data sources, e.g., the association may be created by theindustry leader, and it may pay the expert honoraries. C2. Controlled dataset ingestion: the level of confidence in thedata required for journalistic use excludes massive ingestion fromuncontrolled data sources, e.g., through large-scale Web crawls.

R3. Performance on “off-the-shelf” hardware:

The efficiencyof our data processing pipeline is important; also, the tool shouldrun on general-purpose hardware, available to users like the oneswe consider, without expertise or access to special hardware.Further, IJ applications’ data analysis needs entail:

R4. Finding connections across heterogeneous datasets is acore need. In particular, it is important for our approach to betolerant of inevitable differences in the organization of data acrosssources. Heterogeneous data integration works, such as [10, 11, 16],and recent heterogeneous polystores, e.g., [5, 17, 36] assume thatsources have well-understood schemas; other recent works, e.g., [13,33, 35] focus on analyzing large sets of Open Data sources, all ofwhich are tabular. IJ data sources do not fit these hypothesis: datacan be semi-structured, structured, or simply text. Therefore, weopt for integrating all data sources in a heterogeneous graph (with no integrated schema), and for keyword-based querying where users specify some terms, and the system returns subtreesof the graph, that connect nodes matching these terms.

C4. Lack of single, well-behaved answer score:

After discussingseveral journalistic scenarios, no unique method (score) for decidingwhich are the best answers to a query has been identified. Instead:( 𝑖 ) it appears that “very large” answers (say, of more than 20 edges)are of limited interest; ( 𝑖𝑖 ) connections that “state the obvious”, e.g.,that a French scientist is connected to a French company throughtheir nationality, are not of interest. Therefore, unlike prior key-word search algorithms, which fix a score function and exploit it toprune the search, our algorithm must be orthogonal and work itwith any score function.Building upon our previous work, and years-long discussions ofIJ scenarios, this paper makes the following contributions: • We describe the CoI IJ application proposed by S. Horel(Section 2), we extract its technical requirements and wedevise an end-to-end data analysis pipeline addressing theserequirements (Section 3). • We provide application-driven optimizations, inspired fromthe CoI scenario but reusable to other contexts, which speedsup the graph construction process (Section 4). • We introduce a parallel, in-memory version of the keywordsearch algorithm previously introduced in [6, 7], and weexplain our design in both the physical database layout andthe parallel query execution (Section 5). • We evaluate the performance of our system using both syn-thetic and real-world PubMed data, we demonstrate its scala-bility, and we show that we have improved the performancecompared to our prior work by several orders of magnitude,thereby enabling the journalists to perform interactive ex-ploration of their data (Section 6).

Figure 1: Graph data integration in ConnectionLens.

The topic.

Biomedical experts such as health scientists and re-searchers in life sciences play an important role in society, advisinggovernments and the public on health issues. They also routinelyinteract with industry (pharmaceutical, agrifood etc.), consulting,collaborating on research, or otherwise sharing work and interests.To trust advice coming from these experts, it is important to en-sure the advice is not unduly influenced by vested interests. Yet,IJ work, e.g. [26, 27, 34], has shown that disclosure information isoften scattered across multiple data sources, hindering access tothis information. We now illustrate the data processing required togather and collectively exploit such information.

Sample data.

Figure 1 shows a tiny fragment of data that can beused to find connections between scientists and companies.

Fornow, consider only the nodes shown as a black dot or as a text la-bel, and the solid, black edges connecting them; these model directlythe data. The others are added by ConnectionLens as we discuss inSection 3.1. ( 𝑖 ) Hundreds of millions of bibliographic notices (in XML ) are published on the PubMed web site; the site also linksto research (in PDF). In recent years, PubMed has included an op-tional CoIStatement element where authors can declare (in freetext) their possible links with industrial players; less than 20% ofrecent papers have this element, and some of those present, areempty (“The authors declare no conflict of interest”). ( 𝑖𝑖 ) Withinthe PDF papers themselves, paragraphs titled, e.g., “Acknowledg-ments”, “Disclosure statement” etc. may contain such information,even if the CoIStatement is absent or empty. This information isaccessible if one converts the PDF in a format such as

JSON . InFigure 1, Alice declares her consulting for ABCPharma in XML,yet the “Acknowledgments” paragraph in her PDF paper mentionsHealthStar . ( 𝑖𝑖𝑖 ) A (subset of a) knowledge base (in RDF ) such asWikiData describes well-known entities, e.g., ABCPharma; however,less-known entities of interest in an IJ scenario are often missingfrom such KGs, e.g., HealthStar in our example. ( 𝑖𝑣 ) Specializeddata sources, such as a trade catalog or a Wiki Web site built byother investigative journalists, may provide information on somesuch actors: in our example, the PharmaLeaks Web site shows that This example is inspired from prior work of S. Horel where she identified (manuallyinspecting thousands of documents) an expert supposedly with no industrial ties, yetwho authored papers for which companies had supplied and prepared data. mpowering Investigative Journalism with Graph-based Heterogeneous Data Management

HealthStar is also funded by the industry. Such a site, established bya trusted source (or colleague), even if it has little or no structure, isa gold mine to be reused, since it saves days or weeks of tedious IJwork.

In this and many IJ scenarios, sources are highly heterogeneous,while time, skills, and resources to curate, clean, or structure the dataare not available . Sample query.

Our application requires the connections of spe-cialists in lung diseases, working in France, with pharmaceuticalcompanies . In Figure 1, the edges with green highlight and thosewith yellow highlight, together, form an answer connecting Alice toABCPharma (spanning over the XML and RDF sources); similarly,the edges highlighted in green together with those in blue , spanningover XML, JSON and HTML, connect her to HealthStar.

The potential impact of a CoI database.

A database of knownrelationships between experts and interested companies, built byintegrating heterogeneous data sources, would be a very valuableasset. In Europe, such a database could be used, e.g., to select, fora committee advising EU officials on industrial pollutants, expertswith few or no such relationships. In the US, the Sunshine Act [1],just the French 2011 law, require manufacturers of drugs and medi-cal devices to declare such information, but this does not extend tocompanies from other sectors.

Figure 2: Investigative Journalism data analysis pipeline.

The pipeline we have built for IJ is outlined in Figure 2. First,we recall ConnectionLens graph construction (Section 3.1), whichintegrates heterogeneous data into a graph, stored and indexed inPostgreSQL. On this graph, the GAM keyword search algorithm(recalled in Section 3.2) answers queries such as our motivatingexample; these are both detailed in [6]. The modules on yellowbackground in Figure 2 are the novelties of this work, and willbe introduced below: scenario-driven performance optimizationsto the graph construction (Section 4), and an in-memory, parallelkeyword search algorithm, called P-GAM (Section 5).

ConnectionLens integrates JSON, XML, RDF, HTML, relationalor text data into a graph, as illustrated in Figure 1. Each source ismapped to the graph as close to its data model as possible, e.g., XMLedges have no labels while internal nodes all have names, while inJSON conventions are different etc. Next, ConnectionLens extractsnamed entities from all text nodes , regardless of the data sourcethey come from, using trained language models. In the figure, blue,green, and orange nodes denote Organization, Location, and Personentities, respectively. Each such entity node is connected to the textnode it has been extracted from, by an extraction edge recordingalso the confidence of the extraction (dashed in the figure).

Entitynodes are shared across the graph , e.g., Person:Alice has been found in three data sources, Org:BestPharma in two sources etc.ConnectionLens includes a disambiguation module which avoidsmistakenly unifying entities with the same labels but differentmeanings. Finally, nodes with similar labels are compared , andif their similarity is above a threshold, a sameAs (red) edge isintroduced connecting them, labeled with the similarity value.A sameAs edge with similarity 1.0 is called an equivalence edge .Then, 𝑝 equivalent nodes, e.g., the ABCPharma entity and theidentical-label RDF literal, would lead to 𝑝 ( 𝑝 − )/ 𝑝 nodes is declaredthe representative of all 𝑝 nodes, and instead, we only store the 𝑝 − 𝐺 = ( 𝑁, 𝐸 ) , wherenodes can be of different types (URIs, XML elements, JSON nodesetc., but also extracted entities) and edges encode: data sourcestructure, entities extracted from text, and node label similarity. We view our motivating query, on highly heterogeneous contentwith no a-priori known structure, as a keyword search queryover a graph . Formally, a query 𝑄 = { 𝑤 , 𝑤 , . . . , 𝑤 𝑚 } is a set of 𝑚 keywords, and an answer tree (AT, in short) is a set 𝑡 of 𝐺 edgeswhich ( 𝑖 ) together, form a tree, and ( 𝑖𝑖 ) for each 𝑤 𝑖 , contain at leastone node whose label matches 𝑤 𝑖 . We are interested in minimal answer trees, that is answer trees which satisfy the following prop-erties: ( 𝑖 ) removing an edge from the tree will make it lack at leastone keyword match, and ( 𝑖𝑖 ) if more than one nodes match a querykeyword, then all matching nodes are related through sameAs linkswith similarity 1.0. In the literature (see Section 7), a score function is used to compute the quality of an answer, and only the best 𝑘 ATs are returned, for a small integer 𝑘 . Our problem is harder since:( 𝑖 ) our ATs may span over different data sources, even of differentdata models; ( 𝑖𝑖 ) they may traverse an edge in its original or inthe opposite direction , e.g., to go from JSON to XML throughAlice; this brings the search space size in 𝑂 ( | 𝐸 | ) , where | 𝐸 | is thenumber of edges; and ( 𝑖𝑖𝑖 ) no single score function serves allIJ needs since, depending on the scenario, journalists may favordifferent (incompatible) properties of an AT, such as “being charac-teristic of the dataset” or, on the contrary, “being surprising”. Thus, we cannot rely on special properties of the score function , tohelp us prune unpromising parts of the search space, as done inprior work (see Section 7). Intuitively, tree size could be used to limitthe search: very large answer trees (say, of more than 100 edges)generally do not represent meaningful connections. However, inheterogeneous, complex graphs, users find it hard to set a size limitfor the exploration. Nor is a smaller solution always better than alarger one. For instance, an expert and a company may both have“nationality” edges leading to “French” (a solution of 2 edges), butthat may be less interesting than finding that the expert has writtenan article specifying in its CoIStatement funding from the company(which could span over 5 edges or more).Our Grow-and-Aggressive-Merge (GAM) algorithm [6, 7] enu-merates trees exhaustively, until a number of answers are found, ora time-out. First, it builds 1-node trees from the nodes of 𝐺 whichmatch 1 or more keywords, e.g., 𝑡 , 𝑡 , 𝑡 in Figure 3, showing somepartial trees built when answering our sample query. The keyword ngelos-Christos Anadiotis, Oana Balalau, Théo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stéphane Horel, Ioana Manolescu, and Youssr Youssef Figure 3: Trees built by GAM for our sample query. match in each node label appears in bold. Then, GAM relies ontwo steps.

Grow adds to the root of a tree one of its adjacent edgesin the graph, leading to a new tree: thus 𝑡 is obtained by Growon 𝑡 , 𝑡 by Grow on 𝑡 , and successive Grow steps lead from 𝑡 to 𝑡 . Similarly, from 𝑡 , successive Grow’s go from the HTML to theJSON data source (the HealthStar entity occurs in both), and thento the XML one, building 𝑡 . Second, as soon as a tree is built byGrow, it is Merge d with all the trees already found, rooted in thesame node, matching different keywords and having disjoint edgeswrt the given tree. For instance, assuming 𝑡 is built after 𝑡 , theyare immediately merged into the tree 𝑡 , having the union of theiredges. Each Merge result is then merged again with all qualifyingtrees (thus the “agressive” in the algorithm name). For instance,when Grow on 𝑡 builds a tree rooted in the PubMedArticle node(not shown; call it 𝑡 𝐴 ), Merge( 𝑡 , 𝑡 𝐴 ) is immediately built, and isexactly the answer highlighted with green and blue in Figure 1.Together, Grow and Merge are guaranteed to generate all solu-tions. If 𝑚 =

2, Grow alone is sufficient, while 𝑚 ≤ GAM may build a tree in several ways , e.g., the an-swer above could also be obtained as Merge(Merge( 𝑡 , Grow( 𝑡 )), 𝑡 ); GAM keeps a history of the trees already explored, to avoidrepeating work on them. Importantly, GAM can be used with anyscore function. Its details are described in [6, 7]. In this section, we present an optimization we brought to the graphconstruction process, guided by our target application.In the experiments we ran, Named Entity Recognition (NER)took up to 90% of the time ConnectionLens needs to integrate datasources into a graph. The more textual the sources are, the moretime is spent on NER. Our application data lead us to observe that: • Some text nodes, e.g., those found on the path PubMedAr-ticle.Authors.Author.Name, always correspond to entities ofa certain type , in our example, Person. If this information isgiven to ConnectionLens, it can create a Person entity node,like the Alice node in Figure 1, without calling the expensiveNER procedure . • Other text nodes may be deemed uninteresting for the extrac-tion , journalists think no interesting entities appear there.If ConnectionLens is aware of this, it can skip the NER callon such text nodes . Observe that the input data, including allits text nodes, is always preserved; we only avoid extractioneffort deemed useless (but which can still be applied later ifapplication requirements evolve).To exploit this insight, we introduced a notion of context , andallow users to specify (optional) extraction policies . A contextis an expression designating a set of text nodes in one or several data sources. For instance, a context specified by the rooted pathPubMedArticle.Authors.Author.Name designates all the text valuesof nodes found on that path in an XML data source; the samemechanism applies to an HTML or JSON data source. In a relationaldata source containing table 𝑅 with attribute 𝑎 , a context of theform 𝑅.𝑎 designates all text nodes in the ConnectionLens graphobtained from a value of the attribute 𝑎 in relation 𝑅 . Finally, anRDF property 𝑝 used as context designates all the values 𝑜 suchthat a triple ( 𝑠, 𝑝, 𝑜 ) is ingested in a ConnectionLens graph.Based on contexts, an extraction policy takes one of the followingform: ( 𝑖 ) 𝑐 force 𝑇 𝑒 where 𝑐 is a context and 𝑇 𝑒 is an entity type, e.g.,Person, states that each node designated by the context is exactlyone instance of 𝑇 𝑒 ; ( 𝑖𝑖 ) 𝑐 skip , to indicate that NER should notbe performed on the text nodes designated by 𝑐 ; ( 𝑖𝑖𝑖 ) as syntacticsugar, for hierarchical data models (e.g., XML, JSON etc.), 𝑐 skipAll allows stating that NER should not be performed on the text nodesdesignated by 𝑐 , nor on any descendant of their parent. This allowslarger-granularity control of NER on different portions of the data.Observe that our contexts (thus, our policies) are specified withina data model ; this is because the regularity that allows defining then can only be hoped for within data sources with identical structure.Policies allow journalists to state what is obvious to them, and/orwhat is not interesting, in the interest of graph construction speed. Force policies may also improve graph quality, by making sure NERdoes not miss any entity designated by the context.

We now describe the novel keyword search module that is themain technical contribution of this work. A in-memory graph stor-age model specifically designed for our graphs and with keywordsearch in mind (Section 5.1) is leveraged by a a multi-threaded, par-alell algorithm, called P-GAM (Section 5.2), and which is a parallelextension of our original GAM algorithm, outlined in Section 3.2.

The size of the main memory in modern servers has grown signifi-cantly over the past decade. For instance, AWS EC2 offers nodes pro-viding up to 24TB of main memory and 448 hardware threads [39].Data management research has by now led to several mature prod-ucts (DB engines) running entirely in main memory, such as OracleDatabase In-Memory, SAP HANA, and Microsoft SQL Server withHekaton. Moving the data from the hard disk to the main memorysignificantly boosts performance, avoiding disk I/O costs. How-ever, it introduces new challenges on the optimization of the datastructures and the execution model for a different bottleneck: thememory access [9].We have integrated P-GAM inside a novel in-memory graphdatabase, which we have built and optimized for P-GAM opera-tions. The physical layout of a graph database is important, giventhat graph processing is known to suffer from random memoryaccesses [4, 19, 24, 37]. Our design ( 𝑖 ) includes all the data neededby appplications as described in Section 2, while also ( 𝑖𝑖 ) aimingat high performance, parallel query execution in modern scale-upservers, in order to tackle huge search spaces (Section 3.2).We start with the scalability requirements. Like GAM, P-GAMalso performs Grow and Merge operations (recall Figure 3). mpowering Investigative Journalism with Graph-based Heterogeneous Data Management Figure 4: Physical graph layout in memory.

To enumerate possible Grow steps, P-GAM needs to access alledges adjacent to the root of a tree, as well as the representative(Section 3.1) of the root, to enable growing with an equivalenceedge. Further, as we will see, P-GAM (as well as GAM) relies on asimple edge metric, called specificity , derived from the number ofedges with the same label adjacent to a given node, to decide thebest neighbor to Grow to. For instance, if a node has 1 spouse and10 friend edges, the edge going to the spouse is more specific thanone going to a friend.A Merge does not need more information than available in itsinput trees; instead, it requires specific run-time data structures, aswe describe below.In our memory layout, we split the data required for search,from the rest , as the former are critical for performance; we re-fer to the latter as metadata. Figure 4 depicts the memory tablesthat we use. The

Node table includes the ID of the data sourcewhere the node comes from, and references to each node’s: ( 𝑖 ) rep-resentative, ( 𝑖𝑖 ) 𝐾 neighbors, if they exist (for a fixed 𝐾 - staticallocation), ( 𝑖𝑖𝑖 ) metadata, and ( 𝑖𝑣 ) other neighbors, if they exist(dynamic allocation). We separate the allocation of neighbors intostatic and dynamic, to keep 𝐾 neighbors in the main Node structure,while the rest are placed in a separate heap area, stored in the Nodeconnections table. This way, we can allocate a fixed size to eachNode, efficiently supporting the memory accesses of P-GAM. Inour implementation, we set 𝐾 =

5; in general, it can be set basedon the average degree of the graph vertices. The

Node metadata table includes information about the type of each node (e.g., JSON,HTML, etc.) and its label, comprising the keywords that we usefor searching the graph. The

Edge table includes a reference to thesource and the target node of every edge, the edge specificity, anda reference to the edge metadata. The

Edge metadata table includesthe type and the label of each edge. Finally, we use a keywordIndex ,which is a hash-based map associating every node with its labels.P-GAM probes the keywordIndex when a query arrives to find thereferences to the Node table that match the query keywords andstart the search from there. Among all the structures, only

Nodeconnections (singled out by a dark background in Figure 4) is in adynamically allocated area; all the others are statically allocated.The above storage is row (node) oriented, even though columnstorage often speeds up greatly analytical processing; this is due to

Algorithm 1:

P-GAM

Input: 𝐺 = ( 𝑁, 𝐸 ) , query 𝑄 ={w , . . . , w 𝑚 }, maximumnumber of solutions 𝑀 , maximum time limit Output:

Answer trees for 𝑄 on 𝐺 pQueue 𝑖 ← new priority queue of (tree, edge) pairs,1 ≤ 𝑖 ≤ 𝑛𝑡 ; 𝑁 𝑄 ← ∪ 𝑤 𝑖 ∈ 𝑄 keywordIndex.lookup(w 𝑖 ); for 𝑛 ∈ 𝑁 𝑄 , 𝑒 edge adjacent to 𝑛 do push ( 𝑛, 𝑒 ) on some pQueue 𝑗 (distribute equally) end launch 𝑛𝑡 P-GAM Worker (Algorithm 2) threads; return solutions the nature of the keyword search problem, which requires travers-ing the graph from the nodes matching the keywords, in BFS style.Since we consider fully ad-hoc queries (any keyword combinations),there are no guarantees about the order of the nodes P-GAM visits.Therefore, in our setting, the vertically selective access patterns,which are optimally exploited by column-stores, do not apply. In-stead, the crucial optimization here is to find the neighbors of everynode fast. This is leveraged by our algorithm, as we explain below.

Our P-GAM (Parallel GAM) query algorithm builds a set of datastructures, which are exploited by concurrent workers (threads) toproduce query answers. We split these data structures to sharedand private to the workers. We start with the shared ones. The, history data structure holds all trees built during the exploration,while treesByRoot gives access to all trees rooted in a certainnode. As the search space is huge, the history and treesByRoot datastructures grow very much. Specfically, for history, P-GAM firsthas to make sure that an intermediate AT has not been consideredbefore (i.e. browse the history) before writing a new entry. Similar,treesByRoot is updated only when a tree changes its root or ifthere is a Merge of two trees; however, it is probed several timesfor Merge candidates. Therefore, we have implemented these datastructures as lock-free hash-based maps to ensure high concurrencyand prioritize read accesses. Observe that, given the high degree ofdata sharing, keeping these data structures thread-private wouldnot yield any benefit.Moving to the thread-private data structures, each thread , saynumber 𝑖 , has a priority queue pQueue 𝑖 , in which are pushed (tree,edge) pairs, such that the edge is adjacent to the root of the tree.Priority in this queue is determined as follows: we prefer the pairs whose nodes match most query keywords ; to break a tie, we prefer smaller trees ; and to break a possible tie among these, we preferthe pair where the edge has the highest-specificity . This is a simplepriority order we chose empirically; any other priority could beused, with no change to the algorithm.P-GAM keyword search is outlined in Algorithm 1. It creates theshared structures, and 𝑛𝑡 threads (as many as available based on theavailability of computing hardware resources). The search startsby looking up the nodes 𝑁 𝑄 matching at least one query keywords(line 2); we create a 1-node tree from each such node, and pushit together with an adjacent edge (line 4), in one of the pQueue’s(distributing them in round-robin). ngelos-Christos Anadiotis, Oana Balalau, Théo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stéphane Horel, Ioana Manolescu, and Youssr Youssef Algorithm 2:

P-GAM Worker (thread number 𝑖 out of 𝑛𝑡 ) repeat pop ( 𝑡, 𝑒 ) , the highest-priority pair in pQueue 𝑖 (or, ifempty, from the pQueue 𝑗 having the most entries); 𝑡 𝐺 ← Grow( 𝑡, 𝑒 ); if 𝑡 𝐺 ∉ history then for all edges 𝑒 ′ adjacent to the root of 𝑡 𝐺 , push( 𝑡 𝐺 , 𝑒 ′ ) in pQueue 𝑖 ; build all 𝑡 𝑀 ← Merge( 𝑡 𝐺 , 𝑡 ′ ) where 𝑡 ′ ∈ treesByRoot.get( 𝑡 𝐺 .root) and 𝑡 ′ matches 𝑄 keywords disjoint from those of 𝑡 𝐺 ; if 𝑡 𝑀 ∉ history then recursively merge 𝑡 𝑀 with all suitable partners; add all the (new) Merge trees to history; for each new Merge tree 𝑡 ′′ , and edge 𝑒 ′′ adjacent to the root of 𝑡 ′′ , push ( 𝑡 ′′ , 𝑒 ′′ ) inpQueue 𝑖 ; end end until time-out or 𝑀 solutions are found or all pQueue 𝑗 empty, for 1 ≤ 𝑗 ≤ 𝑛𝑡 ;Next, 𝑛𝑡 worker threads run in parallel Algorithm 2, until aglobal stop condition: time-out, or until the maximum numberof solutions has been reached, or all the queues are empty. Eachworker repeatedly picks the highest-priority (tree, edge) pair on itsqueue (line 2), and applies Grow on it (line 3), leading to a 1-edgelarger tree (e.g., 𝑡 obtained from 𝑡 in Figure 3). Thus, the stackpriority orders the possible Grow steps at a certain point during thesearch; it tends to lead to small solutions being found first, so thatusers are not surprised by the lack of a connection they expected(and which usually involves few links). If the Grow result tree hadnot been found before (this is determined from the history), theworker tries to Merge it with all compatible trees, found withintreesByRoot (line 6). The Merge partners (e.g., 𝑡 and 𝑡 in Figure 3)should match different (disjoint) keywords; this condition ensuresminimality of the solution. Merge results are repeatedly Merge’dagain; the thread switches back to Grow only when no new Mergeon the same root is possible. Any newly created tree is checkedand, if it matches all query keywords, added to the solution set (andnot pushed in any queue). Finally, to balance the load among theworkers, if one has exhausted his queue, it retrieves the highest-priority (tree, edge) pair from the queue with most entries, pushingthe possible results in its own queue.As seen above, the threads intensely compete for access to historyand treesByRoot. As we demonstrate in Section 6.3, our designallows excellent scalability as the number of threads increases. We now present the results of our experimental evaluation. Sec-tion 6.1 presents the hardware and data used in our application.Then, Section 6.2 studies the impact of extraction policies (Sec-tion 4). Section 6.3 analyzes the scalability of the P-GAM algorithm,focusing on its interaction with the hardware, and demonstrates itssignificant gains wrt GAM. Section 6.4 demonstrats P-GAM scala-bility on a large, real-world graph built for our CoI IJ application. Total (s) Extraction (s) Storage (s)

No policy

Using policy

929 716 131

Table 1: Sample impact of an extraction policy.Figure 5: Synthetic graphs: chain 𝑘 and star 𝑝,𝑘 . We used a server with a 2x10-core Intel Xeon E5-2640 v4 (Broadwell)CPUs clocked at 2.4GHz, and 128GB of DRAM. We do not use Hyper-Threads, and we bind every CPU core to a single worker thread. Asshown in Figure 2, ConnectionLens (90% Java, 10% Python) is used(Section 6.2) to construct a graph out of a set of data sources, and store it in PostgreSQL. Next in the processing pipeline, we migrate the graph to the novel in-memory graph engine previously describe,which queries it using the P-GAM algorithm. The query engine isa NUMA-aware, multi-threaded C++ application.

In this experiment, we loaded a set of 20 .

000 bibligraphic PubmedXML bibliographic notices (38 . The scalability analysis is performed on synthetic graphs, whosesize and topology we can fully control. We focus on two aspectsthat impact scalability: ( 𝑖 ) contention in concurrent access to datastructures, and ( 𝑖𝑖 ) size of the graph (which impacts the searchspace). To analyze the behavior of P-GAM’s concurrent data struc-tures, we use chain 𝑘 graphs, because they yield a large number ofintermediate results, shared across threads, even for a small graph.This way, we can isolate the size of the graph from the size of theintermediate results.We use two shapes of graphs (each with 1 associated query) ,leading to very different search space sizes (Figure 5). In both graphs,all the kwd 𝑖 for 0 ≤ 𝑖 are distinct keywords, as well as the labelsof the node(s) where the keyword is shown; no other node labelmatches these keywords. Chain 𝑘 has 2 𝑘 edges; on it, {kwd , kwd }has 2 𝑘 solutions, since any two neighbor nodes can be connectedby an 𝑎 𝑖 or by a 𝑏 𝑖 edge; further, 2 𝑘 + − mpowering Investigative Journalism with Graph-based Heterogeneous Data Management Graph

𝑆 𝑇 𝑃𝐺𝐴𝑀 (ms) 𝑇 𝑃𝐺𝐴𝑀 (s) 𝑇 (ms) 𝑇 (s)chain , , , , , Table 2: Single-thread P-GAM vs. GAM performance.Figure 6: GAM-P scaling on chain graphs. toward (but not reaching) the other.

Star 𝑝,𝑘 has 𝑝 branches, eachof which is a line of length 𝑘 ; at one extremity each line has akeyword kwd 𝑖 , 1 ≤ 𝑖 ≤ 𝑝 , while at the other extremity, all lineshave kwd . As explained in Section 3.1, these nodes are equivalent,one is designated their representative (in the Figure, the topmostone), and the others are connected to it through equivalence edges,shown in red. On this graph, the query {kwd , kwd , . . . , kwd 𝑝 } hasexactly 1 solution which is the complete graph; there are 𝑂 ( 𝑘 + ) 𝑝 partial trees. Single-thread P-GAM vs. GAM

We start by comparing P-GAM, using only 1 thread , with the (single-threaded) Java-based GAM,accessing graph edges from a PostgreSQL database. We ran the twoalgorithms on the synthetic graphs and queries, with a time-out of 15minutes ; both could stop earlier if they exhausted the search space.Table 2 shows: the number of solutions 𝑆 , the time 𝑇 𝑃𝐺𝐴𝑀 (ms)until the first solution is found by PGAM and its total running time 𝑇 𝑃𝐺𝐴𝑀 (s), as well as the corresponding times 𝑇 and 𝑇 for GAM(Java on Postgres). On these tiny graphs, both algorithms found allthe expected solutions, however, even without parallelism, P-GAMis 10 × to more than 100 × faster. In particular, on all but the 3 smallestgraphs, GAM did not exhaust its search space in 15 minutes. Thisexperiment validates the expected orders of magnitude speed-upof a carefully designed in-memory implementation, even withoutparallelism (since we restricted P-GAM to 1 thread). Parallel P-GAM

Next, on the graphs chain 𝑘 for 12 ≤ 𝑘 ≤ ,kwd } as we increase the number of worker threads from 1 to20. We see a clear speedup as the number of threads increases,which is on average 13x for the graph sizes that we report. Thespeedup is not linear, because as the size of the intermediate resultsgrows, it exceeds the size of the CPU caches, while threads needto access them at every iteration. Our profiling revealed that, asseveral threads access the shared data structures, they evict contentfrom the CPU cache that would be useful to other threads. Instead,we did not notice overheads from our synchronization mechanisms. Figure 7: P-GAM scaling on star graphs.

To study the scalability of the algorithm with the graph size,we use 𝑠𝑡𝑎𝑟 ,𝑘 for 𝑘 ∈ { , , , , } and the query{kwd , kwd , kwd , kwd }. Figure 7 shows the exhaustive searchtime of P-GAM on these graphs of up to 20 .

000 nodes, using 1 to4 threads. We obtain an average speed-up of 3 . × with 4 threads,regardless the size of the graph, which shows that P-GAM scaleswell for different graph models and graph sizes. After profiling,we observed that the size of the intermediate results impacts theperformance, similar to the previous case of the chain graph.In the above star ,𝑘 experiments, we used up to 4 threads sincethe graph has a symetry of 4 (however, theads share the work withno knowledge of the graph structure). When keyword matches arepoorly connected, e.g., at the end of simple paths, as in our stargraphs, P-GAM search starts by exploring these paths, moving far-ther away from each keyword; if 𝑁 nodes match query keywords,up to 𝑁 threads can share this work. In contrast, as soon as theseexplored paths intersect, Grow and Merge create many opportu-nities that can be exploited by one thread or another. On chain 𝑘 ,the presence of 2 edges between any adjacent nodes multiplies theGrow and Merge opportunities, work which can be shared by manythreads. This is why on chain 𝑘 , we see scalability up to 32 workerthreads, which is the maximum that our server supports. We now describe experiments on actual application data.

The graph.

We selected sources based on S. Horel’s expertiseand suggestions, as follows. ( 𝑖 ) We loaded PubMed bib-liographic notices (

XML ), corresponding to articles from 2019 and2020; they occupy

803 MB on disk. We used the same extrac-tion policy as in Table 1 to perform only the necessary extraction.( 𝑖𝑖 ) We have downloaded PDF articles corresponding tothese notices (those that were available in Open Access), trans-formed them into

JSON using an extraction script we developed,and preserved only those paragraphs starting with a set of key-words (“Disclosure”, “Competing Interest”, “Acknowlegments” etc.)which have been shown [3] to encode potentially interesting par-ticipations of people (other than authors) and organizations in anarticle. Together, these JSON fragments occupy

173 MB on disk.

The JSON and the XML content from the same paper are connected(at least) through the URI of that paper, as shown in Figure 1. ( 𝑖𝑖𝑖 ) Wehave crawled 375 HTML . Table 3 shows the numbers of nodes | 𝑁 | , of edges | 𝐸 | , and, respectively, of Person, Organization and Location entities( | 𝑁 𝑃 | , | 𝑁 𝑂 | , | 𝑁 𝐿 | ), split by the data model, and overall. ngelos-Christos Anadiotis, Oana Balalau, Théo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stéphane Horel, Ioana Manolescu, and Youssr Youssef | 𝑁 | | 𝐸 | | 𝑁 | | 𝑁 𝑃 | | 𝑁 𝑂 | | 𝑁 𝐿 | XML 32,028,429 19,851,904 1,483,631 584,734 126,629JSON 1,025,307 432,303 75,297 7,320 4,139HTML 246,636 185,479 3,726 7,227 320Total 33,300,372 20,469,686 1,562,654 665,167 131,088

Table 3: Statistics on Conflict of Interest application graph. 𝑇 𝑇 𝑙𝑎𝑠𝑡 𝑇 𝑆 𝐷𝑆

10 A10, U1, I3 7577 17383 17383 1000 4-6,

11 A11, I4, I5 10396 32320 60000 6 3,

12 A12, I4, I6 7320 7467 60000 24 4,

13 A3, A13, U2, P4 15759 35025 60000 5 5-6, 8,

14 A3, A14, U3, G1 10711 10711 60000 1 7,

15 A3, A15, U4, P4 8560 9942 60000 16 9, Table 4: P-GAM performance on CoI real-world graph.Querying the graph.

Table 4 shows the results of executing 15queries, until or for at most , using P-GAM. From left to right, the columns show: the query number, thequery keywords, the time 𝑇 until the first solution is found, thetime 𝑇 𝑙𝑎𝑠𝑡 until the last solution is found, the total running time 𝑇 ,the number of solutions found, and some statistics on the number ofdata sources participating in the solutions found ( 𝐷𝑆 , see below).All times are in milliseconds. We have anonymized the keywordsthat we use, not to single out individuals or corporations, and sincethe queries are selected aiming not at them, but at a large variety ofP-GAM behavior. We use the following codes: A for author, G forgovernment service, H for hospital, P for country, U for university,and I for industry (company). A 𝐷𝑆 value of the form “2-10, ”means that P-GAM found solutions spanning at least 2 and at most10 data sources, while most solutions spanned over 6 sources.We make several observations based on the results. The stopconditions were set here based on what we consider as an inter-active query response time, and a number of solutions which al-low further exploration by the users (e.g., through an interactiveGUI we developed). Further, solutions span over several datasets,demonstrating the interest of multi-dataset search enabled, andthat P-GAM exploits this possibility. Finally, we report results afterperforming queries including different amount of keywords and thesystem remains responsive within the same time bounds, despitethe increasing query complexity. In this paper, we presented a complete pipeline for managing het-erogeneous data for IJ applications. This innovates upon recentwork [6] where we have addressed the problems of integratingsuch data in a graph and querying it, as follows: ( 𝑖 ) we present acomplete data science application with clear societal impact, ( 𝑖𝑖 ) we show how extraction policies improve the graph construction per-formance, and ( 𝑖𝑖𝑖 ) we introduce a parallel search algorithm whichscales across different graph models and sizes. Below, we discussprior work most relevant wrt the contributions we made here; moreelements of comparison can be found in [6].Our work falls into the data integration area [16]; our IJ pipelinestarts by ingesting data into an integrated data repository, deployedin PostgreSQL. The first platform we proposed to Le Monde jour-nalists was a mediator [8], resembling polystores, e.g., [17, 28].However, we found that: ( 𝑖 ) their datasets are changing, text-richand schema-less, ( 𝑖𝑖 ) running a set of data stores (plus a media-tor) was not feasible for them, ( 𝑖𝑖𝑖 ) knowledge of a schema or thecapacity to devise integration plan was lacking. ConnectionLens’first iteration [12] lifted ( 𝑖𝑖𝑖 ) by introducing keyword search, butit still kept part of the graph virtual , and split keyword queriesinto subqueries sent to sources. Consolidating the graph in a singlestore, and the centralized GAM algorithm [6] greatly sped up andsimplified the tool, whose performance we again improve here. Weshare the goal of exploring and connecting data, with data discov-ery methods [20, 21, 35, 38], which have mostly focused on tabulardata. While our data is heterogeneous, focusing on an IJ applicationpartially eliminates risks of ambiguity, since in our context, oneperson or organization name typically denote a single concept. Keyword search has been studied in XML [22, 31], graphs (fromwhere we borrowed Grow and Merge operations for GAM) [15, 23],and in particular RDF graphs [18, 29]. However, our keyword searchproblem is harder in several aspects: ( 𝑖 ) we make no assumption onthe shape and regularity of the graph; ( 𝑖𝑖 ) we allow answer trees toexplore edges in both directions; ( 𝑖𝑖 ) we make no assumption on thescore function, invalidating Dynamic Programming (DP) methodssuch as [31] and other similar prunings. In particular, we showin [7] that edges with a confidence lower than 1 , such as similarityand extraction edges in our graphs, compromise, for any “reason-able” score function which reflects these confidences, the optimalsubstructure property at the core of DP. Works on parallel keywordsearch in graphs either consider a different setting, returning acertain class of subgraphs instead of trees [40] or standard graphtraversal algorithms like BFS [14, 25, 30]. To the best of our knowl-edge, GAM is the first keyword search algorithm for the specificproblem that we consider in this paper. Accordingly, in this paperwe have parallelized GAM, into P-GAM, by drawing inspiration andaddressing common challenges raised in graph processing systemsin the literature, in particular concerning the CPU efficiency whileinteracting with the main memory [4, 19, 24, 32, 37].Our future work includes: building a unified CoI repository basedon more biomedical sources, enhancing our in-memory query pro-cessor, and querying the graph using natural language. Acknowledgments.

The authors thank M. Ferrer and the Dé-codeurs team (Le Monde) for introducing us, and for many in-sightful discussions.

REFERENCES mpowering Investigative Journalism with Graph-based Heterogeneous Data Management [4] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi.2015. A scalable processing-in-memory accelerator for parallel graph processing.In

Proceedings of the 42nd Annual International Symposium on Computer Architec-ture, Portland, OR, USA, June 13-17, 2015 , Deborah T. Marr and David H. Albonesi(Eds.). ACM, 105–117. https://doi.org/10.1145/2749469.2750386[5] Rana Alotaibi, Damian Bursztyn, Alin Deutsch, Ioana Manolescu, and StamatisZampetakis. 2019. Towards Scalable Hybrid Stores: Constraint-Based Rewritingto the Rescue. In

Proceedings of the 2019 International Conference on Managementof Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5,2019 , Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, andTim Kraska (Eds.). ACM, 1660–1677. https://doi.org/10.1145/3299869.3319895[6] Angelos-Christos G. Anadiotis, Oana Balalau, Catarina Conceição, Helena Gal-hardas, Mhd Yamen Haddad, Ioana Manolescu, Tayeb Merabti, and JingmaoYou. 2020. Graph integration of structured, semistructured and unstructureddata for data journalism.

CoRR abs/2012.08830 (2020). arXiv:2012.08830 https://arxiv.org/abs/2012.08830

Currently under evaluation as an invited journal paper. [7] Angelos-Christos G. Anadiotis, Mhd Yamen Haddad, and Ioana Manolescu. 2020.Graph-based keyword search in heterogeneous data sources. In

Bases de DonnéesAvancés (informal publication) . arXiv:2009.04283 https://arxiv.org/abs/2009.04283[8] Raphaël Bonaque, Tien Duc Cao, Bogdan Cautis, François Goasdoué, J. Letelier,Ioana Manolescu, O. Mendoza, S. Ribeiro, Xavier Tannier, and Michaël Thomazo.2016. Mixed-instance querying: a lightweight integration architecture for datajournalism.

Proc. VLDB Endow.

9, 13 (2016), 1513–1516. https://doi.org/10.14778/3007263.3007297[9] Peter A. Boncz, Stefan Manegold, and Martin L. Kersten. 1999. Database Archi-tecture Optimized for the New Bottleneck: Memory Access. In

Proceedings ofthe 25th International Conference on Very Large Data Bases (VLDB ’99) . MorganKaufmann Publishers Inc., San Francisco, CA, USA, 54–65.[10] Maxime Buron, François Goasdoué, Ioana Manolescu, and Marie-Laure Mugnier.2020. Obi-Wan: Ontology-Based RDF Integration of Heterogeneous Data.

Proc.VLDB Endow.

DL Workshio (CEUR WorkshopProceedings) , Vol. 250. CEUR-WS.org. http://ceur-ws.org/Vol-250/paper_76.pdf[12] Camille Chanial, Rédouane Dziri, Helena Galhardas, Julien Leblay, Minh-Huong Le Nguyen, and Ioana Manolescu. 2018. ConnectionLens: Finding Con-nections Across Heterogeneous Data Sources.

Proc. VLDB Endow.

11, 12 (2018),2030–2033. https://doi.org/10.14778/3229863.3236252[13] Christina Christodoulakis, Eric Munson, Moshe Gabel, Angela Demke Brown, andRenée J. Miller. 2020. Pytheas: Pattern-based Table Discovery in CSV Files.

Proc.VLDB Endow.

Proceedings ofthe 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA2017, Washington DC, USA, July 24-26, 2017 , Christian Scheideler and Moham-mad Taghi Hajiaghayi (Eds.). ACM, 293–304. https://doi.org/10.1145/3087556.3087580[15] Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin, Xiao Zhang, and Xuemin Lin. 2007.Finding Top-k Min-Cost Connected Trees in Databases. In

Proceedings of the 23rdInternational Conference on Data Engineering, ICDE 2007, The Marmara Hotel,Istanbul, Turkey, April 15-20, 2007 , Rada Chirkova, Asuman Dogac, M. TamerÖzsu, and Timos K. Sellis (Eds.). IEEE Computer Society, 836–845. https://doi.org/10.1109/ICDE.2007.367929[16] AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012.

Principles of DataIntegration . Morgan Kaufmann. http://research.cs.wisc.edu/dibook/[17] J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S.Madden, D. Maier, T. Mattson, and S. B. Zdonik. 2015. The BigDAWG PolystoreSystem.

SIGMOD (2015).[18] Shady Elbassuoni and Roi Blanco. 2011. Keyword search over RDF graphs. In

Pro-ceedings of the 20th ACM Conference on Information and Knowledge Management,CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011 , Craig Macdonald, IadhOunis, and Ian Ruthven (Eds.). ACM, 237–242. https://doi.org/10.1145/2063576.2063615[19] Nima Elyasi, Changho Choi, and Anand Sivasubramaniam. 2019. Large-ScaleGraph Processing on Emerging Storage Devices. In . IEEE Computer Society, 1001–1012. https://doi.org/10.1109/ICDE.2018.00094[21] Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, Ahmed K. Elma-garmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping Semantics: Linking Datasets Using Word Embeddingsfor Data Discovery. In . IEEE Computer Society, 989–1000.https://doi.org/10.1109/ICDE.2018.00093[22] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. 2003.XRANK: Ranked Keyword Search over XML Documents. In

Proceedings of the2003 ACM SIGMOD International Conference on Management of Data, San Diego,California, USA, June 9-12, 2003 , Alon Y. Halevy, Zachary G. Ives, and AnHaiDoan (Eds.). ACM, 16–27. https://doi.org/10.1145/872757.872762[23] Hao He, Haixun Wang, Jun Yang, and Philip S. Yu. 2007. BLINKS: rankedkeyword searches on graphs. In

Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data, Beijing, China, June 12-14, 2007 ,Chee Yong Chan, Beng Chin Ooi, and Aoying Zhou (Eds.). ACM, 305–316.https://doi.org/10.1145/1247480.1247516[24] Sungpack Hong, Siegfried Depner, Thomas Manhardt, Jan Van Der Lugt, MerijnVerstraaten, and Hassan Chafi. 2015. PGX.D: a fast distributed graph process-ing engine. In

Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, No-vember 15-20, 2015 , Jackie Kern and Jeffrey S. Vetter (Eds.). ACM, 58:1–58:12.https://doi.org/10.1145/2807591.2807620[25] Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient ParallelGraph Exploration on Multi-Core CPU and GPU. In , Lawrence Rauchwerger and Vivek Sarkar (Eds.). IEEEComputer Society, 78–88. https://doi.org/10.1109/PACT.2011.14[26] Stéphane Horel. 2018.

Lobbytomie

Distributed Parallel Databases

34, 4 (2016),463–503. https://doi.org/10.1007/s10619-015-7185-y[29] Wangchao Le, Feifei Li, Anastasios Kementsietsidis, and Songyun Duan. 2014.Scalable Keyword Search on Large RDF Data.

IEEE Trans. Knowl. Data Eng.

Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA ’10) . Association for Computing Machinery,New York, NY, USA, 303–314. https://doi.org/10.1145/1810479.1810534[31] Ziyang Liu and Yi Chen. 2007. Identifying meaningful return information for XMLkeyword search. In

Proceedings of the ACM SIGMOD International Conference onManagement of Data, Beijing, China, June 12-14, 2007 , Chee Yong Chan, Beng ChinOoi, and Aoying Zhou (Eds.). ACM, 329–340. https://doi.org/10.1145/1247480.1247518[32] Jasmina Malicevic, Baptiste Lepers, and Willy Zwaenepoel. 2017. Everythingyou always wanted to know about multicore graph processing but were afraidto ask. In

Proceedings ofthe 2020 International Conference on Management of Data, SIGMOD Conference2020, online conference [Portland, OR, USA], June 14-19, 2020 , David Maier, RachelPottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q.Ngo (Eds.). ACM, 1939–1950. https://doi.org/10.1145/3318464.3380605[34] Naomi Oreskes and Erik Conway. 2012.

Merchants of Doubt

Proc. VLDB Endow.

CIDR 2020, 10thConference on Innovative Data Systems Research, Amsterdam, The Netherlands,January 12-15, 2020, Online Proceedings

ACM SIGOPS 24th Sym-posium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, No-vember 3-6, 2013 , Michael Kaminsky and Mike Dahlin (Eds.). ACM, 472–488.https://doi.org/10.1145/2517349.2522740[38] Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Y. Halevy, Hongrae Lee, FeiWu, Reynold Xin, and Cong Yu. 2012. Finding related tables. In

Proceedings ofthe ACM SIGMOD International Conference on Management of Data, SIGMOD ngelos-Christos Anadiotis, Oana Balalau, Théo Bouganim, Francesco Chimienti, Helena Galhardas, Mhd Yamen Haddad, Stéphane Horel, Ioana Manolescu, and Youssr Youssef , K. Selçuk Candan, Yi Chen, Richard T.Snodgrass, Luis Gravano, and Ariel Fuxman (Eds.). ACM, 817–828. https://doi.org/10.1145/2213836.2213962[39] Amazon Web Services. [n.d.]. Memory optimized instances.https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/memory-optimized-instances.html. Last accessed: 2021-01-25. [40] Yueji Yang, Divyakant Agrawal, H. V. Jagadish, Anthony K. H. Tung, and ShuangWu. 2019. An Efficient Parallel Keyword Search Engine on Knowledge Graphs. In35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China,April 8-11, 2019