Sapphire: Querying RDF Data Made Simple
SSapphire: Querying RDF Data Made Simple
Ahmed El-Roby
Carleton University [email protected]
Khaled Ammar
University ofWaterloo [email protected]
Ashraf Aboulnaga
Qatar ComputingResearch Institute -HBKU [email protected]
Jimmy Lin
University ofWaterloo [email protected]
ABSTRACT
RDF data in the linked open data (LOD) cloud is veryvaluable for many different applications. In order to un-lock the full value of this data, users should be able to issuecomplex queries on the RDF datasets in the LOD cloud.SPARQL can express such complex queries, but construct-ing SPARQL queries can be a challenge to users since it re-quires knowing the structure and vocabulary of the datasetsbeing queried. In this paper, we introduce Sapphire, a toolthat helps users write syntactically and semantically cor-rect SPARQL queries without prior knowledge of the querieddatasets. Sapphire interactively helps the user while typingthe query by providing auto-complete suggestions based onthe queried data. After a query is issued, Sapphire providessuggestions on ways to change the query to better matchthe needs of the user. We evaluated Sapphire based on per-formance experiments and a user study and showed it to besuperior to competing approaches.
1. INTRODUCTION
In recent years, advances in the field of information extrac-tion have helped in automating the construction of large Re-source Description Framework (RDF) datasets that are pub-lished on the web. These datasets could be general-purposesuch as DBpedia [5], a dataset of structured informationextracted from Wikipedia, or they could be specific to par-ticular domains such as movies , geographic information ,and city data . These datasets are graph-structured, andare interlinked via edges that point from one dataset to an-other, forming a massive graph known as the Linked OpenData (LOD) cloud .The LOD cloud contains a wealth of structured informa-tion that can be extremely useful to users and applicationsin diverse domains. However, utilizing this information re-quires an effective way to find the answers to questions inthe datasets that make up this cloud. Answering questionsover RDF data generally follows one of two approaches: (a)natural language queries, and (b) structured querying usingSPARQL [3], the standard query language for RDF.Natural language approaches rely on keyword search ormore complex question answering techniques. These ap-proaches are convenient and easy to use, and they find ac- http://dbpedia.org http://lod-cloud.net curate answers for simple questions such as “How many peo-ple live in New York?” . Questions like this one that askabout a specific property of an entity are termed factoidquestions . Such questions can be answered by a simple struc-tured search that can be constructed effectively by naturallanguage approaches.However, the RDF data that makes up the LOD cloud isnot limited to answering simple questions. This data canbe used to answer complex questions that require complexstructured searches. Natural language approaches are noteffective at constructing such complex structured searches.Instead, complex structured searches are better expressedusing SPARQL queries. It is a common practice for datasources in the LOD cloud to provide SPARQL endpoints that allow users to issue SPARQL queries on the RDF datathat they contain .To illustrate the need for SPARQL, consider the question “How many scientists graduated from an Ivy League uni-versity?” This question was used in the QALD-5 compe-tition [25]. QALD is an annual competition for QuestionAnswering over Linked Data, and this question was not an-swered by any of the natural language systems that partici-pated in QALD-5. This is not surprising since the questioninvolves concepts such as “scientist”, “graduated”, and “IvyLeague university” that are not easy to map to a structuredsearch over the queried dataset (DBpedia), in addition torequiring a count of the results. On the other hand, thefollowing query over the SPARQL endpoint of DBpedia willfind the required answer:
PREFIX res:
To be able to compose a query such as this one, the userneeds to know the structure of the dataset, the vocabularyused to represent different concepts, and the literals used inthe dataset including their data types and format. For ex-ample, the user needs to know that “scientist” is an rdf:type and “Ivy League” is an affiliation of a university. Achiev-ing this level of knowledge about a dataset can be difficulteven for experienced users given the massive scale and di-verse vocabulary of the LOD cloud. By one recent count , e.g., http://dbpedia.org/sparql for DBpedia. http://stats.lod2.eu a r X i v : . [ c s . D B ] S e p he LOD cloud has almost 3000 data sources that containover 14 billion RDF triples from various domains. To il-lustrate the diversity of the vocabulary in the LOD cloud,consider that DBpedia alone has over 3K distinct predicatesat the time of writing this paper. Thus, it is quite likelythat a user would need to construct SPARQL queries ondata whose structure and vocabulary she does not know infull, for example when querying a new dataset. Our goal inthis paper is to help users with this challenging task.We present Sapphire , an interactive tool aimed at helpingusers write syntactically and semantically correct SPARQLqueries on RDF datasets they do not have prior knowledgeabout. Sapphire is aimed at users who have a technicalbackground but are not necessarily SPARQL experts, e.g.,data scientists or application developers. Thus, Sapphiremakes no attempt to “shield” its users from the syntax ofSPARQL, but rather helps them construct valid SPARQLqueries with ease. Sapphire achieves this in two ways thatboth rely on a predictive user model that is built for the end-points to be queried in an initialization phase. First, whilea user is typing a query, Sapphire interactively provides herwith data-driven suggestions to complete the predicates andliterals in the query, similar to the auto-complete capabilityin many user interfaces. Second, when a user completes thequery and submits it for execution, Sapphire suggests waysto modify the query into one that may be better suited tothe needs of the user. For example, if the user query re-turns no answers, Sapphire would attempt to modify it intoa query that does return answers.Sapphire’s query completion and query suggestion mod-ules rely on natural language techniques. Thus, in the spec-trum of approaches for querying RDF, Sapphire bridges thegap between the simple but ambiguous natural languageapproaches on the one hand, and the powerful but cum-bersome SPARQL on the other. The novelty of Sapphirecomes from the need to balance multiple conflicting goals:Sapphire must provide high quality recommendations thatactually help the user find the information that she needs, itmust have fast response time since it is interactive, it mustrun on a reasonably sized machine without placing excessivedemands on the machine’s resources, and it must not over-load the SPARQL endpoints it queries. These design goalsrequire judicious design choices which we present in the restof this paper. We have built Sapphire as a web based query-ing tool , and we demonstrated its user interface and querycomposition workflow in [13]. In this paper, we present theinternals of Sapphire and demonstrate through experimentsand a user study that it is significantly more effective thancompeting approaches in finding answers to user queries,and it achieves interactive performance.We review related work in Section 2 and present the ar-chitecture of Sapphire in Section 3. We present the Sapphireuser interface from [13] in Section 4. We then present thecontributions of the paper, which are as follows: • Summarizing the queried endpoints to collect concise,important data that is utilized by Sapphire (Section 5). • The predictive user model which is at the heart of Sap-phire and includes two modules: query completion andquery suggestion (Section 6). • An extensive evaluation of Sapphire based on perfor-mance experiments and a user study (Section 7). http://github.com/aelroby/Sapphire
2. RELATED WORK
Prior work on helping users construct structured querieson RDF data falls into three categories: 1. Natural languageapproaches. 2. Approximate queries. 3. Query by example.
Natural Language Approaches : Several prior workscreate structured queries based on natural language ap-proaches [12, 20, 24, 29, 14]. Each of these works focuseson one or more specific query templates, and uses keywordsearch or natural language questions to construct these tem-plates and fill in the placeholders they contain. All of theseapproaches suffer from two limitations compared to Sap-phire: (1) Their expressiveness is limited to specific querytemplates, and (2) inferring query structure, predicates, andliterals based only on natural language is inherently am-biguous. In contrast, Sapphire can construct any SPARQLquery, and it removes ambiguity by involving the user di-rectly in query composition.In this paper, we compare Sapphire to QAKiS [7] andKBQA [10] as representatives of the state of the art in nat-ural language approaches, and we show that Sapphire out-performs these two systems. QAKiS [7] is a question an-swering system over RDF that automatically extracts fromWikipedia different ways of expressing relations in natu-ral language (e.g., “a bridge spans a river” and “a bridgecrosses a river” express the same relation). These equivalentexpressions are used to match fragments of a natural lan-guage question and construct the equivalent SPARQL query.KBQA [10] is a more recent question answering system thatfocuses on factoid questions. KBQA learns question tem-plates from a large Q&A corpus (e.g., Yahoo! Answers),and learns mappings from these templates to RDF predi-cates in the queried dataset. The templates and mappingsare then used to answer user questions.
Approximate Queries : This line of work goes beyondthe fixed query templates used by natural language approaches,enabling the user to express approximate structured queries.That is, the query posed by the user does not have to be ex-actly matched with the queried RDF data [19, 18, 30]. Theseapproaches are still limited in the query structure that theysupport, and they require the user to know the vocabularyand the approximate schema of the queried datasets. In con-trast, Sapphire enables the user to compose any SPARQLquery without prior knowledge of the queried datasets.We compare Sapphire to S [31], a recent system that wasshown to outperform other approximate query approaches. S summarizes the queried dataset by maintaining a graphof the relationships between RDF entity types based on therelationships between instances of these types. Queries arerewritten based on this graph. S assumes that the user canissue queries using correct predicates and instance URIs inthe dataset, but possibly not with the correct query struc-ture. Query by Example : SPARQLByE [4, 11] infers theSPARQL query that best suits the user’s needs based ona set of example answers she provides. A key limitation ofthis approach is that the user needs to know a set of ex-amples that satisfy her query, which is often not practical.For example, to answer the query “How many people livein New York?” , the user should know the precise popula-tion of some cities to provide as examples, which can beimpractical. In contrast, Sapphire helps the user directlyconstruct a SPARQL query rather than inferring it. Wecompare Sapphire to SPARQLByE and show that Sapphire
PARQL
EndpointSPARQL
EndpointSPARQL Endpoint
Query
Completion
Cached Predicates and LiteralsQuery SuggestionFederated Query ProcessorQuery Terms
Term SuggestionsUser
QueryAnswersQuery
Query Suggestions
Client Sapphire Server Web
Predictive User Model
Figure 1: Architecture of Sapphire.is more expressive.
3. Sapphire ARCHITECTURE AND CHAL-LENGES
In this section we present the overall architecture of Sap-phire, and an overview of the different design choices andchallenges that must be addressed in order to implement auseful and efficient system.Figure 1 shows the architecture of Sapphire, which runsas a server that sits between the user and the SPARQL end-points for one or more RDF datasets on the web. Sapphireaccesses the endpoints through a federated query processor.Sapphire uses FedX [22], a widely-used federated query pro-cessor, but any other federated query processor can be used.The core of Sapphire is the Predictive User Model (PUM),which helps the user express her information needs usingSPARQL queries. The PUM relies on information about thedatasets being queried. Before querying an endpoint, theuser must register this endpoint with the Sapphire server,and the server goes through an initialization step in whichit caches important data from this endpoint. One challengethat must be addressed by Sapphire is which data from anendpoint to cache, and how to retrieve this data withoutoverloading the endpoint.While the user is composing a query, the query terms areforwarded to the Query Completion Module (QCM) as theyare typed by the user. The QCM interactively provides theuser with suggestions to complete the terms in her querybased on the data cached during initialization. A questionthat must be answered when designing the QCM is how toprovide interactive response time even for the large scaledata in the LOD cloud.After composing a syntactically correct query, the fed-erated query processor executes the query and returns an-swers. Simultaneously, the Query Suggestion Module (QSM)suggests changes to the query to help the user find the an-swers she is looking for. The goal of the QSM is to suggestqueries that are similar to the one issued by the user, butdifferent enough to present her with useful alternatives thatmay help her satisfy her information needs. These sugges-tions span two directions: 1. Finding alternative literals andpredicates to the ones used in the query. 2. Relaxing thestructure of the issued query to approximately match theissued query with candidate patterns in the dataset. Query suggestions are provided for all queries, and it is up to theuser to accept these suggestions if the returned answers donot satisfy her information needs. The QSM poses several in-teresting research questions, such as which literals and pred-icates to replace in the query and how to find replacementterms. Also, what does it mean to relax the structure of aquery and how to find the relaxed structure efficiently. Theway we address the different requirements and challenges inSapphire is described in the next three sections. We start bydiscussing the user interface in Section 4, then we presenthow initialization happens for a new endpoint in Section 5,and the PUM in Section 6.
4. USER INTERFACE
Sapphire has a web-based user interface that was demon-strated in [13]. This interface is shown in Figure 2. Theinterface presents a text box for each part of a SPARQLquery. While the user is typing query terms, the QCM pro-vides suggestions to complete these terms as shown in Fig-ure 3. After the user inputs a query, the query is validatedand executed. Whenever a query is executed, the QSM triesto find alternatives to the query that was constructed by theuser. Figure 2 shows an example of how the QSM suggestschanges to the executed query. In this example, the userwants to find all people with the surname “Kennedys” (inplural form). However, no answers were found using thissurname. The QSM suggests a modification that will re-sult in finding 1,051 answers, by changing “Kennedys” to“Kennedy”. If the user accepts this suggestion and updatesthe query, the new query is executed and the answers aredisplayed in the answer table (Figure 4). New suggestionsare now displayed to the user in case these answers still donot satisfy her information needs. The query alternatives areshown to the user in the form of suggestions to change oneterm at a time. For example, one suggestion could be “Inthe triple (subject, predicate, object) , did you mean predicate , instead of predicate ? There are N answersavailable.”. This approach avoids showing the user a com-pletely rewritten SPARQL query in one step, which wouldmake the suggestions difficult to understand, especially forlarge and complex queries. The only exception is when theQSM suggests queries that are different in structure thanthe issued query. We will elaborate on this specific type ofquery suggestion in Section 6.2.uery suggestion andquery processor exe-cutes automatically if allquery triples are valid. All variables are automatically included in the selection by de-fault. A user can hide unnecessary columns if desired. A user can updatea query triple andexecute the updatedquery using thisoption.Query modifiers, such as group by , order by , limit , etc, canbe added here if desired.Figure 2: User interface showing a suggestion to modify the current query which returned to answers.Figure 3: Auto-complete suggestions using the QCM.The suggested queries are executed in the backgroundusing the Federated Query Processor and answers areprefetched so that when the user decides to choose one ofthe alternatives, the query is not re-executed, and the an-swers are displayed almost-instantaneously. When the an-swers to a query are displayed to the user, she has the abilityto manipulate them in the answer table, as shown in Fig-ure 4. Supported operations include the following: the usercan search the answer table using a keyword search box, or-der the answers by any column, show and hide columns, anddrag and drop answers from the answer table to the querytext boxes for additional queries. Next, we turn our atten-tion to the technical details of Sapphire initialization andthe PUM.
5. INITIALIZATION FOR A NEW ENDPOINT
This section describes the initialization step in which Sap-phire retrieves data from a newly registered endpoint. Thatis: 1. Which data from the endpoint to cache? 2. How isthis data retrieved? 3. How is it indexed for efficient accessby the PUM?
The data cached by Sapphire from the endpoints plays asignificant role in helping the user write a query that de-scribes her information needs. In designing Sapphire, weassume that it is simpler and more intuitive for users toexpress their information needs using keywords rather thanusing URIs. Therefore, the focus of the Sapphire PUM is on mapping keywords entered by the user in her query to RDFpredicates and literals in the dataset.Thus, Sapphire needs to cache RDF predicates and literalsfrom a dataset so that these predicates and literals can sub-sequently be matched to keywords in the user query. Whichpredicates and literals to cache is a challenging question.The choice of data to cache cannot rely on statistical knowl-edge of the queried datasets or the query logs, since suchknowledge is not available. Therefore, we develop heuris-tics based on common characteristics of RDF datasets andSPARQL queries.Our first heuristic relies on the observation that the num-ber of distinct predicates in a dataset is typically muchsmaller than the number of distinct literals. For example,at the time of this writing, DBpedia has approximately 3Kdistinct predicates compared to 70M distinct literals. There-fore, Sapphire caches all the predicates in a dataset.Given the typically large number of literals in a dataset,Sapphire uses heuristics to limit the number of literals that itcaches. First, we assume that very long literals are not likelyto be used in queries. Thus, Sapphire only caches literalsbelow a certain length (in this paper we use 80 charactersas the limit). Second, we assume that the user is interestedonly in a certain language and allow the user to restrict thelanguage of the cached literals (in this paper we cache onlyEnglish literals).Following the aforementioned heuristics reduces the num-ber of cached literals. However, the number of literals thatsatisfy these heuristics will likely be too large to retrievefrom the endpoint using a single SPARQL query. Such aquery would be a long-running query, and most endpointsimpose a timeout limit on queries to avoid overloading theircomputing resources, or reject queries from the start if theirestimated execution time is above a threshold. Thus, weneed to decompose this query into multiple queries that areeach within the timeout limit. Furthermore, we need to en-sure that the entire initialization process finishes within areasonable amount of time. Our goal is for initializationtime to be on the order of hours, which we believe is reason-able since the initialization process happens only once foreach endpoint. Next, we describe the queries that we use toretrieve literals from an endpoint for caching.Sapphire divides the dataset based on the predicates andontrols the visibilityof columns. Prepare a printable version. Search capabilityallows users to filterresults using key-word search.Sort answers by any column.Figure 4: The answer table after applying the query suggestion in Figure 2. In this example, the 1,051 answers to the queryare filtered via a keyword search on “ john ”, and the filtered answers are ordered by the “ person ” column.the class hierarchy defined by RDF schema (RDFS) [1].RDFS defines classes that serve as data types for differententities, and organizes the classes into a hierarchy basedon the subClassOf relation. For example,
MovieDirector and
Politician are two classes that are both subclasses of
Person . Sapphire issues a SPARQL query to retrieve allclasses and their subclasses from the endpoint. It also issuesa query to retrieve all RDF predicates associated with liter-als, ordered by the numbers of literals associated with eachpredicate. These are short queries that are not expected totime out. Sapphire then iterates through the predicates as-sociated with literals, from most frequent to least frequent.For each predicate, Sapphire navigates through the class hi-erarchy from root to leaves. At each class of the hierarchy,Sapphire creates a query to retrieve literals associated withthe current predicate and current class, and that are be-low the threshold length and in the target language. Toincrease the likelihood that this query will succeed, it is de-composed into multiple queries using SPARQL paginationtechniques (OFFSET and LIMIT). If this query succeeds,Sapphire moves to the next sibling in the class hierarchy. Ifthis query times out, Sapphire navigates down to the nextlevel of the class hierarchy, which contains smaller classes,and issues the query. This process continues until all theliterals are retrieved. Sapphire allows the user to set a limiton the number of queries to issue to an endpoint and stopswhen this limit is reached. Since Sapphire orders predicatesby frequency, it prioritizes caching the literals associatedwith frequent predicates.For the uncommon case of datasets that do not use theclass hierarchy of RDFS (about 75% of the datasets in theLOD cloud use RDFS ), Sapphire issues a query to retrievethe entity types that occur frequently in the dataset. Sap-phire then issues queries to retrieve the literals associatedwith each predicate and each of these entity types, iterat-ing through the predicates and types from most frequent toleast frequent. If there is a limit on the number of queries,Sapphire stops if this limit is reached. The complete list ofqueries that are sent to an endpoint during initialization ispresented in Appendix A. As discussed earlier, one of the key challenges facing Sap-phire is providing suggestions to the user interactively. These http://stats.lod2.eu suggestions come from the cached data, so this data mustbe indexed in a way that supports fast lookup.The basic lookup operation for suggesting completions tothe user is as follows: given a string t entered by the user,what strings in the data contain t ? We observe that a suffixtree [27] is ideally suited for this type of lookup, so we useit as an index in Sapphire. The advantage of a suffix treeis that lookup operations depend only on the size of thelookup string t and the number of times z that this stringoccurs in the input, with a time complexity O ( | t | + z ). Thedisadvantage of a suffix tree is that it can grow very large,sometimes over an order of magnitude larger than the sizeof the input.Given the space consumption of suffix trees, only a sub-set of the cached data can be indexed in this tree. Sincethe number of RDF predicates is relatively small, all pred-icates are indexed. The more challenging question is whichsubset of the literals to index? To answer this question, weintroduce the notion of most significant literals, and indexonly these literals in the suffix tree. A literal is consideredsignificant when the entity it is associated with occurs fre-quently in the dataset. That is, there are many incomingedges in the RDF graph pointing to this entity, indicatingthe entity’s importance. Definition The significance score of a literal l is S ( l ) = |{ s | ( s, p , o ) ∧ ( o, p , l ) }| , where ( s, p i , o ) is an RDF triple. For example, the literal “New York” is associated with theentity representing this city. Since this entity is pointed toby many other entities (i.e., occurs as an RDF object), theliteral “New York” is significant. This definition of signifi-cance captures important classes in the RDF class hierarchy,and also captures important instances (people, locations,etc.). To identify the significant literals, Sapphire issuesqueries along the class hierarchy as it did for retrieving theliterals.The final issue related to initialization is how to lookupin cached literals not in the suffix tree. We call these the residual literals . Lookup on the residual literals requires asequential search, and we have found that this may be tooslow for interactive response. To speed up this sequentialsearch, Sapphire organizes the literals into bins of residualliterals , or residual bins for short, where each bin has allthe literals of a given length (i.e., bin ( literal ) = | literal | ).As discussed in the next section, the PUM always searchesfor strings within a certain range of lengths, so its sequen- erm t Search in suffix tree index Search in residual bins (string length | t | to | t | + (cid:2011) )Pick top‐k literalsk suggested literals t t n prioritized matches m matches Figure 5: Completing a query term in the QCM.tial search will be limited to a few bins. In addition, thesearch can be parallelized, with multiple threads simulta-neously scanning the bins. We show in Section 7 that thissimple organization is effective at guaranteeing interactiveperformance.To illustrate the cost of initialization, we note that ini-tialization for DBpedia, one of the largest datasets in theLOD cloud, took 17 hours. In the process, Sapphire issuedapproximately 800 SPARQL queries to retrieve literals and3000 to identify significant literals, in addition to the fewqueries that retrieve predicates and the class hierarchy. Ap-proximately 200 queries timed out. The suffix tree for DB-pedia contains 43K strings (3K predicates and 40K literals)and is 400MB in size. There are around 21M literals not inthe suffix tree, divided among 80 bins. We show in Section 7that having even a small fraction of the literals in the suffixtree benefits performance.
6. PREDICTIVE USER MODEL
The Predictive User Model (PUM) uses the data cachedduring initialization to help the user compose her SPARQLquery. The user inputs a query to Sapphire by enteringthe triple patterns that describe the structure of the query.As the user types a subject, predicate, or object in a triplepattern, the PUM invokes the Query Completion Module(QCM) to provide suggestions for the user to complete theterm being typed. When the user composes a full query andclicks “Run” in the Sapphire user interface, the PUM passesthe query to the federated query processor for execution andalso invokes the Query Suggestion Module (QSM) to suggestchanges to the query. The QSM suggests changes to thequery based on the structure of potential candidate answersin the dataset, in order to bring the user closer to the querythat finds the answer she is looking for. The user can chooseone of the suggestions of the QSM and update the query, andpossibly continue editing it. Editing the query would invokethe QCM again, and the process repeats as many times asneeded by the user. We present the QCM next, followed bythe QSM.
Algorithm 1:
Assign Tasks to Processes input :
Bins to Search bins (cid:48) , Number of Processes P output: Assigned Task for Each Process Number of literals to search n = (cid:80) | bins (cid:48) | i =1 | bins (cid:48) i | ; Process capacity d = nP ; Process id pid = 0; for i = 1 to | bins (cid:48) | do Number of literals remaining in bin i j = | bins (cid:48) i | ; while j > do if j < P rocess pid .d then // Process pid assigned all literals in bin Assign ( P rocess pid , [ bins (cid:48) i [0] , bins (cid:48) i [ | bins (cid:48) i | − P rocess pid .d = P rocess pid .d − | bins (cid:48) i | ; j = 0; else // Process pid assigned remaining capacity Assign ( P rocess pid , [ bins (cid:48) i [ | bins (cid:48) i | − j ] , bins (cid:48) i [ | bins (cid:48) i | − j + P rocess pid .d ]]); j = j − P rocess pid .d ; P rocess pid .d = 0; pid = pid + 1; end end end The Sapphire user interface is organized so that the userinputs each subject, predicate, or object of a triple pattern ina separate text box, as shown in Figure 2. As the user typesa string in one of these text boxes, the QCM is invoked everytime the user types a character in order to provide auto-complete suggestions for the string being typed. The onlyexception is if the user enters a variable (i.e., a string startingwith ‘?’), in which case Sapphire makes no suggestions.Specifically, the problem solved by the QCM is as follows:Given the string t entered thus far by the user, find k stringsin the data that contain t to suggest to the user. In thispaper, we use k = 10. Figure 5 shows how the QCM findsthe required k strings. The term t is looked up in both thesuffix tree and the residual bins. Matches in the suffix treeare returned to the user as soon as they are found. If thesearch in the suffix tree returns fewer than k matches, theremaining matches come from the residual bins. We assumethat auto-complete suggestions are most useful if they arenot much longer than the current input string t . Therefore,the QCM only searches bins containing literals of length | t | to | t | + γ , which reduces the cost of the sequential search. Inthis paper, we use γ = 10. When the search in residual binscompletes, the shortest result literals are returned as part ofthe k auto-complete suggestions.To ensure interactive response time, we parallelize theQCM’s sequential search in the residual bins, utilizing P parallel processes (threads). Typically, P would equal thenumber of available cores on the Sapphire server. Each pro-cess searches one or more bins, and the QCM assigns workto processes in a way that balances load, with each processscanning an equal number of literals. Algorithm 1 shows thedetails of task assignment. lgorithm 2: Suggesting Alternative Query Terms input :
Query q , Predicate Set P R , Literal Bins toSearch bins (cid:48) , Number of Processes P output: Alternative Queries Q (cid:48) for each triple tr in q do for each non-variable element e in tr do if e is a predicate then Lexica for term S = Lemon.getLexica ( e ); for Each element s in S do Predicate alternatives pa.add (FindPredicateAlternatives( s , P R , P )); end else Literal alternatives la.add (FindLiteralAlternatives( e , bins (cid:48) , P )); end end end SortBySimilarityScore( pa ); SortBySimilarityScore( la ); for For each alternative a in pa do Construct a new query q (cid:48) ; Alternative queries for predicates
P Q.add ( q (cid:48) ); end for For each alternative a in la do Construct a new query q (cid:48) ; Alternative queries for literals
LQ.add ( q (cid:48) ); end Q (cid:48) .add (TopQueriesWithAnswer( P Q, k/ Q (cid:48) .add (TopQueriesWithAnswer( LQ, k/ return Q (cid:48) ; The QSM suggests alternative queries that are semanti-cally close to the query issued by the user. The suggestionsof the QSM are particularly important if the query issuedby the user returns no answers, but they can be useful evenif the query returns answers. Defining semantic closenessis an interesting question. In Sapphire, the QSM suggestschanges to the query in two directions: (1) Suggesting al-ternatives to the terms (predicates and literals) used in thequery, and (2) Relaxing the structure of the query.
Algorithm 2 shows how the QSM finds alternatives forpredicates and literals in the user query. The basic idea isto find predicates and literals in the data set that are sim-ilar to the ones in the query or to their lexica. The lexicaprovides knowledge about how properties, classes and in-dividuals are verbalized in natural language. For example,“wife” or “husband” can be verbalized by using “spouse” in-stead. The QSM examines the predicates and literals usedin the triple patterns of the query one at a time. For eachpredicate p , the QSM first finds the lexica for the predi-cate (line 4). We use the DBpedia Lemon Lexicon [8, 26]to provide such lexica for the terms typed in by the user.The QSM then finds alternative predicates in the datasetwhose similarity score with the original predicate p or its Algorithm 3:
Relaxing Query Structure input :
Query q output: Matching Graphs G suggested Literals in query L = q .extractLiterals(); for Each literal l in L do Seed group seeds ( l ) = l ∪ Top k − la ( l ); end Start with empty graph g ; while g does not span terminals from all seed groups do Scan vertices using Dijkstra’s bi-directional shortestpath algorithm; Select a terminal x not in g that is closest to avertex in g (initially any literal from the query); Add to g the shortest path that connects x with g ; end // There can be several g subgraphs spanning // terminals if multiple paths with the same weight // cost exist for Each g found while connecting seeds do Construct subgraph g (cid:48) induced by g in G ; Construct minimum spanning tree(s) of g (cid:48) ; while There exist non-terminals of degree 1 fromspanning tree(s) do remove non-terminals of degree 1 from thisspanning tree; end Add minimum spanning tree(s) to G suggested ; end Return G suggested ;lexica exceeds a similarity threshold θ . In Sapphire, we useJaro-Winkler (JW) similarity [9] to calculate the similar-ity between strings. JW similarity is based on the minimumnumber of single-character transpositions required to changeone string into the other, while giving a more favorable scoreto strings that match from the beginning. This similaritymeasure outperforms other similarity measures in our con-text. In this paper, we use the θ = 0 .
7. For each literal l , theQSM considers the bins containing literals of length in therange [ | l |− α, | l | + β ] (termed bins (cid:48) in Algorithm 2). A searchoperation over these bins is conducted, similar to the searchover bins in the QCM. The difference is that the search tofind alternative literals is based on the JW similarity. Allliterals that have a similarity score ≥ θ are considered tobe matches. We use the values α = 2 and β = 3. Thelists of alternative predicates and literals are sorted basedon the JW similarity score. Similar to the QCM, the QSMcan parallelize finding alternative terms among P processes.The alternative terms are sorted based on their similarityscores, and a new SPARQL query is constructed for each ofthe alternative predicates and literals found by the QSM.Sapphire uses the federated query processor to execute thealternative queries and suggests the top queries that returnanswers. If the structure of the graph pattern specified by the userin the query is different from the structure of the querieddataset, the user will not find the desired answer, even if the ack Kerouac
Viking Press ?book writerpublisher
Jack Kerouacname
Big Surnamewriter type movielabel author
Viking Presslabelpublisherpublisherauthor namenameDoor Wide Open On The Road authornameDoctor Sax publisher labelGrove Press
Figure 6: Example query and the subgraph from the dataset that can be used to answer this query.
Jack KerouacnameViking Presslabel Jack Kerouacname xwriterauthorauthorauthor Viking Presslabelpublisherpublisher xBig Surnametype a.1 a.2 a.3b.1 b.2
No more expansions b.3
Figure 7: The expansion steps in the relaxation process.predicates and literals in the query match the desired answerin the dataset. Therefore, the QSM suggests changes to relaxthe structure of the query (i.e., make it less constrained)based on the structure of the dataset.Figure 6 shows a motivating example. The query in thisexample is syntactically correct (top left box), and it aimsto find books by “Jack Kerouac” that were published by“Viking Press”. The figure shows part of the graph of thequeried dataset. The predicates and literals of the query canbe found in the dataset, and the matches are shown in thefigure as dotted lines and rectangles. The figure also showstwo answers that satisfy the query requirements, and thepath that connects them in bold (“Door Wide Open” and“On the Road”). These answers will not be found by thequery as posed by the user since the query structure doesnot match the structure of the data (the dotted matches arenot connected). Relaxing the query structure can solve thisproblem by bringing the structure of the query closer to thestructure of the dataset.In Sapphire, we assume that it is easier for the user toidentify correct literals than to identify correct query struc- ture. Thus, we define the goal of query relaxation to beconnecting literals in the query (or similar literals found bythe JW similarity search) through valid paths in the graphof the dataset. Ideally, the paths should be short and thealgorithm should prefer paths that include the predicatesentered by the user as part of the query. We observe thatconnecting the literals in the query can be formulated as a
Steiner tree problem [17], and that favoring paths that in-clude certain predicates can be achieved by modifying theweights on the edges of the graph.The Steiner tree problem is defined as follows. In anyundirected graph G = ( V, E ), where V is the set of ver-tices and E is the set of edges, and each edge e ij connectingvertices ( i, j ) has a weight w ij , the Steiner tree problem isfinding a minimum weight tree that spans a subset of ter-minal vertices (literals in our case) T ⊂ V . If T = V , theproblem is reduced to a minimum spanning tree problem. If | T | = 2, the problem is reduced to a shortest path problem.However, when 2 < | T | < | V | , finding a minimum weighttree is NP-Hard.We associate a weight with each edge in the graph of theataset. These weights can be inferred by the algorithm anddo not need to be materialized. For an edge representing apredicate that matches one of the predicates in the query,or one of the predicates identified by the process in Sec-tion 6.2.1 as an alternative query term for a predicate in thequery, this weight is w q . For any other edge, the weight is w default > w q . Since the Steiner tree algorithm aims to findthe tree with the minimum overall weight, assigning weightsin this manner favors matching the predicates in the query(or alternatives to these predicates) over simply finding atree with a small number of edges.Since finding the Steiner tree is an NP-hard problem, weneed an efficient approximate algorithm. Moreover, tradi-tional Steiner tree algorithms, whether exact or approxi-mate, require fast access to any vertex or edge in the graph,whereas in our case the graph exists on remote endpointsand can be accessed only through SPARQL queries. Ouralgorithm must minimize the number of such queries. Wedescribe next, (1) the literals to be connected via the Steinertree algorithm, and (2) the algorithm that we use to connectthese literals.In the previous section, we described how we generatealternative query terms for the literals in the query (line 9 inAlgorithm 2). Each literal in the query and the alternativeterms generated for it form a group , and we refer to thevertices representing these literals in the RDF graph as the seeds of the QSM for exploring the graph. For example,“Viking Press”, “The Viking Press”, and “The Viking” areall seeds in the same group. The goal of our algorithm isto create a Steiner tree that connects one literal from eachgroup. It is not useful to connect multiple literals from thesame group since these literals are alternatives to each otherand not meant to be used together in the same query.To connect the literals efficiently, our algorithm expandsthe graph starting from the seeds until the groups are all con-nected, and it attempts to minimize the number of verticesvisited in this expansion. We use a known Steiner tree ap-proximation algorithm and we adapt it for our use case [16].The details are presented in Algorithm 3, and consist of thefollowing two steps:
1. Connecting seeds:
The goal of this step is to find atree, not necessarily minimal, that connects all groups. Ini-tially, each seed is a candidate subgraph of the RDF graph,and the candidate subgraphs are expanded using the bi-directional Dijkstra shortest path algorithm [15]. In this al-gorithm, seeds from different groups take turns in expansionrather than choosing a single source seed from which to startthe expansion. In practice, this approach visits (expands)fewer vertices than the regular Dijkstra shortest path algo-rithm, which means fewer SPARQL queries. The expansioncontinues until paths are found that connect seeds from allgroups.In the expansion, each vertex v in a candidate subgraph isexpanded into a subgraph subG defined as follows: (1) subG = { (? s, ? p, ? o ) | ? o = v } if v is a literal, and (2) subG = { (? s, ? p, ? o ) | ? s = v ∨ ? o = v } if v is a URI. That is, if the vertex is a literal(initially, all vertices are literals), the subgraph is expandedby finding all triples that have this literal as an object sinceliterals can only be objects. Each of these triples introducesa new edge (the predicate) and vertex (the subject) to thecandidate subgraph. If a vertex is a URI, the subgraph isexpanded by finding all triples that have this vertex as asubject or an object. As in the case of literal vertices, each of these triples introduces a new vertex to the candidatesubgraph (the subject of the triple if the expanded vertex isthe object, and the object if the expanded vertex is the sub-ject). The edge connecting the new vertex to the expandedvertex is the predicate of the triple. These expansion stepsare expressed as SPARQL queries executed on the endpointof the dataset.The algorithm expands candidate subgraphs according tothe bi-directional Dijkstra algorithm until it finds a shortestpath that connects two seeds from different groups. Thispath becomes the graph g that will be used to find the treeconnecting all the groups. The expansion of other candidatesubgraphs continues according to the bi-directional Dijkstraalgorithm, and whenever the expansion of a candidate sub-graph results in connecting to g a seed from a group thatis not yet part of g , the path that connects this seed to g is added to g . The expansion stops when there is a set ofconnected seeds, one from each group. Recall that we as-sign lower weights to the edges in the data matching pred-icates in the query or similar predicates. This guides thebi-directional Dijkstra algorithm towards expanding pathsthat match query predicates first, and consequently reducesthe number of queries required to find a tree that matchesthe query predicates.We provide the expansion algorithm with a budget forthe number of queries that can be used. In order to re-main within the budget, the expansion of sibling verticesthat are chosen for expansion does not start if the num-ber of siblings is larger than the remaining query budget.This restriction discourages the expansion of vertices with ahigh degree branching factor with the hope that this candi-date subgraph’s seed can be reached by another seed froma different group. We use a budget of 100 SPARQL queriesfor graph expansion, and we found that this gives us goodresponse time for query suggestion. While expanding thecandidate subgraphs, the results of the expansion are memo-ized so that if a vertex is encountered more than once duringexpansion, the results will be obtained from the memoizeddata structure without issuing a new SPARQL query.Figure 7 shows how the vertices in the example are ex-panded starting from the seeds in the query. Common ver-tices between candidate subgraphs are lightly shaded. Allthe edges have a cost of w default except for “writer” and“publisher”, which have a cost of w q . Therefore, “writer” ischosen to be expanded in step a.3. However, this vertex willnot be further expanded because the expansion did not re-sult in any common vertices with the subgraph of the otherliteral in the query. Therefore, it is not possible that fur-ther expansion will help finding a shorter path than the onealready found.
2. Constructing the minimum tree:
After the expan-sion step, we construct a graph G consisting of the unionof all expansions. For each g found during expansion, weconstruct a subgraph g (cid:48) which is the graph induced by g in G . That is, g (cid:48) is a graph whose vertices are the same as g and whose edges are the edges in G such that both ends ofthe edge are vertices in g . Next, a minimum spanning treeis constructed for subgraph g (cid:48) . Multiple minimum spanningtrees may exist and be generated in this step. Finally, allnon-terminal vertices that have a degree of 1 are repeat-edly deleted from the minimum spanning tree(s) since theycannot be part of the Steiner tree. There could be severalSteiner trees with the same total edge weight. Each tree isn alternative query suggested to the user. The approxima-tion ratio of this algorithm is known to be 2 − /s [16], where s is the number of seeds in the query. Performance:
Unlike the QCM, which should have sub-second latency to provide suggestions while the user types,the QSM can have a latency of a few seconds. That is,after the user submits a query, she will see alternative, com-plete, and syntactically correct suggested SPARQL queriesafter waiting a few seconds. In querying LOD data usingSPARQL, a query will likely have a small number of liter-als (in our user study, the maximum number of literals in aquery was 3). Our algorithm is fast enough for such prob-lem sizes to guarantee a QSM response time of less than 10seconds on average.
7. EVALUATION
We evaluate Sapphire along the following dimensions: 1. Auser study in which participants answer questions using anatural language QA system and Sapphire (Section 7.1).2. A quantitative comparison with recent natural language,approximate query, and query-by-example systems (Sec-tion 7.2). 3. Analyzing the response time of the QCM andQSM modules (Section 7.3).Sapphire is implemented in Java. It runs as a web appli-cation over a web server. The user interacts with Sapphirethrough a web browser as described in [13]. We use a pub-licly available implementation of the suffix tree constructionalgorithm [23], FedX [22] as the federated query processor,and the lemon lexicon for DBpedia , which can be also usedfor other data sets. We use DBpedia in all our experimentsand we interact with it via its SPARQL endpoint . DBpe-dia is a good evaluation dataset because it is large and it isthe central and most connected multi-domain dataset in theLOD cloud . We run our experiments on a machine withan 8-core Intel i7 CPU at 2.6 GHz and 8GB of memory. Thememory usage of Sapphire to query DBpedia never exceeds4GB. The most important question related to Sapphire is whetherit actually helps users find answers in RDF datasets. Toanswer this question, we conducted a user study in whichusers are presented with a set of questions they need to an-swer using both Sapphire and QAKiS [7], a natural languagequestion answering system that performs well compared tothe other natural language systems (see Section 7.2).The questions in our study are a subset of the queryset from the Schema-agnostic Queries Semantic Web Chal-lenge [2]. These queries are questions over DBpedia derivedfrom the Question Answering over Linked Data (QALD)competition . We started with 35 questions and dividedthem into three difficulty categories (easy, medium, and dif-ficult). Each of the authors of this paper independently la-beled each question as easy, medium, or difficult. Out of the35 questions, the authors all agreed on the difficulty level of http://github.com/ag-sc/lemon.dbpedia http://dbpedia.org/sparql http://lod-cloud.net http://qald.sebastianwalter.org S u cc e ss R a t e Difficulty LevelQAKiS Sapphire
Figure 8: Success rate of answering questions. P e r c e n t a g e o f Q u e s t i o n s A n s w e r e d Difficulty LevelQAKiS Sapphire
Figure 9: Percentage of questions answered by at least oneparticipant.27 questions, and we used these questions in the user study.These queries are available in Appendix B.We recruited 16 participants who have a computer sci-ence background but are not familiar with RDF or SPARQL.Each participant was given 10 questions (4 easy, 3 medium,and 3 difficult). The questions were randomly assigned toparticipants per category. We asked the participants to findanswers to all the questions using both Sapphire and QAKiS.Since they are fundamentally different in the way they areused, using one system to find an answer should have min-imal effect on how the other system is used. However, wealternated the system the user used first for every question.For example, if the participant answers one question usingSapphire first then QAKiS, the next question is answered us-ing QAKiS first then Sapphire. One question from the easycategory was used in a tutorial prior to the study to demon-strate the two systems to the users (the same question for allparticipants). During the study, the first question a partici-pant tried (from the easy category) was used as a warm-upquestion to familiarize the user with the two systems. Thedata we collected for this first question is dropped from theresults. We used screen recording to capture the sessions ofall participants.
We first investigate whether Sapphire helped users findanswers to their assigned questions, and how it comparesto QAKiS. A total of 48 questions in each category wasgiven to the participants in this study (16 participants × N u m b e r o f A tt e m p t s Difficulty LevelQAKiS Sapphire
Figure 10: Average number of attempts before finding ananswer. A v e r a g e T i m e S p e n t o n Q u e s t i o n s ( M i nu t e s ) Difficulty LevelQAKiS Sapphire
Figure 11: Average time spent on answered queries.also report the 95% confidence interval to demonstrate thatthe findings are consistent among all participants. When-ever we observed a noticeable difference between Sapphireand QAKiS (in all experiments), we calculated the p-valueand found it to be less than the significance level (0.05),which indicates that these differences are statistically signif-icant. The figure shows that Sapphire is superior to QAKiSin the medium and difficult categories, while both systemsperform the same for the easy category. Participants foundanswers for over 80% of the medium difficulty questions us-ing Sapphire, compared to around 50% using QAKiS. Thegap widens for the difficult category, where participants an-swered almost 80% of the questions using Sapphire and only35% using QAKiS.The success rate does not tell the full story since somequestions are easier than others and some users are betterat answering questions than others, regardless of the diffi-culty category or the system used. Another way to comparethe two systems is to see, for every question, whether thatquestion was answered by any participant. Figure 9 showsthe percentage of questions answered by at least one par-ticipant using both systems. The figure shows that everyquestion was answered by at least one participant using Sap-phire, while QAKiS could find answers for only 63% of thequestions in the medium category and 30% in the difficultcategory.Figure 10 shows the average number of attempts the par-ticipants went through before finding an answer in each cate-gory. An attempt is counted when a participant clicks “Run”to issue a query. Sapphire requires slightly more attemptsthan QAKiS, but the numbers are comparable and not sta-tistically significant (p-value > After each session, we surveyed the participants abouttheir experience using Sapphire and how it compares toQAKiS. The comments we received are consistent acrossparticipants: At first, they find it difficult to express thequestion using triple patterns (due to the lack of experiencein RDF) but are still able to answer the questions. However,when they get used to this style of querying, Sapphire be-comes much easier to use. They also agree that Sapphire ismuch more helpful than QAKiS in answering more difficultquestions.Another observation we make from viewing the recordedsessions is that different participants answering the samequestion sometimes take different approaches and use dif-ferent terms, but end up with the same SPARQL query.In other cases, different participants end up with differentqueries to find the same answer. For example, some par-ticipants rank results by a condition and select the correctanswers while others include the condition in the triple pat-terns of the query. This demonstrates the flexibility andeffectiveness of Sapphire.For another qualitative perspective on Sapphire, we re-cruited two SPARQL experts, one with no experience inquerying DBpedia and the other with three years of expe-rience. The two participants were asked to write SPARQLqueries to find answers to the 48 questions used in the userstudy, with and without Sapphire. Without Sapphire, i.e.,interacting directly with the SPARQL endpoint of DBpedia,the first participant was unable to answer any of the ques-tions because he does not know how the DBpedia URIs arerepresented and what kind of vocabulary is used in it. Whenusing Sapphire, he was able to find answers to most ques-tions. The participant with three years experience in DB-pedia answered most questions. Sapphire did help him an-swer the questions he failed to find answers for using DBpe-dia’s SPARQL endpoint. Both experts agreed on Sapphire’svalue in helping users to write SPARQL queries against datasources they are less familiar with and expressed interest inusing Sapphire for their future projects.
In this section, we compare Sapphire to other state of theart systems for querying RDF data. We compare to thesystems participating in the QALD-5 competition [25]. In pro % ri par R R ∗ P P ∗ F F ∗ Xser [28] 42 84% 26 7 0.52 0.66 0.62 0.79 0.57 0.72APEQ [25] 26 52% 8 5 0.16 0.26 0.31 0.50 0.21 0.34QAnswer [21] 37 74% 9 4 0.18 0.26 0.24 0.35 0.21 0.30SemGraphQA [6] 31 62% 7 3 0.14 0.20 0.23 0.32 0.17 0.25YodaQA [25] 33 40% 8 2 0.16 0.20 0.24 0.30 0.19 0.24QAKiS [7] 40 80% 14 9 0.28 0.46 0.35 0.58 0.31 0.51KBQA [10] 8 16% 8 0 0.16 0.16 S [31] 26 52% 16 5 0.32 0.42 0.62 0.81 0.42 0.55SPARQLByE [11] 7 14% 4 0 0.08 0.08 0.57 0.57 0.14 0.14Sapphire
43 86%
43 0
Table 1: Comparing systems using questions from QALD-5.addition, we also compare to (a) QAKiS, which is used inour user study, (b) the more recent natural language QA sys-tem KBQA [10], (c) the recent approximate query matchingsystem S [31], and (d) the recent query-by-example sys-tem SPARQLByE [11]. We use the questions from QALD-5(50 questions) and the performance measures used in thiscompetition. We copy the performance numbers of the sys-tems that participated in QALD-5 and of KBQA from [10].We obtain performance numbers ourselves for QAKiS andSPARQLByE, both of which are publicly available. We alsoobtained performance numbers ourselves for S , which weimplemented.QAKiS is a natural language QA system, and we allow upto 3 attempts for each question. In these attempts, we donot change the query terms using our knowledge of the vo-cabulary. For example, the question “What is the revenue ofIBM?” can be paraphrased in a different attempt as “IBM’srevenue”, but we would not change it to “IBM’s income”. S constructs a summary graph of the data in on offlinestep, and accepts SPARQL queries that it rewrites to matchthe structure of the data based on the summary graph. S expects the predicates and literals to be correct, so we useSapphire to help us find predicates and literals that exist inDBpedia when constructing the SPARQL query for S . Wecompose the SPARQL query for S based on the question inQALD-5, restricting ourselves to the terms in the question. S rewrites the query and we execute the rewritten queryusing FedX.SPARQLByE requires the user to provide example an-swers. The system attempts to learn the commonalities be-tween these answers and capture them in a SPARQL query.The answers of this SPARQL query are presented to the useras additional candidate answers, and the user can mark themas correct or incorrect. SPARQLByE requires at least twosample answers so we use it for questions that have three an-swers or more in their gold standard result. We present twoanswers from the gold standard result as inputs to SPARQL-ByE, and we provide feedback to the system until it finds thecorrect query or cannot learn any more (i.e., cannot modifythe query).When using Sapphire, we only use terms from the questionto enter the query, as we did with other systems. We thenuse Sapphire’s suggestions to complete and modify the queryuntil an answer is found. We do not use our knowledge ofthe vocabulary to change the terms or query structure.The systems are evaluated using the following performancemeasures [25, 10]: 1. The number of questions that are pro-cessed and for which answers are found ( pro ). 2. Thenumber of questions whose answers are correct ( ri ) 3. The number of questions whose answers are partially correct( par ). In addition, the following recall and precision mea-sures are computed, where total is the total number ofquestions in the question set: Recall defined as R = ri total , partial recall defined as R ∗ = ri + par total , precision definedas P = ri pro , partial precision defined as P ∗ = ri + par pro , F defined as 2 . P.RP + R , and F ∗ defined as 2 . P ∗ .R ∗ P ∗ + R ∗ .Table 1 shows the performance of the different systems.The table shows that Sapphire outperforms all other sys-tems on all measures. Natural language QA systems suf-fer from low precision due to the challenge of inferring thestructure and terms of a SPARQL query from the naturallanguage formulation of the question. This challenge is notfaced by Sapphire, which helps the user to directly constructSPARQL queries. Therefore, Sapphire has a precision of 1.0for the questions it is able to answer. Among the naturallanguage systems, KBQA has precision of 1.0 like Sapphire,but it has much lower recall. This is because KBQA fo-cuses only on factoid questions. If only the factoid questionsare considered, KBQA achieves a recall of 0.67, still lowerthan Sapphire. S , while lower in performance than Sap-phire, performs better than other systems. SPARQLByEhas much lower recall than other systems because it cannotanswer most of the questions.The table justifies our choice of QAKiS as a representa-tive QA system in our user study. Other than Xser and S , QAKiS is the best performing system after Sapphire interms of recall and F-measure. Xser is not publicly avail-able. S requires exact knowledge of the literals and URIsin the queried dataset, which we deem too difficult for a userstudy. It is important for the QCM to provide auto-completesuggestions with very low response time in order to guar-antee an interactive experience for users. We measure theresponse time of the QCM in the user study. Two com-ponents contribute to the response time of the QCM: thelookup in the suffix tree, and the sequential search in thebins of literals. We have found that the total response timeof these two components is, on average, 0.16 seconds whenincluding 40K significant literals in the suffix tree and using8 cores for the sequential search in the residual bins. Thisresponse time is low enough to provide a good interactiveexperience.We now study the two components of this response time inore detail. We have found that a lookup operation in thesuffix tree takes approximately 0.25 milliseconds, regardlessof the number of literals that are indexed. This responsetime is certainly low enough for an interactive user expe-rience. Recall that matches in the suffix tree are returnedimmediately to the user before the search in the bins of lit-erals begins. Thus, having a hit (match) in the suffix treegreatly enhances the interactive experience, since the usersees auto-complete suggestions very quickly. Even if thesesuggestions are not chosen by the user, they still give animpression of a responsive system. Therefore, a higher hitratio in the suffix tree is better for interactive response ofthe QCM. The hit ratio (fraction of query terms for whicha match is found in the suffix tree) depends on the numberof literals included in the suffix tree. Our experiments showthat even with only 40K literals in the suffix tree, we achievea hit ratio of 50%.The second component of the QCM response time is thesequential search in the literal bins. Recall that the binsto be searched are filtered based on the length of the termentered by the user. We have found that, on average, thisfiltering eliminates 46% of the literals to be searched. Thesearch in the residual bins takes 0.6 seconds when using 1core, and 0.16 seconds when using 8 cores. The takeawayof this experiment is that the QCM can provide interactiveresponse time by utilizing more cores.
The logs of our user study indicate that participants usedthe suggestions of the QSM in over 90% of the questions.Users utilized alternative predicates in 28% of the ques-tions, alternative literals in 17% of the questions, and re-laxed query structure in 67% of the questions. This demon-strates the crucial role the QSM plays in guiding the usertowards correctly describing her questions. The QSM spendsaround 10 seconds on average before returning suggestionsto the user. This is acceptable since the QSM does not in-teract with the user while she is typing. Instead, the userwaits for suggestions from the QSM, and a 10 second waitis reasonable.
8. CONCLUSION
In this paper, we introduced Sapphire, a tool that helpsusers construct SPARQL queries that find the answers theyneed in RDF datasets. Sapphire caches data from the datasetsto be queried and uses this cached data to suggest comple-tions for SPARQL queries as the user is entering them, andmodifications to these queries after they are executed. Wehave shown Sapphire to be effective at helping users withno prior knowledge of the queried datasets answer complexquestions that other systems fail to answer. As such, Sap-phire is a valuable tool for querying the LOD cloud.
9. REFERENCES [1] RDF schema 1.1, .[2] Schema-agnostic queries over large-schema databases. http://sites.google.com/site/eswcsaq2015/ .[3] SPARQL 1.1 query language. .[4] M. Arenas, G. I. Diaz, and E. V. Kostylev. Reverseengineering SPARQL queries. In
Proceedings of the International World Wide Web Conference (WWW) ,2016.[5] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,R. Cyganiak, and Z. Ives. DBpedia: A nucleus for aweb of open data. In
Proceedings of the InternationalSemantic Web Conference (ISWC) . 2007.[6] R. Beaumont, B. Grau, and A.-L. Ligozat.SemGraphQA@ QALD5: LIMSI participation atQALD5@ CLEF. In
CLEF Working Notes Papers ,2015.[7] E. Cabrio, J. Cojan, A. P. Aprosio, B. Magnini,A. Lavelli, and F. Gandon. QAKiS: an open domainQA system based on relational patterns. In
Proceedings of the International Semantic WebConference (ISWC) , 2012.[8] P. Cimiano, C. Unger, and J. McCrae. Ontology-basedinterpretation of natural language.
Synthesis Lectureson Human Language Technologies , 7(2), 2014.[9] W. W. Cohen, P. Ravikumar, S. E. Fienberg, et al. Acomparison of string distance metrics forname-matching tasks. In
IJCAI Workshop onInformation Integration on the Web (IIWeb-03) , 2003.[10] W. Cui, Y. Xiao, H. Wang, Y. Song, S.-w. Hwang, andW. Wang. KBQA: Learning question answering overQA corpora and knowledge bases.
Proceedings of theVLDB Endowment (PVLDB) , 10(5), 2017.[11] G. Diaz, M. Arenas, and M. Benedikt. SPARQLByE:Querying RDF data by example (demo).
Proceedingsof the VLDB Endowment (PVLDB) , 9(13), 2016.[12] X. Dong and A. Halevy. Indexing dataspaces. In
Proceedings of the ACM SIGMOD InternationalConference on Management of Data , 2007.[13] A. El-Roby, K. Ammar, A. Aboulnaga, and J. Lin.Sapphire: Querying RDF data made simple (demo).
Proceedings of the VLDB Endowment (PVLDB) , 2016.[14] A. Fader, L. Zettlemoyer, and O. Etzioni. Openquestion answering over curated and extractedknowledge bases. In
Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining , 2014.[15] A. V. Goldberg and R. F. F. Werneck. Computingpoint-to-point shortest paths from external memory.In
ALENEX/ANALCO , 2005.[16] F. K. Hwang, D. S. Richards, and P. Winter.
TheSteiner tree problem , volume 53. 1992.[17] R. M. Karp. Reducibility among combinatorialproblems. In
Complexity of computer computations .1972.[18] C. Kiefer, A. Bernstein, and M. Stocker. Thefundamentals of iSPARQL: A virtual triple approachfor similarity-based semantic web tasks. In
Proceedingsof the International Semantic Web Conference(ISWC) , 2007.[19] K. J. Kochut and M. Janik. SPARQLeR: ExtendedSPARQL for semantic association discovery. In
Proceedings of the European Semantic Web Conference(ESWC) , 2007.[20] V. Lopez, M. Fern´andez, E. Motta, and N. Stieler.PowerAqua: supporting users in querying andexploring the semantic web.
Semantic Web , 3(3), 2011.[21] S. Ruseti, A. Mirea, T. Rebedea, andS. Trausan-Matu. Qanswer-enhanced entity matchingor question answering over linked data. In
CLEFWorking Notes Papers , 2015.[22] A. Schwarte, P. Haase, K. Hose, R. Schenkel, andM. Schmidt. FedX: Optimization techniques forfederated query processing on linked data. In
Proceedings of the International Semantic WebConference (ISWC) . 2011.[23] E. Ukkonen. On-line construction of suffix trees.
Algorithmica , 1995.[24] C. Unger, L. B¨uhmann, J. Lehmann, A.-C.Ngonga Ngomo, D. Gerber, and P. Cimiano.Template-based question answering over RDF data. In
Proceedings of the International World Wide WebConference (WWW) , 2012.[25] C. Unger, C. Forascu, V. Lopez, A.-C. N. Ngomo,E. Cabrio, P. Cimiano, and S. Walter. Questionanswering over linked data (QALD-5). In
CLEFWorking Notes Papers , 2015.[26] C. Unger, J. McCrae, S. Walter, S. Winter, andP. Cimiano. A lemon lexicon for dbpedia. In
Proceedings of the International Conference on NLP& , 2013.[27] P. Weiner. Linear pattern matching algorithms. In
Annual Symposium on Switching and AutomataTheory , 1973.[28] K. Xu, S. Zhang, Y. Feng, and D. Zhao. Answeringnatural language questions via phrasal semanticparsing. In
CLEF Working Notes Papers , 2014.[29] M. Yahya, K. Berberich, S. Elbassuoni, andG. Weikum. Robust question answering over the webof linked data. In
Proceedings of the ACMInternational Conference on Information andKnowledge Management (CIKM) , 2013.[30] S. Yang, Y. Wu, H. Sun, and X. Yan. Schemaless andstructureless graph querying.
Proceedings of the VLDBEndowment (PVLDB) , 7(7), 2014.[31] W. Zheng, L. Zou, W. Peng, X. Yan, S. Song, andD. Zhao. Semantic SPARQL similarity search overRDF knowledge graphs.
Proceedings of the VLDBEndowment (PVLDB) , 9(11), 2016.
APPENDIXA. INITIALIZATION QUERIES
This appendix presents the SPARQL queries used in ini-tializing Sapphire. SPARQL queries are typically providedwith limited resources by the remote endpoints. A long-running query that is expected to consume a lot of resourcesmay be rejected by the remote endpoint. If the query is ac-cepted, it will likely time out. Therefore, the initializationqueries of Sapphire are broken down into multiple queriesthat are less resource-intensive and therefore less likely totime out. These queries are as follows.1. Finding predicates sorted by their frequency (not aresource-intensive query):
Q1) SELECT DISTINCT ?p (COUNT(*) AS ?frequency)WHERE {?s ?p ?o}GROUP BY ?pORDER BY DESC(?frequency)
2. Finding literals and most significant literals: The queriesused to find literals need to be carefully structured to min-imize their execution time and the chances of timing out.The key to achieving this goal is increasing the selectivity ofthe query. We focus on two common characteristics of RDFdata that are relevant to Sapphire: 1. Entities are associatedwith RDF types or schema classes. 2. Literals of interest inSapphire are associated with a limited set of predicates.Some datasets are well-structured and have a hierarchyof RDF schema classes, with each entity in the dataset be-longing to a class. This is the case for most of the datasetsthat we encountered on the LOD cloud. We can exploit thischaracteristic by restricting the retrieval of literals to partof the class hierarchy. The following query finds all classesand their subclasses in a dataset:
For datasets that do not have an RDF schema class hier-archy, we can exploit the most used property in the LOD (RDF types) in the dataset. The following query is used tofind all types in the dataset sorted by their frequency: Q3) SELECT DISTINCT ?o (COUNT(?s) AS ?frequency)WHERE{?s a ?o.}GROUP BY ?oORDER BY DESC(?frequency)
In both cases, the following query is used to find predicatessorted by the number of associations to literals:
Q4) SELECT DISTINCT ?p (COUNT(?o) AS ?frequency)WHERE{?s ?p ?o.Filter (isliteral(?o))}GROUP BY ?pORDER BY DESC(?frequency)
The top k of these predicates are filtered based on whetherthey satisfy the filtering conditions on the language of theliterals they are associated with and the length of these lit-erals. This filtering is done by issuing the following querymultiple times, once for each predicate. The placeholder $PREDICATE$ is replaced with the current predicate beingqueried: Q5) SELECT DISTINCT ?oWHERE{?s $PREDICATE$ ?o.Filter (isliteral(?o) && lang(?o) = ’en’ &&strlen(str(?o)) < 80)}LIMIT 1 http://stats.lod2.eu/properties fter issuing these queries to retrieve and filter predicates,if the dataset uses RDF schema classes, Sapphire constructsthe tree representing the class hierarchy. Starting from theroot of this tree, the following query is issued to find if literalsassociated with entities of a certain class (type) $TYPE$ witha predicate $PREDICATE$ can be found. This query is issuediteratively, iterating over all classes and predicates: Q6) SELECT DISTINCT ?oWHERE{?s a $TYPE$.?s $PREDICATE$ ?o.Filter (isliteral(?o) && lang(?o) = ’en’ &&strlen(str(?o)) < 80).}
If a query on the class $TYPE$ times out, queries oversubclasses of this class are issued. If the query succeeds andreturns an answer, then issuing the same queries over thesubclasses is redundant.In the case of datasets that do not use an RDF schemaclass hierarchy, we need a different way to reduce query re-sult size. For this, we use
LIMIT and
OFFSET . Specifically,we issue the following query multiple times, iterating over $TYPE$ and $PREDICATE$ , and using
LIMIT and
OFFSET topaginate the answers so that the query does not time out:
Q7) SELECT DISTINCT ?oWHERE{?s a $TYPE$.?s $PREDICATE$ ?o.Filter (isliteral(?o) && lang(?o) = ’en’ &&strlen(str(?o)) < 80).}LIMIT $LIMIT$OFFSET $OFFSET$
Finally, we need to find the most significant literals. Thefollowing query template is used for this, and it is issuediteratively similar to Q7:
Q8) SELECT DISTINCT ?o (COUNT(?subject) AS ?frequency)WHERE{?s a $TYPE$.?subject ?p ?s.?s $PREDICATE$ ?o.FILTER(lang(?o) = ’en’ && strlen(str(?o)) < 80)}GROUP BY ?oORDER BY DESC(?frequency)LIMIT $LIMIT$OFFSET $OFFSET$
Recall that $PREDICATE$ is associated with literals. There-fore, the literal filter is not added and only the filters onlanguage and length are used.Much of the complexity of the above queries is to avoidtimeouts at the remote endpoints. This is important whenusing Sapphire in a federated architecture. Recall that Sap-phire can also be used in a warehousing architecture, whereall the datasets are stored locally on the same server as Sap-phire. In the warehousing architecture, no limitations areplaced on querying the dataset, e.g., no resource constraintsand no timeouts. This makes finding literals much simpler since we can issue long-running SPARQL queries withoutworrying about timeouts.Specifically, the following query can be used to find literalsfiltered by length and language in the warehousing architec-ture ( $LIMIT$ and $OFFSET$ can still be used to restrict thenumber of results returned, if needed):
Q9) SELECT DISTINCT ?oWHERE{?s ?p ?o.FILTER(isliteral(?o) && lang(?o) = ’en’ &&strlen(str(?o)) < 80)}GROUP BY ?oLIMIT $LIMIT$OFFSET $OFFSET$
The following query finds the most significant literals inthe warehousing architecture if there are no timeout con-straints (again, with $LIMIT$ and $OFFSET$ if needed):
Q10) SELECT DISTINCT ?o (COUNT(?s1) AS ?frequency)WHERE{?s1 ?p ?s2.?s2 ?p2 ?o.FILTER(isliteral(?o) && lang(?o) = ’en’ &&strlen(str(?o)) < 80)}GROUP BY ?oORDER BY DESC(?frequency)LIMIT $LIMIT$OFFSET $OFFSET$
B. QUERIES USED FOR USER STUDYB.1 Easy Queries
1. Country in which the Ganges starts2. John F. Kennedy’s vice president3. Time zone of Salt Lake City4. Tom Hanks’s wife5. Children of Margaret Thatcher6. Currency of the Czech Republic7. Designer of the Brooklyn Bridge8. Wife of U.S. president Abraham Lincoln9. Creator of Wikipedia10. Depth of lake Placid
B.2 Medium Queries
1. Instruments played by Cat Stevens2. Parents of the wife of Juan Carlos I3. U.S. state in which Fort Knox is located4. Person who is called Frank The Tank5. Birthdays of all actors of the television show Charmed6. Country in which the Limerick Lake is located7. Person to which Robert F. Kennedy’s daughter is mar-ried8. Number of people living in the capital of Australia .3 Difficult Queries.3 Difficult Queries