[PDF] Compacting Frequent Star Patterns in RDF Graphs

Abstract

Knowledge graphs have become a popular formalism for representing entities and their properties using a graph data model, e.g., the Resource Description Framework (RDF). An RDF graph comprises entities of the same type connected to objects or other entities using labeled edges annotated with properties. RDF graphs usually contain entities that share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. We address the problem of identifying frequent star patterns in RDF graphs and devise the concept of factorized RDF graphs, which denote compact representations of RDF graphs where the number of frequent star patterns is minimized. We also develop computational methods to identify frequent star patterns and generate a factorized RDF graph, where compact RDF molecules replace frequent star patterns. A compact RDF molecule of a frequent star pattern denotes an RDF subgraph that instantiates the corresponding star pattern. Instead of having all the entities matching the original frequent star pattern, a surrogate entity is added and related to the properties of the frequent star pattern; it is linked to the entities that originally match the frequent star pattern. We evaluate the performance of our factorization techniques on several RDF graph benchmarks and compare with a baseline built on top of gSpan, a state-of-the-art algorithm to detect frequent patterns. The outcomes evidence the efficiency of proposed approach and show that our techniques are able to reduce execution time of the baseline approach in at least three orders of magnitude reducing the RDF graph size by up to 66.56%.

Full PDF

CCompacting Frequent Star Patterns in RDF Graphs

Farah Karim

Leibniz University of Hannover, Germany

Mirpur University of Science and Technology (MUST), Mirpur-10250 (AJK), Pakistan

[email protected]

Maria-Esther Vidal

TIB Leibniz Information Centre for Science and Technology, Hannover

Leibniz University of Hannover, Germany

[email protected]

Sören Auer

TIB Leibniz Information Centre for Science and Technology, Hannover

Leibniz University of Hannover, Germany

[email protected]

March 12, 2020

Abstract

Knowledge graphs have become a popular formalismfor representing entities and their properties usinga graph data model, e.g., the Resource DescriptionFramework (RDF). An RDF graph comprises enti-ties of the same type connected to objects or otherentities using labeled edges annotated with proper-ties. RDF graphs usually contain entities that sharethe same objects in a certain group of properties, i.e.,they match star patterns composed of these proper-ties and objects. In case the number of these entitiesor properties in these star patterns is large, the size ofthe RDF graph and query processing are negativelyimpacted; we refer these star patterns as frequent starpatterns . We address the problem of identifying fre-quent star patterns in RDF graphs and devise theconcept of factorized RDF graphs , which denote com-pact representations of RDF graphs where the num-ber of frequent star patterns is minimized. We also develop computational methods to identify frequentstar patterns and generate a factorized RDF graph ,where compact RDF molecules replace frequent starpatterns. A compact RDF molecule of a frequent starpattern denotes an RDF subgraph that instantiatesthe corresponding star pattern. Instead of having allthe entities matching the original frequent star pat-tern, a surrogate entity is added and related to theproperties of the frequent star pattern; it is linked tothe entities that originally match the frequent starpattern. Since the edges between the entities and theobjects in the frequent star pattern are replaced byedges between these entities and the surrogate entityof the compact RDF molecule, the size of the RDFgraph is reduced. We evaluate the performance ofour factorization techniques on several RDF graphbenchmarks and compare with a baseline built ontop of gSpan , a state-of-the-art algorithm to detectfrequent patterns. The outcomes evidence the eﬃ-ciency of proposed approach and show that our tech-1 a r X i v : . [ c s . D B ] M a r iques are able to reduce execution time of the base-line approach in at least three orders of magnitude.Additionally, RDF graph size can be reduced by upto . while data represented in the original RDFgraph is preserved. Keywords:

Semantic Web, RDF Compaction,Linked Data, Knowledge Graph.

Knowledge graphs have gained momentum as ﬂexibleand expressive structures for representing not onlydata and knowledge but also actionable insights [28];they provide the basis for eﬀective and intelligent ap-plications. Currently, knowledge graphs are utilizedin diverse domains e.g., DBpedia [19], Google Knowl-edge Graph [26], and KnowLife [12]. The ResourceDescription Framework (RDF) [18] has been adoptedas a formalism to represent knowledge graphs; infact, in the Linked Open Data cloud [6], there are in2019 more than 1,200 RDF knowledge graphs avail-able . RDF models knowledge in the form of graphswhere nodes represent entities; connections betweenentity nodes are representing RDF triples composedof subject, property, and object. The subjects andobjects are represented by nodes, and an edge repre-sents a property that relates a subject with an object.Diverse applications have been developed on top ofknowledge graphs [5, 15, 28]. However, the adoptionof knowledge graphs as de facto data structure of real-world applications demands eﬃcient representationsand scalable techniques for creating, managing, andanswering queries over knowledge graphs. Thus, ef-ﬁcient graph representations of real-world scenariosare still demanded to enhance and facilitate the de-velopment of applications over knowledge graphs.In real-world applications, a group of entities canshare the same values in a set of features. Forexample, several sensor observations can sense thesame temperature, in a given timestamp and city.This situation can be represented in an RDF graphwith four triples per sensor observation o i , i.e., ( o i temperature t ) , ( o i unit u ) , ( o i timestamp ts ) , https://lod-cloud.net/ and ( o i gps _ coordinates gc ) . All the resources rep-resenting these sensor observations match the vari-able ? o in the star pattern (SGP) composed by theconjunction of the following triple patterns (? o tem − perature t ) (? o unit u ) , (? o timestamp ts ) , and (? ogps _ coordinates gc ) [24]. In case the star patternsare instantiated with many entities, a large numberof RDF triples will have the same properties and ob-jects and the corresponding star pattern will be re-peatedly instantiated; we name these star patterns frequent star patterns . Although RDF triples that in-stantiate a frequent star pattern correctly model thereal world, the size of the knowledge graph as well asthe eﬃciency of the tasks of management and pro-cessing, can be negatively aﬀected whenever a largenumber of triples of frequent star patterns populatethe knowledge graph. Since frequent star patterns arevery common in real world knowledge graphs, tech-niques are required to enable both the eﬃcient rep-resentation of the knowledge encoded in these starpatterns, as well as the processing and traversal ofthe represented knowledge.The Database and Semantic Web communitieshave addressed the problem of representing relationaland graph data models; they have proposed a varietyof representation methods and data structures thattake into account the main features of a relational orgraph model with the aim of speeding up relation andgraph based analytics [1, 2, 3, 14, 16, 17, 20, 23, 32].Compression techniques [1, 32] over the column-oriented databases [7, 27], use the decompositionstorage model [10] to maintain data, where each at-tribute value and a surrogate key, from the concep-tual schema, are stored in a binary relation. How-ever, a relation stored using the decomposition stor-age model cannot easily exploit compression unlesssurrogate keys are repeated [10]. Further, the de-composition model stores two copies of a binary rela-tion, also the surrogate keys are required to be storedrepeatedly for each attribute causing an increase inthe storage space requirements. In the context ofRDF graph, the scientiﬁc community has also ac-tively contributed; approaches like [3, 14, 21, 31] gen-erate compact binary representations for RDF knowl-edge graphs. RDF binary compression techniques donot take into account the semantics encoded in knowl-2dge graphs; they require customized engines to per-form query processing. Moreover, there have beendeﬁned compression approaches for RDF graphs ableto exploit semantics encoded in RDF triples. Ap-proaches [20, 23] are application dependent and re-quire a user to input the compression rules and con-straints. Alternatively, compression approaches tai-lored for ontology properties [17] have shown to beeﬀective, but they require prior knowledge of classesand properties involved in repeated graph patterns togenerate compact representations. Lastly, techniquesproposed by Joshi et al. [16] require decompression toaccess and process the original data, as well as extraprocessing over the data. Albeit eﬀective in reducingthe storage space, existing compression methods addoverhead to the process of data management, andparticularly, query execution time can be negativelyimpacted. gSpan [30] and GRAMI [11] are state-of-the-art algorithms that aim to identify frequentpatterns. However, only patterns with constants areconsidered and they are neither able to identify starpatterns nor decide frequentness . We have built anexhaustive algorithm that resorts to the gSpan enu-meration of frequent patterns to identify the frequentstar patterns in an RDF knowledge graph; this ap-proach corresponds to the baseline of our empiricalevaluation. Our Research Goal:

We address the problem of identifying frequent star patterns in RDF knowledgegraphs, where certain properties and their corre-sponding objects are repeatedly shared by several en-tities of a type causing unnecessary growth of theknowledge graphs. Our research goal is to mini-mize the number of frequent star patterns in RDFknowledge graphs to generate compact representa-tions without losing any information. We investigatethe following research questions: • What are the criteria that characterize frequentstar patterns? • Do compact graph representations impact on thesize of knowledge graphs?

Approach:

We devise the concept of factorized RDFgraphs , which corresponds to a compact graph with aminimized number of frequent star patterns. Further, we develop computational methods to detect frequentstar patterns in RDF graphs and to generate a fac-torized RDF graph . These methods are able to iden-tify entities and properties in frequent star patternsin RDF graphs, and generate factorized RDF graphsby representing frequent star patterns with compactRDF molecules. A compact RDF molecule of a fre-quent star pattern is an RDF subgraph that instan-tiates the star pattern; a surrogate entity stands forthe entities that satisfy the corresponding frequentstar pattern. The surrogate entity is linked to theproperties and the corresponding objects in the fre-quent star pattern (see Figure 4c). The entities, ini-tially matching the frequent star pattern, are alsolinked to the surrogate entity of the compact RDFmolecule. Compact RDF molecules signiﬁcantly re-duce the size of the RDF graph by replacing labelededges and entities connected the objects in the fre-quent star pattern, with edges linking the entitiesto the surrogate entity of a compact RDF molecule.We study the eﬀectiveness of our factorization tech-niques over the

LinkedSensorData benchmark [22];it describes more than 34,000,000 weather observa-tions collected by around 20,000 weather stations inthe United States since 2002. Experiments are con-ducted against three

LinkedSensorData

RDF graphsby gradually increasing the graph size. The observedresults evidence that frequent star patterns character-ize the best set of properties relating several entitiesof a class to the same objects in an RDF graph. More-over, our techniques reduce RDF graphs size by upto . using properties and classes recommendedby the frequent star patterns detection approach. Contributions: we devise computational methodsfor factorizing RDF graphs. The speciﬁc contribu-tions are as follows: i) Criteria for detecting frequentstar patterns; ii)

Factorization techniques compact-ing frequent star patterns in RDF graphs. We havepresented two algorithms: An exhaustive approach(named E.FSP) searches the space of frequent pat-terns produced by an algorithm like gSpan, to iden-tify frequent star patterns. Further, G.FSP imple-ments a Greedy meta-heuristics that is able to tra-verse the space of star patterns and identify the onesthat are frequent. Star patterns are traversed initerations, starting with the star patterns with the3 e c c c e e Ce e e p p p p p p p p p p p p p p p p type t y p e t y p e t y p e (a) An RDF Graph G c e c c c e e p p p p p p p p p p p p e e e e e e e e e (b) Entities in the Graph Pattern p p p ?x e e e (c) A Star Pattern Figure 1:

Motivating Example . Frequent star pattern. (a) RDF graph with classes, entities, and proper-ties; (b) Entities c , c , c , and c are related to e , e , and e with properties p , p , and p , respectively;(c) A star pattern with subject variable ? x , respectively, relates e , e , and e with properties p , p , and p .largest number of properties. The criteria of frequentstar patterns correspond the stop criteria of the al-gorithm. iii) An empirical study of both the fre-quent star patterns detection and factorization tech-niques using existing benchmarks. Experimental re-sults show that both E.FSP and G.FSP identify fre-quent star patterns. Moreover, G.FSP overcomesE.FSP by reducing execution time in at least threeorders of magnitude. More importantly, the experi-ments indicate that factorizing frequent star patternsby using surrogate keys enable for the creation ofcompact RDF graphs that reduce size while preserv-ing the information in the original RDF graph.The article is structured as follows: We motivateour research in Section 2, and present an analysisof the state of the art in Section 3. Our approachis deﬁned in Section 4, while Section 5 reports onthe results of the experimental study. Finally, weconclude with an outlook on future work in Section 6.

We motivate the problem addressed by this work withan RDF graph where entities of the same type – orresources – match the same star pattern. In an RDFgraph, matching the same star pattern means thatthe properties and objects are the same, whereas the entities are diﬀerent. When the number of entitiesmatching a star pattern is very high, the size of theRDF graph increases and the query processing overthe RDF graph is aﬀected negatively. A star patternwith a high number of matching entities is a frequentstar pattern. Figure 1a depicts an RDF graph com-posed by a class C , the entities c , c , c , c , e , e , e , e , e , and e , and the properties p , p , p , and p . A directed edge ( s p o ) in the RDF graph standsfor an RDF triple where p is a label that represents anRDF predicate, while s and o are subject and objectnodes, respectively. Edges labeled with the predicate type , indicate that c , c , c and c are of the sametype, i.e., the class C . The directed edge ( c p e ) expresses that the entity c is related to object e with the property p . Similarly, entities c , c , and c are related to object e with the property p , i.e.,the indegree of e is four. Similarly, entities c , c , c and c are related to e and e with the proper-ties p and p , respectively. Note that entities c , c , c , and c are associated with the same objects,i.e., e , e and e through the edges annotated withsame properties p , p , and p . Albeit sound, theseredundant labeled edges generate frequent star pat-terns because entities of the same type are describedusing the same properties and objects. Figure 1b il-lustrates the RDF subgraphs that map to the same property type refers to rdf:type e e e e p p p p c e e e e p p p p (a) Subgraphs per 4 Properties c e e e p p p c e e e p p p c e e e p p p c e e e p p p (b) Subgraphs involving threeProperties c e e p p c e e p p c e e p p c e e p p c e e p p c e e p p (c) Subgraphs involving twoProperties Figure 2:

Graph Patterns Identiﬁed by gSpan . Subgraphs, involving entities c and c , extracted bygSpan from the RDF graph in Figure 1a. (a) Subgraphs per set { p , p , p , p } of properties; (b) Subgraphsinvolving three properties from p , p , p , and p ; (c) Subgraphs around two properties from p , p , p , and p .star pattern, shown in Figure 1c, extracted from theRDF graph in Figure 1a; note that ?x is a variablewhose instantiations correspond to constants in theRDF graph. In these RDF subgraphs, the properties p , p , and p , and the corresponding objects e , e ,and e , respectively, are the same, whereas the enti-ties c , c , c , and c are diﬀerent. This indicates thatthe star pattern is a frequent star pattern, i.e., severalentities c , c , c , and c instantiate the star pattern.Thus, several entities are related to the same objects,even not all the properties of the class are involved infrequent star patterns. A frequent star pattern com-prising the entities c , c , c , and c is illustrated inFigure 1c, where the node ? x represents the entities c , c , c , and c of class C in the RDF graph inFigure 1a. gSpan [30] solves the problem of identify-ing the frequent subgraphs that involve same subjectentities related to the same object values using a setof properties. However, our approach requires theidentiﬁcation of frequent star patterns, where eachstar pattern– with a subject variable–involves diﬀer-ent subject entities related to the same object valuesusing a set of properties. Figures 2a ,2b, and 2c showsome of the subgraphs extracted by gSpan involvingentities c and c , and the sets of properties contain-ing four, three, and two properties, respectively, fromthe RDF graphs in Figure 1a. gSpan exhaustively enumerates the frequent subgraphs; thus, ﬁnding fre-quent star patterns requires an exhaustive search overthe generated frequent subgraphs. In this work, weexploit the RDF model and propose a technique thatallows for transforming an RDF graph G into an-other RDF graph G (cid:48) where the number of frequentstar patterns is minimized. The graph G (cid:48) includesall the nodes from G but additionally, G (cid:48) comprisesnodes that represent factorized entities – like the onein Figure 4c. Database and Semantic Web communities have pro-posed several representations to speed up processingover the large amounts of data represented using re-lational and RDF data models [1, 2, 3, 14, 16, 17, 20,23, 32]. These compression approaches can be catego-rized into compression techniques for relational andRDF graph data models. Relational data model ap-proaches [1, 32] eﬃciently store very large datasets incolumn-oriented stores. Approaches [3, 14, 16, 17, 20,21, 23, 31] target the eﬃcient storage of RDF graphdata. Furthermore, several frequent pattern miningalgorithms [11, 30] extract frequent isomorphic graphpatterns from a graph.5 .1 Data Compression for RelationalData Models

Column-oriented databases [27, 32] store each at-tribute in a separate column such that successive val-ues of the attribute are accumulated consecutively onthe disk. This improves the query processing whenthe values of some of the columns are required toprocess the query. The column oriented data stor-age opens a number of opportunities to apply com-pression techniques more naturally over the multi-ple values of the same type. Compression approachproposed by Abadi et al. [1] compress each columnin C-store [27] using one of the methods like NullSuppression, Dictionary Encoding, Run-length En-coding, Bit-Vector Encoding or Lempel-Ziv [25, 29].Zukowski et al. [32] focus on improving bad CPU/-cache performance caused by the compression tech-niques involving if-then-else statements in the code,e.g., Null Suppression, Run-length Encoding, anddoes not take advantage of the super-scalar prop-erties, e.g., pipe-lining the processes, in the mod-ern CPUs. Zukowski et al. propose three com-pression methods i.e., PFOR, PFOR-DELTA, andPDICT. These compression solutions are exploited bycolumn-oriented stores using the decomposition stor-age model [10], where n-array relations are decom-posed into n binary relations. Each binary relationconsists of one attribute values and the correspondingsurrogate keys. In this model, two copies of data arestored increasing the data storage requirements. Fur-ther, for each attribute a copy of the correspondingduplicated surrogate key is required resulting in anincrease of the storage by a factor of two. Moreover,various compression techniques for a large number ofunique values, i.e., subject entities, are hard to im-plement. Our approach generates a factorized graphwhere entities matching a frequent star pattern arerepresented by a surrogate entity of the correspond-ing compact RDF molecule. These compact graphrepresentations replace repeated properties and cor-responding objects with properties and objects in thecompact RDF molecules, hence, improve the stor-age space requirements for the decomposition storagemodel [10]. Meier et al. [20] propose a user-speciﬁc minimizationtechnique based on Datalog rules to remove the RDFtriples from a given RDF graph. Similarly, Pichler etal. [23] study the RDF redundancy elimination in thepresence of rules, constraints, and queries speciﬁed byusers. These two approaches are user speciﬁc and re-quire human input for compressing the ever growingRDF graphs. A scalable lossless RDF compressiontechnique, proposed by Joshi et al. [16], automati-cally generates decompression rules. The rules areused to split the RDF datasets into an active datasetcontaining compressed triples, and a dormant datasetconsisting of uncompressed RDF triples. This tech-nique requires the overhead of decompression overthe compressed data to access the information ini-tially represented in datasets. A factorized repre-sentation of RDF graphs is presented by Karim etal. [17], where repeated observation values are repre-sented only once. This approach reduces the num-ber of RDF triples in the observational data, whichis semantically described using the Semantic SensorNetwork (SSN) Ontology [9]. We propose an ap-proach to automatically identify frequent star pat-terns in RDF graphs described using any ontology.Further, we devise factorized graphical representa-tions of RDF graphs which do not require data de-compression to perform data management tasks. Fer-nández et al. [14] present a binary RDF represen-tation format consisting of a Header, a Dictionaryand a Triple component containing RDF metadata,RDF terms catalog, and compactly encoded RDFtriples, respectively. Pan et al. [21] propose RDFcompression based on graph patterns, which reducesthe number of RDF triples and then generates com-pact binary representations of the reduced triples.The compression technique k -triples presented byÁlvarez-García et al. [3] exploits the two dimen-sional k -trees structure, proposed by Barisaboa etal. [8], to distribute the compact triples obtained byHeader-Dictionary-Triples partitioning [14]. Theseapproaches are able to eﬀectively reduce redundan-cies in RDF graphs, and provide eﬀective techniquesfor RDF graph compression. However, customized6ngines are required to perform query processingover the compressed RDF graphs, and decompres-sion techniques are needed during data management.We devise factorization techniques that use semanticsencoded in RDF data and compactly represent RDFtriples, reduce redundancy, and facilitate data man-agement tasks without requiring any decompressionor a customized engine. The problem of frequent pattern mining involves ﬁnd-ing subgraphs, from a graph, that have frequencyabove a given threshold. gSpan [30] exploits thedepth ﬁrst search (DFS) to mine frequent patterns.gSpan maps a graph to a DFS code representing theedges sequence. Several DFS codes can be gener-ated for a single graph. These DFS codes are or-dered lexicographically based on the edge labels andthe order of nodes being visited. From these orderedDFS codes the minimum DFS codes are selected tobuild the DFS tree. DFS over a code tree discov-ers all the minimum DFS codes of frequent patterns.GRAMI [11] mines frequent patterns and ﬁnds onlythe minimal set of instances that satisfy the givenfrequency threshold. GRAMI stores the templatesof frequent patterns instead of storing their appear-ances. This avoids the creation and storage of allappearances of patterns. For frequency evaluation,GRAMI maps the frequent patterns mining problemto constraint satisfaction problem (CSP), which isrepresented by a tuple; (a) an ordered set of variablesrepresenting nodes, (b) a set of domains of variablesin (a), and (c) a set of constrains between these vari-ables. Two subgraphs patterns are isomorphic if thevariables in corresponding CSP tuple have diﬀerentvalues from the domains, however, nodes and edgelabels are the same. Notwithstanding these frequentpattern mining approaches are able to identify thefrequent isomorphic graph patterns, extracting fre-quent star patterns, which involve diﬀerent subjectnodes related with same objects nodes using sameset of edge labels, requires an exhaustive search overthe identiﬁed frequent patterns. It is important tohighlight that although these approaches eﬀectivelymine subgraph patterns, they are not able to iden- tify patterns where one node is a variable. Contrary,our approach searches for star patterns and is able todetect the ones with highest instantiations.

We introduce important preliminary deﬁnitions, andthen formally deﬁne the problem of detecting fre-quent star patterns and compacting them in an RDFgraph.

Our approach is based on the RDF data model build-ing on RDF triples.

Deﬁnition 4.1 (RDF triple [4] ) . Let I , B , L be dis-joint inﬁnite sets of URIs, blank nodes, and literals,respectively. A tuple ( s p o ) ∈ ( I ∪ B ) × I × ( I ∪ B ∪ L ) is an RDF triple, where s is the subject, p is the prop-erty, and o is the object.A set of RDF triples is called RDF dataset (orknowledge graph) and can also be viewed as a graph.Thus, in Figure 1a, the edge ( c type C ) representsan RDF triple , where entity c corresponds to sub-ject, type and C represent a property and an object,respectively; there are nineteen more RDF triples . Deﬁnition 4.2 (RDF Graph) . An RDF graph G =( V, E, L ) is a labeled directed graph where nodes rep-resent entities or objects, while labels stand for prop-erties: • An RDF triple ( s p o ) ∈ E , corresponds to anedge in E from node s to node o ; p is the labelof the edge and denote the property that relatesboth nodes; • s , o ∈ V , s corresponds to a subject and o cor-responds to an object; and • p ∈ L , is an edge label corresponding to a prop-erty.7 eﬁnition 4.3 (RDF Molecule [13]) . AnRDF molecule RM is a set of RDF triplesthat share the same subject, i.e., RM = ( s p o ) , ( s p o ) , . . . , ( s p n o n ) .Figure 1b presents four RDF molecules around thesubjects c , c , c , and c of class C . In the RDFmolecule around subject c all the RDF triples de-scribe c using properties p , p , and p . Similarly,RDF triples in each of the other RDF molecules de-scribe the subjects c , c , and c using properties p , p , and p . Star patterns denote graph patterns covering RDFmolecules:

Deﬁnition 4.4 (Star Pattern) . Given is an RDFgraph G = ( V, E, L ) , a class C in E and a setof properties SP = { p , p , . . . , p n } such that C isthe domain of all the properties in SP . Let en-tities o , o , . . . , o n be the objects of the properties p , p , . . . , p n , respectively. Let ? s be a variable. Astar pattern of C over the properties p , p , . . . , p n and objects o , o , . . . , o n corresponds to a graph pat-tern composed of the conjunction of triple patterns: (? s p o ) , (? s p o ) , . . . , (? s p n o n ) .Figure 1c shows a star pattern composed of threetriple patterns containing properties p , p , and p and the corresponding objects e , e , and e , respec-tively. The entities c , c , c , and c of class C inthe RDF graph in Figure 1a match the star pattern.The variable ? x is the subject of the triple patternsreferring to the entities matching the star pattern. Deﬁnition 4.5 (Class Multiplicity) . Given an RDFgraph G = ( V, E, L ) , a class C in E and a set ofproperties SP = { p , p , . . . , p n } such that C is thedomain of all the properties in set SP of properties.Let entities o , o , . . . , o n be objects of the proper-ties p , p , . . . , p n , respectively. The multiplicity of o , o , . . . , o n in G , M ( o , o , . . . , o n | G ) is deﬁned asthe number of diﬀerent entities in C that match astar pattern having the same objects o , o , . . . , o n inthe properties p , p , . . . , p n . Entities s correspond to instantiations of the subject variable in the starpattern. M ( o , o , . . . , o n | G ) = |{ s | ( s :type C ) ∈ G, ( s p o ) ∈ G, ( s p o ) ∈ G,. . . , ( s p n o n ) ∈ G }| In the RDF graph in Figure 1a, the multiplicity ofthe objects e , e and e , given the set { p , p , p } ofproperties, is , because there are four instantiationsof the subject variable. Similarly, the multiplicity ofobjects e , e and e , in the set { p } of properties is and . Deﬁnition 4.6 (Class Multiplicity Inverse) . Givenclass C , a set SP = { p , p , . . . , p n } of proper-ties and corresponding objects o , o , . . . , o n , themultiplicity inverse of o , o , . . . , o n in G , denoted M I ( o , o , . . . , o n | G ) , is: M I ( o , o , . . . , o n | G ) = 1 /M ( o , o , . . . , o n | G ) In the RDF graph in Figure 1a, the class multi-plicity inverse of the objects e , e , and e , given theset { p , p , p } of properties, is . The multiplicityinverse of objects e , e , and e in the set { p } ofproperties is and . Deﬁnition 4.7 (Multiplicity of Star Patterns) . Given a class C in an RDF graph G with proper-ties SP = { p , p . . . , p n } . The multiplicity of thestar patterns in C over SP , AM I G ( p , p , . . . , p n | C ) ,is deﬁned as follows: AM I G ( p , p , . . . , p n | C ) = (cid:100) f (cid:48)∀ s ∈ C ( { M I ( o , o , . . . , o n | G ) | ( s type C ) ∈ G, ( s p o ) ∈ G, ( s p o ) ∈ G, . . . , ( s p n o n ) ∈ G } ) (cid:101) where f (cid:48) ( . ) is an aggregation (e.g., summation) func-tion.In the RDF graph in Figure 1a, the multiplicityof the star patterns of C over the set { p , p , p } ofproperties is + + + = 1 , which is obtainedby summing up the class multiplicity inverse of theobjects e , e , and e given the set { p , p , p } ofproperties, for each entity c , c , c , and c of class8 matching the star pattern. Similarly, the multiplic-ity of the star patterns of class C over the set { p } is + + + = 3 in the RDF graph, and is obtainedby summing up the individual class multiplicity in-verse of objects e , e , and e given the set { p } ,for each of the entities c , c , c , and c of class C that map the corresponding star patterns. The mul-tiplicity of the star patterns over a set of propertiescorresponds to the number of star patterns composedof the set of properties and the corresponding objects.The problem of frequent star patterns detection is de-ﬁned next, the solutions correspond to frequent starpatterns. We deﬁne the frequent star patterns de-tection problem as the minimization of connectionsbetween a class instances and values linked throughthe properties. To ﬁnd the minimum number of edgesover the properties in a class, the sum of the numberof edges in the star patterns over a set of propertiesand the number of edges between the class entitiesand the properties that are not involved in the starpatterns is computed. Deﬁnition 4.8 (FSP Detection Problem) . Given anRDF graph G = ( V, E, L ) and a class C in G with setof properties S and number of instances AM G ( C ) .The problem of Frequent Star Patterns Detection(FSP Detection) is to ﬁnd a subset SP of S such thatthe star patterns SGP of C over SP corresponds to frequent star patterns , i.e., Edges ( SP, C, G ) is min-imized: arg min SP ⊆ S { AMIG ( SP | C ) ∗ ( | SP | + 1) + AMG ( C ) ∗ ( | S − SP | ) (cid:124) (cid:123)(cid:122) (cid:125) Edges ( SP,C,G ) } (1) Figure 3 illustrates the problem of detecting fre-quent star patterns from the RDF graph in Figure 1a.Figure 3a presents three star patterns

AM I G ( SS | C ) over the set of properties p , p , p , and p , and15 edges in Edges ( SS, C, G ) . However, only onestar pattern AM I G ( SS (cid:48) | C ) over the set of proper-ties p , p , and p exists in Figure 3b. A small valueof i.e., eight, shows a subgraph over SS (cid:48) that is represented by only one star pattern withmore instantiations than the star patterns for SS ,i.e., it is a frequent star pattern. Thus, the set ofproperties SP where is minimal, en-closes a subgraph with the minimal number of star patterns which have the maximal number of instanti-ations; additionally, these star patterns are the oneswith the greater number of properties. Figure 3c de-picts the factorized RDF graph where this frequentstar pattern has been replaced with a compact RDFmolecule on a surrogate entity cM ; this factorizationreduces the size of the original RDF graph. Theorem 4.1.

Given an RDF graph G , a class C in G , and non-empty sets of properties S , SP , and SP (cid:48) of C such that SP (cid:48) ⊂ SP ⊂ S . If Edges ( SP (cid:48) , C, G ) > Edges ( SP, C, G ) ,then ∀ SP (cid:48)(cid:48) ⊂ SP (cid:48) , Edges ( SP (cid:48)(cid:48) , C, G ) ≥ Edges ( SP, C, G ) .Proof. By contradiction. Suppose

Edges ( SP (cid:48)(cid:48) , C, G ) < Edges ( SP, C, G ) .From Edges ( SP (cid:48) , C, G ) > Edges ( SP, C, G ) and SP (cid:48) ⊂ SP ⊂ S , it can be inferred that AM I G ( SP | C ) < AM G ( C ) , AM I G ( SP (cid:48) | C )

SGP of a class C over the properties p , p , . . . , p n and objects o , o , . . . , o n . Given a sur-rogate entity sg of type C . A compact RDF moleculefor SGP is an RDF molecule composed of RDF triples ( sg p o ) , ( sg p o ) , . . . , ( sg p n o n ) .Figure 4c shows a compact RDF molecule that in-stantiates the star pattern presented in Figure 1c,which is composed of the properties p , p , and p and the corresponding objects e , e , and e , respec-tively. The surrogate entity cM in the compact RDFmolecule, represents the entities c , c , c , and c oftype C matching the star pattern, as shown in Fig-ure 1b. Deﬁnition 4.10 (The RDF-F Problem) . Given anRDF graph G = ( V, E, L ) and a set of properties9 e c e e C e t y p e t y p e p p p p p p p p c e e e e p p p t y p e t y p e c e e e e p p p A M I G ( SS | C ) = S={p ,p ,p ,p }SS=S (a) Edges ( SP, C, G ) over p , p , p , and p A M I G ( SS ’ | C ) = S={p ,p ,p ,p }SS’={p ,p ,p } c e c c c e e Ce e e p p p p p p p p p p p p p p p p ty p e t y p e t y p e t y p e p p p ?s e e e Frequent Star Pattern (b)

Edges ( SP, C, G ) over properties p , p , and p c c c c p p C t y p e cM e e e e e e p p p i n s t a n c e O f i n s t a n c e O f i n s t a n c e O f i n s t a n c e O f Compact RDF Molecule (c)

Factorized Graph G (cid:48) from G Figure 3:

The Frequent Star Patterns Detection Problem . Properties involved in frequent starpatterns. (a) Stars patterns over the set SS = { p , p , p , p } of properties in class C require three surrogateentities and Edges ( SS, C, G ) are 15; (b) Star patterns over the set SS (cid:48) = { p , p , p } of properties inclass C require one surrogate entitiy and Edges ( SS (cid:48) , C, G ) are eight; (c) A factorized RDF graph G (cid:48) of G composed of compact RDF molecule with a surrogate entity cM . c c c c cM µ N µ N µ N µ N (a) µ N from G into G (cid:48) c c c c CCCC typetypetypetype cMcMcMcM c c c c instanceOfinstanceOfinstanceOfinstanceOf (b) type G into instanceOf G (cid:48) p p p C t y p e cM e e e (c) A Compact RDFMolecule Figure 4:

The RDF Graph Factorization Problem . Factorization of RDF graph G into G (cid:48) . (a) Entitymappings µ N from the RDF graph G in 1a to the surrogate entity cM in G (cid:48) ; (b) Transformation of property type from G to G (cid:48) ; (c) A compact RDF molecule for the frequent star pattern over the properties p , p ,and p . SP , the problem of RDF factorization (RDF-F) cor-responds to ﬁnding a factorized RDF graph of G , G (cid:48) = ( V (cid:48) , E (cid:48) , L (cid:48) ) , where the following hold: • Entities in G are preserved in G (cid:48) , i.e., V ⊆ V (cid:48) . • For each entity s i in V that corresponds to an in- stantiation of the variable of a frequent star pat-tern SGP of a class C over the set SP in G , thereis an entity s SGP in V (cid:48) that corresponds to thesurrogate entity of the compact RDF moleculeof SGP . Formally, there is a partial mapping µ N : V → V (cid:48) :10 Instances of the frequent star pattern

SGP are mapped to the surrogate entity of thestar pattern, i.e., µ N ( s i ) = s SGP . – The mapping µ N is not deﬁned for the restof the entities that do not instantiate a fre-quent star pattern in G . • For each RDF triple t in ( s p o ) in E : – If µ N ( s ) is deﬁned and C s is the typeof s , and p is type , then the triples( s instanceOf µ N ( s ) ), ( µ N ( s ) type C s ) be-long to E (cid:48) . – If µ N ( s ) is deﬁned and C s is the type of s ,and p ∈ SP , then the triples ( µ N ( s ) p o )belong to E (cid:48) . – Otherwise, the RDF triple t is preserved in E (cid:48) .Consider RDF graphs G and G (cid:48) shown in Fig-ures 1a and 3c, respectively. Figure 4a depicts a map µ N that assigns entities c , c , c , and c of class C in G to the surrogate entity cM in G (cid:48) . Further, entities c , c , c , c , C , e , e , e , e , e , and e are pre-served in G (cid:48) . Moreover, the edge labeled with prop-erty p in G , i.e., ( c p e ) is presented with edges( c instanceOf cM ), ( cM p e ) and ( cM type C ) in G (cid:48) ; similarly, edges labeled with properties p and p in G are represented in G (cid:48) . Figure 4b shows the trans-formations of the connections between entities c , c , c , and c and the class C using labeled edges anno-tated with property type , with the connections relat-ing the entities c , c , c , and c to the correspondingsurrogate entity cM using the property instanceOf . Deﬁnition 4.11 (Axioms for InstanceOf) . Theproperty instanceOf is a functional property deﬁnedas follows: • If ( s i instanceOf sg ) and ( sg type C ) then( s i type C ). • If ( s i instanceOf sg ) and ( sg p j o k ) then( s i p j o k ).These two axioms enable to represent implicitly, allthe knowledge encoded in the edges from an original RDF graph that are removed during the factorizationprocess. They are utilized during query processingto rewrite queries over the original RDF graph intoqueries against the factorized RDF graph. Algorithm 1

E.FSP Algorithm

Input:

A dictionary subgraphsDict of subgraphs overthe subsets of properties in S , A set S of proper-ties of class C . Output:

Frequent star patterns fsp , A set SP ofproperties fsp ← [] , SP ← ∅ , minEdges ← , subsetCard ←| S | while subsetCard ≥ do propSets ← getSubsetsOf ( S, subsetCard ) for SP ∈ propSets do subgraphs ← subgraphsDict [ SP ] totalEdges ← countEdges ( subgraphs ) if minEdges == 0 then minEdges ← totalEdges fsp ← subgraphs bestSP ← SP else if totalEdges < minEdges then minEdges ← totalEdges fsp ← subgraphs bestSP ← SP end if end for subsetCard ← subsetCard − end while SP ← bestSP return fsp , SP To solve the

FSP detection problem, we proposetwo algorithms that perform iterations over frequentpatterns involving diﬀerent sets of properties setsof a class C in an RDF graph G , and the classentities. E.FSP , presented in Algorithm 1, resortsto a frequent pattern mining algorithm like gSpan.

E.FSP exploits breadth ﬁrst search technique to ex-haustively traverse the search space of frequent pat-terns generated by the frequent pattern mining algo-11 ,p ,p ,p e ,e ,e ,e ,e ,e ,e ,e ,e ,e ,p ,p e ,e ,e ,p ,p e ,e ,e ,e ,e ,e ,e ,p ,p e ,e ,e ,e ,e ,e ,e ,p ,p e ,e ,e ,e ,e ,e ,e totalEdges=16 p ,p e ,e ,p e ,e ,p e ,e ,p e ,e ,p e ,e ,p e ,e ,e ,e ,e ,e ,e ,e totalEdges=11 totalEdges=17 totalEdges=17 totalEdges=17totalEdges=14totalEdges=14totalEdges=14 totalEdges=18 totalEdges=18 totalEdges=18 (a) Exhaustive FSP Approach (E.FSP) p ,p ,p ,p e ,e ,e ,e ,e ,e ,e ,e ,e ,e ,p ,p e ,e ,e ,p ,p e ,e ,e ,e ,e ,e ,e ,p ,p e ,e ,e ,e ,e ,e ,e ,p ,p e ,e ,e ,e ,e ,e ,e (b) Greedy FSP Approach (G.FSP) Figure 5:

Frequent Star Patterns Detection . E.FSP and

G.FSP iterate over the star patterns in theRDF graph in Figure 1a to detect the frequent star patterns. (a)

E.FSP exhaustively iterates over thewhole search space of frequent patterns; (c)

G.FSP iterates the search space without generating all the starpatterns.rithm, and always ﬁnds the best frequent star pat-terns. Figure 5a illustrates the iterations performedby

E.FSP to ﬁnd the frequent star patterns in theRDF graph in Figure 1a.

E.FSP receives a dictionary subgraphsDict of all the subgraphs over the subsetsof the set S of properties in the class C in an RDFgraph G . The keys of the dictionary subgraphsDict are the combination of properties in the subsets of S , and the dictionary values are the subgraphs in-volving the properties from the corresponding keys. E.FSP generates frequent star patterns and a setof properties involved in the frequent star patterns.

E.FSP initializes the variables fsp , SP , minEdges ,and subsetCard in line 1. The variables minEdges and subsetCard are initialized with values and car-dinality of S , respectively. From lines 2-18, E.FSP iterates over all the subgraphs involving two or moreproperties to ﬁnd the frequent star patterns. In Fig-ure 5a,

E.FSP starts iterations with the set of prop-erties SP = { p , p , p , p } , and the subgraphs in-volving the properties in subsets of SP , where thecardinality of subsets is equal to the cardinality of S , i.e., four (line 3). The generated subset containsall the properties in SP , i.e., { p , p , p , p } , and the total number of edges totalEdges in SP is computedi.e., 16 (line 5-6). Since minEdges are , therefore,the value 16 of totalEdges is assigned to minEdges ,subgraphs over SP = { p , p , p , p } and SP are as-signed to f sp and bestSP , respectively (line 7-10).At line 17, the subset size subsetSize is reduced byone in order to generate the subsets of properties of S with the cardinality one less the cardinality of S i.e., three. The subsets { p , p , p } , { p , p , p } , and { p , p , p } , of cardinality three, generate more num-ber of edges i.e., value of totalEdges is 17, than theminimum number of edges minEdges , i.e., 16, and arenot selected as the best sets of properties. However,the subgraphs over the subset { p , p , p } contain 11number of triples, which is less than 16 the valueof minEdges . Therefore, E.FSP selects { p , p , p } as the best set of properties and the correspondingsubgraphs as the frequent star patterns (line 11-15).Once all the subsets SP of S with cardinality three,are evaluated, the value of subsetCard is reduced byone i.e., two, and the subsets of cardinality two areevaluated in the next iteration. Figure 5a presentsthat all the subsets of cardinality two generate largervalues, i.e., 14 and 18, for totalEdges than the value121 for minEdges . Therefore, none of the subsets ofproperties of cardinality two contains the frequentstar patterns. Further, all the subsets of cardinalitygreater or equal to two have been evaluated, E.FSP stops and returns { p , p , p } as the best set of prop-erties and the corresponding subgraphs as the fre-quent star patterns (line 19-20). G.FSP , presented in Algorithm 2, adopts a greedyalgorithm to traverse the search space without gen-erating all the frequent patterns.

G.FSP starts it-erations using a set SP of properties containing allthe properties in S of a class C in an RDF graph G . G.FSP computes the value of Formula 1 for SP anditerates over the subsets SP (cid:48) of cardinality one lessthe cardinality of SP and computes Formula 1 foreach of subsets SP (cid:48) . A property subset SP (cid:48) with asmaller formula value than the formula value of SP ,is selected as the best set of properties in that it-eration, and is used in the next iteration to checkthe subsets of cardinality one less the cardinality ofthe selected set of properties. The iterations are per-formed until the cardinality of the selected subsetof properties is less than two. Based on the prop-erty presented in Theorem 4.1, G.FSP stops, if noneof the subsets SP (cid:48) generates less value for formulathan the formula value of SP . In addition, G.FSP stops whenever the cardinality of the set of proper-ties is less than two, or the multiplicity of star pat-terns

AM I G ( SP | C ) is one. G.FSP receives a set S ofproperties in class C in an RDF graph G , and a list starList of star patterns involving properties in S . G.FSP returns frequent star patterns fsp and a set ofproperties SP involved in the frequent star patterns.Figure 5b shows the iterations performed by G.FSP to detect the frequent star patterns in the RDF graphin Figure 1a.

G.FSP initializes all the variables atline 1, where SP is assigned the set S of propertiesfor the ﬁrst iteration i.e., SP = { p , p , p , p } . Inlines 2-29, G.FSP iterates over the subsets of SP toﬁnd the frequent star patterns based on the criteriain Formula 1. The cardinality value four of SP isgreater than two (line 3), and AM I G ( SP | C ) is notequal to one (line 4-7), therefore, G.FSP computesthe value of

Edges ( SP, C, G ) of SP i.e., (line8). In lines 9-25, G.FSP iterates over the subsetsof SP of cardinality one less the cardinality of SP Algorithm 2

G.FSP Algorithm

Input:

A set S of properties of class C in G , and alist starList of star patterns over properties in S . Output:

Frequent star patterns fsp , A set SP ofproperties. fsp ← [] , starList (cid:48) ← [] , SP ← S , SP (cid:48) ← ∅ , fValue ← fValue (cid:48) ← repeat if | SP | ≥ then if AM I G ( SP | C ) == 1 then fsp ← starList return fsp , SP else fValue ← Edges ( SP, C, G ) for p ∈ SP do SP (cid:48) ← SP − { p } if | SP (cid:48) | ≥ then Create starList (cid:48) over SP (cid:48) using starList value ← Edges ( SP (cid:48) , C, G ) if AM I G ( SP (cid:48) | C ) == 1 then fValue (cid:48) ← value bestSP ← SP (cid:48) bestSList ← starList (cid:48) break else if value < fValue (cid:48) then fValue (cid:48) ← value bestSP ← SP (cid:48) bestSList ← starList (cid:48) end if end if end for end if end if starList ← bestSList , SP ← bestSP until fValue (cid:48) > fValue fsp ← starList return fsp , SP

13o ﬁnd the best set of properties for the next itera-tion. At line 10, a property p is removed from SP togenerate a subset SP (cid:48) e.g., by removing p a subset SP (cid:48) = { p , p , p } is generated. Since the cardinal-ity of SP (cid:48) is more than two, therefore, a star list starList (cid:48) , representing the star patterns over SP (cid:48) ,is created using starList (line 12-13). The valueof Edges ( SP (cid:48) , C, G ) for SP (cid:48) is computed i.e., 16(line 13). For SP (cid:48) , AM I G ( SP (cid:48) | C ) is not one, andthe value 16 of Edges ( SP (cid:48) , C, G ) for SP (cid:48) is notless than the value 15 of Edges ( SP, C, G ) for SP ,therefore, the star patterns over SP (cid:48) = { p , p , p } do not involve frequent star patterns and SP (cid:48) is nota best candidate for the next iteration. Similarly, theproperty subsets { p , p , p } and { p , p , p } , gener-ated from SP by removing p and p , respectively,give a higher value 16 for Edges ( SP (cid:48) , C, G ) andthe star patterns over these set of properties are notbetter than the star patterns over SP . However, SP (cid:48) = { p , p , p } , generated from SP by removing p , gives one star pattern, therefore, the star pat-tern involving properties in SP (cid:48) is returned as thefrequent star pattern without performing more itera-tion (line 14-18). In case, the set SP (cid:48) of properties isinvolved in more than star patterns and the formulavalue of SP (cid:48) smaller than the value of SP , then SP (cid:48) is selected for the next iteration (line 19-23). G.FSP stops and no further iterations are performed if noneof the subsets SP (cid:48) of SP generates a smaller value for Edges ( SP (cid:48) , C, G ) than Edges ( SP, C, G ) . G.FSP returns the star patterns involving SP , with a min-imum value for Edges ( SP, C, G ) , as the frequentstar patterns, and SP as the best set of properties. E.FSP and

G.FSP work under the following assump-tions: (a) all RDF molecules are complete, i.e., allclass entities have values for all the properties, (b)all the properties are functional. In addition to theseassumptions,

G.FSP has one more assumption: (c) ifthere are ties while deciding between the sets of prop-erties, only one will be selected. Complexity of

E.FSP is exponential, i.e., n . G.FSP adopts a Greedy ap-proach and prunes the search space by selecting onlythe best set of properties during each iteration untilthe stop condition is met, i.e., no better set of prop-erties with a minimum formula value can be found.In the worst case, the computational complexity of

G.FSP is (cid:80) ni =0 ( n − i ) = n ( n +1)2 , where n is the cardi-nality of the input set of properties. The complexityof G.FSP grows linearly with the increase in the sizeof the input set of properties.

We present a solution to the problem of factoriz-ing RDF graphs describing data using ontologies. Asketch of the proposed method is presented in Al-gorithm 3. The algorithm receives an RDF graph G = ( V, E, L ) , a class C , and a set SP (cid:48) of propertiesfrom E.FSP or G.FSP , and generates a factorizedRDF graph G (cid:48) = ( V (cid:48) , E (cid:48) , L (cid:48) ) , and the entity map-pings µ N from the entities of class C in V in RDFgraph G to the surrogate entities in V (cid:48) in RDF graph G (cid:48) . The algorithm initializes the set of mappings µ N ,the set of nodes V (cid:48) , the set of labeled edges E (cid:48) andthe set of edge labels (properties) L (cid:48) of the factor-ized RDF graph G (cid:48) (line 1). For all the entities of C related to the same objects o , o , . . . , o n using edgesannotated with properties p , p , . . . , p n in SP (cid:48) , thealgorithm creates a surrogate entity sg for the corre-sponding compact RDF molecule in G (cid:48) (lines 2-3). Inlines 4-6, the algorithm maps all the entities, that arerelated to o , o , . . . , o n using properties p , p , . . . , p n in G , to the surrogate entity in µ N . Once all the map-pings of the entities of C in G to the correspondingsurrogate entities in G (cid:48) are in µ N , the factorized RDFgraph G (cid:48) is created using µ N (lines 8-29). For eachRDF triple ( s p o ) in E , if entity mapping µ N ( s ) isdeﬁned, then a compact RDF molecule is created. If p is type , then the new edges ( s instance µ N ( s )) , and ( µ N ( s ) p o ) are added to G (cid:48) along with entities s , o and the mapped surrogate entity of s , and the edgelabels p and instanceOf (lines 11-14). If p is in SP ,the new edge ( µ N ( s ) p o ) , and entities s , o and theedge label p are added to G (cid:48) (lines 15-18). If entitymapping µ N ( s ) are deﬁned, however, p is not in SP ,or p is not type , then the edge ( s p o ) is added to G (cid:48) along with the corresponding nodes and the edgelabel (lines 19-23). If entity mapping µ N ( s ) is notdeﬁned, then the edge ( s p o ) and the nodes s and o ,and edge label p are added to G (cid:48) (lines 24-28).Figure 6 depicts the transformations for the set { p , p , s type C.?s p ?o . Rule 1: Property p ?s type C.?s p ?o . Rule 2: Property p ... ?s type C.?s p n ?o n . Rule n: Property p n Original RDF Graph T r an s f o r m a t i on s o f E n t i t i e s o f C l a ss C w . r .t. P r ope r t y S e t { p , p ,..., p n } ?sg type C.?sg p ?o .?s instanceOf ?sg. ... Factorized RDF Graph ?sg type C.?sg p ?o .?s instanceOf ?sg.?sg type C.?sg p n ?o n .?s instanceOf ?sg. (a) Transformation Rules for Class C c e e e C ty p e e p p cM e e e C ty p e e p p c i n s t a n c e O f p , p , p are functional properties c instanceOf is functional property cMc p , p , p are functional properties Original RDF Graph Factorized RDF Graph

Assumptions: Assumptions: (b) Original and Factorized RDF Graphs

Figure 6:

Transformations in RDF Graph . Transformation rules preserved between original and factor-ized RDF graphs. (a) Transformation rules over the properties p , p , . . . , p n ; (b) Portions of RDF graphs(original and factorized). Nodes and edges added to create the factorized RDF graph, are highlighted inbold. c e c c c e e Ce e e p p p p p p p p p p p p p p p p type t y p e t y p e t y p e c c c c p p C t y p e cM e e e e e e p p p i n s t a n c e O f i n s t a n c e O f i n s t a n c e O f i n s t a n c e O f Number of Labeled Edges = 20.0 Number of Labeled Edges = 12.0 Percentage Savings =

Original RDF Graph Factorized RDF Graph (a) %age Decrease in Edges after Factorization c e c c c e Ce e Number of Labeled Edges = 18.0 Number of Labeled Edges = 22.0 Percentage Savings = -22.0% e e e c e c t y p e t y p e t y p e t y p e type t y p e p p p p p p p p p p p p c e c c c e Ce e e e e c e c t y p e t y p e t y p e t y p e p p p p p p p p cM cM cM cM c e e t y p e p p c e e t y p e p p cM i n s t a n c e O f i n s t a n c e O f i n s t a n c e O f i n s t a n c e O f i n s t a n c e O f i n s t a n c e O f i n s t a n c e O f Original RDF Graph Factorized RDF Graph (b) %age Increase in Edges after Factorization

Figure 7:

RDF Graph Factorization Overhead . Factorization of RDF graphs is not worthy in all cases.(a) Entities of class C in the original RDF graph match the frequent star pattern over the properties p , p ,and p ; (b) few entities match each star pattern over p and p causing factorization overhead. . . . , p n } of properties performed in an RDF graphwhenever a corresponding factorized RDF graph iscreated. Figure 6a presents transformation rules; onerule for each property in { p , p , . . . , p n } of class C .Each rule states how the labeled edges associatedwith a C in an original RDF graph are transformedinto the edges in the factorized graph. Rule 1 assertsthat the relation between an entity s of C with an ob-ject o is not explicitly represented by one propertyin the factorized RDF graph. In order to retrieve o , a path across the labeled edges between entities s and the corresponding surrogate entity sg have to betraversed. Similarly, the rest of the transformationrules establish how explicit associations between en-tities of C and the objects using properties p , . . . , p n in the original RDF graphs are represented by thepath of labeled edges annotated with properties inthe factorized RDF graphs. Algorithm 3 adds thecorresponding labeled edges of these paths in lines7-16. Furthermore, Figure 6b presents a portion ofthe RDF graph in Figure 1a and corresponding trans-formation in the factorized RDF graph in Figure 3c.15 lgorithm 3 The Factorization Algorithm

Input:

An RDF graph G ( V, E, L ) , A class C , A set SP of properties from E.FSP

Algorithm 1 or

G.FSP

Algorithm 2

Output:

Factorized RDF Graph G (cid:48) ( V (cid:48) , E (cid:48) , L (cid:48) ) andentity mappings µ N µ N ←− ∅ , V (cid:48) ←− ∅ , E (cid:48) ←− ∅ , L (cid:48) ←− ∅ for all o , o , . . . , o n ∈ V such that SS = { s | p , p , . . . , p n ∈ SP AN D ( s type C ) ∈ G, ( s p o ) ∈ G, ( s p o ) ∈ G . . . , ( s p n o n ) ∈ G } do sg ← SurrogateEntity () for ss ∈ SS do µ N ← µ N ∪ { ( ss, sg ) } end for end for for ( s p o ) ∈ E ∧ s, o ∈ V do if µ N ( s ) (cid:54) = ∅ then {Create compact RDF molecule} if p == type then E (cid:48) ← E (cid:48) ∪ { ( s instanceOf µ N ( s )) , ( µ N ( s ) p o ) } V (cid:48) ← V (cid:48) ∪ { s, µ N ( s ) , o } L (cid:48) ← L (cid:48) ∪ { p, instanceOf } else if p ∈ SP then E (cid:48) ← E (cid:48) ∪ { ( µ N ( s ) p o ) } V (cid:48) ← V (cid:48) ∪ { µ N ( s ) , o } L (cid:48) ← L (cid:48) ∪ { p } else E (cid:48) ← E (cid:48) ∪ { ( s p o ) } V (cid:48) ← V (cid:48) ∪ { s, o } L (cid:48) ← L (cid:48) ∪ { p } end if else E (cid:48) ← E (cid:48) ∪ { ( s p o ) } V (cid:48) ← V (cid:48) ∪ { s, o } L (cid:48) ← L (cid:48) ∪ { p } end if end for return G (cid:48) ( V (cid:48) , E (cid:48) , L (cid:48) ) , µ N The surrogate entity and the new labeled edges arehighlighted in bold in the factorized RDF graph. TheAlgorithm 3 creates the surrogate entity in line 4; newedges are created in line 10. Additionally, assump-tions about the characteristics of the entity associa-tions in the graph are presented. Some edges exist-ing between the entities in RDF graph in Figure 1aare not present in the factorized RDF graph in Fig-ure 3c, these entity associations can be obtained bytraversing the factorized RDF graph as indicated bythe corresponding transformation rules in Figure 6a.Figure 7 illustrates an example of factorizationoverhead, i.e., a case when it is not worthy to fac-torize a class given a set of properties in an RDFgraph. Figure 7a presents an example where savingsare observed in the number of edges after factoriza-tion. The factorization of RDF graph in Figure 7afor the class C using the properties p , p , and p ,reduces the number of edges from . to . andthe positive value . for percentage savings indi-cates a percentage decrease in the number of edges.Furthermore, the edge savings gained after factoriza-tion are high enough to compensate the addition ofthe surrogate entity cM in the factorized RDF graph.In contrast, factorization of the RDF graph over theclass C using the properties p and p introduces anoverhead, as shown in Figure 7b, by increasing thenumber of nodes and edges in the factorized RDFgraph. The number of edges is increased from . to . , shown in Figure 7b, after factorization anda negative value − . for the percentage savingsindicates an increase in the number of edges. Thestar patterns, detected in the original RDF graph, inFigure 7b, are replaced by the corresponding com-pact RDF molecules with the corresponding surro-gate entity and new labeled edges (presented in Al-gorithm 3). Due to the high number of star patterns,the addition of the surrogate entities and new labelededges increases the size of the factorized RDF graph. We study the eﬀectiveness of the proposed techniquesfor detecting frequent star patterns. Moreover, givena class, we evaluate the impact of the factorization16echniques over the RDF graphs size by selecting sev-eral combinations of the properties in the class. Weempirically assessed the following research questions:

RQ1)

Are the proposed frequent star patterns de-tection techniques able to eﬃciently detect the fre-quent star patterns in RDF graphs?

RQ2)

Are theproposed frequent star patterns detection techniquesable to detect the frequent star patterns in RDFgraphs?

RQ3)

What is the impact of diﬀerent combinations of prop-erties of a class over the size of factorized RDFgraphs?

RQ4)

Are the proposed factorization tech-niques able to reduce the number of labeled edges inRDF graphs? Our experimental conﬁguration is asfollows:

Datasets.

Evaluation is conducted on three

Linked-SensorData datasets [22] semantically described us-ing the Semantic Sensor Network (SSN) Ontology.These RDF datasets comprise observations and mea-surements of several climate phenomena, e.g., tem-perature, visibility, precipitation, rainfall, and hu-midity, collected during the hurricane and blizzardseasons in the United States in the years 2003, 2004,and 2005 . Table 1a describes the main characteris- Available at: http://wiki.knoesis.org/index.php/LinkedSensorData tics of these RDF datasets. Moreover, Figure 8 showspercentage of repeated RDF triples with wind speed,temperature, and relative humidity values in dataset D D D . The unit of measurement is same for eachtype of observation. These plots show that some ofthe large number of observed values are highly re-peated in the datasets. Further, values are not dis-cretized to produce the same query answers. Metrics.

We measure the results of our empiricalevaluation in terms of number of nodes and edges inan RDF graph. The size of an RDF graph is pre-sented as the sum of nodes and edges in the graph,where the nodes correspond to the entities and ob-jects, whereas the edges are labeled edges annotatedwith the properties of a class in an RDF graph. Inour empirical evaluation, we report on the followingmetrics: a) Execution Time (Exec.Time(ms)) is the time required to ﬁnd the frequent star pat-terns in RDF graphs. b) Number of Nodes (NN) is the number of

Observation and

Measurement en-tities and objects in RDF graphs. c) Number ofLabeled Edges (NLE) represents the number oflabeled edges annotated with the properties in

Ob-servation and

Measurement classes in RDF graphs. d) Percentage Savings in the Number of La-beled Edges (%Savings) stands for the percentageTable 1:

Datasets . (a) Statistics of the datasets with observations about several weather phenomena, col-lected from around 20,000 weather stations in the United States; (b) The number of labeled edges

NLE(G) , inthe datasets obtained after gradually integrating the RDF datasets D1, D2, and D3 describing observations. (a) Statistics of datasets collected from around 20,000 weather stations in the US.

Dataset Climate Date

Blizzard April, 2003 38,054,493 4,092,492 D2 Hurricane Charley August, 2004 108,644,568 11,648,607 D3 Hurricane Katrina August, 2005 179,128,407 19,233,458 (b) Number of Labeled Edges

NLE(G) in datasets.

Dataset Observation MeasurementID NLE( G ) NLE( G )D1 D1D2

D1D2D3 ind Speed Values010203040 Percentage of Repeated RDF Triples with Wind Speed Values (a) %age of Windspeed Repeated Triplesin D D D Temperature Values

Percentage of Repeated RDF Triples with Temperature Values (b) %age of Temperature RepeatedTriples in D D D Relative Humidity Values0246810

Percentage of Repeated RDF Triples with Relative Humidity Values (c) %age of Relative Humidity RepeatedTriples in D D D Figure 8:

Percentage of Repeated RDF Triples with Observation Values . Few of the large numberof values are highly repeated. (a) Percentage of repeated RDF triples with windspeed values; (b) Percentageof repeated triples with temperature values; (c) Percentage of repeated triples with relative humidity values.increase or decrease in the number of labeled edgesusing a positive or a negative value, respectively. Theinterpretation of the metric %Savings is, higher isbetter.

Implementation.

The experiments were performedon a Linux Debian 8 machine with a CPU In-tel Xeon(R) Platinum 8160 2.10GHz and 754GBRAM. The datasets are factorized for

Observation and

Measurement classes using all possible combi-nations of the properties in each class. Table 2shows the set of properties for

Observation and

Mea-surement (Meas.), respectively, in the SSN ontol- ogy. Each set of properties is assigned a set iden-tiﬁcation string

SID , and are referred with the cor-responding identiﬁcation string in the paper.

Ob-servation contains property , procedure , generatedBy ,and time property. procedure and generatedBy aresymmetric properties and are considered together inthe sets. Similarly, in Measurement , sets of proper-ties contain the properties value and unit . Further,for experiments, datasets are gradually merged to in-crease datasets size. The source code is available athttps://github.com/SDM-TIB/Graph-Factorization.Table 2:

Observation and Measurement Classes . Sets of properties containing diﬀerent properties ofthe

Observation and

Measurement (Meas.) classes in the SSN ontology, each set of properties is assigned aunique ID, e.g., A1 and A8 . Class Set of Properties SID O b s e r v a t i o n {property} A1{time} A2{procedure, generatedBy} A3{property, procedure, generatedBy,time} A4{property, procedure, generatedBy} A5{property, time} A6{procedure, time, generatedBy} A7 M e a s . {value, unit} A8{value} A9{unit} A1018 .1 Eﬃciency of Frequent Star Pat-terns Detection Approach For evaluating the eﬃciency of the proposed frequentstar patterns techniques and to answer the researchquestion

RQ1 , we execute

E.FSP and

G.FSP overﬁve percent of RDF triples from dataset D . Thedataset of the selected RDF triples describe the Mea-surement and

Observation classes, where several dif-ferent types of observations from the

Observation class are included in the dataset. gSpan [30] is usedto generate the frequent patterns space for

E.FSP ,which iterates over all the generated frequent pat-terns. To evaluate the eﬃciency of two approaches,we selected ﬁve percent of RDF triples from thedataset D ; this number was chosen as a timeoutbecause gSpan was able to generate the frequent pat-terns within thirty minutes. Eﬃciency comparisonin terms of execution time of E.FSP and

G.FSP isreported in Table 3.

G.FSP ﬁnds the frequent starpatterns without generating all the star patterns in-volving all the possible subsets of properties. Table 3shows for

E.FSP and

G.FSP , the number of itera-tions over sets of properties

PSIterations , the numberof frequent star patterns detected , and the ex-ecution time in milliseconds

Exec.Time(ms) requiredto detect the frequent star patterns. The results indi-cate that

E.FSP and

G.FSP detect the same frequentstar patterns. The frequent star patterns, detected by

E.FSP and

G.FSP , are over the set of properties A and A for all the diﬀerent observations in the Ob-servation class, and the

Measurement class, respec-tively. Execution time of

G.FSP to detect frequentstar patterns is less by at least three orders of magni-tude than the execution time of

E.FSP , e.g.,

G.FSP detects frequent star patterns in measurement classin . × milliseconds, whereas . × millisec-onds are required using E.FSP . To answer the research questions

RQ2 and

RQ3 , wecompute the values of Formula 1 for all the sets ofproperties given in Table 2 for the

Observation and

Measurement classes, respectively, in the three RDF datasets. The computed formula values for the

Ob-servation and

Measurement classes are shown in Ta-ble 4. Moreover, we compute the size of the originaland factorized RDF graphs, in terms of nodes andedges in the RDF graphs. The formula values arecomputed for the sets of properties that contain onlyone property in the set, as well as the factorizationis performed using these sets of properties to illus-trate the association between the formula values andthe savings obtained in the factorized graphs. Table 4shows that the set A of properties in the Observation class generates the smaller values D , , , D D , , , and D D D , , for the Formula 1, than all the other sets A , A , A , A , A , and A . A smaller formula value for A indicates that the RDF graphs encapsulate a mini-mum number of star patterns, over the properties inthe set A such that a large number of entities of the Observation class match these star patterns. There-fore, replacing these star patterns with the compactRDF molecules during the factorization reduces thesize of the RDF graphs. Figure 9a presents the num-ber of

Observation nodes

N N and the labeled edges

N LE in the original and factorized RDF datasets D , D D , and D D D . The results show that factor-ization of the Observation class over the set A ofproperties reduces the sum of the number of obser-vation nodes and the labeled edges in the factorizedRDF graphs by up to . On contrary, a largeformula value for A in datasets D , , , D D , , , and D D D , , ,than the other sets A , A , A , A , A , and A in-dicates that a large number of star patterns over theproperties in A exist in the RDF graphs and a smallnumber of entities of the Observation class matchthese star patterns. Figure 9a depicts an increasein the number of

Observation nodes

N N and the la-beled edges

N LE in the factorized RDF datasets D , D D , and D D D after factorizing over the prop-erties in A . Similarly, the results for A , A , A , A ,and A in Table 4 and Figure 9a clearly show thatthe higher the formula value for a set of properties in-creases the number of nodes and edges in the factor-ized RDF graphs by factorizing using the propertiesin the corresponding set. In case of the Measurement class Table 4 shows smaller formula values for the set19 of properties i.e., D , , D D , ,and D D D , , than the other sets A and A . Figure 9b reports the sum of the nodes andthe labeled edges representing measurements in theoriginal and factorized RDF datasets D , D D , and D D D . The sum of the nodes and the labled edgesof the measurements are reduced up to in all thefactorized RDF graphs by factorizing over the prop-erties in A . Furthermore, the higher formula valuesfor the sets A and A indicate less savings afterfactorization compared to the set A . The numberof nodes and edges in the factorized RDF graphs byfactorizing over the properties in sets A and A in Figure 9b are higher than A . These results showthat the diﬀerent combinations of class properties im-pact the factorization of RDF graphs and the pro-posed frequent star patterns detection techniques areable to detect the set of properties involved in thegeneration of frequent star patterns. Moreover, ourtechniques are able to anticipate the best set of prop-erties, answering thus, research questions RQ2 and

RQ3 . We factorize the gradually increasing RDF datasets D , D D , and D D D over the Observation and

Measurement classes using the properties in the setsof properties given in Table 2. The percentage sav-ings are computed in terms of labeled edges for theobservations and measurements in the RDF datasetsafter factorization. Table 1b presents the numberof edges

N LE ( G ) in the Observation and

Measure-ment classes in the original RDF datasets D , D D ,and D D D . Table 5 presents the number of la-beled edges N LE ( G (cid:48) ) and the percentage savings % savings after factorization of the Observation and

Measurement classes. The highest savings . , . , and . in N LE ( G (cid:48) ) for observationsafter factorizing D , D D , and D D D over theproperties in A , shows that the number of frequentstar patterns over the properties in A are reduceby replacing them with the corresponding compactRDF molecules. On the other hand, the set A ofproperties gives negative values of percentage savingsTable 3: Eﬃciency of Frequent Star Patterns Detection.

E.FSP and

G.FSP are used to detect thefrequent star patterns for the

Observation and

Measurement classes in the ﬁve percent of RDF triples fromthe dataset D . E.FSP and

G.FSP detect the same frequent star patterns involving the sets A and A ofproperties from the Observation and

Measurement classes, respectively.

G.FSP takes less time to identifythe same frequent star patterns than the time taken by

E.FSP . PSIterations O b s e r v a t i o n Precipitation 8 4 23 23 . × . × Pressure 5 4 183 183 . × . × Rainfall 5 4 533 533 . × . × RelativeHumidity 5 4 341 341 . × . × Snowfall 8 4 382 382 . × . × Temperature 5 4 395 395 . × . × Visibility 5 4 395 395 . × . × WindDirection 5 4 350 350 . × . × WindSpeed 5 4 410 410 . × . × Measurement 1 1 1,907 1,907 . × . × NLE NN O r i g i n a l A A A A A A A O r i g i n a l A A A A A A O r i g i n a l A A A A A A A A D1D2D3D1D2D1 (a) NN and Edges NLE

NLE NN

D1D2D3D1D2D1 O r i g i n a l A A A O r i g i n a l A A A O r i g i n a l A A A (b) NN and edges NLE

Figure 9:

Nodes and Labeled Edges . The number of nodes NN and labeled edges NLE before andafter factorization of the RDF datasets. (a) The number of nodes

N N and labeled edges

N LE representingobservations in the RDF datasets; (b) The number of nodes

N N and labeled edges

N LE representingmeasurements.Table 4:

Values Computed for Formula 1.

The sets of properties in Table 2 for the

Observation and

Measurement (Meas.) classes, respectively, are used to compute the Formula 1 values over the RDF datasets D , D D , and D D D . The minimum formula values for the Observation and

Measurement classes andthe corresponding sets A and A , respectively, of properties are highlighted in bold. The smaller formulavalues for A and A in the Observation and

Measurement classes, respectively, indicate the maximumsavings after factorizing the RDF graphs over the properties in A and A , as shown in Figure 9 andTable 5. Edges ( SP , C , G ) SID D1 D1D2 D1D2D3 O b s e r v a t i o n A1 12,071,185 46,643,440 103,815,183A2 12,090,195 46,687,690 103,891,717A3 8,111,623 31,205,888 69,358,875A4 20,118,595 78,698,580 174,865,870

A5 4,142,727 15,756,888 34,898,603

A6 8,097,964 31,245,605 69,474,786A7 15,784,707 61,406,644 135,902,747 M e a s . A8 28,491 34,554 40,302

A9 4,037,067 15,563,838 34,623,579A10 4,023,731 15,547,816 34,605,063 Savings , − . , for the RDF dataset D , and − . , for the RDF datasets D D and D D D ,indicating an increase in the number of labeled edgesafter the factorization of the RDF datasets. Simi-larly, for measurements, the positive values . ofpercentage savings after factorizing D , and . for D D and D D D over A indicate a decreasein the number of labeled edges after factorization.Furthermore, the percentage savings in the set A ofproperties are higher than in A and A . These re-sults allow us to positively answer research question RQ4 . This article presents computational methods to iden-tify frequent star patterns and to generate a factor-ized RDF graph , with a minimized number of frequentstar patterns. A frequent star pattern contains classentities linked to the objects or other resources us-ing labeled edges annotated with properties in theclass. These frequent star patterns introduce redun-dancy in terms of edges and nodes. Our proposedcomputational methods implement the frequent starpattern detection algorithm based on search space pruning techniques to identify the classes and prop-erties involved in frequent star patterns. Further-more, the proposed factorization techniques gener-ate compact representation of RDF graphs, factor-ized RDF graph , by replacing a frequent star pat-tern with a compact RDF molecule, composed of asurrogate entity connected to the object in the fre-quent star pattern using the labeled edges annotatedwith relevant properties. We empirically study theeﬀectiveness of the frequent star pattern detectionalgorithm to identify class and properties involved inthe frequent star pattern. Furthermore, we evalu-ate the impact of the factorization techniques overthe gradually increasing RDF graphs size and dif-ferent combinations of class properties. Experimen-tal results suggest that the proposed computationalmethods successfully identify the class properties in-volved in the frequent star patterns and remove re-dundancy caused by these frequent star patterns. Forthe best set of properties, identiﬁed by the frequentstar pattern detection algorithm, the RDF graph sizeis reduced by up to . . Our work broadens therepertoire of techniques for representing and storingknowledge graphs by providing RDF graph compres-sion techniques which exploit the semantics encodedin the data; these techniques generate compact rep-Table 5: Percentage Savings in Labeled Edges after Factorization.

Savings %Savings in the numberof Labeled Edges

NLE( G (cid:48) ) after factorization of the RDF datasets using the sets of properties in Observationand Measurement classes. D1 D1D2 D1D2D3 O b s e r v a t i o n SID NLE( G (cid:48) ) %Savings NLE( G (cid:48) ) %Savings NLE( G (cid:48) ) %Savings A1 20,125,493 16.64 77,745,918 16.66 173,032,155 16.66A2 20,144,503 16.56 77,790,168 16.61 173,108,689 16.63A3 16,226,021 32.79 62,546,938 32.95 139,064,503 33.02A4 28,170,155 -16.68 108,838,750 -16.67 242,239,479 -16.67

A5 12,277,576 49.14 47,175,356 49.43 104,786,128 49.53

A6 16,150,898 33.10 62,317,489 33.20 138,639,234 33.23A7 23,837,352 1.26 92,088,523 1.28 204,304,156 1.60 M e a s . A8 4,059,738 66.37 15,599,469 66.56 34,716,176 66.56

A9 8,069,688 33.15 31,130,127 33.26 69,300,827 33.25A10 8,056,352 33.26 31,114,105 33.29 69,282,311 33.26

Acknowledgments

Farah Karim is supported by the German Aca-demic Exchange Service (DAAD); this work is par-tially funded by the EU H2020 project IASiS (GANo.727658).

References [1] D. Abadi, S. Madden, and M. Ferreira. Inte-grating compression and execution in column-oriented database systems. In

Proceedings ofthe 2006 ACM SIGMOD international confer-ence on Management of data , pages 671–682.ACM, 2006.[2] D. Allen, A. Hodler, M. Hunger, M. Knobloch,W. Lyon, M. Needham, and H. Voigt. Under-standing trolls with eﬃcient analytics of largegraphs in neo4j.

BTW 2019 , 2019.[3] S. Álvarez-García, N. R. Brisaboa, J. D. Fernán-dez, and M. A. Martínez-Prieto. Compressedk2-triples for full-in-memory rdf engines. arXivpreprint arXiv:1105.4004 , 2011.[4] M. Arenas, C. Gutierrez, and J. Pérez. Foun-dations of RDF databases. In

Reasoning Web.Semantic Technologies for Information Systems ,pages 158–204. Springer, 2009.[5] S. Auer, V. Kovtun, M. Prinz, A. Kasprzik,M. Stocker, and M. Vidal. Towards a knowledgegraph for science. In

Proceedings of the 8th Inter-national Conference on Web Intelligence, Min-ing and Semantics, WIMS 2018 , 2018. [6] C. Bizer, T. Heath, and T. Berners-Lee. Linkeddata: The story so far. In

Semantic services,interoperability and web applications: emergingconcepts , pages 205–227. IGI Global, 2011.[7] P. A. Boncz, M. Zukowski, and N. Nes. Monet-db/x100: Hyper-pipelining query execution. In

Cidr , volume 5, pages 225–237, 2005.[8] N. R. Brisaboa, S. Ladra, and G. Navarro.k2-trees for compact web graph representation.In

International Symposium on String Process-ing and Information Retrieval , pages 18–30.Springer, 2009.[9] M. Compton, P. Barnaghi, L. Bermudez,R. GarcíA-Castro, O. Corcho, S. Cox, J. Gray-beal, M. Hauswirth, C. Henson, A. Herzog, et al.The ssn ontology of the w3c semantic sensor net-work incubator group.

Web semantics: science,services and agents on the World Wide Web ,17:25–32, 2012.[10] G. P. Copeland and S. N. Khoshaﬁan. A decom-position storage model. In

Acm Sigmod Record ,volume 14, pages 268–279. ACM, 1985.[11] M. Elseidy, E. Abdelhamid, S. Skiadopoulos,and P. Kalnis. Grami: Frequent subgraph andpattern mining in a single large graph.

Proceed-ings of the VLDB Endowment , 7(7):517–528,2014.[12] P. Ernst, A. Siu, and G. Weikum. Knowlife:a versatile approach for constructing a largeknowledge graph for biomedical sciences.

BMCbioinformatics , 16(1):157, 2015.[13] J. D. Fernández, A. Llaves, and Ó. Corcho. Ef-ﬁcient RDF interchange (ERI) format for RDFdata streams. In

The Semantic Web - ISWC2014 - 13th International Semantic Web Con-ference, Riva del Garda, Italy, October 19-23,2014. Proceedings, Part II , pages 244–259, 2014.[14] J. D. Fernández, M. A. Martínez-Prieto,C. Gutiérrez, A. Polleres, and M. Arias. Binaryrdf representation for publication and exchange23hdt).

Web Semantics: Science, Services andAgents on the World Wide Web , 19:22–41, 2013.[15] I. Grangel-González, L. Halilaj, M. Vidal,O. Rana, S. Lohmann, S. Auer, and A. W.Müller. Knowledge graphs for semantically inte-grating cyber-physical systems. In

Database andExpert Systems Applications - 29th InternationalConference , 2018.[16] A. K. Joshi, P. Hitzler, and G. Dong. Logicallinked data compression. In

Extended SemanticWeb Conference , pages 170–184. Springer, 2013.[17] F. Karim, M. N. Mami, M.-E. Vidal, andS. Auer. Large-scale storage and query process-ing for semantic sensor data. In

Proceedings ofthe 7th International Conference on Web Intel-ligence, Mining and Semantics , page 8. ACM,2017.[18] O. Lassila, R. R. Swick, et al. Resource descrip-tion framework (rdf) model and syntax speciﬁ-cation. 1998.[19] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch,D. Kontokostas, P. N. Mendes, S. Hellmann,M. Morsey, P. Van Kleef, S. Auer, et al.Dbpedia–a large-scale, multilingual knowledgebase extracted from wikipedia.

Semantic Web ,6(2):167–195, 2015.[20] M. Meier. Towards rule-based minimization ofrdf graphs under constraints. In

InternationalConference on Web Reasoning and Rule Sys-tems , pages 89–103. Springer, 2008.[21] J. Z. Pan, J. M. G. Pérez, Y. Ren, H. Wu,H. Wang, and M. Zhu. Graph pattern basedrdf data compression. In

Joint International Se-mantic Technology Conference , pages 239–256.Springer, 2014.[22] H. K. Patni, C. A. Henson, and A. P. Sheth.Linked sensor data. 2010.[23] R. Pichler, A. Polleres, S. Skritek, andS. Woltran. Redundancy elimination on rdfgraphs in the presence of rules, constraints, and queries. In

International Conference on WebReasoning and Rule Systems , pages 133–148.Springer, 2010.[24] E. PrudâĂŹhommeaux and A. Seaborne. Sparqlquery language for rdf. w3c recommendation(january 15, 2008), 2011.[25] M. A. Roth and S. J. Van Horn. Databasecompression.

ACM Sigmod Record , 22(3):31–39,1993.[26] A. Singhal. Introducing the knowledge graph:things, not strings.

Oﬃcial google blog , 5, 2012.[27] M. Stonebraker, D. J. Abadi, A. Batkin,X. Chen, M. Cherniack, M. Ferreira, E. Lau,A. Lin, S. Madden, E. O’Neil, et al. C-store:a column-oriented dbms. In

Proceedings of the31st international conference on Very large databases , pages 553–564. VLDB Endowment, 2005.[28] M.-E. Vidal, K. M. Endris, S. Jazashoori,A. Sakor, and A. Rivas. Transforming hetero-geneous data into knowledge for personalizedtreatments a use case.

Datenbank-Spektrum ,pages 1–12.[29] T. Westmann, D. Kossmann, S. Helmer, andG. Moerkotte. The implementation and perfor-mance of compressed databases.

ACM SigmodRecord , 29(3):55–67, 2000.[30] X. Yan and J. Han. gspan: Graph-based sub-structure pattern mining. In , pages 721–724. IEEE, 2002.[31] M. Zhu, W. Wu, J. Z. Pan, J. Han, P. Huang,and Q. Liu. Predicate invention based rdfdata compression. In

Joint International Se-mantic Technology Conference , pages 153–161.Springer, 2018.[32] M. Zukowski, S. Heman, N. Nes, and P. A.Boncz. Super-scalar ram-cpu cache compression.In