GGDs: Graph Generating Dependencies
GGGDs: Graph Generating Dependencies
Larissa C. Shimomura [email protected] University of TechnologyEindhoven, Netherlands
George Fletcher [email protected] University of TechnologyEindhoven, Netherlands
Nikolay Yakovets [email protected] University of TechnologyEindhoven, Netherlands
ABSTRACT
We propose Graph Generating Dependencies (GGDs), a new classof dependencies for property graphs. Extending the expressivityof state of the art constraint languages, GGDs can express bothtuple- and equality-generating dependencies on property graphs,both of which find broad application in graph data management.We provide the formal definition of GGDs, analyze the validationproblem for GGDs, and demonstrate the practical utility of GGDs.
ACM Reference Format:
Larissa C. Shimomura, George Fletcher, and Nikolay Yakovets. 2020. GGDs:Graph Generating Dependencies. In
Proceedings of -.
ACM, New York, NY,USA, 5 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Constraints play a key role in data management research, e.g.,in the study of data quality, data integration and exchange, andquery optimization [1, 2, 7–9, 11–13]. As graph-structured data setsproliferate in domains such as social networks, biological networksand knowledge graphs, the study of graph dependencies is also ofincreasing practical interest [3, 7]. This raises new challenges asgraphs are typically schemaless, unlike relational data.Recently, different classes of dependencies for graphs have beenproposed such as Graph Functional Dependencies (GFDs [11]),Graph Entity Dependencies (GEDs [9]) and Graph Differential De-pendencies (GDDs [14]). However, these dependencies focus ongeneralizing functional dependencies (i.e., variations of equality -generating dependencies) and cannot capture tuple -generating de-pendencies (TGDs) for graph data [7]. As an example, we mightwant to enforce the constraint on a human resources graph that “iftwo people vertices have the same name and address property-valuesand they both have a works-at edge to the same company vertex,then there should be a same-as edge between the two people.” Thisis an example of a TGD on graph data, as satisfaction of the con-straint requires the existence of an edge (i.e., the same-as edge), andwhen not satisfied, we repair the graph by generating same-as edgeswhere necessary. TGDs are important for many applications, e.g.,for entity resolution during data cleaning and integration [8, 13].Indeed, TGDs arise naturally in graph data management appli-cations. Given the lack of TGDs for graphs in the current study of
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]. -, -, Eindhoven, Netherlands © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn graph dependencies, we propose a new class of graph dependen-cies called Graph Generating Dependencies (GGDs) which fullysupports TGDs for property graphs (i.e., TGDs for graphs where ver-tices and edges can have associated property values, such as namesand addresses in our example above – the most common data modelin practical graph data management systems) and generalizes ear-lier graph dependencies. Informally, a GGD expresses a constraintbetween two (possibly) different graph patterns enforcing relation-ships between property values and topological structure.In this short paper, we formally define GGDs, analyze the vali-dation problem for GGDs, and illustrate the utility of GGDs for theentity resolution problem. We conclude the paper with indicationsfor further study of GGDs.
We place GGDs in the context of relational and graph dependencies.
Relational data dependencies.
The classical Functional Dependen-cies (FDs) have been widely studied and extended for contemporaryapplications in data management. The most related for GGDs inthe state of the art are the Conditional Functional Dependencies(CFDs [2, 8]) and the Differential Dependencies (DDs [17]). CFDswere proposed for data cleaning tasks where the main idea is toenforce an FD only for a set of tuples specified by a condition, un-like the original FDs in which the dependency holds for the wholerelation. The DDs extend the FDs by specifying looser constraintsaccording to user-defined distance functions between attribute val-ues.
Graph dependencies.
Previous work in the literature focused ondefining FDs for RDF data and TGDs for graph data exchange andeliminating redundancy in RDF[1, 5, 12, 15]. Most closely related toGGDs are the graph functional dependencies (GFDs), graph entitydependencies (GEDs), and graph differential dependencies (GDDs)[9, 11, 14]. The GFDs are formally defined as a pair ( Q [ x ] , X → Y ) inwhich Q [ x ] is a graph pattern that defines a topological constraintwhile X , Y are two sets of literals that define the property-valuefunctional dependencies of the GFD. Since graph data is usuallyschemaless, the property-value dependency is defined for the vertexattributes present in the graph pattern. The GEDs subsume theGFDs and can express FDs, GFDs, and EGDs. Besides the property-value dependencies present in the GFDs, GEDs also carry special idliterals to enable identification of vertices in the graph pattern. TheGDDs extend the GEDs by introducing distance functions instead ofequality functions, similar to the DDs for relational data but definedover a topological constraint expressed by a graph pattern. Similarto the definition of our proposed GGDs, the Graph Repairing Rules(GRRs [6]) were proposed as an automatic repairing semantics forgraphs. The semantics of a GRR is: given a source graph patternit should be repaired to a given target graph pattern. The graph-pattern association rules (GPARs [10]) according to [7] is a specific a r X i v : . [ c s . D B ] J un , -, Eindhoven, Netherlands Larissa C. Shimomura, George Fletcher, and Nikolay Yakovets case of TGDs and has been applied to social media marketing. AGPAR is a constraint of the form Q ( x , y ) ⇒ q ( x , y ) which statesthat if there exists an isomorphism from the graph pattern Q ( x , y ) to a subgraph of the data graph, then an edge labeled q betweenthe vertices x and y is likely to hold.The main differences of our proposed GGDs compared to pre-vious works are the use of differential constraints (on both sourceand target side), edges are treated as first-class citizens in the graphpatterns (in alignment with the property graph model), and the abil-ity to entail the generation of new vertices and edges (see Section 4for details). With these new features of the GGDs, we can encoderelations between two graph patterns as well as the (dis)similaritybetween its vertices and edges properties values. In general, GGD isthe first constraint formalism for property graphs supporting bothEGDs and TGDs, as well as DDs for property values. We first summarize standard notation and concepts [3, 9, 17]. Let O be a set of objects, L be a finite set of labels, K be a set of propertykeys, and N be a set of values. We assume these sets to be pairwisedisjoint.A property graph is a structure ( V , E , η , λ , ν ) where • V ⊆ O is a finite set of objects, called vertices; • E ⊆ O is a finite set of objects, called edges; • ν : E → V × V is function assigning to each edge an orderedpair of vertices; • λ : V ∪ E → P ( L ) is a function assigning to each object afinite set of labels (i.e., P ( S ) denotes the set of finite subsetsof set S ). Abusing the notation, we will use λ v for the func-tion assigning labels to vertices and λ e for the function thatassigns labels to the edges; and • ν : ( V ∪ E ) × K → N is partial function assigning valuesfor properties/attributes to objects, such that the object sets V and E are disjoint (i.e., V ∩ E = ∅ ) and the set of domainvalues where ν is defined is finite.A graph pattern is a directed graph Q [ x ] = ( V Q , E Q , λ Q ) where V Q and E Q are finite sets of pattern vertices and edges, respectively,and λ Q is a function that assigns a label λ Q ( u ) to each vertex u ∈ V Q or edge e ∈ E Q . Abusing notation, we use λ v Q as a function toassign labels to vertices and λ e Q to assign labels to edges. Addi-tionally, x is a list of variables that include all the vertices in V Q and edges in E Q .We say a label l matches a label l ′ ∈ L , denoted as l ≍ l ′ , if l ∈ L and l = l ′ or l = ‘-’ (wildcard) . A match denoted as h [ x ] of agraph pattern Q [ x ] in a graph G is a homomorphism of Q [ x ] to Gsuch that for each vertex u ∈ V Q , λ v Q ( u ) ≍ λ v ( h ( u )) ; and for eachedge e = ( u , u ′ ) ∈ E Q , there exists an edge e ′ = ( h ( u ) , h ( u ′ )) and λ e Q ( e ) ≍ λ e ( e ′ ) .A differential function ϕ [ A ] on attribute A is a constraint ofdifference over A according to a distance metric [17]. Given twotuples t , t in an instance I of relation R, ϕ [ A ] is true if the differencebetween t . A and t . A agrees with the constraint specified by ϕ [ A ] ,where t . A and t . A refers to the value of attribute A in tuples t and t , respectively. We use the differential function idea to defineconstraints in GGDs. t teacher s studentmentors m Q s Q s t teacher h high schoolworks w Q t Q t s student h high schoolpartOf a Q s Q s d departmentstudies u d department p projectmanages b Q t Q t s s type ( s .type , “high school”) = 0 type ( s .type , “high school”) = 0 name ( h .name ,s .school) name ( h .name ,s .school) t t s s t t name ( p .dept ,d .name) name ( p .dept ,d .name) ;; : : : : Figure 1: Example GGDs. A Graph Generating Dependency (GGD) is a dependency of theform Q s [ x ] , ϕ s → Q t [ x , y ] , ϕ t where: • Q s [ x ] and Q t [ x , y ] are graph patterns, called source graphpattern and target graph pattern, respectively; • ϕ s is a set of differential constraints defined over the variables x (variables of the graph pattern Q s ); and • ϕ t is a set of differential constraints defined over the variables x ∪ y , in which x are the variables of the source graph pattern Q s and y are any additional variables of the target graphpattern Q t .A differential constraint in ϕ s on [ x ] (resp., in ϕ t on [ x , y ] ) is aconstraint of one of the following forms [14, 17]:(1) δ A ( x . A , c ) ≤ t A (2) δ A A ( x . A , x ′ . A ) ≤ t A A (3) x = x ′ or x (cid:44) x ′ where x , x ′ ∈ x (resp. ∈ x ∪ y ) for Q s [ x ] (resp. for Q t [ x , y ] ), δ A is auser defined similarity function for the property A and x . A is theproperty value of variable x on A , c is a constant of the domainof property A and t A is a pre-defined threshold. The differentialconstraints defined by (1) and (2) can use the operators ( = , <, >, ≤ , ≥ , (cid:44) ) . The user-defined distance function δ A can be, for example,an edit distance when A is a string or the difference between twonumerical values.The constraint (3) x = x ′ states that x and x ′ are the sameentity (vertex/edge) and can also use the inequality operator statingthat x (cid:44) x ′ . Since the pattern variables x in Q s (resp. x , y in Q t )includes both vertices and edges, this allows to match vertex-vertexvariables, edge-edge and vertex-edge variables. Example 1 (GGD σ in Figure 1). Here, σ implies that for thematches of the source graph pattern Q s , if the student type is “highschool” then there exists a target graph pattern Q t , in which thesame matched vertex for teacher has an edge labelled ‘works’ to a GDs: Graph Generating Dependencies -, -, Eindhoven, Netherlands a article p personmentions m Q s Q s t article t themeabout w Q t Q t p person c cityworks w Q s Q s i citylives l Q t Q t s s times ( m .times , times ( m .times , type ( t .type , “human”) = 0 type ( t .type , “human”) = 0 t t s s t t lat-lon ( c .lat-lon ,i .lat-lon) = 0 lat-lon ( c .lat-lon ,i .lat-lon) = 0 c = ic = i : : : : ;; Figure 2: Example GGDs. ‘high school’ vertex in which the difference/(dis)similarity betweenthe high school name and the student school name should be lessthan or equal to 1.
Example 2 (GGD σ in Figure 1). According to σ , for the matchesof Q s if the project department and the department name are(dis)similar according to the threshold “2" then there exists an edgelabelled “manages" linking the department and the project (graphpattern Q t ). In order to interpret a GGD Q s [ x ] , ϕ s → Q t [ x , y ] , ϕ t , we first spec-ify what it means for a graph pattern match to satisfy a set ofdifferential constraints. Consider a graph pattern Q [ z ] , a set of dif-ferential constraints ϕ z and a match of this pattern representedby h [ z ] in a graph G . The match h [ z ] satisfies ( | = ) a differentialconstraint k ∈ ϕ z if:(1) When k is δ A ( z . A , c ) ≤ t A then attribute z . A exists at ver-tex/edge z = h ( z ) and δ A ( z . A , c ) ≤ t A meaning that the userdefined distance (for property A) δ A between a constant c and the attribute A value of vertex/edge z is less or equalthan the defined threshold t A .(2) When k is δ A A ( z . A , z ′ . A ) ≤ t A A then attributes A , A exist at vertex/edge z = h ( z ) and z ′ = h ( z ′ ) and δ A A ( z . A , z ′ . A ) ≤ t A A .(3) When k is z = z ′ , then h ( z ) and h ( z ′ ) refer to the samevertex/edge.The match h [ z ] satisfies ϕ z , denoted as h [ z ] | = ϕ z if the match h [ z ] satisfies every differential constraint in ϕ z . If ϕ z = ∅ then h [ z ] | = ϕ z for any match of the graph pattern Q [ z ] in G .Given a GGD Q s [ x ] , ϕ s → Q t [ x , y ] , ϕ t we denote the matchesof the source graph pattern Q s [ x ] as h s [ x ] while the matches ofthe target graph pattern Q t [ x , y ] are denoted by h t [ x , y ] which caninclude the variables from the source graph pattern x and additionalvariables y particular to the target graph pattern Q t [ x , y ] .A GGD σ = Q s [ x ] , ϕ s → Q t [ x , y ] , ϕ t holds in a graph G, denotedas G | = σ , if and only if for every match h s [ x ] of the source graphpattern Q s [ x ] in G satisfying the set of constraints ϕ s , there exists amatch h t [ x , y ] of the graph pattern Q t [ x , y ] in G satisfying ϕ t suchthat for each x in x it holds that h s ( x ) = h t ( x ) . In case a GGD isnot satisfied, we typically fix this by generating new vertices/edgesin G . Example 3 (GGD σ in Figure 2). Following the semantics of theGGDs, for every match of Q s that the number of times an articlementions a person is greater than 10, there exists a match of Q t such that the theme type is “human". Observe that, in this example, GFDs GEDs GDDs GGDs
Figure 3: Expressiveness of GGDs to other graph constraints. we use the property value of the edge variable m in the differentialconstraint which is possible in GGDs as edges are also consideredvariables in the graph patterns. Example 4 (GGD σ in Figure 2). This GGD enforces that if thelatitude and longitude coordinates of the city c in which a personworks and of the city i in which a person lives are the same, then c and i should refer to the same city. Observe that in this case thetarget graph pattern is empty.GGDs can express other graph constraints previously proposedin the literature. Figure 3 shows the relationship between the graphdependencies in terms of expressiveness. GEDs[7] subsumes GFDs[11],while GDDs[14] extend GEDs by including differential constraintsrepresented in the figure by the dashed line. GGDs can expressthe GFDs, GEDs and GDDs by considering an empty target graphpattern ( Q t [ x , y ] ). Since GEDs and GFDs only enforce equality be-tween attributes, we can express the equality in GGDs differentialconstraints by using an equality operator and a threshold value 0. We next discuss the validation problem for GGDs, defined as:Given a finite set Σ of GGDs and graph G, does G | = Σ (i.e., G | = σ for each σ ∈ Σ )? We propose an algorithm to validate a GGD σ = Q s [ x ] , ϕ s → Q t [ x , y ] , ϕ t . This algorithm returns true if the σ is validated and returns false if σ is violated.We proceed as follows. For each match h s ( x ) of the graph pattern Q s [ x ] in G :(1) Check if h s ( x ) satisfies the source constraints (ie., h s ( x ) | = ϕ s ). If yes then continue.(2) Retrieve all matches h t ( x , y ) of the target graph pattern Q t [ x , y ] where h s ( x ) = h t ( x ) for all x ∈ x . If there are nosuch matches of the target graph pattern, return false.(3) Verify if h t ( x , y ) | = ϕ t . If there exists at least one match ofthe target graph pattern such that h t ( x , y ) | = ϕ t , then returntrue, else return false.This process is repeated for each σ ∈ Σ . For each match on which σ is violated, new vertices/edges can be generated in order to repairit (i.e, in order to make the GGD σ valid on G ).We next analyse the complexity of each of the “operations” pre-sented in the algorithm separately to analyse the complexity ofthe validation of a GGD. Graph pattern matching queries can beexpressed as conjunctive queries (CQ) [3] which are well-known tohave NP-complete evaluation complexity [16]. The graph patternmatching problem can be solved in PTIME when the graph patternis bounded with k tree-width [11, 16]. To analyze the complexityof constraint checking, let | h s [ x ]| be the number of matches foundof the query pattern Q s , | ϕ s | the number of differential functionsin ϕ s and f i is the cost to check the differential function i definedby the user, in which 0 ≤ i ≤ | ϕ s | . The total cost for checking thedifferential constraints in ϕ s is: | h t [ x ]|(| ϕ s | (cid:205) | ϕ s | i = f i ) .For each of the matches that satisfies the differential functions in ϕ s , we verify the target side of the differential constraint, Q t [ x , y ] , ϕ t . , -, Eindhoven, Netherlands Larissa C. Shimomura, George Fletcher, and Nikolay Yakovets x teacher y universityworks z Q s Q s c research e partOf d a participates b s s name ( x .name ,a .name) , x = a, name ( x .name ,a .name) , x = a, university teacher name ( y .name ,e .name) , y = e name ( y .name ,e .name) , y = e source 2source 1 x teacher y universityworks z Q t Q t a research e partOf d c participates b t t university teachersameAssameAs ;; Figure 4: GGD for Entity Resolution.
Assuming that the cost for checking the differential functions istractable, we can show that the complexity of the validation prob-lem of GGDs follows from the evaluation problem for classicalrelational tuple-generating dependencies, i.e., has Π P-completecomplexity [16]. Pichler and Skritek have established polynomialtime validation complexity for a large subclass of tgds [16], whichcorresponds to graph patterns covering over 99% of graph patternsobserved in practice [4].
The main novelty of the GGDs is in the generation of new verticesor edges in case a GGD is violated. Given this feature, GGDs can beapplied in different scenarios. In this section, we show how GGDscan be used in solutions for entity resolution (ER).ER is the task of identifying and linking entities across (possibly)different data sources that refer to the same real-world entity[8, 13].The generation of new vertices and/or edges in case a GGD isviolated gives the possibility to rewrite ER matching rules or condi-tions as GGDs. Towards entity resolution we can define the sourcegraph patterns as several disjoint patterns from (possibly) differ-ent graph sources and use the target graph pattern specificationsas the representation of the deduplicated graphs. Thus, using thisapproach, we can also encode more information than just vertex-to-vertex, or row-to-row in relational databases, as we consider allthe information in a defined graph pattern.
Example 5 (Figure 4). As discussed before, the source graph pat-tern encodes the rules to perform entity resolution over (possibly)different graph sources. To perform ER, we can add links of type‘sameAs’ between the matched entities in the target graph pattern.These links will be generated to validate the defined GGD.A second interesting case in which the GGDs can be used tosolve entity resolution is when two graph patterns that refer tothe same real-world entity have different structures in (possibly)different sources. In this case, we can generate with the GGDs avertex or a graph pattern that can summarize all the information ofthese two graph patterns (see
Example 6 ). An advantage in usingGGDs for the ER is the use of edges as variables, allowing to usethe information of edge properties also in the matching rules, as itcan be observed in the next example.
Example 6 (Figure 5). In this case, we have two graph sourcesthat model differently the same act of purchasing a product. Inorder to deduplicate this data, it is useful to create a vertex in theintegrated graph that is able to aggregate the information thatmatches in both sources. a person y productpurchase z Q s Q s u purchase r has b p makes m s s date ( z .date , w .date) = 0 date ( z .date , w .date) = 0 product person source 2source 1 Q t Q t t t c person-purchase acc ( p .acc , a .acc) = 0 acc ( p .acc , a .acc) = 0 code ( y .code , r .code) = 0 code ( y .code , r .code) = 0 date ( c .date , w .date) = 0 date ( c .date , w .date) = 0 acc ( c .acc , a .acc) = 0 acc ( c .acc , a .acc) = 0 code ( c .code , r .code) = 0 code ( c .code , r .code) = 0 Figure 5: GGD for Entity Resolution.
Motivated by practical applications in graph data management, weproposed a new class of graph dependencies called Graph Gener-ating Dependencies (GGDs). The GGDs are inspired by the tuple-and equality-generating dependencies from relational data, whereconstraint satisfaction can generate new vertices and edges. A GGDdefines a graph dependency between two (possibly) different graphpatterns and the constraints over the property values are differen-tial constraints. We also presented the complexity of the validationproblem as well as how GGDs can be applied in the problem of ER.As future work, we plan to study the satisfiability and implica-tion problems for the GGDs, inference rules, tractable cases, thediscovery of GGDs, repair of GGDs, and also further apply theGGDs to other tasks in graph data management.
ACKNOWLEDGMENTS
Acknowledgments.
This project has received funding from theEuropean Union’s Horizon 2020 research and innovation programmeunder grant agreement No 825041.
REFERENCES [1] Pablo Barceló, Jorge Pérez, and Juan L. Reutter. 2013. Schema mappings and dataexchange for graph databases. In
ICDT . 189–200.[2] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. 2007. Conditionalfunctional dependencies for data cleaning. In
ICDE . 746–755.[3] Angela Bonifati, George Fletcher, Hannes Voigt, and Nikolay Yakovets. 2018.
Querying Graphs . Morgan & Claypool Publishers.[4] Angela Bonifati, Wim Martens, and Thomas Timm. 2020. An analytical study oflarge SPARQL query logs.
VLDB J.
29, 2 (2020), 655–679.[5] Diego Calvanese, Wolfgang Fischl, Reinhard Pichler, Emanuel Sallinger, andMantas Šimkus. 2014. Capturing relational schemas and functional dependenciesin RDFS. In
AAAI . 1003–1011.[6] Y. Cheng, L. Chen, Y. Yuan, and G. Wang. 2018. Rule-based graph repairing:semantic and efficient repairing methods. In
ICDE . 773–784.[7] Wenfei Fan. 2019. Dependencies for Graphs: Challenges and Opportunities.
J.Data and Information Quality
11, 2 (2019), 5:1–5:12.[8] Wenfei Fan and Floris Geerts. 2012.
Foundations of data quality management .Morgan & Claypool Publishers.[9] Wenfei Fan and Ping Lu. 2019. Dependencies for Graphs.
ACM Trans. DatabaseSyst.
44, 2, Article 5 (Feb. 2019), 40 pages.[10] Wenfei Fan, Xin Wang, Yinghui Wu, and Jingbo Xu. 2015. Association rules withgraph patterns.
Proc. VLDB Endow.
8, 12 (2015), 1502–1513.[11] Wenfei Fan, Yinghui Wu, and Jingbo Xu. 2016. Functional Dependencies forGraphs. In
SIGMOD . 1843–1857.[12] Nadime Francis and Leonid Libkin. 2017. Schema Mappings for Data Graphs. In
PODS . 389–401.[13] Ihab F. Ilyas and Xu Chu. 2019.
Data Cleaning . ACM.[14] Selasi Kwashie, Lin Liu, Jixue Liu, Markus Stumptner, Jiuyong Li, and Lujing Yang.2019. Certus: An Effective Entity Resolution Approach with Graph DifferentialDependencies (GDDs).
Proc. VLDB Endow.
12, 6 (Feb. 2019), 653–666.
GDs: Graph Generating Dependencies -, -, Eindhoven, Netherlands [15] Reinhard Pichler, Axel Polleres, Sebastian Skritek, and Stefan Woltran. 2010.Redundancy Elimination on RDF Graphs in the Presence of Rules, Constraints,and Queries. In
Web Reasoning and Rule Systems . 133–148. [16] Reinhard Pichler and Sebastian Skritek. 2011. The Complexity of EvaluatingTuple Generating Dependencies. In
ICDT . 244–255.[17] Shaoxu Song and Lei Chen. 2011. Differential Dependencies: Reasoning andDiscovery.