[PDF] LMKG: Learned Models for Cardinality Estimation in Knowledge Graphs

Abstract

Accurate cardinality estimates are a key ingredient to achieve optimal query plans. For RDF engines, specifically under common knowledge graph processing workloads, the lack of schema, correlated predicates, and various types of queries involving multiple joins, render cardinality estimation a particularly challenging task. In this paper, we develop a framework, termed LMKG, that adopts deep learning approaches for effectively estimating the cardinality of queries over RDF graphs. We employ both supervised (i.e., deep neural networks) and unsupervised (i.e., autoregressive models) approaches that adapt to the subgraph patterns and produce more accurate cardinality estimates. To feed the underlying data to the models, we put forward a novel encoding that represents the queries as subgraph patterns. Through extensive experiments on both real-world and synthetic datasets, we evaluate our models and show that they overall outperform the state-of-the-art approaches in terms of accuracy and execution time.

Full PDF

LLMKG: Learned Models for Cardinality Estimationin Knowledge Graphs

Angjela Davitkova § TU Kaiserslautern (TUK)

Kaiserslautern, [email protected]

Damjan Gjurovski § TU Kaiserslautern (TUK)

Kaiserslautern, [email protected]

Sebastian Michel

TU Kaiserslautern (TUK)

Kaiserslautern, [email protected]

Abstract —Accurate cardinality estimates are a key ingredientto achieve optimal query plans. For RDF engines, speciﬁcallyunder common knowledge graph processing workloads, thelack of schema, correlated predicates, and various types ofqueries involving multiple joins, render cardinality estimation aparticularly challenging task. In this paper, we develop a frame-work, termed LMKG, that adopts deep learning approaches foreffectively estimating the cardinality of queries over RDF graphs.We employ both supervised (i.e., deep neural networks) andunsupervised (i.e., autoregressive models) approaches that adaptto the subgraph patterns and produce more accurate cardinalityestimates. To feed the underlying data to the models, we putforward a novel encoding that represents the queries as subgraphpatterns. Through extensive experiments on both real-world andsynthetic datasets, we evaluate our models and show that theyoverall outperform the state-of-the-art approaches in terms ofaccuracy and execution time.

I. I

NTRODUCTION

The capability of knowledge graphs (KG) to model struc-tured data in a machine-readable way while having the possi-bility to be easily extended by interlinking data from varioussources has contributed to their steadily increasing popularity.Ranging from information retrieval applications to content-based recommendation systems, knowledge graphs are presentin various domains, inspiring the perpetual initiation of newideas and solutions. In recent years, techniques used for miningknowledge graphs have been widely researched and hugely im-pacted by deep learning models. Efforts for the improvementof RDF graph representation and embeddings by deep learningmodels have led to a promising performance in the widelystudied tasks of question answering and link prediction [16].Deep learning has also performed exceptionally well whenconsidering the tasks of graph generation and processing [33],[46]. However, one area remains vaguely explored and that isthe usage of deep learning models for query optimization inknowledge graphs.Intuitively, producing efﬁcient query plans heavily relieson accurate cardinality estimates [19]. Although an RDFdatabase can be seen as a single table composed of threecolumns (subject, predicate, object), traditional techniquesused in relational databases have shown to perform poorlyfor SPARQL queries [8], [14], [27], [35]. The challenges incardinality estimation come directly from the nature of RDF § Equal contribution data and the lack of a rigid schema.

First , present correlationbetween the individual predicates renders the use of traditionalcardinality estimation techniques, like histograms, inapt [27].In other words, although the cardinality of two predicates in-dependently may be quite selective, their co-occurrence can bequite common compared to other combinations—leading to aninaccurate estimate if independence is assumed.

Moreover , theSPARQL queries typically include many (self-) joins betweenRDF triples [8]. Hence, to accurately estimate the cardinality,we need to go beyond the join uniformity assumption [14],[35].

Finally , it is not uncommon for SPARQL queries tocontain more than one type of query patterns, like a query thatexhibits both a star and a chain query pattern. Such queries thatcontain a compound of different patterns further contribute tothe cardinality estimation challenges [14], [35].In this paper, we introduce

LMKG , a learned model frame-work for cardinality estimation in knowledge graphs . Given aknowledge graph, LMKG learns to estimate the cardinality ofthe most used types of queries (i.e., star and chain queries [1]).Motivated by the recent research advancements for cardinalityestimation in relational databases [22], [45] and the ability ofneural networks to detect interconnections between variables,with LMKG, we establish the problem of cardinality estima-tion of knowledge graph patterns as a deep learning problem.LMKG offers the creation of an unsupervised cardinality es-timator (LMKG-U) by employing autoregressive models withgraph pattern encodings. By encoding the queries as graphpatterns, LMKG also provides the possibility for creating a su-pervised cardinality estimator (LMKG-S). LMKG efﬁcientlylearns the correlations between separate subgraph patternsand, as a result, provides distinctively accurate cardinalityestimates. To handle the challenge of high correlation betweenterms and the large number of (self-) joins, LMKG learnsover relevant subgraph patterns and not independent termsor triples. To deal with the high number of patterns, LMKGprovides an efﬁcient sampling approach for generating relevanttraining data. In addition to the most typically used encodings,a novel subgraph encoding coined SG-Encoding is introduced.The SG-Encoding can incorporate various subgraph patternswhile maintaining a compact representation. Although SG-Encoding makes it possible to handle the challenge of com-posite patterns, the proof of concept and detailed evaluation isleft for our future work. a r X i v : . [ c s . D B ] F e b . Contributions and Outline The main contributions of this paper are:1) We formulate the problem of cardinality estimation inknowledge graphs through the lenses of supervised andunsupervised deep learned models.2) To tackle the problem of cardinality estimation in knowl-edge graphs, we develop a framework called LMKG thatincludes models of different types that can be tailored toa speciﬁc dataset or a sample workload.3) To featurize subgraph patterns and provide them as inputin the models, we explore different encodings includingour newly introduced

SG-Encoding .4) We report on a comprehensive experimental study, eval-uating the LMKG framework against the state-of-the-artapproaches, and ﬁnally, objectively discuss the challengesof learned knowledge graph cardinality estimation.The remainder of the paper is organized as follows. Sec-tion II discusses related work. Section III provides the nec-essary notation and formulates cardinality estimation as asupervised and unsupervised problem. Section IV sketchesthe high-level outlook of LMKG and explains the comprisingphases. The considered query types and the proposed encodingstrategies are explained in Section V. Section VI details onthe deep learning models used in LMKG. In Section VII weelaborate on practical insights gained from the development ofLMKG. Section VIII reports on the results of the experimentalevaluation, while, eventually, Section IX concludes the paper.II. R

ELATED W ORK

Cardinality Estimation in Knowledge Graphs:

Early workon cardinality estimation in knowledge graphs focuses onintroducing new estimation algorithms that either use statisticsthat are gathered for the properties of the ontology [34] orstatistics for the summary of the graph patterns [23]. TheJena ARQ optimizer [37] uses pre-computed statistics andsingle attribute synopsis for estimating the join selectivity.However, the introduced estimation functions assume indepen-dence between the attributes which leads to underestimations.Similarly to the Jena ARQ optimizer, RDF-3X [28] doesnot consider correlation between predicates. Vidal et al. [39]suggest that basic graph patterns can be partitioned intogroups that share one common variable, so called star-shapedgroups. They use a sampling technique and introduce a costmodel to estimate the cardinality of the star-shaped groups.Neumann and Moerkotte [27] introduce a synopsis calledcharacteristic sets which they use as a basis for estimating thecardinality of SPARQL queries, again focusing on star-shapedqueries. Gubichev et al. [8] extend the statistics capturedby the characteristics sets and use them for performing joinordering in SPARQL queries. Jachiet et al. [15] use statistics,mainly focused on the predicates. Huang and Liu [14] proposecombining two methods, Bayesian networks capturing thejoint probability distribution over correlated properties forstar query patterns and a chain histogram for chain querypatterns. Presto [40] stores statistics of common subgraph patterns, relying on the presence of bound variables in thequery. Stefanoni et al. [35] represent the RDF graph in amore compact manner and use the created graph summariesfor cardinality estimation. G-Care [31], a recent benchmarkingframework, compares existing approaches for cardinality esti-mation in graphs. They consider the state-of-the-art approachesin the area of knowledge graphs and additionaly, adapt existingcardinality estimators used in relational databases for graphs.

Learned Approaches for Cardinality Estimation in Rela-tional DBMS:

Recently, the usage of machine learning modelsfor the improvement of traditional database components hasexpanded, ranging from applications in query optimization,approximate query processing or even complete system en-hancement. An early work, Leo [36] computes adjustmentsof the optimizer’s statistics and cardinality estimates laterused during query optimizations, by monitoring previouslyexecuted queries. Liu et al. [22] provide an effective neuralnetwork selectivity estimation for all relational operators, bytaking a bounded range of each column as input. Similarly,MSCN [17], a multi-set convolutional network representsrelational query plans with set semantics to capture queryfeatures and true cardinalities. Dutt et al. [5] provide anapplication of neural networks and tree-based ensembles forselectivity estimation of multi-dimensional range predicates.For overcoming the issue of misestimates, Woltmann et al. [42]suggest a local-oriented approach. Hayek and Shmueli [12]propose the usage of estimated containment rates for im-proving query estimates. Exploration of simple deep learningmodels [30] and pure data-driven models [13], have also beenproven efﬁcient for cardinality estimation and beyond. Unlikethe previous work, Naru [45] is an unsupervised data-drivensynopsis that achieves high accuracy cardinality estimation inrelational databases with adapted deep autoregressive modelsby introducing a new Monte Carlo integration technique calledprogressive sampling. The recently proposed NeuroCard [44]approach, independently developed from our work, extends theidea of autoregressive models for a join cardinality estimatorover an entire database and has the potential to be applied onKGs. Deeper investigation on the applicability to cardinalityestimation for KG queries is part of our future work. Similarly,Hasan et al. [11], suggest the usage of both autoregressivemodels and supervised deep learning models for accuratecardinality estimation. Others [18], [24], [25], [29], shift thefocus from an explicit modeling of cardinalities to optimalplan generation by applying reinforcement learning.

Learning in Graphs:

Although not for cardinality estimationover knowledge graph queries, deep learning for KGs hasbeen widely researched [16]. Since we need to representthe subgraphs efﬁciently, the work of KG embedding is ofparticular importance for us. For instance, Hamilton et al. [10]propose an inductive approach that uses node features tolearn an embedding function. TransE [2] and TransH [41]create term embeddings by following the translation principle.However, generating the node embeddings in the presenceof unbound terms has not been discussed. Additionally, thiswork mainly focuses on term or triple and not on a subgraph2epresentation. In a more related research, GraphAF [33] andMolecularRNN [32] focus on generating molecular graphswith the use of deep learning. Although efﬁcient for generatingmolecules, this work does not allow the estimation of theindividual term densities. To create our encoding, we buildon their idea for subgraph representation.III. N

OTATION & P

ROBLEM S TATEMENT

An RDF knowledge graph KG is a ﬁnite set of triples.Each triple t is constructed out of three terms ( s, p, o ) ,corresponding to a subject, a predicate, and an object. Everyterm is uniquely identiﬁed with a URI where the objects can beliterals (e.g., strings, integers). More speciﬁcally, the subjectis a resource or a node in the graph, via which a predicateforms a relationship to another node or a literal value, calledan object. A standard query language for accessing RDF storesis SPARQL . SPARQL is based on matching graph patternsand there exist various types of patterns that a query can have(e.g., star, chain). A SPARQL query can have variables thatare not bound to a speciﬁc term (e.g., ? x ).Let us consider a knowledge graph KG with triple patterns t i ∈ KG of the form ( s i , p i , o i ) with domains s i ∈ S, p i ∈ P and o i ∈ O . We consider star and chain queries with anarbitrary number of joins and unbound variables. Let qp be aSPARQL query consisting of a graph pattern { t , ..., t j , ..., t k } ,where k > ∧ t j ∈ KG and every triple t j can havean arbitrary number of unbound variables. The cardinality card ( qp ) represents the number of graph patterns from theknowledge graph KG that match the query graph pattern qp ,i.e., the result size of qp . We aim at developing an estimator est ( qp ) whose predicted value will be as close as possible tothe real cardinality card ( qp ) . As an estimator, we explore theusage of supervised and unsupervised models. Supervised estimator

As a supervised estimator, we investi-gate the usage of a deep neural network. The neural networkwill receive as input the query pattern qp and will produce asoutput the estimated cardinality est ( qp ) . Unsupervised estimator

As an unsupervised estimator, weinvestigate the usage of an autoregressive model. These modelsdecompose the joint density into n conditional probabilities,where n is the number of terms used to train the model.IV. F RAMEWORK O VERVIEW

As the name suggests, the LMKG framework represents acompound of several models. It comprises two phases that aredepicted in Fig. 1. In the ﬁrst phase, labeled the creation phase ,the initial step is to decide on the models that will be created,then generate the adequate training data if there is no sampleworkload available and, ultimately, train and tune the chosenlearned models. The second phase, called the execution phase ,encompasses the user-system interaction, indicating the startof the querying process. Below, we delineate the tasks that areperformed in the two phases. Training Data

Model Choice nb of modelsmodel typeencoding type

LMKG Models

QueryDecomposition QueryEncodingFinal CardinalityEstimation EstimatedCardinaltiesCreation PhaseExecution Phase

Train Data Creation KG User params model typeencoding type

Sample WorkloadUser Queries

Fig. 1: LMKG framework overview

Model choice

LMKG allows the creation of different modelsthat can be tailored to a speciﬁc type of query. Under theassumption that there is no existing query workload, initially,LMKG tries to create the most optimal models consideringboth the memory and the accuracy. However, if a sample queryworkload and a deﬁned budget are provided, based on theworkload statistics, LMKG can decide which models have ahigher priority. Additionally, the number of models can alsobe tuned by the user without a query workload. With ournewly proposed encoding strategy (Subsection V-A1), the useralso has the possibility to create only one model capable ofhandling different types of queries (e.g., both star and chainqueries). Having a single model that captures all query typesand sizes may lead to larger errors. Therefore, we proposevarious model grouping strategies. If a change in the workloadof queries is detected during the execution phase, a new modelmay be created, or an existing model may be dropped.

Training data creation

Once the learned models have beendecided upon and there is no sample workload available, thenext step is to create the training data. LMKG creates thetraining data for the cardinality estimation models from theprovided knowledge graphs, described in Section VII-A. WhenLMKG trains in a supervised manner, the training data consistsof different graph patterns tailored for speciﬁc knowledgegraph queries and their assigned cardinalities. It is importantto note that the graph patterns can include unbound variables.For the unsupervised approach, the training data includes onlythe bound graph patterns since the model can estimate theconditional probabilities of the individual terms and use themfor queries involving patterns with unbound variables.

Training

The training step uses the training data, either gener-ated in the previous step or provided as a sample workload, tocreate one general model for multiple query types or severalgrouped models, each tailored for answering speciﬁc typesand/or sizes of queries. The training step involves transformingthe different graph patterns into features suitable for a deeplearning model. This process depends on the chosen model.

Querying

Once the model creation phase is ﬁnished, themodels can be used from the user, indicating the start of3he execution phase. Given a user-speciﬁed query, the taskof LMKG is to provide an estimate of the query cardinality.As depicted in Fig. 1, if a query is of a speciﬁc type andsize, which is already learned by one of the models, we candirectly use the model to estimate its cardinality. However, ifthis is not the case and the query contains multiple patterns,it is forwarded to the Query Decomposition step, where itis decomposed into query patterns that are suitable for theexisting models. The query subgraphs are then forwarded tothe encoding process where each pattern is encoded accordingto the model. The featurizers then forward the inputs throughthe models and receive a prediction for the cardinality.Since LMKG represents the starting point and the ﬁrstattempt for learning cardinalities of knowledge graph patterns,the focus is limited only on equality, i.e., presence or absenceof terms. For cardinality estimation of range queries, one couldmodify the input encoding with histogram selectivity values,which we will address in future work.V. Q

UERIES AND F EATURIZATION

A SPARQL query consists of triple patterns that can haveunbound variables instead of one or more terms. The queriesare distinguished based on their topologies in different classes:chain, star, tree, cycle, clique, petal, ﬂower, and graph. Al-though LMKG can be easily adapted for each of the querytypes, we focus on the two most common types of queries:star and chain [1]. Consider the following example:

SELECT ?x WHERE{ ?x :hasAuthor :StephenKing;:genre :Horror. }

The query pattern asks for all subjects (i.e., books) that areof genre Horror and have as author StephenKing. This patterncorresponds to the class of star-shaped queries, in which thetriples are centered around a single entity, i.e., same subject orobject. Formally, a subject star-shaped query consists of sev-eral triples [( s , p , o ) , . . . , ( s , p k , o k )] such that all triplesare centered around the same subject s . A chain-shaped queryjoins triples, such that the object of the preceding triple isthe subject of the next one. More speciﬁcally, a chain queryconsists of k triples [( s , p , o ) , ( s , p , o ) , . . . , ( s k , p k , o k )] ,such that s i = o i − , where i ∈ [2 , k ] . As an example, consider: SELECT ?x, ?y WHERE{ ?x :hasAuthor ?y.?y :bornIn :USA. }

The triples share a common unbound variable ? y which inthe ﬁrst triple has the role of an object and in the second ofa subject. In this case, the solution will consist of all of theauthors that have written a piece and are born in the USA.We next delineate the single triple pattern encodings thatconstitute the more involved join query encodings. To modela triple pattern, we convert the triple terms into numericalvalues, each having an identiﬁer in the range of 1 to themaximal number of nodes or predicates. Each of the termsin the triple is separately encoded. An encoding of a triple pattern is a concatenation of the encodings of the terms. Tocapture term correlation, we always include all terms presentin a subgraph and focus on a more compact representationof them. Since a simple integer encoding requires an ordinalrelationship which is not present in categorical variables, asare the ones used in our case, in LMKG, we currently supporttwo types of encodings for the triple patterns: • One-hot encoding: an encoding in which the bound termsinvolved in the triple are set to . For example, if we assumethat the total number of subjects in our knowledge graphis , then the one-hot encoding for the subject with id will be [010] . An unbound term is treated as an absentone by setting its position in the encoding to . A singletriple encoding constructed out of one-hot encoded termswill occupy O ( | S | + | P | + | O | ) space. • Binary encoding: an encoding in which the value of theterm is represented with a binary digit. As an example, fora KG with unique subjects, the binary encoding of thesubject with id will be [10] . As in the one-hot encoding,an absent term is simply encoded with a value of . Thespace of the binary triple encoding is now O ( log | S | + log | P | + log | O | ) . Although the binary encoding loses someof the expressiveness attained by the one-hot encoding andintroduces larger complexity in the input, it is the preferredchoice for encoding triple patterns. This is mainly becausethe knowledge graphs usually include data from variousdomains and consequently have a large number of uniquesubjects, predicates, and objects. Hence, a binary encodingwill result in a smaller input dimensionality, making LMKGcapable of dealing with heterogeneous knowledge graphs. A. Query Encodings

The most popular graph representations are Adjacency List,a list of nodes each pointing to a list of edges originating fromthe node, or an Adjacency Matrix A , where A i,j,l = 1 if node i has an edge l to node j . Considering space consumption,adjacency lists are preferred when graphs are sparse, as theydo not represent absent edges. However, as graphs get denser,the beneﬁt of adjacency lists vanishes and can even lead toa higher space complexity compared to adjacency matrices.Having the baselines in mind, we next present two possibledirections for representing KGs subpatterns as well as outlinethe differences with the existing approaches:

1) Graph-based encoding:

The graph-based encodingshould have the ability to represent every possible subgraphpattern of a speciﬁc size. Since the input of a neural networkis of ﬁxed size, the encoding should be ﬁxed to the maximalnumber of features used to represent a dense subgraph ofthe KG. Knowing that the number of graph nodes (i.e.,subjects and objects) is d and the number of predicates is b , a KG encoding can be an adjacency matrix A of space O ( d ∗ d ∗ b ) . For the representation of a small subgraph patterncomposed of several nodes and predicates, the complete matrixwill be required, leading to huge matrices for real-life KGs.Incontrovertibly, this severely impacts the training time as wellas the accuracy of the model. Even for adjacency lists, we4 Book :hasAuthor :StephenKing, :genre :Horror. [ 00000 001 00001 010 01000 ][ 000 01 001 10 100 ]

SG Encoding

0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 has author genre X = [ 000 001 100 ]E = [ 01 10 ]

Ordering of the nodes and edges in query

A =Binary Encoding

Nodes (Subjects + Objects) Predicates

Term

Stephen King The Shining IT Horror USA hasauthor genre bornin ID Pattern-Bound Encoding

Query [[ ] ]

One-hot encoding

Term-ID Mapping nnn: number of nodes in the query Query Node Types Binary EncodingQuery Edge Types Binary Encoding

StephenKingTheShiningHorror IT USA has authorhas authorgenregenre born in

Term ?Book StephenKing Horror hasauthor genre ID Fig. 2: Encoding examplewould need to encode the complete space of possibilities,leading to a dense graph representation, which, as alreadydiscussed, is not preferred over an adjacency matrix.If we consider subgraph patterns that represent a subsetof the complete graph, space consumption can be drasticallyreduced. Following existing work on molecular graphs [32],[33], we can represent a KG subgraph with n nodes as G = ( A, X ) , where A ∈ { , } n ∗ n ∗ b and X ∈ { , } n ∗ d .Although suitable for molecular graphs where the number ofdistinct relations is small, in most KGs this encoding createsa large sparse matrix A , due to the high number of relations.Following this line of thought, to circumvent large matricescreated from the existing representations and adapt themto knowledge subgraphs, LMKG employs a novel subgraphencoding, termed SG-Encoding . We deﬁne a subgraph patternto have n number of nodes and e number of predicates,where n < d and e < b . A subgraph pattern is representedas SG = ( A, X, E ) , where A is the adjacency tensor, X is the node feature matrix and E is the predicate featurematrix. Given an ordering of the nodes and predicates from thesubgraph pattern, we deﬁne A ∈ { , } n ∗ n ∗ e , X ∈ { , } n ∗ d and E ∈ { , } e ∗ b , where A ijl = 1 if there exists an edge l between the i -th and j -th node. We set X im to if the i -thnode is of type m and E lk to if the l -th predicate is of type k . Intuitively, if b is smaller than a threshold value, we caneliminate the matrix E , as shown in previous work [33].Initially, each row in X and E is a one-hot encoding ofsize d and b , respectively, where the rows in X representthe node type and the rows in E the edge type. For a morecompact representation, we provide a further modiﬁcation ofthe matrices X and E , where instead of one-hot encoding, weemploy a binary encoding for each of the nodes and edges inthe subgraph. More speciﬁcally, the encoding is modiﬁed suchthat X ∈ { , } n ∗(cid:100) log | d | +1 (cid:101) and E ∈ { , } b ∗(cid:100) log | e | +1 (cid:101) , thus,drastically reducing the space for the subgraph representation.Although not all subjects are objects and vice versa, whenencoding SPARQL queries, there is still a need to expressthat they can be equal. This is especially important for chainqueries, where the subject of one triple pattern is the object of another. Consequently, there is only a single node matrixand not two separate ones.The SG-Encoding can handle composite queries as it cansimultaneously represent more than one type of a query. Toinitiate the SG-Encoding, an ordering of the nodes and edgesis required. The subgraph encoding for an example starquery is illustrated in Fig. 2 (right). We show the encodingof a query by breaking down the process into three mainsteps. In Step 1, executed in the creation phase (SectionIV), a mapping of the nodes and predicates to an integerid is created. Next, SG-Encoding creates an ordering of thenodes and predicates for the given query which is illustratedin Step 2. Finally, for a predeﬁned n = 3 and e = 2 ,the parts of the encoding are created. For the subpattern ? Book : hasAuthor : StephenKing. , we set A = 1 ,indicating that : hasAuthor is the ﬁrst predicate in the edgeorder, the unbound variable is the ﬁrst and : StephenKing is the second in the respective node order. The mapping ofnodes and edges in the query to their ids is represented by X and E , respectively. From the example, the ﬁrst two bits in E indicate that the predicate, : hasAuthor , has id 1.

2) Pattern-bound encoding: represents an encoding thatworks only for a model tailored for a certain type of query.For subject-oriented star patterns, the pattern-bound encodingrequires ordering of the predicate-object pairs connected tothe subject. More speciﬁcally, if we have a triple set of size t that is centered around a single unbound subject, the starencoding is a concatenation of all of the t pair encodings andthe subject encoding. The terms can be represented either by aone-hot encoding or a binary encoding. An example encodingfor a star query is shown in Fig. 2 (left). In the ﬁrst step, amapping of the nodes and the predicates to an integer id iscreated. Then every term in the query is encoded using eitherthe one-hot or the binary encoding, resulting in the encodingsshown in Step 2.1 and Step 2.2, respectively.Unlike in star-shaped queries, in chain-shaped queries theordering of the nodes and edges is evident. Intuitively, apredicate connected to a subject is its descendant and itsposition in the order will follow the one of the subject.5imilarly the same holds for the object which has as anantecedent the predicate. Given an ordering of the nodes andedges, we can encode the chain query as a concatenation of theencodings of the separate terms, encoded either with a one-hotencoding or preferably for larger patterns, a binary encoding.As previously discussed, an adjacency list representation ispreferable when the subgraph query pattern is sparse. Thisis the case when the subgraph query pattern is ﬁxed sincewe limit the capability to encode only the speciﬁc subgraphpattern and not all possible variants, i.e. a sparse represen-tation. As one can observe, a serialization of an adjacencylist resembles the pattern-bound encoding. More speciﬁcally,a ﬂattened adjacency list for a star graph pattern will directlycorrespond to the star encoding. However, for a chain query, anadjacency list will be larger than the pattern-bound encodingsince by knowing that an object in a triple will be a subjectin the next one, we further remove redundant nodes.LMKG can apply both encodings depending on the selectedquery types that are to be learned by the models. If a modelis chosen for learning only a speciﬁc type of query, thena pattern-bound approach is sufﬁcient. Although by usingthe matrices X and E without A from the SG-encodingwill lead to the pattern-bound approach, the matrix A iscrucial for representing different types of query topologies orcombinations of them in a single model. For instance, thesame model may later be trained on tree or clique queries ofa predeﬁned size.VI. N EURAL NETWORK SPECIFICATION

Neural networks are capable of detecting patterns in highdimensional data, which is highly useful when it comes tothe high number of terms co-occurring in a KG. Followingprevious work on learned cardinality estimation in relationaldatabases, e.g., [17], [30], [42], [45], LMKG utilizes twomodels, a supervised and an unsupervised model, i.e., LMKG-S and LMKG-U.

A. Supervised Model (LMKG-S)

Supervised deep learning models efﬁciently approximatenon-linear functions, thereby providing an estimated output fora given input. By tuning the number of layers and neurons, thatmeans, increasing the set of learned parameters, these modelscan ﬁt different levels of non-linearity and learn differentpatterns from the data. To beneﬁt from the expressiveness ofdeep learning models, in LMKG, we ﬁrst address cardinalityestimation as a supervised learning problem. A standard wayof modeling the supervised regression task is with the usage ofmulti-layer perceptions. The network consists of an input layer,an arbitrary number of hidden layers, and an output layer. Thenumber of neurons and layers can vary and are set according tothe complexity of the input. In the case of LMKG, the structureof the neural network changes according to the encoding ofthe input, tailored to the considered query types, i.e., star andchain queries. During the training process, the neural networkoptimizes itself based on the provided example pattern querieswith precalculated cardinalities. The architecture of LMKG-S

EstimatedCardinalty AX C on ca t e n a t e E DropoutInputFully Connected Layer Flatten

Appearances Layer Types

Fig. 3: LMKG-S architectureis shown in Fig. 3. Initially, the cardinalities are log scaledfollowed by a min-max scaling. Depending on the size of theinput, some of the layers can be optional, displayed with adashed border in Fig. 3. Nevertheless, X, A, and E are alwaysconcatenated once ﬂattened and propagated through one ormultiple layers. Every fully connected layer except the outputlayer uses ReLU, a non-linear activation function deﬁned as f ( x ) = max (0 , x ) . The output layer uses a sigmoid function f ( x ) = 11 + e − x , having an output between and that issuitable for the already scaled values. After the explorationof different objective functions, we have concluded that anadequate loss function for cardinality is the mean q-error. Theq-error is deﬁned as the relation between the true cardinalityand the estimate, i.e., q error ( y, ˆ y ) = max ( ˆ yy , y ˆ y ) . Whentraining for one speciﬁc type of query, LMKG-S can alsoemploy the pattern-bound encoding. This encoding is thensimply a ﬂat vector composed of its terms and can be directlyprovided as an input to the neural network.Although different approaches exist for handling graph data,the usage of lightweight models results in faster execution andlower memory consumption. B. Unsupervised Model (LMKG-U)

A deep autoregressive model is an unsupervised deep learn-ing model that efﬁciently estimates a joint distribution P ( x ) from a set of samples. An autoregressive property dictatesthat, for a predeﬁned variable ordering, the output of themodel contains the density for each variable conditioned onthe values of all preceding variables [7]. Therefore, given asinput x = [ x , x , ..., x d ] , the autoregressive model produces d conditionals which, when multiplied, will result in the pointdensity P ( x ) . Formally, for x = [ x , x , ..., x d ] : P ( x ) = P ( x ) P ( x | x ) ...P ( x d | x , ..., x d − ) = D (cid:89) d =1 P ( x d | x

A. Training Data Creation

The endless combinations of queries emerging from the di-rected graph structure directly impact the training data creationas well as, the accuracy of the cardinality estimation model.For smaller queries and homogeneous KGs, the complete setof patterns for a speciﬁc size can be created. However, asthe size increases, the creation of all the possible patterns ofsize k creates a combinatorial problem. Therefore, to generatethe training data, proper sampling has to be conducted. Thesampling has to satisfy the scaled-down property of the KG,meaning that the samples should possess similar propertiesas the original KG [20]. As Leskovec and Faloutsos [20]show, the best performance for a scaled-down sampling isachieved by the random walk (RW) sampling since it is biased towards highly connected nodes. Furthermore, RW preservesthe property even when the sample size gets smaller. Tosimulate RW for star patterns of size k , we randomly pick astarting node and then simulate a random step k times from thestarting node. Similarly, for chain patterns, we start a randomwalk from a randomly selected node and stop once the requiredsize is reached. Although we use RW sampling for generatingtraining data, efﬁcient sampling in KGs is still a challengingarea and as further shown, the main cause of inaccurate modelestimation is the quality of the samples, especially visible inKGs with many unique terms. B. Grouping of Models and Models’ Analysis

To try to deal with virtually unlimited combinations ofqueries, LMKG can create compounds of the models groupedby different criteria, allowing the choice between:

Single learned model that can be used for the completeknowledge graph and all types of queries, including star-shaped and chain-shaped queries with k number of joins. Query type grouping that creates separate models each spe-cialized for different types of queries. Each model is tailoredto a query type and it can use both encodings.

Query size grouping that creates a single model for a rangeof queries, grouped by their size, e.g., a single model can becreated for patterns up to size 4 and another for larger ones.LMKG offers the creation of different estimators charac-terized by the speciﬁc encoding, the type of learning, andgrouping. We next delineate the beneﬁts and the downsidesof the models along these dimensions and how they addressthe challenges in cardinality estimation in KGs:

Encoding:

Both encodings are capable of addressing two ofthe challenges. They capture the term intercorrelations for aspeciﬁc query, by not leaving out any of them in the encoding.Additionally, both encodings can express many self-joins,assuming that one-hot term encoding is not used.

Pattern-bound encoding: simple to implement and has asmall dimensionality since it is tuned to a speciﬁc query.However, when the patterns consist of reoccurring nodes orpredicates, in some cases they are repeated in the encoding,which may lead to higher dimensionality. The encoding is notgeneralizable to different query topologies, thus, it requires thecreation of multiple models and needs higher maintenance.

SG-Encoding: unlike the pattern-bound encoding, addressesthe third challenge since it can simultaneously represent dif-ferent query topologies. On the other hand, it can have a largerdimensionality which is especially noticeable when all theterms in the query are distinct.

Supervision:

Both models capture term correlations but arehighly impacted by the training data sample quality. LMKG-Sneeds to capture enough representative queries which describethe common workload. LMKG-U samples have to have thesame ratio as the original data which can be challenging forlarger dimensions.

LMKG-S has a faster training and prediction time, as wellas a smaller memory footprint. However, the training datacreation is more time consuming since unbound variables need7o be included. Additionally, generalizing to queries that are farfrom the training data is challenging and somewhat impactedby the slight overﬁt for better results, leading also to worseestimates for outliers.

LMKG-U can create training data faster since the modellearns only from bound terms. It also captures the term inter-correlations better than LMKG-S, producing highly accuratecardinality estimates especially suitable for skewed datasets.However, LMKG-U has a larger memory footprint and highertraining and prediction time than LMKG-S. This is especiallynoticeable when working with heterogeneous datasets thathave a high number of unique terms.

Grouping of Models:

The grouping mainly affects the accu-racy and the creation time of the models.

Single learned model: suitable for small memory budgetsand homogeneous and smaller KGs. It requires less tuningand maintenance during the run-time phase. However, it mayproduce lower accuracy for heterogeneous KGs due to the highnumber of patterns that affect the quality of the samples.

Query Type or Query Size Grouping: enables parallel cre-ation of the training data and the models, leading to anexcessive time reduction. However, having multiple modelsmay lead to increased memory consumption and maintenance.Combining the models depends on the overall model cre-ation budget as well as the dataset characteristics. If a smallermodel creation time is needed, LMKG-S is preferred overLMKG-U. If the training data is smaller, LMKG-U can stillbe considered, even in combination with LMKG-S. If thereare no training time constraints, then the data characteristicsshould be examined. For instance, for star-queries over datasetshaving only several nodes with a huge number of in- orout- degree i.e., skewed distribution, LMKG-U is preferred.However, when many rare terms appear and we are workingwith chain queries, the training data sampling may be worsefor LMKG-U. Thus, in a situation where these two casesappear, a combination of both LMKG-U and LMKG-S maybe the preferred approach. A single compound incorporatinga supervised and an unsupervised model, as one model, forestimating a single query cardinality is currently out of thescope of this paper and left for future work.VIII. E

XPERIMENTS

We next present the results from the experiments, conductedon both synthetic and real-world datasets, organized intotwo main blocks. The ﬁrst block incorporates a thoroughinvestigation of the LMKG models. The second block com-prises a comparison of the LMKG models with state-of-the-artcardinality estimation approaches for KGs.

Setup:

We have implemented the LMKG models in Tensor-Flow and carried out the experiments for the learned modelson an NVidia GeForce RTX 2080 Ti GPU. We have evaluatedthe competing approaches on a server with an Intel Xeon E7-4830 v3 CPU @ 2.10GHz and 1 TB RAM. For all the com-petitive approaches except for the Characteristics Sets (CSET)approach, we have used the publicly available implementationprovided by the authors of the G-CARE framework [31]. For TABLE I: Experiment and dataset speciﬁcations

Dataset SWDF, LUBM20, YAGOTopology Chain, StarResult Size [5 , ] , [5 , ] , ..., Query Size 2, 3, 5, 8 Dataset SWDF LUBM20 YAGOTriples ~ ~ ~ ~

76K 663K 12MPredicates 171 19 91 the CSET approach, we followed the reference paper andtried to implement the presented algorithm to the best of ourcapabilities. We decided to reimplement CSET ourselves as theG-CARE adaptation for chain queries returned unrealisticallyhigh estimates that negatively affect the ﬁnal results. Likein the G-CARE framework, the sampling approaches are run30 times and the resulting cardinality estimate and executiontime are an average over the 30 samples. The timeout of themethods is set to 5 minutes.In Table I we show the speciﬁcations for the experiments.As mentioned, LMKG currently supports chain and starqueries as the most typical query types [1]. To see the models’performance, we group the queries into buckets depending onthe query result size, with boundaries deﬁned by log withbase 5. For observing the performance for different querycomplexities we choose a query size of 2, 3, 5, and 8, for bothstar and chain queries, including at least 1 unbound variable.

Datasets:

We use one synthetic and two real-world datasetswith different characteristics (shown in Table I). We usethe Semantic Web Dog Food (SWDF) [26] which althoughcontains a smaller number of triples, has a high number ofinterconnections between the terms. We use LUBM [9], awidely used RDF benchmark, for which we use a scalingfactor . We also use YAGO [38] as a larger knowledgebase, chosen for its large number of distinct term values.The data analysis shows that the query cardinality follows askewed distribution. As depicted in Fig. 4, where we showthe query cardinality for the datasets by averaging over thedifferent query sizes, the vast amount of queries have a smallcardinality. Moreover, there is a presence of extremely largequery cardinalities, i.e., outliers, which highly impact theaccuracy of the models. For training the models we createtraining data according to the explanation in Section VII-Awhere the sample size depends on the dataset characteristics. Competitors:

As competitors, we use the approaches in therecently developed benchmarking framework G-CARE [31]and an adaptation of the supervised estimator MSCN [17].G-CARE includes sampling-based and summary-based tech-niques, some native to knowledge graphs and othersadapted from relational databases. The competitors are:Summary-based approaches:

Characteristic set [27] summa-rizes entities based on their emitting edges and it is speciﬁcallytailored for star-shaped queries.

SUMRDF [35] is a graph sum-marization approach that estimates the cardinality by relyingon the possible world semantics. Sampling-based approaches:

WanderJoin [21], performs random walks directly on top of theKG by considering each triple as a vertex and a join as an edge.

Impr [3] uses random walks for estimating graphlet counts.

Join Sampling with Upper Bounds (JSUB): [47] random walksampling approach for sampling over joins, adapted for pro-8

Fig. 4: Datasets distribution Fig. 5: Impact of outliers M a x q - e rr o r A v g . q - e rr o r ( l o g s c a l e ) (a) LMKG-U (sample from LUBM)

20 50 100 200 M a x q - e rr o r A v g . q - e rr o r ( l o g s c a l e ) (b) LMKG-S (sample from LUBM) Fig. 6: Training time vs. accuracy - bars show max q-error;dots show avg. q-errorducing estimates of the upper bound of the cardinality.

MSCN-n: a multi-set convolutional network using query features assets and n materialized samples. In MSCN we perform self-joins over a single table to allow KG queries and always trainon the same queries as LMKG-S. We use two variants, MSCN-0 and MSCN-1k with and samples, respectively, withthe same hyperparameters and trained until convergence. Generation of Test Queries:

For generating the test queries,we vary the query topology, the query size, and the queryresult size. For a speciﬁc query size and a ﬁxed topology,we select 600 queries where each query is drawn from abucket for a speciﬁc result size. Although we try to selectthe same number of queries from each bucket, the bucketsincluding queries with a larger result size, i.e., cardinality,are usually smaller. It is also important to note that we limitthe graph patterns to include only bounded predicates, due tothe competitors’ limitations to answer queries with unboundedones. It is important to point out that although LMKG-S caneasily handle heterogeneous datasets with many distinct terms,LMKG-U can not. LMKG-U is an autoregressive model wherethe size of the model parameters is directly affected by thenumber of terms and their unique values. Hence, as the numberof unique term values increases so will the model. This directlyaffects our experiments for the YAGO knowledge base. Aspreviously depicted, YAGO contains a huge number of distinctterm values (Table I). With the current setting, LMKG-U is notable to learn the complete set of queries of size 3 and beyond.Since addressing this limitation is part of our ongoing work,we remove LMKG-U for the comparison with YAGO.

A. Analysis of LMKG Models

We next discuss some of the factors impacting LMKG:

Hyperparameter Tuning:

We conducted experiments varyinghyperparameters such as epochs, hidden units, and layers.Fig. 6 shows how two of the accuracy metrics change throughthe training process. The metrics are calculated after severalepochs. After a reasonable number of epochs, the approachesreach satisfactory results for both the average and the maximalq-error. For our experiments, we choose epochs for [5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 ) Query Result Size A v g . q - e rr o r ( l o g s c a l e ) LMKG-S-SpecializedStarLMKG-S-SizeGroupedLMKG-S-TypeGroupedLMKG-S-SingleModel (a) Star queries [5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 ) Query Result Size A v g . q - e rr o r ( l o g s c a l e ) LMKG-S-SpecializedChainLMKG-S-SizeGroupedLMKG-S-TypeGroupedLMKG-S-SingleModel (b) Chain queries

Fig. 7: Avg. q-error comparison between specialized andcombined modelsLMKS-S and epochs for LMKG-U to balance the trainingtime and accuracy.For LMKG-S, for SWDF, LUBM, and YAGO, on average,an epoch takes , , and seconds, respectively. Thementioned training times are directly affected by the samplesize. Due to the higher complexity of the model in the presenceof numerous terms, LMKG-U requires a longer training time.For the sample size that we have considered, for SWDF andLUBM, on average, an epoch takes and minutes, respec-tively. However, when we want to create a larger sample size,because of many unique terms and possible query patterns,an epoch in LMKG-U can take up to and minutes forqueries with size 5 in LUBM and SWDF, respectively.We varied the number of hidden units ( , , ) andhidden layers (2–4) depending on the dataset characteristics.We found that or layers of neurons are often acceptablefor both models. If not speciﬁed otherwise, we always reportthe combination producing the best results. Impact of Grouping:

In Fig. 7 we show the accuracy ofthe following LMKG-S models: specialized model for queriesof speciﬁc type and size, model grouped by size, modelgrouped by type, and model for every query type and size(SingleModel). We stop after 50 epochs, where every modelconsists of two layers and the same conﬁguration. For almostevery case, the specialized model overﬁts the queries andproduces the best estimates. The single model, as expected, hasthe lowest estimation accuracy. Knowing that the model trainson a much larger dataset, this accuracy can be acceptable,especially in environments with a small memory budget. Themodels grouped by size and type produce good estimates

Query Size A v g . q - e rr o r ( l o g s c a l e ) imprjsub sumrdfwj csetmscn-0 mscn-1kLMKG-U LMKG-S (a) SWDF Query Size A v g . q - e rr o r ( l o g s c a l e ) (b) LUBM Fig. 8: Accuracy for query size9 ,5 )[5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 )[5 ,5 ) Query Result Size A v g . q - e rr o r ( l o g s c a l e ) imprjsub sumrdfwj csetmscn-0 mscn-1kLMKG-U LMKG-S (a) SWDF [5 , 5 )[5 , 5 )[5 , 5 )[5 , 5 )[5 , 5 )[5 , 5 )[5 , 5 ) Query Result Size A v g . q - e rr o r ( l o g s c a l e ) (b) LUBM [5 , 5 )[5 , 5 )[5 , 5 )[5 , 5 )[5 , 5 )[5 , 5 )[5 , 5 ) Query Result Size A v g . q - e rr o r ( l o g s c a l e ) (c) YAGO Fig. 9: Accuracy for query result size star chain

Query Type A v g . q - e rr o r ( l o g s c a l e ) (a) SWDF star chain Query Type A v g . q - e rr o r ( l o g s c a l e ) (b) LUBM star chain Query Type A v g . q - e rr o r ( l o g s c a l e ) (c) YAGO Fig. 10: Accuracy for query typealthough worse than the specialized model. Having this accu-racy comparison in mind and since the number of specializedmodels needed for the various combinations is signiﬁcantlyhigher compared to the grouped models, for the experimentalanalysis we chose the query size grouping.

Impact of Outliers:

Upon measuring the models’ accuracy, itis evident that unlike LMKG-U whose accuracy is impactedby the data dimensionality and range domains of the terms,LMKG-S is extremely affected by outliers. Therefore, in Fig. 5we measure the inﬂuence of outliers in star queries, wherethe impact of outlier removal is most evident. We can seethat even if we remove the top-10 outliers from the querydata, we achieve a higher accuracy of the model. This trendcontinues when a larger fraction of the outliers is removed.Although we apply normalization and scaling of the data, theimpact of the skewness is still evident. Therefore, given alarger space budget, a possible improvement can be to store thecardinalities of the outliers on the side. For a fair comparisonwith the competitors, we proceed without this improvement.

B. Comparison with Competitors

For the comparison with the competitors, for LMKG-S, weused SG-Encoding and query size grouping. For LMKG-U,we used pattern-bound encoding with 32-dimensional embed-dings, hence, query size and type grouping. We depict the accuracy over the different query types, query sizes, and thequery result sizes for all approaches.

1) Accuracy Analysis:

Varying Query Size:

Fig. 8 depicts the accuracy of theapproaches when varying the number of joins in the queries,represented through the average q-error. Unlike the others,whose accuracy declines for a larger number of joins, LMKG-S is not impacted by this factor. LMKG-U has a slightdegradation of the performance when the number of joins in-creases, due to two factors. First, for a larger number of joins,hence terms, LMKG-U needs to learn more term correlations.Secondly, the accuracy is also impacted by the quality of thesample for training, which for larger queries can be still achallenging task. When looking at the different query sizes,LMKG-S performs better than the other approaches, whereasLMKG-U is more accurate for SWDF than for LUBM.

Varying Query Result Size:

Fig. 9 shows the accuracy fordifferent query result sizes. Each range contains the samenumber of queries except for the last ones, where the patternsare sparse. To depict the downsides of LMKG we includethe outliers in the query distribution. The last buckets aregrouped for larger ranges involving the outliers. When vary-ing the query result size, we can most clearly see wherethe LMKG-S approach fails. LMKG-S is highly prone tooutliers, visible for the higher ranges. Thus, LMKG-S ismainly impacted by the skewness. LMKG-U produces more10ABLE II: Memory consumption of different approaches

Dataset LMKG-U LMKG-S SUMRDF CSET MSCN k = 2 k = 3 k = 5 k = 2 k = 3 k = 5 Complete Complete 0/1K

SWDF 19MB 43MB 46MB 4MB 4MB 8MB 1.2MB 816.7 KB 5 / 8 MBLUBM 80 MB 78MB 27MB 2MB 2MB 5MB 8.8MB 8.6 KB 5 / 8 MBYAGO X X X 4MB 7MB 7MB 342MB 5MB 5 / 8 MB constant results throughout the ranges. However for largerdatasets, due to sampling reasons, LMKG-U fails to capturethe interdependencies between the rarely occurring terms. Thisis more evident in the smaller ranges. Hence, for YAGO, ordatasets with similar properties, where the number of termvalues is especially large, we advise the usage of LMKG-S.Suitable for relational data, MSCN represents the predicatevalues with a single feature. However, this is not adequate forlarge domain values, especially for larger ranges. MSCN-1Kperforms better, however, a sample is still not able to capturethe KG’s query diversity. On the contrary, our approaches giveemphasis on the term values and patterns and provide betterestimates. When comparing with the existing KG approaches,overall, LMKG-S is always better for smaller ranges, followedby LMKG-U, WJ, and MSCN-1k. CSET and WJ performbetter for larger query result sizes, however, they are inferiorfor smaller ranges. Regarding the overall performance whenconsidering the query result sizes LMKG-U produces the bestaccuracy out of all the methods.

Varying Query Topology:

Fig. 10 reports the accuracy spe-ciﬁc to star and chain queries of different sizes. LMKG-Sand LMKG-U almost always perform best for both querytypes. WJ and MSCN-1k perform well and in some cases evenoutperform LMKG-U. As depicted, LMKG-U is both affectedby the type of the query and the datasets and their domainvalues for the terms. For instance, SWDF has a higher numberof unique values in the terms of the star queries and thus, theoverall accuracy is slightly worse than for chain queries.

2) Memory:

For the memory consumption, we comparewith the summary-based approaches and MSCN. We measurethe complete model for LMKG-U and LMKG-S. Althoughfor LMKG-U and LMKG-S a model for a speciﬁc size k can answer smaller queries we make the following table forcompleteness. Intuitively, sample-based approaches have anadvantage since they use the KG for drawing samples. InTable II we show the results for the different KGs. LMKG-Sprovides smaller memory than LMKG-U as well as some ofthe competitors. MSCN-0 has a smaller footprint due to themuch smaller input, with the cost of worse accuracy. CSETis better for SWDF and LUBM, however, for YAGO it hasa larger size. On the contrary, the memory consumption ofLMKG-U increases with the dataset size and the number ofunique terms involved. In some scenarios, LMKG-U requiresless memory because for large query sizes the scarcity ofpatterns results in fewer unique terms contributing to a smallersample size and a model with fewer layers and neurons.

3) Estimation Time:

In Fig. 11 we show the estimation timefor the different types of queries varied by their size. For thesampling approaches, we measure the time of generating 30samples since G-CARE needs 30 samples for producing an

Query Size A v g . E x e c u t i o n T i m e M S ( s y m l o g ) imprjsub sumrdfwj csetmscn-0 mscn-1kLMKG-U LMKG-S (a) Varying query size (SWDF) star chain Query Type A v g . E x e c u t i o n T i m e i n M S ( s y m l o g ) (b) Varying query type (SWDF) Query Size A v g . E x e c u t i o n T i m e M S ( s y m l o g ) (c) Varying query size (LUBM) star chain Query Type A v g . E x e c u t i o n T i m e i n M S ( s y m l o g ) (d) Varying query type (LUBM) Fig. 11: Estimation timeaccurate ﬁnal estimate. A smaller sample size produced muchworse accuracy, but intuitively, a faster approach. IntuitivelyMSCN has a similar prediction time as LMKG-S. As shown,LMKG-S performs better than all the approaches, exceptfor the CSET approach. LMKG-U is as good or sometimesbetter than the other approaches. Other than CSET, the otherapproaches are majorly affected by the query size.

C. Lessons Learned

Challenges and Future Directions:

The main performancedegradation in LMKG-S is not a result of the complexity of thequeries, but of the large outliers. As shown by the experiments,LMKG-S overall outperforms every approach, however, it failsto accurately estimate on outliers. To overcome this problem,we suggest the usage of a buffer list for storing the outliers.Differently, LMKG-U provides a more constant performance.However, the main difﬁculty for this model comes from alarge number of unique term values, causing a high number ofcorrelations for which the current RW sampling has not provedefﬁcient in all of the cases. Future work involves exploring amore optimal sampling approach and a reduction of the perterm dimensionality for LMKG-U. Parallel to LMKG, the useof a deep learning model, such as GraphAF [33], may bebeneﬁcial for sampling from the knowledge graph.

Optimal Use-cases:

Evidently, a generalization for differentsizes or types of queries requires a higher training time thancreating the other approaches. Considering all the aspectsof query cardinality estimation in knowledge graphs, LMKGis optimal for scenarios where a workload is given or acardinality for a speciﬁed range of queries (e.g., up to k joins)is wanted. Additionally, the learned models in the LMKGframework are practically useful when considering query op-timization, where a reordering of different patterns of smallersizes is needed. Although the models are more accurate andcan be useful for query cardinality estimation of larger or11ombined queries, their scaling depends on the training data,and thus, a sampling-based approach or a combination ofsampling and learned approach may be more efﬁcient.IX. C ONCLUSION

We addressed the problem of applying deep learning meth-ods for cardinality estimation in KGs by utilizing both, super-vised and unsupervised deep learning models. To efﬁcientlyfeed knowledge subgraphs to our models, we investigatedvarious encodings and introduced a novel encoding that isespecially useful for our supervised model. By focusing onthe subgraph patterns that constitute the KG, our encodingdrastically reduces the input size and enables us to trainthe models on more than one query type. We additionallyexplained possible improvements to our framework and futuredirections. Through the experimental evaluation, we showedthat LMKG-S and LMKG-U exceed the state-of-the-art ap-proaches in terms of accuracy while keeping a small memoryfootprint and requiring less time for generating the estimates.R

EFERENCES[1] Angela Bonifati, Wim Martens, and Thomas Timm. An analytical studyof large SPARQL query logs.

Proc. VLDB Endow. , 2017.[2] Antoine Bordes, Nicolas Usunier, Alberto Garc´ıa-Dur´an, Jason Weston,and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data.

NIPS , 2013.[3] Xiaowei Chen and John C. S. Lui. Mining graphlet counts in onlinesocial networks.

ICDM , 2016.[4] Conor Durkan and Charlie Nash. Autoregressive energy machines.

ICML , 2019.[5] Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek R.Narasayya, and Surajit Chaudhuri. Selectivity estimation for rangepredicates using lightweight models.

Proc. VLDB Endow. , 2019.[6] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle.MADE: masked autoencoder for distribution estimation.

ICML , 2015.[7] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning .MIT Press, 2016.[8] Andrey Gubichev and Thomas Neumann. Exploiting the query structurefor efﬁcient join ordering in SPARQL queries.

EDBT , 2014.[9] Yuanbo Guo, Zhengxiang Pan, and Jeff Heﬂin. LUBM: A benchmarkfor OWL knowledge base systems.

J. Web Semant. , 2005.[10] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductiverepresentation learning on large graphs.

NIPS , 2017.[11] Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, NickKoudas, and Gautam Das. Deep learning models for selectivity estima-tion of multi-attribute queries.

SIGMOD Conference , 2020.[12] Rojeh Hayek and Oded Shmueli. Improved cardinality estimation bylearning queries containment rates.

EDBT , 2020.[13] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, AlejandroMolina, Kristian Kersting, and Carsten Binnig. Deepdb: Learn fromdata, not from queries!

Proc. VLDB Endow. , 2020.[14] Hai Huang and Chengfei Liu. Estimating selectivity for joined RDFtriple patterns.

CIKM , 2011.[15] Louis Jachiet, Pierre Genev`es, and Nabil Laya¨ıda. Optimizing SPARQLquery evaluation with a worst-case cardinality estimation based onstatistics on the data. working paper or preprint, May 2017.[16] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S.Yu. A survey on knowledge graphs: Representation, acquisition andapplications.

CoRR , 2020.[17] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A.Boncz, and Alfons Kemper. Learned cardinalities: Estimating correlatedjoins with deep learning.

CIDR , 2019.[18] Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph M. Hellerstein,and Ion Stoica. Learning to optimize join queries with deep reinforce-ment learning.

CoRR , 2018.[19] Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, AlfonsKemper, and Thomas Neumann. How good are query optimizers, really?

Proc. VLDB Endow. , 2015. [20] Jure Leskovec and Christos Faloutsos. Sampling from large graphs.

KDD , 2006.[21] Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander join: Onlineaggregation via random walks.

SIGMOD Conference , 2016.[22] Henry Liu, Mingbin Xu, Ziting Yu, Vincent Corvinelli, and CalistoZuzarte. Cardinality estimation using neural networks.

CASCON , 2015.[23] Angela Maduko, Kemafor Anyanwu, Amit P. Sheth, and Paul Schliekel-man. Estimating the cardinality of RDF graph patterns.

WWW , 2007.[24] Ryan Marcus and Olga Papaemmanouil. Towards a hands-free queryoptimizer through deep learning.

CIDR , 2019.[25] Ryan C. Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, MohammadAlizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. Neo:A learned query optimizer.

Proc. VLDB Endow. , 2019.[26] Knud M¨oller, Tom Heath, Siegfried Handschuh, and John Domingue.Recipes for semantic web dog food - the ESWC and ISWC metadataprojects.

ISWC/ASWC , 2007.[27] Thomas Neumann and Guido Moerkotte. Characteristic sets: Accuratecardinality estimation for RDF queries with multiple joins.

ICDE , 2011.[28] Thomas Neumann and Gerhard Weikum. The RDF-3X engine forscalable management of RDF data.

VLDB J. , 2010.[29] Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S. SathiyaKeerthi. Learning state representations for query optimization with deepreinforcement learning.

DEEM@SIGMOD , 2018.[30] Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S. SathiyaKeerthi. An empirical analysis of deep learning for cardinality estima-tion.

CoRR , 2019.[31] Yeonsu Park, Seongyun Ko, Sourav S. Bhowmick, Kyoungmin Kim,Kijae Hong, and Wook-Shin Han. G-CARE: A framework for perfor-mance benchmarking of cardinality estimation techniques for subgraphmatching.

SIGMOD Conference , 2020.[32] Mariya Popova, Mykhailo Shvets, Junier Oliva, and Olexandr Isayev.Molecularrnn: Generating realistic molecular graphs with optimizedproperties.

CoRR , 2019.[33] Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang,and Jian Tang. Graphaf: a ﬂow-based autoregressive model for moleculargraph generation.

ICLR , 2020.[34] E. Patrick Shironoshita, Michael T. Ryan, and Mansur R. Kabuka.Cardinality estimation for the optimization of queries on ontologies.

SIGMOD Rec. , 2007.[35] Giorgio Stefanoni, Boris Motik, and Egor V. Kostylev. Estimatingthe cardinality of conjunctive queries over RDF data using graphsummarisation.

WWW , 2018.[36] Michael Stillger, Guy M. Lohman, Volker Markl, and Mokhtar Kandil.LEO - db2’s learning optimizer.

VLDB , 2001.[37] Markus Stocker, Andy Seaborne, Abraham Bernstein, Christoph Kiefer,and Dave Reynolds. SPARQL basic graph pattern optimization usingselectivity estimation.

WWW , 2008.[38] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. YAGO: Alarge ontology from wikipedia and wordnet.

J. Web Semant. , 2008.[39] Maria-Esther Vidal, Edna Ruckhaus, Tomas Lampo, Amad´ıs Mart´ınez,Javier Sierra, and Axel Polleres. Efﬁciently joining group patterns inSPARQL queries.

ESWC , 2010.[40] Xin Wang, Eugene Siow, Aastha Madaan, and Thanassis Tiropanis.PRESTO: probabilistic cardinality estimation for RDF queries based onsubgraph overlapping.

CoRR , 2018.[41] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledgegraph embedding by translating on hyperplanes.

AAAI , 2014.[42] Lucas Woltmann, Claudio Hartmann, Maik Thiele, Dirk Habich, andWolfgang Lehner. Cardinality estimation with local deep learningmodels. aiDM@SIGMOD , 2019.[43] Daphne Koller, Nir Friedman. Probabilistic Graphical Models - Princi-ples and Techniques

MIT Press , 2009.[44] Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan,Xi Chen, and Ion Stoica. Neurocard: One cardinality estimator for alltables.

CoRR , 2020.[45] Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, YanDuan, Peter Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Kr-ishnan, and Ion Stoica. Deep unsupervised cardinality estimation.

Proc.VLDB Endow. , 2019.[46] Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, and JureLeskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models.

ICML , 2018.[47] Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi.Random sampling over joins revisited.

SIGMOD Conference , 2018., 2018.