A Methodology for Empirical Analysis of LOD Datasets
CCoCoE: A Methodology for Empirical Analysisof LOD Datasets (cid:63)
V´ıt Nov´aˇcek
Insight @ NUI Galway (formerly known as DERI)IDA Business Park, Lower Dangan, Galway, IrelandE-mail: [email protected]
Abstract.
CoCoE stands for Complexity, Coherence and Entropy, andpresents an extensible methodology for empirical analysis of Linked OpenData ( i.e. , RDF graphs). CoCoE can offer answers to questions like:
Isdataset A better than B for knowledge discovery since it is more complexand informative? , Is dataset X better than Y for simple value lookupsdue its flatter structure? , etc. In order to address such questions, weintroduce a set of well-founded measures based on complementary no-tions from distributional semantics, network analysis and informationtheory. These measures are part of a specific implementation of the Co-CoE methodology that is available for download. Last but not least, weillustrate CoCoE by its application to selected biomedical RDF datasets.
As the LOD cloud is growing, people increasingly often face the problem ofchoosing the right dataset(s) for their purposes. Data publishers usually pro-vide descriptions that can indicate possible uses of their datasets, however, suchasserted descriptions may often be too shallow, subjective or vague. To comple-ment the dataset descriptions authored by their creators, maintainers or users,we introduce a comprehensive set of empirical quantitative measures that arebased on the actual content of the datasets. Our main goal is to provide meansfor comparison of RDF datasets along several well-founded criteria, and thusdetermine the most appropriate datasets to utilise in specific use cases.To motivate and illustrate our contribution from a practical point of view,imagine a researcher Rob working on a novel method for discovering drug sideeffects. Rob knows that the most successful methods typically define and traina model in order to discover unknown side effects of drugs using their knownfeatures [1]. Rob also knows there are datasets in the LOD cloud that can be usedfor defining features that may not be captured by the state of the art approaches.Moreover, due to the common RDF format, the datasets can relatively easily becombined to generate completely new sets of features. Therefore using relevantLOD data can lead Rob to a breakthrough in adverse drug effect discovery. (cid:63)
This work has been supported by the ‘KI2NA’ project funded by Fujitsu LaboratoriesLimited in collaboration with Insight @ NUI Galway. We also want to thank V´aclavBel´ak for valuable discussions regarding applicable clustering algorithms. a r X i v : . [ c s . A I] J un xamples of such data are DrugBank, SIDER and Diseasome ( c.f. , http://datahub.io/dataset/fu-berlin-[drugbank|sider|diseasome] ). They desc-ribe drugs, medical conditions, genes, etc. The question is how to use the datasetsefficiently. Rob may wonder how much information can he typically gain fromthe datasets and which one is the best in this respect. Which of them is betterfor extracting flat features based on predicate-object pairs, and which is betterfor features based on more complex structural patterns? Last but not least, itmay be useful to know what happens if one combines the datasets. Maybe it willbring more interesting features, and maybe nothing will change much, only thedata will become larger and more difficult to process.CoCoE provides a well-founded methodology for empirical analysis of RDFdata which can be used to determine applicability of the data to particularuse cases (like Rob’s in the motivating example above). The methodology isbased on sampling the datasets with quasi-random heuristic walks that simulatevarious exploratory strategies of real agents. For each sample ( i.e. , walk), wecompute a set of measures that are then averaged across all the samples toapproximate the overall characteristics of the dataset. We define three types ofmeasures: complexity, coherence and entropy. The purpose of the measures isto assess datasets along complementary perspectives that can be quantified ina well-founded and easy-to-interpret manner. The perspectives chosen and theirpossible combinations cover a broad area of use cases in which RDF datasetscan possibly be applied, ranging from simple value look-ups through semanticannotations to complex knowledge discovery tasks.The CoCoE methodology can obviously be implemented in many differentways, but here we describe only one specific realisation. For the complexity mea-sures, we use network analysis algorithms [2]. For the coherence measures, twoauxiliary structures are required. Firstly, we need a distributional representationof the RDF data [3], which describes each entity (subject or object) by a vectorthat represents its meaning based on the entities linked to it. Secondly, we needa taxonomy of nodes in the RDF (multi)graph, which is computed from the dataitself by means of nonparametric hierarchical clustering. These two structures al-low for representing coherence using various types of semantic similarities basedon the vector space representation and the taxonomy structure, such as cosine orWu-Palmer [4]. The taxonomy structure also serves as a basis for the entropiescomputed using cluster annotations of the nodes in the walks.The rest of the paper is organised as follows. Section 2 summarises the relatedwork. Details on the CoCoE methodology and its implementation are given inSection 3. Section 4 presents an experimental illustration of the CoCoE approach.We conclude the paper in Section 5. The distributional representation of RDF data we use builds on our previouswork [3]. We have recently introduced the notion of heuristic quasi-randomwalks and their empirical analysis in [5], which, however, deals with differentypes of data, manually curated taxonomies and predefined gold standards. Thepresented paper extends that work into a generally applicable methodology foranalysing RDF datasets using only the data itself.The clustering method introduced here builds on principles similar to k-hopclustering [6]. Another related approach is nonparametric hierarchical link clus-tering [7], which is more general and sophisticated than our simple method, yetits Python implementation we have experimented with proved to be intractablewhen used in our experiments. A comprehensive overview of semantic similaritymeasures applicable in CoCoE is provided in [4]. The similarities used in ourexperiments were the cosine and Wu-Palmer ones, chosen as representatives ofthe vector space-based and taxonomy-based similarity types.The most relevant tools and approaches for RDF data analysis are [8,9,10,11].Perhaps closest to our work is [9] that computes a set of statistics and histogramsfor a given RDF dataset. The statistics are, however, concerned mostly aboutdistributions of statements, instances and explicit statement patterns. This maybe useful for tasks like SPARQL query optimisation, but cannot directly answerquestions that motivate our work. Graph summaries [8] propose high-level ab-stractions of RDF data intended to facilitate formulation of SPARQL queries,which is orthogonal to our approach aimed at quantitative characteristics of thedata itself. Usage-based RDF data analysis [10] provides insights into commonpatterns of utilising RDF data by agents, but does not offer means for actuallyanalysing the data. Finally, the recent approach [11] is useful for knowledge dis-covery in RDF data based on user-defined query patterns and analytical perspec-tives. Our approach complements [11] by characterising application-independentfeatures of RDF datasets taken as a whole.
In this section, we first introduce various RDF data representations that under-lie the CoCoE methodology. Then we describe the clustering method used forcomputing taxonomies that are needed for certain CoCoE measures. The con-cept of heuristic quasi-random walks is described then, followed by details onthe CoCoE measures. Finally, we explain how to interpret the measure values.
Let us assume an RDF dataset consisting of triples ( s, p, o ) that range overa set of URIs U and literals L such that s, p ∈ U , o ∈ U ∪ L . A direct graphrepresentation of the dataset is a directed labelled multigraph G d = ( V, E d , L d )where V = U ∪L is a set of nodes corresponding to the subjects and objects, E d isa set of ordered pairs ( u, v ) ∈ V × V , and L d : E d → U is a function that assignsa predicate label p to every edge ( s, o ) such that ( s, p, o ) exist in the dataset.Note that we do not distinguish between URI and literal objects in the currentimplementation of CoCoE as we are interested in the most generic schema-independent features of the datasets. An example of a direct graph representationis given in Figure 1. The left hand side of the figure contains RDF statements ig. 1. Example of a direct graph representation of RDF data coming from DrugBank and Diseasome. For better readability, we represent thestatements as simple tuples, with the DB and DS abbreviations referring to thecorresponding namespaces. We also use symbolic names instead of alphanumericIDs present in the original data. In the graph representation in the right hand sideof the figure, arginine , Alzheimer disease , APOE and urokinase correspondto the A , B , C and D codes, respectively. Similarly, the predicates possibleDi-seaseTarget , possibleDrug and associatedGene correspond to the r , s and t codes, respectively. Drug entities are displayed in white, while disease and geneare in dark and light grey, respectively.Next we define a distributional representation of an RDF dataset as amatrix M . The row indices of M correspond to the set V of nodes in the dataset’sdirect graph representation G d . The column indices represent the context of thenodes by means of their connections in the G d graph, and are defined as a unionof two sets of pairs corresponding to all possible outgoing and incoming edges: { ( L d (( x, y )) , y ) | x ∈ V ∧ ( x, y ) ∈ E d } ∪ { ( x, L d (( x, y ))) | y ∈ V ∧ ( x, y ) ∈ E d } . Thevalues of the M matrix indexed by a row a and column ( b, c ) are 1 if there isan edge ( a, c ) with predicate label b or an edge ( b, a ) with predicate label c in G d , and 0 otherwise. The dataset from the previous example corresponds to thefollowing distributional representation: (r,B) (s,D) (t,C) (A,r) (B,s) (B,t) (D,r)A 1 0 0 0 0 0 0B 0 1 1 1 0 0 1C 0 0 0 0 0 1 0D 1 0 0 0 1 0 0 . The rows of the matrix can be used for computing similarities between particulardata items using measures like cosine distance. Let us use a notation x to referto a row vector in M corresponding to the entity x ( i.e. , a subject or objectin the original dataset). Then the cosine similarity between two entities x, y is: sim cos ( x, y ) = x · y | x || y | . For instance, the similarity between
A, D ( i.e. , arginineand urokinase) is sim cos ( A , D ) = √ √ (cid:39) . dimensionality reduction tofacilitate computations utilising the distributional representation. In the exper-iments presented here, we used a simple method for ranking and filtering outthe columns – the χ statistic [12], which can be used for computing diver-gence of specific observations from expected values. In our case, observations arethe columns in a distributional representation, and their values are the columnfrequencies in the data. More formally, let us assume an m × n distributionalrepresentation M with row index set V = { v , v , . . . , v m } and column indexet C = { c , c , . . . , c n } . Then the expected ( i.e. , mean) and observed values forthe χ statistic are E ( M ) = | C | (cid:80) r ∈ V,c ∈ C M r,c , and O ( c i , M ) = (cid:80) r ∈ V M r,c i (for a column c i ). Using these formulae, the χ statistic of a column c i is χ ( c i , M ) = ( O ( c i ,M ) − E ( M )) E ( M ) . The χ values for the columns in our example dis-tributional representation are as follows. The expected value is (cid:39) .
14 (sum ofall values in the matrix divided by the number of columns). All the columns but (r,B) have χ value of (cid:39) .
02. The (r,B) column has χ value of (cid:39) . (r,B) as the only significant column. The similaritybetween the A and D entities then increases to 1 as their corresponding vectors,reduced to the only significant dimension, are equal.In addition to reducing the dimensionality, we use the χ scores to construct weighted indirect representations of RDF datasets, G w = ( V, E w , L w ). G w is an undirected graph with node set V and edge set E w that consists of 2-multisets of elements from V . The E w set is constructed from the correspondingdirect graph representation G d as {{ u, v }| ( u, v ) ∈ E d ∨ ( v, u ) ∈ E d } . The labelingfunction L wu : E w → R associates the edges with a weight that is computed asmaximum from the values { χ (( p, u ) , M ) | p ∈ P I } ∪ { χ (( p, v ) , M ) | p ∈ P O } ∪{ χ (( u, p ) , M ) | p ∈ P O } ∪ { χ (( v, p ) , M ) | p ∈ P I } , where P I , P O are sets of RDFpredicates linking v to u and u to v , respectively. Figure 2 shows how the directgraph representation can be turned into the indirect weighted one. Fig. 2.
Example of a weighted indirect graph representation of RDF data
The last structure we need for computing the CoCoE measures is a sim-ilarity representation G s = ( V, E s , L s ). Similarly to G w , G s is a weightedundirected graph. It captures the similarities between the entities in the corre-sponding RDF dataset. The edge set E s is defined as {{ u, v }| sim cos ( u , v ) > (cid:15) } ,where u , v are the vectors in the dataset’s distributional representation M and (cid:15) ∈ [0 ,
1) is a threshold. The edge labeling function L s : E s → (0 ,
1] then as-signs the actual similarities to the particular edges. Our example dataset hasa sparse similarity representation, as most of the similarities are 0, except of sim cos ( A , D ) = sim cos ( D , A ) = 1 (or √ when using all dimensions). To compute many of the CoCoE measures, a taxonomy of the nodes in theRDF data representations is required. In some domains, standard, manually cu-rated taxonomies exist (such as MeSH in life sciences, c.f. , http://download.bio2rdf.org/current/mesh/mesh.html ). Unfortunately, such authoritative re-sources are not available for most domains, or they do not cover many RDFatasets sufficiently. Therefore we devised a simple algorithm that computesa hierarchical cluster structure ( i.e. , taxonomy) based on traversing the graphrepresentations of the data. We compute two taxonomies T w , T s based on the G w , G s representations, respectively. T w is based on the data representation di-rectly, while T s captures the taxonomy induced by the entity similarities.The most specific ( i.e. , leaf-level) clusters are computed as follows (using thecorresponding G ? = ( V, E ? , L ? ) representation where ? is one of w, s ):1. Compute a list L of nodes v ∈ V ranked according to their clustering coeffi-cients λ G ? ( v ) | a ( v ) | ( | a ( v ) |− , where λ G ? ( v ) is the number of complete subgraphs of G ? containing v , and a ( v ) is a set of neighbours of v in G ? (we use clusteringcoefficient as a simple quantification of node complexity and their potentialfor spawning clusters).2. Set a cluster identifier i to 0 and initialise a mapping ν : N → V betweencluster identifiers and corresponding node sets.3. While L is not empty, do:(a) Pick a node x with the highest rank from L .(b) Set cluster ν ( i ) to a set of nodes { u | Π e ∈ p ( x,u ) L ? ( e ) > (cid:15) } where p ( x, u ) is aset of edges on a path between the nodes x, u in G ? , and (cid:15) is a predefinedthreshold (in our experiments, we set the (cid:15) threshold dynamically to ca.75th percentile of the actual edge weights in the given graph).(c) Remove all ν ( i ) nodes from L and increment i by 1.4. Return the cluster-to-nodes mapping ν .Clusters of level k are computed using the above algorithm from clusters at thelevel k −
1, continually adding new cluster identifiers. The algorithm is, however,applied on an undirected weighted cluster graph G c = ( V c , E c , L c ) and gener-ates the higher-level clusters by unions of nodes associated with the lower levelones. The G c graph for a level k is defined as follows. Let us assume that thelevel k − n clusters c , c , . . . , c n that correspond to sets of nodes ν ( c ) , ν ( c ) , . . . , ν ( c n ). Then the node set V c for level k equals to { c , c , . . . , c n } and the edge set E c is computed as {{ u, v }|∃ x ∃ y.x, y ∈ V c ∧ ( u ∈ ν ( x ) ∩ ν ( y ) ∨ v ∈ ν ( x ) ∩ ν ( y )) } . The weight labeling L c assigns a weight to each edge in E c ac-cording to the following formula: L c ( { x, y } ) = (2 (cid:80) e ∈ E ∗ L ? ( e )+ (cid:80) e ∈ E + L ? ( e )),where L ? is the weight labeling function of the corresponding G ? graph represen-tation, and E ∗ , E + are sets of edges in G ? that are fully and partially covered bythe nodes in the ν ( x ) ∩ ν ( y ) intersection (full coverage means that both nodesof an edge are in the intersection, while the partial coverage requires exactlyone edge node to be present there). It is easy to see that the G c graphs connectclusters that have non-empty node overlap. The weights of the connections arecomputed as a weighted arithmetic mean of the weights of edges with nodes inthe cluster intersections, where the edges with both nodes in the intersectioncontribute twice as much as the edges with only one node there.The final product of the clustering algorithm is a mapping between clusteridentifiers and corresponding sets of nodes. As the more specific ( i.e. , lower-level)clusters are incrementally merged into more abstract ones, each node can be as-signed a set of so called tree codes that reflect its membership in the particularlusters. The tree codes have the form L .L . . . . .L n − .L n where L i are iden-tifiers of clusters of increasing specificity ( i.e. , L , L n are the most general andspecific, respectively). For the CoCoE measures, we sometimes consider only thetop-level cluster identifiers which we denote by C XT for an entity X . The notation C XS refers to the set of all specific cluster identifiers associated with an entity X .To give an example of how the clustering works, let us assume the (cid:15) thresholdis set to the minimum of the graph weights at each level of the clustering. Con-sidering the dataset from the previous examples, the computation of the initialclusters according to the G w representation can start from any node as theirclustering coefficient is always zero. Let us start from the node A then. The cor-responding hierarchical clustering process is depicted in Figure 3, together withthe resulting cluster structure ( i.e. , dendrogram). The value of (cid:15) in the first step Fig. 3.
Example of a clustering is 0 .
02 and thus the clustering puts the nodes
A, B, D into a cluster C1 first,proceeding from C then and creating a cluster C2 consisting of C, B . No othertraversals are possible as the multiplied edge weights fall below the thresholdalready. The next level uses the { B,C } edge weight 0 .
02 as the (cid:15) threshold againas it is the only edge connecting the
C1, C2 clusters. The top-most cluster C3 is a union of the C1, C2 ones. The resulting sets of tree codes are: { C3.C1 } fornodes A, D ; { C3.C2 } for node C ; { C3.C1, C3.C2 } for node B .For some of the measures defined later on, we need a notion of the numberand size of clusters . Let us assume a set of entities Z ⊆ V . The number ofclusters associated with the entities from Z , cn ( Z ), is then cn ( Z ) = | (cid:83) x ∈ Z C x ? | where ? is one of T, S (depending on whether we are interested in the top or spe-cific clusters, respectively). The size of a cluster C i ∈ C x ? , cs ( C i ), is an absolutefrequency of the mentions of C i among the clusters associated with the entitiesin Z . More formally, cs ( C i ) = |{ x | x ∈ Z ∧ C i ∈ C x ? }| .The taxonomies can be used for defining taxonomy-based similarity thatreflects the closeness of entities depending on which clusters they belong to. Thesimilarity is sim tax ( x, y ) = max ( { · dpt ( lcs ( u,v )) dpt ( u )+ dpt ( v ) | u ∈ C xS , v ∈ C yS } ) , where thespecific tree codes in C xS , C yS are interpreted as nodes in the taxonomy inducedby the dataset’s hierarchical clustering. The lcs function computes the least com-mon subsumer of two nodes in the taxonomy and dpt is the depth of a node inthe taxonomy (defined as zero if no node is supplied as an argument, i.e. , if lcs has no result). The formula we use is essentially based on a popular Wu-Palmersimilarity measure [4]. We only maximise it across all possible cluster annota-ions to find the best match (as the data are supposed to be unambiguous, sucha strategy is safe). To illustrate the taxonomy-based similarity, let us assumethe hierarchical clusters from the previous example: { C3.C1 } for nodes A, D ; { C3.C2 } for node C ; { C3.C1, C3.C2 } for node B . The taxonomy-based similari-ties between the nodes are then as follows: sim tax ( A , C ) = sim tax ( D , C ) = 0 . sim tax ( B , C ) = sim tax ( B , A ) = sim tax ( B , D ) = sim tax ( A , D ) = 1 (the nodes are siblings). The CoCoE measures of a dataset are computed using its G w representation onwhich we execute multiple heuristic quasi-random walks defined as follows.Let l be a natural number and h : V → V a heuristic function that selects a nodeto follow for any given node in G w . Then a heuristic quasi-random walk on G w of length l according to heuristic h is an ordered tuple W = ( v, h ( v ) , h ( h ( v )) , . . . ,h l − ( v ) , h l ( v )) where v is a random initial node in G w . The walks simulate ex-ploration of RDF datasets, either by a human user browsing the correspondinggraph, or by an automated traversal and/or query agent. We use the indirectrepresentation to cater for a broader range of possible traversal strategies (agentscan easily explore the subject-predicate-object links in both directions, for in-stance by means of describe queries). By running a high number of walks, onecan examine characteristic patterns of the dataset much earlier then by an ex-haustive exploration of all possible connections (which is generally in the O ( n !)range w.r.t. the number of entities). Formal bounds of representativeness impliedby a specific number of random walks are currently an open problem. However,our experiments suggest that a number ensuring representative enough samplingcan be easily determined empirically.To simulate different types of exploration, we can define various heuristics .For a given input node v , all heuristics compute a ranked list of the neighbours of v . The list is then iteratively processed (starting with the highest-ranking neigh-bour), attempting to select the next node with a probability that is inverselyproportional to its rank. If no node has been selected after processing the wholelist, a random neighbour is picked. The distinguishing factor of the heuristics arethe criteria for ranking the neighbour list. We employed the following selectionpreferences in our experiments: (1) nodes that have not been visited before (H1);(2) unvisited nodes connected by edges with higher weight (H2); (3) unvisitednodes that are more similar to the current one, using the sim tax similarity intro-duced before (H3); (4) unvisited nodes that are less similar (H4). H1 simulatesmore or less random exploration that, however, prefers unvisited nodes. H2 fol-lows more significant relations. Finally, H3 and H4 are dual heuristics, with H3simulating exploration of topics related to the current node and H4 attemptingto cover as many topics as possible.Each walk W can be associated with an envelope e ( W, r ) with a radius r , which is a sub-graph of G w limited to a set of nodes V rW . V rW represents aneighbourhood of the walk and is defined as (cid:83) u ∈ W { v | v ∈ V ∧ | p G w ( u, v ) | ≤ r } where p G w ( u, v ) is a shortest path between nodes u, v in G w . The envelope is usedor computing the complexity and entropy measures later on, as it correspondsto the contextual information available to agents along a walk. Having introduced all the preliminaries, we can finally define the measures usedin our sample implementation of CoCoE. The first type of measures is basedon complexity of the graph representations. We distinguish between local andglobal complexities. The global ones are associated with the graphs as a whole,and we compute specifically graph diameters, average shortest paths and nodedistributions along walks. The local measures associated with the walk envelopesare: (A) envelope size in nodes; (B) envelope size in biconnected components;(C) average component size in nodes; (D) average clustering coefficient of thewalk nodes w.r.t. the envelope graph.The coherences of walks are based on similarities. Let us assume a sequenceof v , v , . . . , v n walk nodes. Then the particular coherences are: (E) taxonomy-based start/end coherence sim tax ( v , v n ); (F) taxonomy-based product coher-ence Π i ∈{ ,...,n − } sim tax ( v i , v i +1 ); (G) average taxonomy-based coherence n − (cid:80) i ∈{ ,...,n − } sim tax ( v i , v i +1 ); (H) distributional start/end coherence sim cos ( v ,v n ); (I) distributional product coherence Π i ∈{ ,...,n − } sim cos ( v i , v i +1 ); (J) av-erage distributional coherence n − (cid:80) i ∈{ ,...,n − } sim cos ( v i , v i +1 ). This family ofmeasures helps us to assess how topically convergent (or divergent) are the walks.To compute walk entropies , we use the T w , T s taxonomies. By definition,the higher the entropy of a variable, the more information the variable contains.In our context, a high entropy value associated with a walk means that there isa lot of information available for agents to possibly utilise when processing thegraph. The entropy measures we use relate to the following sets of nodes andtypes of clusters representing the context of the walks: (K) walk nodes only, topclusters; (L) walk nodes only, specific clusters; (M) walk and envelope nodes,top clusters; (N) walk and envelope nodes, specific clusters. The entropies ofthe sets (K-N) are defined using the notion of cluster size ( cs ( . . . )) introducedbefore. Given a set Z of nodes of interest, the entropy H ( Z ) is computed as H ( Z ) = − (cid:80) C i ∈ C ? ( Z ) cs ( C i ) (cid:80) Cj ∈ C ?( Z ) cs ( C j ) · log cs ( C i ) (cid:80) Cj ∈ C ?( Z ) cs ( C j ) , where ? is one of T, S , for top or specific clusters, respectively.
Generally speaking, high complexity means a lot of potentially useful structuralinformation, but also more expensive search ( e.g. , by means of queries) due tohigh branching factors among the nodes, and the other way around. High coher-ence means that in general, any exploratory walk through the dataset tends to befocused in terms of topics covered, while low coherence indicates rather serendip-itous nature of a dataset where exploration tends to lead through many differenttopics. Finally, high entropy means more information and also less predictabletopic distributions along the nodes in the walks and envelopes, with balancedluster cardinalities. Low entropy means high predictability of the node topics(in other words, strongly skewed cluster cardinalities).Possible combinations of measures can be enumerated as follows. Let us referto comparatively higher and lower measures by the ↑ and ↓ symbols. Then thecombinations of relative complexity, coherence and entropy measures, respec-tively, are: 1, ↑↑↑ : Complex patterns and informative topic annotations aboutfocused subject domains. 2, ↑↑↓ : Focused around unevenly distributed sets oftopics with complex structural information context. 3, ↑↓↑ : Serendipitous, a lotof equally significant complex contextual information. 4, ↓↑↑ : Focused, with bal-anced and simple contextual information. 5, ↑↓↓ : Serendipitous with complexcontextual topics of uneven cardinality. 6, ↓↑↓ : Focused with simple uneven con-texts. 7, ↓↓↑ : Serendipitous with simple balanced contexts. 8, ↓↓↓ : Serendipitouswith simple uneven contexts.Some of the specific measure combinations may be particularly (un)suitablefor certain use cases. To give few non-exhaustive examples, the combination 1, ↑↑↑ is suitable for knowledge discovery about focused subject domains, but alsochallenging for querying. Combination 3, ↑↓↑ is good for serendipitous brows-ing. Combination 4, ↓↑↑ may be useful for semantic annotations of a set of coredomain entities as it provides for simple lookups of focused and balanced con-textual information. Similarly, combination 7, ↓↓↑ may be more applicable forannotations of varied domain entities. In this section, we first present settings of experiments with CoCoE applied tosample RDF datasets. Then we report on results of the experiments and discusstheir interpretation. Note that the implementation of the CoCoE methodologyused in the experiments, including the corresponding data and scripts, is avail-able at http://goo.gl/Wxnb3B . The datasets we used were: 1.
DrugBank – information on marketed drugs,including indications, chemical and molecular features, manufacturers, proteinbindings, etc.; 2.
SIDER – information on drug side effects; 3.
Diseasome – anetwork of disorders and associated genes; 4. all – an aggregate of the
DrugBank , SIDER and
Diseasome datasets using the DrugBank URIs as a core vocabularyto which the other datasets are mapped. The dataset selection was motivated byour recent work in adverse drug effect discovery, for which we have been com-piling a knowledge base from relevant biomedical Linked Open Data [13]. Oneof the main purposes of the knowledge base is to extract features applicable totraining adverse effect discovery models. In this context, we were interested incharacteristics of the knowledge bases corresponding to the isolated and mergeddatasets, yet we lacked the means for measuring this. Therefore we decided touse the knowledge bases being created in [13] as a test case for CoCoE.or each dataset, we generated: (1) The direct graph and distributional rep-resentations G d , M , with M reduced to 250 most significant dimensions accord-ing to their χ scores. (2) The weighted indirect and similarity representations G w , G s , taking into account only similarity values above 0 .
5. (3) Taxonomies T w , T s based on the G w , G s graph clustering, respectively.The quasi-random heuristic walks were ran using all combinations of the fol-lowing parameters for each dataset: (1) Walk lengths l ∈ { , , } . (2) Envelopediameters r ∈ { , } . (3) Heuristics h ∈ { H , H , H , H } ( i.e. , random, weight,similarity and dissimilarity preference). The number of samples ( i.e. , walk exe-cutions per a parameter combination) was | V | k ( l +1) , where | V | , l are the number ofgraph nodes and the walk length in the given experimental batch, respectively,and k is a constant equal to the average shortest path length in the graphs,truncated to integer value. In our experiments, the observed relative trends werestable after reaching this number of repetitions and therefore we took it as asufficient ‘sampling rate.’ Figure 4 gives an overview of how the specific heuristics perform per each datasetregarding the node visit frequency . The x-axis reflects the ranking of nodes NODE RANK N O D E F R E Q U E N C Y Frequencies for dataset: siderrandom unvisited (all)random unvisited (sim)same clusters (all)same clusters (sim)edge weights (all)edge weights (sim)different clusters (all)different clusters (sim) NODE RANK N O D E F R E Q U E N C Y Frequencies for dataset: diseasomerandom unvisited (all)random unvisited (sim)same clusters (all)same clusters (sim)edge weights (all)edge weights (sim)different clusters (all)different clusters (sim) NODE RANK N O D E F R E Q U E N C Y Frequencies for dataset: drugbankrandom unvisited (all)random unvisited (sim)same clusters (all)same clusters (sim)edge weights (all)edge weights (sim)different clusters (all)different clusters (sim) NODE RANK N O D E F R E Q U E N C Y Frequencies for dataset: allrandom unvisited (all)random unvisited (sim)same clusters (all)same clusters (sim)edge weights (all)edge weights (sim)different clusters (all)different clusters (sim)
Fig. 4.
Node distributions along the walks according to the number of visits to them. The y-axis represents the visit frequen-cies. Both axes are log-scale, since all the distributions have very steep long tails.The prevalent trends in the plots are: 1. The heuristics H2 and H3 (edge weightand similarity preference), especially when using the T w taxonomies, tend tohave generally more long-tail distributions than the others (the pattern is mostobvious in the Diseasome dataset). 2. The H4 heuristic, using the T w taxon-omy, has the most even distribution. 3. The heuristics using the T s taxonomiesend to have very similar node visit frequency distributions, close to H1 thatexhibits the most ‘average’ behaviour (presumably due to its highest random-ness). 4. The heuristics seem to follow similar patterns in the DrugBank and all datasets. 5. In
SIDER , the behaviour of the heuristics appears to be mostirregular (for instance, the random heuristic H1 behaves differently for T w and T s taxonomies although the taxonomy used should not have any influence onthat heuristic).Table 1 summarises global characteristics of the datasets and the corre-sponding G w graph representations. | V | , | E | are numbers of nodes and edges in Data set ID | V | | E | | E || V | D d l G | C | SIDER 27 ,
924 96 ,
427 3 .
453 0 . .
998 4 .
385 2Diseasome 28 ,
102 64 ,
172 2 .
284 0 . .
999 3 .
914 3DrugBank 219 ,
513 361 ,
389 1 .
646 0 . .
999 4 .
352 2All 265 ,
548 513 ,
326 1 .
933 0 . .
998 4 .
667 3
Table 1.
Global graph statistics G w , respectively, D is the graph density (defined as D = ·| E || V | ( | V |− ), d is thegraph diameter, l G is the average shortest path length and | C | is the numberof connected components. All graphs have so called small world property [2], astheir densities are rather small and yet there is very little separation betweenany two nodes in the graph in general. This typically happens in highly complexgraphs with a lot of interesting patterns in them.Figure 5 presents plots of the complexity measures based on the walksampling. The x-axis represents the combinations of experimental parameters, H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: envelope size in nodessiderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: envelope size in componentssiderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: average component sizesiderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: envelope clustering coefficientsiderdrugbankdiseasomeall
Fig. 5.
Complexity plots grouped by the type of heuristic – the 1.,2.,3. and 4. horizontal quarters of theplot correspond to H1, H2, H3 and H4, respectively. For each heuristic, thereare six different combinations of the path length and envelope diameter, pro-gressively increasing from left to right. The y-axis represents the actual valueof the measure plotted, rendered in an appropriate log-scale if there are too bigelative differences between the plotted values. Each plot represents one typeof measure and different colours correspond to specific datasets (red for
Dis-easome , green for
DrugBank , blue for
SIDER and black for all ). The full anddashed lines are for experiments using the T w and T s taxonomies, respectively.All the walk-sampling results reported below are plotted in this fashion.The results of the complexity measures can be summarised as follows: 1. Thesize and number of components increase with longer walks and larger envelopes.2. The SIDER dataset has generally lowest number of components of smallestsize, while
Diseasome is dominating in these measures. 3. The all dataset hasrelatively large components in average, but there is less of them than in case of
Diseasome . 4. The all dataset has the largest complexity in terms of clusteringcoefficients, with
Diseasome being closely second and
DrugBank comparativelymuch smaller.
SIDER has zero complexity according to the clustering coefficient.The results of the coherence analysis are in Figure 6. The general observa- H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: start/end path coherence (tax)siderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: product path coherence (tax)siderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: average path coherence (tax)siderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: start/end path coherence (cos)siderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: product path coherence (cos)siderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: average path coherence (cos)siderdrugbankdiseasomeall
Fig. 6.
Coherence plots tions are: 1. The start/end coherences tend to be higher for shorter path lengths.2. The coherences in the samples using the T w taxonomies are generally higherthan the ones using the T s taxonomy. 3. SIDER has the lowest coherence inmost cases. 4. The product and average coherences tend to be relatively lowerfor the H4 (dissimilarity) heuristic. 5. The
Diseasome dataset is generally thesecond best for most coherence types.
DrugBank is generally third, except of thestart/end coherence where it is mostly the best. For the average and productcoherences, the all dataset usually performs best. The trend is clearer for thecoherences based on taxonomical similarity.The entropy results using the T w , T s taxonomies for the topic annotationsare in Figure 7. The observations can be summarised as follows: 1. The entropiescomputed using the T s taxonomy are always higher than the ones based on T w when taking into account only the most general identifiers of the clusterannotations (the left hand side plots). The trend is opposite, though not so = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: top-cluster path entropysiderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: all-cluster path entropysiderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: top-cluster envelope entropysiderdrugbankdiseasomeall H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = H = . E = . L = PARAMETERS M E A S U R E M E N T Measure type: all-cluster envelope entropysiderdrugbankdiseasomeall
Fig. 7.
Entropy plots clear, for the full ( i.e. , specific) cluster annotations. 2. The entropies tend to behigher for the H2, H3 heuristics (weight and similarity preferences). 3. Generally,the entropies increase with the length of the walks, however, the all dataset tendsto exhibit such behaviour more often than the others (which do so basically onlyin case of H2, H3 heuristics for top clusters). 4. The isolated datasets tend tohave higher entropies than the all one for specific clusters (right hand side plots),with
Diseasome or SIDER being the most entropic ones and
DrugBank usuallybeing the second-highest. 5. On the other hand, the all dataset has generallyhighest entropy for the abstract clusters (left hand side plots) based on the T w taxonomy. 6. The results based on the T s taxonomies are mostly close to eachother, however, the Diseasome and
SIDER datasets tend to have higher entropiesthan the others for the H2 and H3, H4 heuristics, respectively.
The classification of the
SIDER , Diseasome , DrugBank and all datasets is ↓↓↑ , ↑↑↓ , ↓−− , ↑↑− , respectively. We assigned the ↑ or ↓ symbols to datasets that havethe corresponding measures distinctly higher or lower than at least two otherdatasets in more than half of all possible settings. We used a new − symbol ifthere is no clearly prevalent higher-than or lower-than trend. According to theclassification, SIDER is more serendipitous with simple balanced contexts and
Diseasome is more focused around uneven sets of topics with complex structuralinformation context. The general classification of the
DrugBank and all datasetsis trickier due to less significant trends observed. However,
DrugBank is definitelysimpler (even more so than much smaller
Diseasome ), and the all dataset ismore focused and complex. Another general observation is that the parametersettings typically do not influence the relative differences between the datasetperformances. The only exceptions are sim cos start/end coherences and specificcluster entropies, but the differences do not seem to be too dramatic even there.he conflicting trends in coherences and entropies in case of
DrugBank and all datasets are related to the slightly different semantics of the particular measureswithin those classes. In case of coherence, the start/end one can be interpreted asan approximation of dataset’s “attractiveness,” i.e. , the likelihood of ending up ina similar topic no matter where and how one goes. The other coherences take intoaccount consequent steps on the walks and thus are more related to the measureof average or cumulative topical “dispersal” across single steps. For entropies, thetop-cluster and all-cluster entropies measure the information content regardingabstract and specific topics, respectively. Therefore the measures can exhibitdifferent trends for datasets that have uneven granularity of the taxonomy levels.To compare the results of the empirical analysis of the datasets with theintentions of their creators, let us start with
SIDER that has been designedas simple-structured dataset where one can easily retrieve associations betweendrugs and their side effects. Our observations indeed confirm this –
SIDER isclassified as relatively simple, with balanced contexts and without any signifi-cant “attractor” topics.
Diseasome focuses on capturing complex disease-generelationships, which again corresponds to our analysis – the dataset is relativelyfocused and complex with rather low entropy in the contexts. Finally,
Drug-Bank is supposed to link drugs with comprehensive information spanning acrossmultiple domains like pharmacology, chemistry or genetics, with the informationusually defined in external vocabularies. The high start/end and low cumulativecoherences indicate a strong attractiveness despite of frequent context switching( i.e. , no matter where you start, it is likely that you will be in a drug-relatedcontext and you will end up there again even if you switch between other topicson the way). The low complexity measured by CoCoE indicates relatively simplestructure of the links. This is consistent with a manually assessed structure ofDrugBank – it contains many relations fanning out from drug entities while theother nodes are seldom linked to anything else than other drugs.One of the most interesting dataset-specific observations, though, is related tothe aggregate all dataset. It is clearly most complex. It has rather low start/endcoherence, but generally quite high cumulative coherences. In addition, the ab-stract and specific entropies are relatively high and low, respectively. This meansthat a traversing agent explores increasingly more distant topics, but shiftingonly a little at a time. The specific contextual topics are quite unpredictable,but the abstract topics tend to be more regular, meaning that one can learn alot of details about few general domains using the dataset. These characteristicsmake the all dataset most suitable for tasks like knowledge discovery and/or ex-traction of complex features associated with drugs or diseases. This is very usefulinformation in the scope of our original motivations for picking the experimentaldatasets ( i.e. , feature selection for adverse drug effect discovery models).
We have presented CoCoE, a well-founded methodology for empirical analysisof LOD datasets. We have also described a publicly available implementation ofhe methodology. The experimental results demonstrated the utility of CoCoE,as it provided a meaningful automated assessment of biomedical datasets thatis consistent with the intentions of the dataset authors and maintainers.Our future work involves more scalable clustering and graph traversal al-gorithms that would make CoCoE readily applicable even to the largest LODdatasets like DBpedia or Uniprot. We also want to experiment with other imple-mentations of the methodology, using and formally analysing especially differentsimilarities and clusterings. Another interesting research topic is studying corre-lation between the performance of specific SPARQL query types and particularCoCoE measure value ranges, which could provide valuable insights for main-tainers and users of SPARQL end-points. We also want to work together withdataset providers in order to establish a more systematic and thorough mappingbetween CoCoE assessment of datasets and their suitability to particular usecases. Last but not least, we intend to investigate other possible applications ofthe CoCoE measures, such as machine-aided modelling or vocabulary debugging.
References
1. Harpaz, R., DuMouchel, W., Shah, N.H., Madigan, D., Ryan, P., Friedman, C.:Novel data-mining methodologies for adverse drug event discovery and analysis.Clinical Pharmacology & Therapeutics (6) (2012) 1010–10212. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications.Cambridge University Press (1994)3. Nov´aˇcek, V., Handschuh, S., Decker, S.: Getting the meaning right: A comple-mentary distributional layer for the web semantics. In: Proceedings of ISWC’11,Springer (2011)4. Pesquita, C., Faria, D., Falco, A.O., Lord, P., Couto, F.M.: Semantic similarity inbiomedical ontologies. PLoS Computational Biololgy (7) (2009)5. Nov´aˇcek, V., Burns, G.A.: SKIMMR: Facilitating knowledge discovery in lifesciences by machine-aided skim reading. PeerJ (2014) In press, see https://peerj.com/preprints/352/ for a preprint.6. Nocetti, F.G., Gonzalez, J.S., Stojmenovic, I.: Connectivity based k-hop clusteringin wireless networks. Telecommunication systems (1-4) (2003) 205–2207. Ahn, Y.Y., Bagrow, J.P., Lehmann, S.: Link communities reveal multiscale com-plexity in networks. Nature (7307) (2010) 761–7648. Campinas, S., Perry, T.E., Ceccarelli, D., Delbru, R., Tummarello, G.: Introduc-ing RDF graph summary with application to assisted SPARQL formulation. In:Proceedings of DEXA’12, IEEE (2012) 261–2669. Langegger, A., W¨oß, W.: RDFStats - an extensible RDF statistics generator andlibrary. In: DEXA Workshops. (2009) 79–8310. M¨oller, K., Hausenblas, M., Cyganiak, R., Handschuh, S.: Learning from linkedopen data usage: Patterns & metrics. In: Proceedings of the WebSci’10, WebScience Trust (2010)11. Colazzo, D., Goasdou´e, F., Manolescu, I., Roatis, A.: RDF Analytics: Lenses overSemantic Graphs. In: Proceedings of WWW’14, ACM (2014)12. Dowdy, S., Weardon, S., Chilko, D.: Statistics for Research. Wiley (2005)13. Abdelrahman, A.T., Munoz, E., Nov´aˇcek, V., Vandenbussche, P.Y.: Supportingknowledge discovery by linking diverse biomedical data. In: AMIA’14 Abstracts,AMIA (2014) Submitted to AMIA’14, preprint at: http://goo.gl/3jl8XLhttp://goo.gl/3jl8XL