Formalising Hypothesis Virtues in Knowledge Graphs: A General Theoretical Framework and its Validation in Literature-Based Discovery Experiments
FFormalising Hypothesis Virtues in Knowledge Graphs:A General Theoretical Framework and its Validation inLiterature-Based Discovery Experiments (cid:73)
V´ıt Nov´aˇcek a a Insight @ NUI Galway (formerly known as DERI)IDA Business Park, Lower Dangan, Galway, Ireland
Abstract
We introduce an approach to discovery informatics that uses so called knowl-edge graphs as the essential representation structure. Knowledge graph is anumbrella term that subsumes various approaches to tractable representation oflarge volumes of loosely structured knowledge in a graph form. It has beenused primarily in the Web and Linked Open Data contexts, but is applicableto any other area dealing with knowledge representation. In the perspective ofour approach motivated by the challenges of discovery informatics, knowledgegraphs correspond to hypotheses. We present a framework for formalising socalled hypothesis virtues within knowledge graphs. The framework is based ona classic work in philosophy of science, and naturally progresses from mostlyinformative foundational notions to actionable specifications of measures cor-responding to particular virtues. These measures can consequently be used todetermine refined sub-sets of knowledge graphs that have large relative potentialfor making discoveries. We validate the proposed framework by experiments inliterature-based discovery. The experiments have demonstrated the utility ofour work and its superiority w.r.t. related approaches.
Keywords: discovery informatics, hypotheses as knowledge graphs, hypothesisvirtue formalisation, automated knowledge graph construction, evolutionaryrefinement, literature-based discovery
1. Introduction
Ever since the dawn of computer age, researchers have been intrigued bythe possibility of automating the process of discovery [27]. Today, the fieldof discovery informatics is getting more relevant than ever before. The large (cid:73)
This work has been supported by the “KI2NA” project funded by Fujitsu Laboratories,Limited in collaboration with Insight, NUI Galway. We also greatly appreciate comments ofPierre-Yves Vandenbussche who helped us to refine the presentation of the article.
Email address: [email protected] (V´ıt Nov´aˇcek)
Preprint submitted to Elsevier October 16, 2018 a r X i v : . [ c s . A I] A p r mounts of data that are being made openly available for anyone to explorehave an immense potential for making new discoveries, and solutions that wouldenable this are highly sought after [15].Knowledge graphs are one of the most universal ways of representing ac-tionable, data-driven knowledge at large scale [8]. They represent knowledge asrelationships (edges) between items of interests (vertices), with the possibilityof adding additional annotations representing for instance multiple relationshiptypes ( i.e. , predicates). Such a representation has many advantages like uni-versal applicability and a wealth of well-founded methods for analysing graphstructures. Yet the full potential of knowledge graphs for practical applicationsin knowledge discovery is still largely to be explored [8].The motivation of the presented work is two-fold. Firstly, we want to pro-pose a general framework for defining features of knowledge graphs that candetermine which parts of the graphs have highest potential for making discover-ies. We believe that this can facilitate the process of semi-automated knowledgediscovery in domains that have a lot of data available in graph-like format, butsuffer from high redundancy and noise ( e.g. , World Wide Web, social networksor biological pathway databases).The second motivation is more practical. In our previous work [30], weaddressed the problem of extracting simple knowledge graphs from biomedicaltexts. The graphs were then used for so called machine-aided skim reading –high-level navigation of a specific domain represented by a textual corpus whichwas assumed to facilitate the discovery process. Indeed, even highly experienceddomain experts were able to discover new and relevant facts using the proto-type system. However, the results also contained some noise and connectionsthat were correct, but rather obvious and/or uninteresting. This motivated thevalidation experiments presented here, which demonstrate that our frameworkfor formalising hypothesis virtues can tackle the problems of noise, redundancyand obviousness in knowledge graphs automatically extracted from texts.Our approach consists of formalising features applicable to ranking knowl-edge graphs (or their partitions) based on their potential for making discoveries.This can be used for instance for decomposing knowledge graphs into atomicsubgraphs and consequent construction of a graph that has higher “discoverypotential” than the original one. The formalisation is based on widely acceptedhypothesis virtues studied in philosophy of science [36]. Examples of virtuesare refutability or generality – a good scientific hypothesis has to be falsifiableand should also provide explanations of phenomena outside of its original scope.We present general conditions for each of the virtues and proceed with defin-ing specific measures that conform to these conditions and can be efficientlyimplemented.The validation of the approach was performed in the context of literature-based discovery [41]. We extracted knowledge graphs from two de facto standardbiomedical corpora traditionally used in evaluation of literature-based discoverytools. For that we used a very simple and domain-agnostic method that extractsstatistically significant co-occurrence relationships. We opted for such a solu-tion to demonstrate the universal applicability of our approach. From these2asic graphs, we constructed refined ones using a genetic algorithm that utilisesthe hypothesis virtue measures in the fitness function. The refined graphs wereanalysed according to the evaluation measures used in the literature-based dis-covery field and compared to related works. The results of the validation werepositive, as we outperformed the state of the art in most respects. Moreover,we discovered relevant relationships that have not been covered by any relatedautomated system or manual study. This demonstrates the practical utility ofour approach.Our main contributions are as follows. We have proposed a novel theoreti-cal framework for extensible definition of measures that can be used to analysethe discovery potential of knowledge graphs. We have defined specific measuresapplicable especially to refinement of knowledge graphs automatically extractedfrom texts. We have implemented an evolutionary method for refinement of theautomatically extracted knowledge graphs that is applicable out-of-the-box toany domain where English texts are available. We have demonstrated the prac-tical relevance of the presented research by a successful experimental validationin the field of literature-based discovery. Last but not least, we have provideda data package containing a prototype implementation of our approach, resultsand other data necessary for the replication of our experiments.The rest of the article is organised as follows. Section 2 presents the generalframework for formalising the hypothesis virtues in the context of knowledgegraphs. Section 3 then introduces actual measures that follow the general re-quirements of the hypothesis virtue formalisations. Our approach is experimen-tally validated in Section 4. The section describes the evolutionary refinementof knowledge graphs extracted from texts and elaborates on the experimentsin literature-based discovery. Related approaches are discussed in Section 5.Finally, we conclude the article and outline our future work in Section 6.
2. Formalising Hypothesis Virtues
The foundations of the presented work are built on [36], a classic work inphilosophy of science. The work introduces five virtues of hypothesis: conser-vatism, modesty, simplicity, generality and refutability. These virtues present acomprehensive compilation of the philosophical treatments of discovery rangingfrom antiquity to modern analytical philosophy, and have been frequently usedas a reference for determining quality of hypotheses in science.According to [36], the virtue of conservatism reflects the fact that goodhypothesis usually makes rather conservative claims. This is to minimise therisk of error by reaching too far from the state of the art in one step (eventhough the combination of the particular conservative claims may go very farafter all, indeed).
Modesty is related to conservatism – a hypothesis A is moremodest than A and B (since A and B entails A), and a more modest hypothesisis considered better as it minimises the risk of wrong and/or redundant claims.The simplicity virtue posits that a good hypothesis should simplify our view ofthe world by making new claims about it, even though the claims themselvesmay actually be quite complex. The generality virtue is related to the predictive3ower of hypothesis – the more phenomena (that have perhaps not even beenconsidered originally) it can explain, the better it is. Finally, refutability meansthat a hypothesis should be falsifiable in as obvious manner as possible. Thisis a factor of utmost importance, as discussed in arguably the most influentialwork on this topic [33].In the following, we first define the notions of hypotheses and their claims inthe context of knowledge graphs (Section 2.1) and then continue with formalisingthe five virtues (Section 2.2).
First we define a universe – a general knowledge graph within which partic-ular hypotheses may be defined.
Definition 1.
A universe graph U is a tuple ( V U , E U , Λ V , Λ E ) where V U is a setof vertices, E U ⊆ V U × V U is a set of edges and Λ V , Λ E are sets of labeling maps(i.e., morphisms) that associate values with the universe vertices and edges,respectively. The labeling maps can, for instance, assign predicate types to edges in seman-tic networks, assert vertex types like class or individual in ontology knowledgegraphs, or associate confidence weights with edges of automatically extractedknowledge graphs. Such a definition can accommodate a broad range of knowl-edge graphs with varying levels of semantic complexity, while keeping the basicstructure still compatible with the analysis methods introduced here. The uni-verse can be either directed or undirected. The experiments presented in thisarticle deal with an undirected universe and therefore we assume undirectedgraphs in the following unless explicitly stated otherwise.A hypothesis in a universe is defined as follows.
Definition 2.
A hypothesis H = ( V H , E H , Λ HV , Λ HE ) is a subgraph of the uni-verse U such that V H ⊆ V U , E H ⊆ E U and ∀ λ HV ∈ Λ HV ∃ λ V ∈ Λ V . λ HV ⊆ λ V , ∀ λ HE ∈ Λ HE ∃ λ E ∈ Λ E . λ HE ⊆ λ E . The second defining condition of the hypothesis subgraph means that any spe-cific labeling map employed by a hypothesis has to be subsumed by a mapdefined in the universe. This ensures that the universe is closed w.r.t. possibleinterpretations of the hypotheses existing within it.Most of the hypothesis virtues critically depend on what a claim of a hy-pothesis is, and therefore we need to define that as well.
Definition 3.
A claim of a hypothesis H is a simple (i.e., acyclic) path in thegraph H . Such a definition presents arguably the most universal view on what a particularknowledge graph may express. No matter what the actual semantics of therelationships in a hypothesis graph are, one can always study what they claimat least in terms of connections of vertices by means of edges, i.e. , paths (we4ill use the terms path and claim interchangeably in the rest of the article).This makes our approach applicable to any type of knowledge graph.Note that one practical implication of the last definition is that we canconsider only connected graphs as hypotheses – if there is no path between twovertices, no claim is being made about them and they should thus be parts ofdifferent hypotheses. This is partly related to the open/closed world assumptiondichotomy. The fact there is no connection between vertices does not mean nosuch connection can exist, it only means nothing is known about it in the contextof the given knowledge graph.The final preliminary definition concerns all claims possibly made by a hy-pothesis.
Definition 4.
A claim set of a hypothesis H is the set Π( H ) of all simple pathsin the corresponding graph. A claim volume of H is the size of its claim set,i.e., | Π( H ) | . The claim volume can be very large and is hard to compute even for relativelysmall graphs [47]. Also, it is not realistic to expect every possible path in aknowledge graph to convey a meaningful claim. Therefore in practice, it isconvenient to restrict the claim set to a more manageable size based on case-specific heuristics. However, the maximal possible number of claims is apt asa theoretical notion for describing general knowledge graphs without furtherinformation about their domain and more complex semantics.
The following five sections present formalisations of the particular hypothe-sis virtues using the preliminary notions introduced above. Note that we pro-vide general guidelines for measuring the virtues first, giving minimalistic set ofconditions the measures should satisfy. Detailed examples of specific measuresfacilitating literature-based discovery are discussed in Sections 3 and 4.
Conservative claims should make small steps in a particular direction, how-ever, the combination of the steps can potentially be quite radical ( i.e. , far-reaching). The conservatism of a path in a hypothesis H can be measured by afunction f : Π( H ) → R that satisfies the following conditions:1. Assuming a metric δ : V U × V U → R on the vertices in the universe graph,the function f applied to a path p = ( v , v , . . . , v | p | ) is negatively cor-related with the g ( { δ ( v , v ) , δ ( v , v ) , . . . , δ ( v | p |− , v | p | ) } ) value, where g : 2 R → R is an aggregation function ( e.g. , sum, mimimum, maximum orarithmetic mean). Here and in the following, we use broad notions of positive and negative correlation. Theyare meant to generalise the respective notions of proportionality and inverse proportionality topossibly non-linear, non-algebraic or statistical relationships that may be specific to particularapplications.
5. If radical claims are preferred, then there is an additional requirement for f being positively correlated with the δ ( v , v | p | ) value.The conservatism of the whole hypothesis H is computed by aggregatingall path conservatism measures across the Π( H ) set. The higher the aggregatevalue, the larger the conservatism. Due to the complexity of enumerating theΠ( H ) set, practical conservatism measures can target only a subset of all possiblepaths. For instance, a set of shortest paths between all vertices in H w.r.t. the δ edge labeling is a viable option as it is comparatively easier to compute andalready satisfies condition 1. if sum is used as an aggregation function. Let us refer by H ω to the complete graph corresponding to a hypothesis H ( i.e. , a graph with an edge between any two vertices in V H ). Then the modestyof H can be defined as | Π( H ω ) || Π( H ) | . This number reflects the ratio between all possible claims about the entitiescovered by H and the actual number of claims being made. The higher theratio, the larger the modesty (a modest hypothesis minimises the number ofclaims made in relation to the number of claims that can possibly be made).As mentioned before, computing the number of all simple paths in a graph isextremely difficult in general. Therefore in practice, approximations of the mod-esty measure are necessary. The approximations, however, should be monotonic w.r.t. the ideal modesty measure: assuming f, g as the ideal and approximatemodesty measures, respectively, then g ( H ) > g ( I ) if and only if f ( H ) > f ( I )for any two hypotheses H, I . For this virtue, we use the dual notion of complexity which has been exten-sively studied in the context of graphs [25]. A good hypothesis should simplifyour view of the world despite of possibly being locally complex [36]. In orderto formalise this intuition, let us assume the simplicity of a graph is measuredby a function f : G U → R , where G U is a set of all graphs conceivable in theuniverse U . The function f should satisfy these conditions:1. Given a hypothesis graph H and a graph complexity measure c : G U → R , f is positively correlated with the expression c ( U \ H ) c ( U )which reflects the universe simplification rate w.r.t. to the hypothesis . From here on, we use the set-theoretic operators for graphs as a convenience notationfor the operations applied on the corresponding vertex and edge sets in the actual tuplerepresentations of the graphs. The labeling sets of the result are assumed to be Λ V , Λ E , i.e. , the universe ones, unless specified otherwise.
6. If locally complex hypotheses are preferred, then the function f is alsorequired to be positively correlated with the value c ( H ).Strictly speaking, the rate in the first condition should also be higher than 1in order for the hypothesis to make the universe actually simpler, but practicalapplications may relax that requirement and just rank the hypotheses based onthe measure. Generality can be quantified as a number of explanations ( i.e. , claims) thehypothesis H can provide for ‘out-of-scope’ phenomena ( i.e. , vertices) in the U \ H graph. This can be expressed as g ( { f ( u ) | u ∈ V U \ V H } ) , where the function g : 2 R → R is an aggregation (like sum or arithmetic mean)over all vertices that are out of the H scope. The function f : V U → R is requiredto be positively correlated with the h ( {|{ v | v ∈ p ∧ v ∈ V H }| | p ∈ Π u ( U ) } ) value,where h : 2 R → R is another aggregation function and Π u ( U ) is a set of all simplepaths in the universe U that start in the vertex u .The generality definition reflects the basic intuition that the higher the num-ber of H vertices on paths explaining phenomena outside of H , the higher thegenerality of H . As the numbers of simple paths can be difficult to computeeven if limited to paths starting in single nodes, approximations of this measureare needed for implementations again. Similarly to the modesty condition, werequire the approximations to be monotonic w.r.t. the ideal generality measure. Refutability can be seen as a quantification of: 1) the easiness with whichthe claim volume | Π( H ) | of a particular hypothesis graph H can be reduced;2) the rate of the reduction. The atomic part of the process of refutation inthe context of knowledge graphs is an invalidation, i.e. , removal, of a vertex.Let us assume a decreasing ranking R : N → V U of the vertices in H based onthe number of simple paths that no longer exist in the graph after the vertexremoval. Then we can define a top-k refutability as | Π( H ) || Π( H ) | + (cid:80) ki =1 | Π( H/R ( i )) | , where H/R ( i ) is a graph resulting from removal of the first vertex in the ranking R from the graph H/R ( i − H/R (0) = H by definition. The lowerthe number of paths still existing after removing the top vertex according to R ,the higher the refutability. The | Π( H ) | expression is added to the denominatorto avoid potential division by zero, and also to normalise the measure value.Note that for growing k values, the top-k refutability generally convergesto similar values for any given set of hypotheses as the measure is relative tothe total number of paths in the graph. Therefore it is practical to use the7easure with rather low k values, perhaps even as low as 1 which measuresthe rate of refutability in a single vertex removal step. Additionally, the idealmeasure is difficult to compute and approximations are required in practiceagain. In particular, one can approximate the Π function in the vertex rankingand refutability definition with one that is monotonic w.r.t. it.
3. Specific Virtue Measures
In this part, we introduce specific instances of hypothesis virtue measuresfollowing the general formalisation presented before. First we give an exampleof a universe and a couple of associated structures in Section 3.1. These willbe used for running examples illustrating the measure details in Section 3.2.Finally, Section 3.3 describes how to use the measures in concert.
The examples throughout this section are all based on an illustrative universegraph U depicted in Figure 1. The graph features real-valued edge labels in the
12 0.5 3 1 14 0.5 6 15 0.5 0.5
Figure 1: Sample universe graph U (0 ,
1] interval that represent confidence weights of the edges (the higher the labelthe higher the expected degree of association between the corresponding ver-tices). These edge labels are used when constructing several auxiliary resourcesfrom the graph. There are no specific types of edges ( i.e. , predicates) in theexamples since in the experiments reported in this article, we focus only on onetype of relationship based on automatically extracted co-occurrence statements.First of all, we need to define a metric on the vertices. The most straight-forward option without any background knowledge on the graph is to use its8weighted) adjacency matrix for constructing characteristic context vectors forevery vertex. The vectors can then be used for computing the actual metric.The adjacency matrix A U of U is presented in Table 1. The context vector x A U for a vertex x is the row (or column, as the graph is undirected) correspondingto x in the adjacency matrix A U . Using the context vectors, we can define theEuclidean distance ( i.e. , a metric) on the vertices as δ ( x, y ) = (cid:112)(cid:80) ni =1 ( x i − y i ) where x i , y i correspond to the i -th elements of the x , y context vectors, respec-tively. The specific distances (up to 4-th decimal point) between the universevertices are given in Table 2. D U The last auxiliary structure we will need in the following sections (namelyfor defining complexity measures) is clustering of the vertices in U . An exampleof a possible clustering is given in Figure 2. It is an overlapping clustering that
12 345 6A B C
Figure 2: Cluster structure of U groups vertices with mutual distances below 1 . A : { , } , B : { , , , } , C : { , , } . Note that the clustering can either be computed from9he universe graph itself or provided externally ( e.g. , in the form of an ontologythat defines a taxonomy upon the graph vertices). Having introduced the sample universe, we can continue with the specificmeasure definitions which we use later on in the literature-based discovery ex-periments.
Following the conditions provided in Section 2.2.1, we define a specific in-stance of the hypothesis graph conservatism measure C ( H ) = 1 | π s ( H, δ ) | (cid:88) p ∈ π s ( H,δ ) δ ( v , v | p | ) (cid:80) | p |− i =1 δ ( v i , v i +1 ) , where π s ( H, δ ) is a set of all shortest paths in H w.r.t. the Euclidean distance δ and p = ( v , v , . . . , v | p | ) is a specific shortest path of length | p | . In other words,the C measure is an arithmetic mean of the shortest path conservatism valueswhere the path conservatism is computed as a fraction of the distance betweenthe extreme vertices of the path and the path length .The measure satisfies the condition 1. from Section 2.2.1 as it already focusesonly on paths with minimal aggregate distance between the consecutive vertices(assuming the sum aggregation). The condition 2. is satisfied as well. For anypath p , δ ( v , v | p | ) ≤ (cid:80) | p |− i =1 δ ( v i , v i +1 ). The equality is achieved if and onlyif the context vectors of the consecutive vertices represent points that lie in astraight line, i.e. , maximise the distance between the extreme vertices of thepath. Therefore the maximum value 1 of the path conservatism measures isachieved exactly when the extreme distance is maximal. Example 1.
In Figure 3 there are three hypothesis graphs
E, F, G that existin the universe U described in Section 3.1. The edges are annotated with theEuclidean distance δ based on the vertex context vectors (see the examples inSection 3.1 for details).The numbers of all shortest paths for the hypothesis graphs E, F, G w.r.t. thedistance δ are , , , respectively. The conservatism measures of the hypothesesare C ( E ) = 13 ( 11 + 12 + 11 ) = 0 . ,C ( F ) = 13 ( 1 . . . . . . ,C ( G ) = 16 ( 1 . . . . . . . , Note that if there is only one shortest path guaranteed to exist between any pair of verticesin the H graph, then | π s ( H, δ ) | = (cid:0) | V H | (cid:1) as H is expected to be connected.
45 1 6 1 F 12 1.32293 1.5 1.8708 G 24 1.5 6 1.80285 1 1
Figure 3: Sample hypothesis graphs
E, F, G therefore the hypotheses can be ranked in the F (cid:31) C G (cid:31) C E order from the most to the least conservative one .3.2.2. Modesty As an approximation of the the ideal modesty measure presented in Sec-tion 2.2.2, we use inverse density of the hypothesis graph M ( H ) = | V H | ( | V H | − | E H | . This function is much easier to compute than the ideal one and is monotonic w.r.t. it. Since the enumerators of both functions are fixed, we only need toshow that the number of edges is monotonic w.r.t. number of all simple paths ina hypothesis graph. This is quite easy – increase in | E H | ( i.e. , adding an edge)will cause | Π( H ) | to grow as well since adding an edge will result in at leastone new simple path in H , the edge itself. Conversely, if the set Π( H ) grows,it means that edges had to be added to the H graph as it is the only way howthe overall number of paths can be increased. Example 2.
The number of edges in the
E, F, G graphs from Example 1 is , , , respectively, while the maximum possible number of edges in the corre-sponding complete graphs is , , . Therefore the modesty values are M ( E ) = 32 = 1 . , M ( F ) = 33 = 1 , M ( G ) = 64 = 1 . From here on, we use convenience ordering relations (cid:31) X for ranking the hypotheses in adecreasing order according to a specific measure X . E (cid:31) X F if and only if X ( E ) > X ( F ). nd the modesty ranking of the hypotheses is E (cid:31) M F, G (cid:31) M F. As stated in Section 2.2.3, we use the dual notion of complexity for measur-ing hypothesis simplicity. For the specific instance of the measure, we employShannon’s entropy that has been frequently used for graph complexity [25]. Todefine the entropy, we utilise the clustering of the hypothesis graph verticesbased on their context vectors. Let us assume a vertex labeling γ : V U → L where L is a set of cluster identifiers. Then we can define a cluster associationprobability p ( l, H ) for a specific cluster l ∈ L within a hypothesis H as p ( l, H ) = |{ v | v ∈ V H ∧ l ∈ γ ( v ) }|| V H | . It is a probability that a randomly selected vertex from H belongs to a cluster l . If we conceive clusters as higher-level topics the hypothesis graph deals with,then the probability reflects the distribution of the topics across the graph. The p ( l, H ) values can be used for computing the cluster association entropy for ahypothesis H as E ( H ) = − (cid:88) l ∈ L p ( l, H ) log p ( l, H ) . It reflects the information value of the hypothesis’ cluster structure – the more“unpredictably” distributed clusters, the higher the complexity and also theinformation value. This conforms to an intuitive assumption that hypothesesdealing with more topics representatively are more informative, i.e. , complex.We define two simplicity measures that employ the cluster association en-tropy and satisfy the respective conditions introduced in Section 2.2.3 S ( H ) = E ( H ) , S ( H ) = E ( U \ H ) E ( U ) . We use both measures in the following to capture different aspects of simplicitysimultaneously.
Example 3.
In Figure 4 there are the three hypotheses graphs
E, F, G and theuniverse graph U depicted again, but this time with cluster annotations providedas vertex labels. The cluster association probabilities for each graph are p ( A, U ) = 13 , p ( B, U ) = 23 , p ( C, U ) = 12 ,p ( A, E ) = 0 , p ( B, E ) = 1 , p ( C, E ) = 23 ,p ( A, F ) = 23 , p ( B, F ) = 13 , p ( C, F ) = 13 ,
1: {A,B}2: {A}3: {C} 4: {B,C} 6: {B,C}5: {B}E 4: {B,C}5: {B} 6: {B,C} F 1: {A,B}2: {A}3: {C} G 2: {A}4: {B,C} 6: {B,C}5: {B}
Figure 4: Sample hypothesis graphs
E, F, G p ( A, G ) = 14 , p ( B, G ) = 34 , p ( C, G ) = 14 . The entropies corresponding to these probabilities are E ( U ) . = 1 . , E ( E ) . = 0 . , E ( F ) . = 2 . , E ( G ) . = 1 . ,E ( U \ E ) . = 2 . , E ( U \ F ) . = 0 . , E ( U \ G ) = 1 . . The hypothesis F is the lowest-ranking no matter which function we use – it hasthe lowest entropy and E ( U \ F ) < E ( U ) , therefore it makes the universe morecomplex. On the other hand, both E, G increase the simplicity of the universe.If only local complexity of the acceptable hypotheses is relevant (measure S ),then the final ranking is G (cid:31) S E (cid:31) S F since E ( G ) > E ( E ) . However, if the rate of simplifying the universe is moreimportant (measure S ), the ranking is E (cid:31) S G (cid:31) S F s E ( U \ E ) E ( U ) . = 1 . > . . = E ( U \ G ) E ( U ) . To limit the potentially intractable number of paths in the ideal generalityformula introduced in Section 2.2.4, we apply two approximations in its spec-ification. Firstly, we focus only on explanations for the universe vertices V HA that are immediately adjacent to the measured hypothesis H . The set of edgesthat connect these vertices to H can then be defined as E HA = { ( u, v ) | ( u, v ) ∈ E U ∧ ( u ∈ V HA , v ∈ V H ∨ u ∈ V H , v ∈ V HA ) } . The second approximation consistsof focusing only on shortest paths w.r.t. the δ distance. The specific generalitymeasure is then defined as G ( H ) = |{ p | p ∈ π s ( U, δ ) ∧ p ∈ V HA ∧ p , p , . . . , p | p | ∈ V H }| . The measure corresponds to the number of shortest paths that start in an ad-jacent vertex and connect it with vertices in the hypothesis graph H , thus pro-viding an explanation for it using only H . As the graphs H are assumed to beconnected, the measure can further be simplified as G ( H ) = | E HA | (1+( V H − | E HA || V H | for graphs where only one shortest path exists between any two ver-tices.The G ( H ) measure uses sum aggregation as the g function present in thegeneral definition. The f function that leads to the presented definition of G ( H )returns zero for any vertex from the V U \ V H set that is not immediately adjacentto H . For other vertices, it returns the number of paths that provide explanationfor them in H . This number is positively correlated with the number of verticesin V H as required in the general definition, since the number of paths leadingfrom a vertex to other vertices in a connected graph H is | V H | − w.r.t. the ideal generality measure,though. If the number of shortest paths increases, then the number of all pathsnaturally has to be higher as well. The other direction is less obvious, andconditional. Assuming the number of all paths in a graph has increased, wehave to show that there also has to be more shortest paths. This is not true ingeneral – if edges between distant vertices are added, they may not contribute toincreasing the number of shortest paths. However, since the measure intuitivelycaptures the notion of generality in the context of knowledge graphs and is easyto compute, we decided to relax the absolute monotonicity requirement for thesake of practicality. Example 4.
The sets vertices adjacent to the
E, F, G hypotheses are V EA = { } , V FA = { , } , V GA = { , } nd the corresponding sets of connecting edges are E EA = E FA = { (4 , , (6 , } , E GA = { (2 , , (2 , } . Since there is only one shortest path between any pair of vertices in our example,the generality measures are G ( E ) = | E EA | · | V E | = 2 · , G ( F ) = | E FA | · | V F | = 2 · ,G ( G ) = | E GA | · | V G | = 2 · and the resulting ranking is G (cid:31) G E, G (cid:31) G F. Using the shortest paths approximation again, we define a specific refutabil-ity measure as R k ( H ) = | π s ( H, δ ) || π s ( H, δ ) | + (cid:80) ki =1 | π s ( H/R ( i ) , δ ) | . Similarly to Section 3.2.4, we consider only the shortest paths instead of allsimple ones, which makes the computation of the measure comparatively easier.Such an approximation is unfortunately not strictly monotonic as shown be-fore, however, we believe that the practicality and intuitiveness of the measureoutweighs the partial monotonicity violation.For the ranking R of the vertices in the R k ( H ) measure computation, weuse the betweenness centrality which is defined as c B ( v, G ) = |{ p | p ∈ π s ( G, δ ) ∧ v ∈ p }|| π s ( G, δ ) | , where v is a vertex and G is a graph. In other words, betweenness centrality of avertex is the number of shortest paths passing though it divided by total numberof shortest paths. The ranking R ranks the vertices in a decreasing order basedon their betweenness centrality. Such ranking generally does not mean thatremoval of a high-ranking vertex results in a higher number of shortest pathsdisappearing when compared to a removal of a lower-ranking vertex – if thegraph remains connected in both cases, the number of shortest paths in it will bethe same after removal of either node. However, removing a vertex with higherbetweenness centrality will result in relative increase of the remaining paths’lengths. This can lead to a decrease of the graph conservatism and thus also to adecrease of its overall value w.r.t. the hypothesis virtues. Consequently, makinga hypothesis weaker more quickly can be seen as refuting it more efficiently. Webelieve that this justifies the chosen ranking even though it means yet anotherrelaxation of the general requirements . An alternative option that fully conforms to the requirements would employ simple pathsinstead of shortest ones and vertex degree instead of betweenness centrality, however, such asolution can easily become intractable. xample 5. The sets of shortest paths w.r.t. the δ distance for the particularhypothesis graphs are π s ( E, δ ) = { (5 , , (5 , , , (4 , } ,π s ( F, δ ) = { (2 , , (2 , , (1 , } ,π s ( G, δ ) = { (2 , , (2 , , , (2 , , (4 , , (4 , , (5 , , } . The corresponding vertex betweenness centralities are then c B (4 , E ) = 1 , c B (5 , E ) = c B (6 , E ) = 0 . ,c B (1 , F ) = c B (2 , F ) = c B (3 , F ) = 0 . ,c B (2 , G ) = c B (5 , G ) = c B (6 , G ) = 0 . , c B (4 , G ) = 0 . . The top-1 refutability measure for the hypothesis E can be computed as follows.The centrality-based ranking of the vertices places on the top, therefore weremove it. The result is a disconnected graph consisting of isolated vertices , where no path exists anymore. The top-1 refutability measure of E is thus R ( E,
1) = 33 + 0 = 1 . Similarly, the top-1 refutability measures for the remaining two hypotheses (witharbitrary removal vertex selection for F due to uniform centrality ranking) are R ( F,
1) = 33 + 1 = 0 . , R ( G,
1) = 66 + 1 = 0 . . The resulting refutability ranking of
E, F, G is E (cid:31) R G (cid:31) R F. The specific measures defined in the previous section can be used to rankthe hypothesis graphs independently of each other as shown in the examples.However, practical applications will very often imply the necessity to comparehypotheses along all the measures. Lacking any a priori information on whichmeasures may be more relevant for a particular application, we propose thefollowing way of ordering the hypothesis graphs.Let H = { H , H , . . . , H n } be the set of hypothesis graphs we wish to com-pare according to a set of measures X = { X , X , . . . , X m } of equal impor-tance. Then we can construct an edge-labeled directed ranking multigraph R = ( H , E ⊆ H × H , λ : E → X ). The multigraph’s vertices are the hypothesesin H . The edge set and the labeling function is constructed from the specificmeasure rankings so that ( H i , H j ) ∈ E , λ ( H i , H j ) = X k if and only if there is16 measure X k such that H i (cid:31) X k H j . Using the ranking multigraph R , we candefine a combined ranking relation (cid:31) on the set H × H as H i (cid:31) H j if and only if d o ( H i , R ) d o ( H i , R ) + d i ( H i , R ) > d o ( H j , R ) d o ( H j , R ) + d i ( H j , R ) , where d i ( H x , R ) , d o ( H x , R ) is the in-degree and out-degree of the vertex H x inthe multigraph R , respectively. In plain words, the combined ranking relation (cid:31) orders the hypotheses based on the relative magnitude of their superiority(out-degree) w.r.t. the specific ranking relations given by the measures. Example 6.
Figure 5 shows the ranking multigraph corresponding to Exam-ples 1-5. A directed edge from vertex X to Y with a label Z means that X (cid:31) Z Y . FG C M S2 G RE C S1 G M S1 S2 R
Figure 5: Ranking multigraph for
E, F, G
The in-degrees and out-degrees of
E, F, G in the ranking graph are d i ( E ) = 3 , d o ( E ) = 4 , d i ( F ) = 6 , d o ( F ) = 1 , d i ( G ) = 3 , d o ( G ) = 7 , therefore G (cid:31) E (cid:31) F since > > .
4. Experimental Validation
In order to validate the proposed formalisation of hypothesis virtues in thecontext of knowledge graphs, we chose to follow-up on our work presented in [30]where we addressed automated extraction of conceptual networks from biomed-ical literature. The work deals with extraction of co-occurrence and similarityrelationships from abstracts available on PubMed ( c.f. , ih.gov/pubmed ) and consequent indexing, querying and navigation of the net-works in a knowledge discovery scenario.As we have shown in [30], the automatically extracted networks can alreadyprovide useful insights even for experts in the field, however, they still containsome noise and irrelevant and/or obvious information. Tackling this challengehas been the main practical motivation for the research presented in this article.We believe we can use our approach to identify portions of the automaticallyextracted graphs that can not only provide general overview of the domain withless noise, but also isolate valid relationships that are surprising for experts.This can ultimately lead to more efficient machine-aided discovery applications.In our validation experiments, we utilise the scenarios, data sets and eval-uation methodologies elaborated within the field of literature-based discoverywhich we introduce in Section 4.1 below. Section 4.2 is the methodologicalcore of this part. It presents an evolutionary approach to the refinement of au-tomatically extracted knowledge graphs using the hypothesis virtue measures.Section 4.3 describes the data sets and methods we use for the experimentalevaluation. Finally, Section 4.4 discusses the results of the experiments.Note that we have implemented our approach and the experiments reportedin this section using a Python prototype available under the GPL free softwarelicense. The corresponding code, experimental data and results are availableat http://skimmr.org/hyperkraph/ . Detailed README documentation onthe implementation and data is provided as a part of the respective archiveshosted at the referenced URL. The field of literature-based discovery is widely considered to stem from thework [44]. Based on [44] and a follow-up article [45], the work [43] introducedthe notion of Swanson linking – connecting two pieces of knowledge in isolateddocuments A and B using concepts from intermediate documents (C) that aredirectly or indirectly related to A and B. Surveys of recent works addressingthis problem are provided in [7, 34, 41].The application of our framework to refining knowledge graphs automati-cally extracted from literature is closely related to literature-based discovery.Our goal is to generate a set of graphs that reflect relationships between termsin literature and are optimised w.r.t. hypothesis virtues. Such a structure canvery straightforwardly facilitate the process of finding “interesting” links be-tween isolated concepts via intermediates, which is the key problem of literaturediscovery. Therefore we can use the standard approaches and man-made “goldstandard” discoveries from that field to experimentally validate our approach inan established application scenario. HYPERKRAPH is a general name we use for the ongoing implementation of prototypesbased on the presented research. It stands for
HYPothEsis viRtues in Knowledge gRAPHs . .2. Evolutionary Refinement of Automatically Extracted Knowledge Graphs The basic assumption we use for validating our framework is that applyingthe hypothesis virtue measures to refining graphs extracted from literature willfacilitate literature-based discovery tasks better than the unrefined graphs. Toverify this, we have to tackle the graph refinement first. The key question is:
Given a knowledge graph based on statements automatically extracted from text,how can we refine it so that only the parts of the graph that have comparativelyhigh hypothesis virtue measures remain?
This is essentially an optimisation problem in which we know how to tellwhether a solution X is better than Y, but we do not know much about whatthe actual solutions are and how the main knowledge graph is (or should be)composed of them. Such problems can quite efficiently be tackled by evolution-ary computing [9]. In the rest of this section, we describe a specific algorithmfor evolutionary refinement of knowledge graphs.
Figure 6 presents the high-level overview of the graph extraction and refine-ment process. First we use our SKIMMR tool [30] to extract basic co-occurrencestatements from the input texts. The statements are in the shape of tuples( t , t , w d , T ), where t , t are two terms that co-occur in an input text T and w d is the weight of the co-occurrence based on the sentence distances of theterms within T .In the next step (M2 in Figure 6), we:1. Use the basic statements to compute corpus-wide co-occurrence weightsusing normalised point-wise mutual information.2. Encode the terms in the statements using integer identifiers (to optimisethe memory usage in the consequent steps).3. Build a fulltext index upon the lexical vertex labels for accessing themduring the evaluation (this mitigates the impact of spelling alternativesand other irregularities in the automatically extracted names).4. Initialise an undirected edge-labeled universe graph U with edges con-structed from the corpus-wide statements. The graph can possibly belimited to edges with normalised point-wise mutual information weightsabove a pre-defined threshold.5. Construct a context vector space for the U vertices based on their neigh-bors and corresponding edge weights.6. Use the vector space to compute the Euclidean distances between thevertices.Steps M3 and M4 in Figure 6 perform the K-means clustering of the universegraph U in order to provide a vertex labeling γ that associates each vertex withcluster(s) it belongs to (see Section 4.3.3 for details on the K-means settings in19
1: textsM1: SKIMMRD2: co-occurrence statementsM2: graph generatorD3: edge-labeled graphM3: K-means moduleM4: vertex annotatorD4: clusteringD5: edge- and vertex-labeled graphM5: evolutionary moduleD6: optimised graphs
Figure 6: High-level workflow of the graph construction and refinement
1: universe graphM1: initialisationD2: populationM2: mutation, crossover and validationD3: expanded populationM3: ranking and trimmingD4: trimmed population IF NOT TERMINATING
Figure 7: Detailed workflow of the evolutionary refinement the particular experiments we conducted). At this moment, everything is readyfor optimising U according to the hypothesis virtue measures of its sub-graphs. The optimisation step in Figure 6 is performed using a genetic algorithm [9].Its detailed workflow is presented in Figure 7. The genetic algorithm has thefollowing configurable parameters: 1. mutation and mating probabilities p m , p c defining how likely it is for an individual in a population to mutate and mate( i.e. , engage in a crossover with another individual); 2. number k m defining howmany times an individual can attempt to mate in a generation; 3. maximumnumber of generations N G ; 4. rate ρ p of the standard deviation of the populationsize – it sets the size of the population P i to | P i | = gauss ( | P i − | , ρ p | P i − | ) where gauss ( µ, σ ) returns a random number from the normal distribution with mean µ and standard deviation σ , truncated to integer; 5. the mean and standarddeviation µ i , σ i for determining the sizes of the individuals in the initial popu-lation. For specific values of the parameters and discussion of their influence onthe evolution process in our experiments, see Section 4.3.3.The population is initialised (step M1 in Figure 7) by a repetitive random21election of possibly overlapping stars of size gauss ( µ i , σ i ) from the graph U .Stars consist of one “hub” vertex and a set of vertices “fanning out” of the hubvia immediate edges. They are a specific type of sub-graphs that can be used asatomic graph construction blocks [25] and thus they are fitting for the purposeof population initialisation.Step M2 in Figure 7 consists of applying the evolutionary operators on thepopulation and consequent validation of the newly added individuals whichdiscards disconnected ones. The mutation deletes or adds an edge from/to theindividual graph with equal probabilities. The crossover combines two parentsby randomly selecting half of the edges from each parent and combining them ina new individual. All existing edge labels are copied in the process of creatingnew individuals.Step M3 in Figure 7 is essential for the optimisation – it computes thehypothesis virtue measures of each individual in the expanded population andthen ranks the population according to the combined ranking (cid:31) introduced inSection 3.3. The population is then trimmed to a random size based on theprevious population size (computed using the ρ p parameter).Steps M2 and M3 are repeated until a termination condition is met. Thiscan either be reaching a pre-defined number of generations N G , or achievingsome sort of population convergence. For the evaluation of our approach we chose two standard scenarios inliterature-based discovery based on the works [44, 45]. Details on the corre-sponding data sets and experiments we performed using them are described inthe following sections.
We used two data sets in the experimental evaluation, both of which addressdiscovery of connections between previously isolated concepts (and correspond-ing bodies of literature). One data set is based on [44] that explores the rela-tionship between fish oil and Raynaud’s syndrome. The other data set is basedon similar study of previously neglected connections between migraine and mag-nesium [45]. We refer to these two data sets and corresponding experiments asto T R , T M , respectively.The initial corpora of texts for the T R , T M experiments were obtained fromPubMed via queries compiled according to the specifications given in [44, 45].Each of these works defines source and target terms t s , t t together with a set I c of intermediate terms that connect them. A query for the PubMed abstractscorresponding to specific t s , t t , I c is compiled as a disjunction of atomic conjunc-tions (cid:95) t ∈{ t s ,t t } ,t c ∈ I c ( t ∧ t c )The particular queries we used for obtaining the T R , T M corpora were22 "raynaud" AND "blood") OR ("raynaud" AND "viscosity") OR("raynaud" AND "platelet") OR ("raynaud" AND "vascular") OR("raynaud" AND "reactivity") OR ("fish oil" AND "blood") OR("fish oil" AND "viscosity") OR ("fish oil" AND "platelet") OR("fish oil" AND "vascular") OR ("fish oil" AND "reactivity") and ("migraine" AND "vasospasm") OR ("migraine" AND "spreading depres-sion") OR ("migraine" AND "vascular reactivity") OR ("migraine"AND "depolarization") OR ("migraine" AND "epilepsy") OR ("migraine"AND "inflammation") OR ("migraine" AND "prostaglandins") OR("migraine" AND "platelet aggregation") OR ("migraine" AND "sero-tonin") OR ("migraine" AND "brain anoxia") OR ("migraine" AND "cal-cium channel blockers") OR ("magnesium" AND "vasospasm") OR ("mag-nesium" AND "spreading depression") OR ("magnesium" AND "vascularreactivity") OR ("magnesium" AND "depolarization") OR ("magnesium"AND "epilepsy") OR ("magnesium" AND "inflammation") OR ("magnesium"AND "prostaglandins") OR ("magnesium" AND "platelet aggregation")OR ("magnesium" AND "serotonin") OR ("magnesium" AND "brain ano-xia") OR ("magnesium" AND "calcium channel blockers") respectively. Note that the while the T M query exactly corresponds to the termsgiven in [45], the T R query is relaxed to sub-terms as the exact query only yieldsvery few abstracts. The PubMed search was limited to articles indexed untilNovember, 1985 and August, 1987 for T R , T M , respectively, so that we cancompare ourselves to the findings of the original works which have served as a de facto gold standard in the literature-based discovery field [4].The characteristics of the T R , T M corpora are summarised in Table 3. Number Corpus T R T M of tokens is a sum of the word-length of the documents in the corpus and num-ber of base statements is the number of the base co-occurrence statements theSKIMMR tool extracted from the corpus. To generate knowledge graphs from the text corpora, we use the approachintroduced in Section 4.2. We construct the experimental graphs using only co-occurrence statements with above-average positive normalised point-wise mu-tual information scores. This filters out statements with comparatively lowco-occurrence weight. We use the general SKIMMR version that extracts enti-ties based on shallow parsing rather than domain-specific models (see https://github.com/vitnov/SKIMMR for details). This is to demonstrate the general-ity of our work – if we show that our approach can deliver good results even in23uite a specific domain using basic and universally applicable initial text mining,it indicates that it is likely to perform similarly well in any other domain.The characteristics of the extracted graphs are provided in Tables 4 and 5.The basic characteristics | V G | , | E G | , dn G , | C G | , | c maxG | , | c avgG | , | c medG | in Table 4 are Graph | V G | | E G | dn G | C G | | c maxG | | c avgG | | c medG | T R T M the number of vertices, number of edges, graph density, number of connectedcomponents, maximum, average and median component size in vertices, respec-tively. The component-wise characteristics in Table 5 are computed as a weighedarithmetic mean across all the components where the weight is the componentsize in vertices. The characteristics rd G , dm G are the graph radius and diam- Graph rd G dm G tr G asp G asp δG T R T M eter (minimum and maximum eccentricity, respectively, where eccentricity ofa vertex is its maximum distance to other vertices). The tr G characteristicsis transitivity – the fraction of all possible triangles reflecting the tendency ofvertices in the graph to cluster together [39]. The characteristics asp G , asp δG are average shortest path lengths in terms of edges and the distance labeling,respectively. Additional characteristics of the graph is the degree distributiondepicted in Figure 8 (the plot is log-scaled in both x- and y-axis).The extracted graphs both have one large connected component comprisingmost of the vertices, complemented by other trivial components mostly con-sisting of one edge. The largest components exhibit so called “small-world”property [48] – despite of being quite large and having small density, they haverelatively small diameters and average shortest paths. This observation is sup-ported by two additional facts. The graphs have relatively high transitivity, i.e. , high tendency of vertices to cluster together which is typical for complexsmall-world networks [39]. Also, the vertex degree distribution approximatelyfollows the power law as shown in Figure 8, which is characteristic for scale-freenetworks [31]. This means that the extracted graphs have relatively denselyconnected structure with many claims involving frequently repeated concepts,which is largely caused by highly (co)occurring terms. This is perhaps not idealfor making discoveries about as many previously disconnected phenomena aspossible, and we later show how our approach can remedy this problem. For the clustering, we use the K-means module of the scikit-learn pack-age [32]. As the algorithm’s scalability to large numbers of samples and fea-24 rank10 deg r ee Degree rank plots RM
Figure 8: Degree distributions for the experimental graphs tures is limited by available memory, we partition the set of context vectorscorresponding to the universe graph vertices to buckets of size 2 ,
000 and thenrun the K-means algorithm on them with the parameter K set to 40. Thepartitioning is done by incremental random selection of 50 seed vectors fromthe unpartitioned set, computing their centroid and then filling the partitionwith the seeds plus up to 1 ,
950 unpartitioned vectors closest to the centroid.We have experimented with different settings of every parameter, however, wefound out that the resulting distributions of vectors into clusters are practicallyinvariant to the settings, with mean and median cluster sizes converging to thesame values no matter what the settings were.The parameters of the evolutionary refinement were p m = 0 . , p c = 0 . , k m = 5 , N G = 50 , ρ p = 0 . , µ i = 100 , σ i = 80 . The initial individual size parameters are only reflected in rare extreme cases asthe size of the random stars is much more dependent on the data set structurein practice. For the other parameters, we applied values typically used by modelapproaches presented in the evolutionary computing literature [9]. The numberof generations has been set well above a threshold after which the performanceof the corresponding populations starts to oscillate around similar evaluationscores (see Section 4.4.1 for details).The evolutionary refinement with these parameters took 56m and 6h36mfor the T R , T M experiments, respectively, using a 2010 make laptop with 4-core CPU, 8GB RAM and Ubuntu Linux 14.04 OS. The virtue measures (themost demanding part) were computed using six parallel processes. The number25f processes can be easily adjusted to the computing power available, whichfacilitates vertical scalability of the refinement. Horizontal scalability is plannedfor future versions of the prototype and consists of using a distributed processinglibrary instead of the native Python multiprocessing module. We use several evaluation methods. Part of them is based on a recentwork [4] which defines evidence-based and literature frequency-based evalua-tion measures within the de facto standard literature-based discovery scenarioselaborated in [44, 45]. The additional benefit of using [4] as a primary referencefor the evaluation is that the authors compared results of several representa-tive approaches to literature-based discovery. Thus we can interpret our resultswithin a broader context of the whole field. In addition to the measures definedin [4], we perform qualitative evaluation of the actual contents of the resultsand compare ourselves to related state of the art where applicable.The evidence-based evaluation measures the capability of an approach to re-discover the intermediate concepts linking the source and target in the corpusas per discoveries made by human experts. It also measures the importance theapproach associates with the re-discovery. For an intermediate t c , the absoluteevidence-based evaluation measure directly corresponding to [4] is defined as evd ( t c ) = min G ∈G c ( rnk ( G )) , where G c = { G | t s , t t , t c ∈ V G ∧ ∃ p ∈ Π( G ) .p = ( t s , . . . , t c , . . . , t t ) } is a set ofsolution graphs that contain the source and target terms t s , t t linked by theintermediate t c . The function rnk : G → N is a ranking of all solution graphs G = { G | t s , t t ∈ V G } from the most to the least relevant where the relevance isdetermined by the specific approach being evaluated.We construct the sets of ranked solution graphs from the set of individuals ina selected refined generation by: 1. Creating a union graph from all populationindividuals. 2. Generating a set of paths between the source and target termvertices that also contain an intermediate vertex. 3. Ranking the paths usingtheir hypothesis virtue measures, i.e. , the (cid:31) relation, with the population uniongraph as a universe. The step 2. can either compute all simple paths or allshortest paths. In our experiments, we use the latter option due to tractabilityissues. The conception of paths as solution graphs represents another designchoice consistent with the previous definitions – a path linking certain conceptsis the simplest way of claiming (and potentially also explaining) something aboutthem. Note that for mapping terms to vertices in the resulting knowledge graphs, we use thefulltext index computed upon the lexical expressions corresponding to the graph vertices.This is done when generating the universe graph, see Section 4.2.1 for details. To get allterm manifestations in our automatically extracted knowledge graphs, we look up the term ofinterest in the index and then manually prune the results to get all alternatives that refer tothe corresponding concept.
26n addition to the absolute evd score, we compute the overall relative im-portance of an intermediate term t c . This measure is defined as a mean relativeinverse rank of the graphs that contain t c among all solutions, i.e. , evd r ( t c ) = 1 |G c | (cid:88) G ∈G |G| − rnk ( G ) + 1 |G| . It effectively measures the average relative relevance of the hypotheses linkingthe source and target terms via t c – the more often the link is discovered inhigh-ranking graphs, the higher the measure.The second evaluation method proposed in [4] measures the frequency of thediscovered claims in the scientific literature. Similarly to our definition, a pathin the result graph is considered a claim in [4]. The literature frequency can beused to define a measure of solution rarity as rar ( G ) = 1 | π s ( G I ) | (cid:88) p ∈ π s ( G I ) f pm ( Q A ( p )) , where G I = { G | G ∈ G ∧ ∃ t c ∈ I c , p ∈ Π( G ) .p = ( t s , . . . , t c , . . . , t t ) } is a set ofsolutions that contain an intermediate term, π s ( G I ) = (cid:83) G ∈G π s ( G ) is the unionof shortest paths taken across G I , and f pm is the number of results returned byPubMed for an association query Q A ( p ). The query for a path ( p , p , . . . , p | p | )corresponds to the conjunction (cid:86) t ∈ p t of all terms in the path (with a publica-tion time window limited according to the corresponding experimental corpus).For instance, the path (fish oil, platelet aggregation, Raynaud’s syndrome) corre-sponds to the PubMed query "fish oil" AND "platelet aggregation" AND"Raynaud’s syndrome" AND ("0001/01/01"[PDAT] : "1985/11/30"[PDAT]) in the T R experiment. Finally, the rarity measure can be straightforwardly usedfor defining an interestingness measure [4] as a normalised inverse of the rarity int ( G ) = 11 + rar ( G ) . The qualitative evaluation of the results is based on the sets of topics coveredby the particular solutions. A topic is informally defined by potentially relevantterms that lay on a path between source and target concepts in a solution.Potentially relevant terms are those that refer to non-trivial concepts that mayelucidate the meaning of the particular path. Using the notion of topics, wedefine the measures of topical density, relative topical relevance and relativetopical novelty, respectively, as top d ( G ) = | T unq ( G I ) || T all ( G I ) | , top r ( G ) = | T rel ( G I ) || T unq ( G I ) | , top n ( G ) = | T nvl ( G I ) || T rel ( G I ) | for a set G of all solution graphs. The sets T unq ( G I ) , T all ( G I ) , T rel ( G I ) , T nvl ( G I )are sets of unique, all, relevant and novel topics covered by the solution graphsin G that contain an intermediate term.27he relevance of topics is determined by a review of the available scientificliterature. This tells us whether or not a given set of terms can provide a mean-ingful and non-trivial explanation of the connection between the source andtarget terms. More specifically, a topic is considered relevant if and only if thefollowing conditions are met simultaneously: 1. The terms in the topic refer tofeatures of a biomedically relevant relationship that can be traced in literature.2. The relationship is associated with the corresponding target, source and in-termediate terms. 3. The relationship is not trivial – it has to be a supportedby genuine discoveries presented in literature, not statements of obvious merelyoccurring in articles.A novel topic is one that is relevant and not covered by any single publishedwork in its whole. This can be determined using a publication search enginesuch as PubMed, where we can check the number of results of a conjunctivequery involving all terms in the corresponding claim path. If the number ofresults is zero, then the topic is unique.We compute the top d , top r , top n scores for the initially extracted and re-fined graphs in both experiments, focusing on solutions involving correspond-ing source, target and intermediate terms. Whenever applicable, we comparethe relevant topics we generated with the topics (re)discovered by related ap-proaches. We split this section into three parts – first we explain the process of selectionof the refined graphs to be evaluated, then we analyse properties of the selectedgraphs, and finally we discuss the results of the evaluation.
Before analysing the actual results of the evolutionary refinement, we haveto select the generation we will focus on. A natural criterion for that is theperformance of generations in terms of the evaluation measures. The relativeranking of intermediate concepts ( i.e. , the evd r measure) is best suited for thistask as it tells us to which extent the generations tend to “consider” the interme-diate connections important. Figure 9 shows how the mean evd r values for allintermediates evolve throughout the generations for each experiment. The blueand green lines represent the T R , T M experiments, respectively. The full linescorrespond to mean values taken across all intermediate terms (also marked bythe “star” character in the plot legend). The dashed lines are for mean valuesomitting intermediates that are not present in the given generation (marked bythe “plus” character in the legend).For the T R experiment, the generation 40 clearly performs best as it containssolution graphs for each intermediate term and their mean relative ranking isvery high (within the top 20% of solutions). For the T M experiment, the situa-tion is less clear. The best generation in terms of mean across all intermediateterms is number 39, however, if one takes only the present intermediates intoaccount, the generations 35-38 all perform better. Yet we decided to further28
10 20 30 40 50generation0.00.10.20.30.40.50.60.70.80.9 v a l ue Relative intermediate ranks (mean values) R*M*R+M+
Figure 9: Mean relative intermediate ranks through generations analyse the generation 39 as it covers three intermediates, while the generations35-38 only cover two. From here on, we refer to the selected generations by the T R , T M expressions, respectively.Further support for selecting the generation to be analysed can be drawnfrom the numbers of claims containing the source and target claims, and thenumbers of such claims that also contain an intermediate term. The evolutionof these values is depicted in Figure 10. Note that the figure’s y-axis is log-scaled due to different orders of magnitude of the displayed values. Similarly tothe previous figure, the blue and green lines represent the T R , T M experiments,respectively. The full lines correspond to the total number of claims containingthe source and target term in a given generation (also marked by the “t” char-acter in the plot legend). The dashed lines represent the fraction of the claimsthat also contain an intermediate term (marked by the “r” character).The total number of relevant claims is steadily decreasing up until approx-imately 20-25th generation and then starts to oscillate. For the relative num-ber of solutions with intermediates, similar trend can be seen after the 40-thgeneration. This can be interpreted as an indication that the generations arestructurally stabilised then, at least for the evaluation data we work with. Before we proceed with discussing the results, let us have a look at thecharacteristics of the knowledge graphs corresponding to the generations weselected for evaluation. Tables 6 and 7 present the same type of data like thetables in Section 4.3.2. The extra rows with the ∆ prefixes show the relative29
10 20 30 40 50generation10 -2 -1 v a l ue Claim numbers RtMtRrMr
Figure 10: Claim numbers through generations difference between the refined and initial graphs. The columns represent exactly
Graph | V G | | E G | dn G | C G | | c maxG | | c avgG | | c medG | T R T R T M T M Table 6: Basic characteristics of the evolved experimental graphs the same measures as in the tables in Section 4.3.2 – number of vertices, numberof edges, graph density, number of connected components, maximum, averageand median component size in nodes ( | V G | , | E G | , dn G , | C G | , | c maxG | , | c avgG | , | c medG | ),and the graph radius, diameter, transitivity and average shortest path lengths interms of edges and the distance labeling ( rd G , dm G , tr G , asp G , asp δG ). Figure 11 Graph rd G dm G tr G asp G asp δG T R T R T M T M Table 7: Component-wise characteristics of the evolved experimental graphs contains plots of the degree distribution in the refined graphs.The refined graphs still contain about 65% and 72% of the original verticesfor the T R , T M experiments, respectively, however, the edges are much more30 rank10 deg r ee Degree rank plots RM
Figure 11: Degree distributions for the evolved experimental graphs pruned, to about 9% and 10%, respectively. The graph density is thus lower,too (at about 20%). The numbers of connected components do not change much.This is only to be expected given the nature of the population preparation andthe tendency of the evolution process to preserve connectedness. The sizes ofthe components are more or less proportional to the reduction of the vertexnumber.What is more interesting are the component-wise characteristics of the re-fined graphs summarised in Table 7. The radius, diameter and average shortestpath lengths are all increased by up to 78% and no less than 32%, despite of thegraphs getting smaller. The clustering coefficient decreases quite radically – toabout 3.8% and 5.4% of the original value for the T R , T M experiments, respec-tively. The vertex degree distribution still approximately follows the power law,however, the curve is not as steep as for the original graphs. These combinedcharacteristics indicate that the refined graphs exhibit the small world propertyto much lower extent than the original ones. This means that they are struc-turally more evenly organised and tend to have less vertices or vertex groupsthat connect large portions of the graph through very few edges. A possibleconsequence of this fact is lower redundancy and higher rate of non-obviousconnections in the refined graphs. Indeed, the analysis of the data w.r.t. thestandard literature-based discovery application scenarios confirms this, as weshow in the next section. 31 .4.3. Performance of the Refined Graphs In this section, we first discuss the performance of our experiments w.r.t. thequantitative measures used by related approaches. This is then followed byqualitative analysis of the knowledge graphs we generated.Table 8 lists the values of the evd measure for the T R data set. Our approach(the N column) is compared to the works [4, 42, 49, 11, 16] in columns C, S, W,G, H, respectively. For our approach, we list both evd, evd r values, while forthe others, only evd is present as they do not consider evd r . We also provide | G c | , i.e. , the number of solution graphs with intermediates. The evd numbers Intermediate N C [4] S [42] W [49] G [11] H [16] evd evd r | G c | Blood Viscosity 5 0.98 1 15 ∗ T R data correspond to the best rank of the result that contains given intermediate term.The “-” character means the intermediate cannot be found in any result for thatapproach. If there is “Y”, then the intermediate can be found in the results butno ranking is provided. Finally, the results with “ ∗ ” in the C column indicatethat the intermediate can only be found indirectly by manually exploring abroader context of the result [4].Our approach finds all intermediate terms which makes its performanceequivalent to or better than the related approaches in this respect. Blood vis-cosity and platelet aggregation are placed among the top 16% of the results(out of 205 in the T R experiment) while vascular reactivity is considered to berelatively less important intermediate.Table 9 lists the same type of results as Table 8, only for the T M experimentand slightly different set of related works. Note that the related works aresometimes inconsistent in the exact wording of the intermediate terms, thereforewe only focused on nine out of eleven where we were able to clearly mash upthe different alternatives of the term.The quantitative results of our approach are sparser than in case of the T R experiment. This has been caused mainly by the minimalistic, domain-agnosticapproach we chose, which resulted into relatively low coverage of the intermedi-ate synonyms appearing in the data (the fulltext mapping could only discoverterms rather similar to the canonical intermediate form used as a query, whilemany synonyms are quite dissimilar strings). All related approaches but one [11]use term expansion and mapping using biomedical vocabularies like MeSH, andsome even use quite extensive manual interventions (see Section 5.4 for details).Despite of these limitations, we re-discovered five out of nine intermediates. Outof these, only three were discovered using a mature-enough generation of therefined knowledge graph, though.For the intermediates we managed to find, we achieved results comparableto or better than the other approaches. For instance, three out of five related32 ntermediate N C [4] S [42] W [49] B [2] G [11] evd evd r | G c | Calcium ChannelBlockers - - - 22 3 Y 10 1Epilepsy
23 0.628 7 9 ∗ - Y 8 3Brain Anoxia /Hypoxia - - - - 5 - 6 77Inflammation - - - 3 ∗
335 0.333 2 1 ∗
352 0.274 4 4 1 Y 42 27Serotonin - - - 1 1 Y 5 1Cortical / SpreadingDepression
58 0.468 3 - 6 - 45 -Vascular Mechanism /Reactivity T M , T M data approaches were not able to re-discover the cortical depression intermediatewhich is considered very important in [45].The overall results of the evidence-based evaluation are encouraging. In the T R experiment, our approach performed better than [4, 16], worse than [42, 11]and equally to [49]. In the T M experiment, we bettered [4, 49, 11] while [42, 2]outperformed us . In total, we did better than more than half of the relatedapproaches in terms of the intermediate ranking.The rarity and interestingness measures for the two experiments are given inTable 10. We can only compare ourselves to [4] as the measures were defined and Experiment N C [4] rar ( G ) int ( G ) rar ( G ) int ( G ) T R T M used there for the first time. The average results of our approach are lower thanin [4] for the T R experiment. However, the median rarity and interestingness ofthe paths generated in our experiment is 0 and 1, respectively – only about onethird of the T R path associations have non-zero frequency on PubMed. Thismeans that two thirds of the claims generated by our approach have the sameperformance in terms of rarity and interestingness as in [4]. The average resultsof the T M experiment are better in our case. More than 98% of the T M claims Note that direct comparison of the ranking results is conceptually difficult since the ap-proaches generate rather varied forms of results, e.g. , mere terms in [11] or oriented multi-graphs in [4]. However, we can at least give this basic summary, which we corroborate byanalysing the actual contents of the results later on. We also further discuss the major com-parative benefits of our approach in Section 5.4.
Terms degree ranking betw. centrality ranking T R T R T M T M T R T R T M T M Source, target 0.522 0.609 0.541 0.708 0.537 0.592 0.656 0.736Intermediates 0.563 0.729 0.557 0.57 0.678 0.699 0.584 0.575Table 11: Degree-based ranking of the re-discovery terms targets and intermediates in terms of degree is increased by the refinement inboth experiments. The increase is largest for source and target terms in the T M experiment and for the T R intermediates. The importance in terms of between-ness centrality is increasing relatively less, with the T M intermediates actuallybecoming slightly less important. These observations are consistent with theevidence-based evaluation in the sense of “sensitivity” of the experimental datasets towards the source, target and intermediate terms. The refinement of the T R graph clearly raises the importance of all vertices, especially the interme-diates. Indeed, all the terms are present in relatively highly ranking claims ofthe resulting T R graph. For the T M data set where only the importance ofthe source and target vertices is markedly rising, the results are much sparser –although the T M graph contains many claims connecting the source and target,there is relatively few intermediates from [45] present in these claims.The qualitative analysis of the solution contents further elaborates on theabove observations about the initial and resulting graph structure. As specifiedin Section 4.3.4, the analysis is based on the topics covered by the solutiongraphs. These are terms that provide additional context for the intermediatesin the solutions. We provide comprehensive lists of the unique context topics inAppendix A, together with references to supporting literature.The contents of the Appendix A is summarised in Table 12 which containsthe top d , top r , top n score values for the initial and refined knowledge graphsin both experiments . The table shows that our approach improves the qualityof the extracted knowledge graphs. The relative topical density top r ( i.e. , theratio of unique topics among the paths connecting source and target terms)increases by about 27% and 122% for the T R , T M experiments, respectively. As we are not experts in the domains involved, we adopted a very conservative strategyfor determining the topic relevance. If we could not directly verify any particular relationshipbetween biomedical concepts present in the solution graphs using a review of published lit-erature via PubMed, we asserted the corresponding solution irrelevant. We encourage moreknowledgeable readers to suggest possible updates of the detailed tables in Appendix A. core T R T M T R T R T M T M top d top r top n The relevance top r increases by about 31% and 46% for T R , T M , respectively.Finally, the relative topical novelty top n increases by about 86% for the T M experiment. In case of the T R experiment, the measure is slightly lower for T R than for T R , however, there is only one non-novel solution in both knowledgegraphs. The decrease in the relative top r value is caused by lower total numberof solutions in the refined graph.These results confirm our assumption that the refinement improves the qual-ity of statements extracted from literature, at least in the context of two stan-dard literature-based discovery scenarios. The improvement in quality is three-fold. Firstly, the refined knowledge graphs are less redundant (the topical den-sity is higher). Secondly, there is markedly more relevant solutions in the results.And thirdly, the refined solutions are largely non-obvious (high top n measure).Direct and exact comparison of our qualitative results to related state ofthe art is unfortunately impossible due to the afore-mentioned differences in thesolution representations. However, we can at least discuss the commonalitiesand differences informally. Figure 12 displays the hierarchy of topics coveredby the T R results. Each vertex in the hierarchy graph represents a part of thetopic. The roots of the presented hierarchies are the intermediate concepts.The vertices shared across multiple topics have normal outlines. Vertices thatcomplete the topics on the way from the root have bold outlines.The impact of glyceryl trinitrate on vasodilation and consequently also onblood flow has been studied in the context of possible treatment of Raynaud’ssyndrome [19]. Our method reflects these findings in constructing a correspond-ing connection between Raynaud’s syndrome and platelet aggregation which isquite closely related to blood flow [17]. Phosphatidylcholine, also a relativelycommon vertex in the generated claims, refers to a class of phospholipids that isclosely related to metabolism of fatty acids, including those found in fish oils [1].The vertices connected to phosphatidylcholine mediate the relationship betweenfish oil and platelet aggregation in the solutions. The topic with ADP-inducedplatelet aggregation [35] specifies the type of platelet aggregation fish oils caninfluence. The solutions concerned with anti-thrombotic effect put this vertexin connection with fish oils, possibly with an intermediate vertex referring tomyocardial infarction. This corresponds to the anti-thrombotic effect of fishoils demonstrated for instance in [52]. Finally, one of our solutions identifieda link between platelet aggregation and fish oils via their influence on levels ofplasma-beta thromboglobulin, a marker in ischemic heart disease [13].The solution involving the vascular reactivity intermediate puts it in thecontext of influence of fish oils on lower vascular resistance, as discussed for35 l a t e l e t a gg r e g a ti on b l ood f l o w , g l y ce r y l t r i n it r a t e pho s ph a ti dy l c ho li n ea dp - i ndu ce dp l a t e l e t a gg r e g a ti on a n ti - t h r o m bo ti ce ff ec t p l a s m a - b e t a t h r o m bog l obu li n , i s c h e m i c h ea r t d i s ea s e , i n iti a l d ec r ea s e , p l a t e l e t c oun t n - a r ac h i don i c ac i d , do c oh e x ae no i c ac i dp r ec u r s o r c m yo ca r d i a li n f a r c ti on v a s c u l a r r eac ti v it y l o ca l c oo li ng , d i g it a l s y s t o li c p r e ss u r e , l e ss r e s pon s i v e , h ae m odyn a m i c p r o f il e b l ood v i s c o s it y p l a s m a - b e t a t h r o m bog l obu li n , i s c h e m i c h ea r t d i s ea s e , h i gh l e v e l o f p l a s m a f i b r i nog e n , s h ea r r a t e Figure 12: Hierarchy of relevant topics in T R T R solutions with related state of theart approaches, we can only refer to [4] and [44] as the other works generatemere lists of possible intermediates without further context. Many contextsassociated with the intermediates as possible explanations of the connectionsare missing in the related works. Examples are blood flow, glyceryl trinitrate,ADP-induced platelet aggregation, phosphatidylcholine or plasma-beta throm-boglobulin within ischemic heart disease. However, most of these connectionsare rather explanatory and not essential in the scope of Raynaud’s syndrome de-spite of being valid. In [4], many of the graphs involve epoprostenol (essentiallya prostaglandin) as a mediator of the influence of fish oils on platelet aggrega-tion. This is consistent with [44] that establishes the connection between fishoil and platelet aggregation as a result of increased level of prostaglandins. Thiscontext is missing in our results that involve the intermediates, however, it ispresent twice among the top-ten solutions (at ranks 4 and 8). Once it appearsin relation to the action of the drug indomethacin, and then also in relationto luteolytic activity in women with Raynaud disease. These are potentiallyinteresting findings that extend the results produced by comparable state of theart approaches.Figure 13 displays the hierarchy of topics covered by the T M results. Onesolution involving the epilepsy intermediate puts it in the context of magnesiumbeing used as a mechanism for management of reverberating brain waves [40].These are associated with epilepsy and vertigo attacks and the solution sug-gests that treatments for these conditions may be used for migraine as well.Other claims related to epilepsy all share multifocal EEG abnormalities whichare characteristic for epilepsy [29]. Two different types of claims were comple-menting these findings – two solutions dealing with magnesium concentrationsin cerebrospinal fluid in relation to migraine [37], and one solution related totransmitter release and nerve stimulation. The solutions involving the corticalspreading depression intermediate were all related to similar concepts as theepilepsy ones. This is not surprising, since cortical spreading depression is quiteclosely related to seizures [10].Similarly to the T R experiment, we can only compare the contents of our T M solutions to [4] and [45] which are the only works that provide context inaddition to the intermediates. The graphs presented in [4] for migraine and mag-nesium are generally much sparser than those for Raynaud and fish oil (typicallycontaining only the source, target and intermediate node). Moreover, none ofthe results discussed in the article in detail concern epilepsy or spreading depres-sion. The work [45] confirms close relationship between epilepsy and corticalspreading depression which is consistent with a straightforward interpretation37 p il e p s y m u lti f o ca l EE G a bno r m a lit yv e r ti go , c onvu l s i on t r ea t m e n t , r e v e r b e r a ti on i n c r ea s e i n m a gn e s i u m c on ce n t r a ti on i n c s f M e r s k e y , ce r e b r o s p i n a l f l u i d t r a n s m itt e r r e l ea s e du r i ng r e p e titi v e n e r v e ac ti v it y , n e r v e s ti m u l a ti on m a gn e s i u m s u l f a t e c o r ti a l s p r ea d i ngd e p r e ss i on c h il dhood e p il e p s y w it h r o l a nd i c s p i k e , h e m i s p e h e r e h e m i p l e g i a , c h il dhoodp a r ti a l o cc i p it a l e p il e p s y , r e v e r b e r a ti on , m a gn e s i u m b l o c k a d e Figure 13: Hierarchy of relevant topics in T M
38f our results. Our solutions also managed to bring up the relationship betweenmagnesium and cerebrospinal fluid in the context of epileptic attacks. In addi-tion to that, our results appear to strongly associate migraine with multifocalEEG abnormalities. This is consistent with relationship between the abnor-malities and headaches demonstrated for instance in [12]. Other potentiallyinteresting findings not covered by related works are vertigo, reverberation andthe relationship between migraine, cortical spreading depression and rolandicepilepsy [50].
5. Related Work
We split this section into four thematic blocks that correspond to the maintheoretical and application-specific facets of our work. In particular, we reviewthe areas of: 1. automated discovery, 2. ontology learning, 3. discovery supportedby knowledge graphs, 4. literature-based discovery.
Research of ways how discoveries can be automated or facilitated by ma-chines dates back to the dawn of the digital computer era. The work [27]provides a comprehensive analysis of the discovery process operationalised ascreative problem solving. It reviews several classic machine discovery systemsand the heuristics used by them, and also mentions several properties of worthydiscoveries like novelty and value. A more recent related work [18] reviews themajor approaches to studying the process of scientific discovery, provides an-other survey of automated discovery systems and analyses additional featuresof relevant discoveries, such as surprise. The works [21, 20] review still moremachine discovery systems and heuristics, and identify features like refutabilityand simplicity as essential to discoveries. One of the most recent and relevantworks from this area is [23]. It builds on [27, 18] and introduces formalisationsof several discovery features. In particular, it models novelty and value usingmetric spaces, and surprise using Bayesian probabilities.Discovery features discussed in the referenced works conform to our virtuedefinitions, although most of them do not provide a systematic formalisation,only rather application-specific implementations. For instance, refutability andsimplicity as reviewed in [21] directly correspond to our virtues. Surprise andnovelty discussed in the other works can be modelled by putting emphasis onradical claims as addressed by the conservatism virtue, only using different dis-tance metrics for each of the respective features. We believe that our approachpresents a new way to formalising discovery features that is consistent with re-lated state of the art, but is more systematic, comprehensive and extensible. Inaddition, we provide an actionable set of measures implemented in the contextof knowledge graphs. This enables universal applicability of our research, whichis not the case in most of the rather specific afore-mentioned approaches.39 .2. Ontology Learning
In the last fifteen years, there has been a growing interest in exploring thepotential of automatically extracted graph structures for knowledge discovery.Many of such approaches can be clustered under the umbrella of ontology learn-ing [22] which aims at extracting complex statements from unstructured textualresources. This is done using specifically tailored methods from AI disciplineslike natural language processing or machine learning.As a recent survey [51] shows, the applicability of existing ontology learningapproaches to (semi)automated knowledge discovery is still quite limited. Manyof the techniques are dependent on manually curated resources. They alsointroduce a lot of assumptions during the extraction process (based on, forinstance, linguistic facts valid only in the context of a particular language ordiscourse). This limits their universal applicability. Another problem is thatthe more complex knowledge representation the learned ontologies use, the morerestrictive they are about their meaning. This typically leads to brittleness w.r.t. the often inherently vague and contextual nature of the knowledge theyrepresent. This can easily cause problems in machine-aided knowledge discoveryscenarios where we typically want to represent the knowledge implied by theinput data in as unbiased way as possible. Another practical limitation is thatmost ontology learning system do not scale very well as reported in [51].
More recent works related to machine discovery using knowledge graphsinclude [6, 8, 28] which contain also comprehensive reviews of prior similarapproaches. The approach elaborated in [6] presents methods for knowledgediscovery in RDF [24] data based on user-defined query patterns and analyticalperspectives. Our approach complements [6] by offering means for automatedanalysis and refinement of knowledge graphs using application-independent,well-founded features.Google’s Knowledge Vault [8] presents a web-scale approach to probabilisticknowledge fusion that uses graphs represented in the RDF format. It tacklesthe scalability vs. accuracy trade-off of the manual and automatic approachesto construction of knowledge graphs. This is done by refining statements ex-tracted from the web content using models learned from pre-existing highly ac-curate knowledge bases like YAGO or Freebase. Additional details and broadertheoretical context of the approach introduced in [8] is given in [28], which of-fers a comprehensive review of relational machine learning approaches in thecontext of RDF-compatible knowledge graphs. The main advantage of our ap-proach w.r.t. the works [8, 28] is that we are not critically dependent on thebackground knowledge model. In addition, we present a complementary well-founded approach to determining which relationships in automatically extractedknowledge graphs are worth preservation. Having said that, the techniques re-viewed in [28] can certainly provide valuable hints on future extensions of ourapproach to graphs with oriented edges representing more than one type ofrelationships ( i.e. , RDF graphs). 40 .4. Literature-Based Discovery
As our approach has been validated by experiments in literature-based dis-covery, we need to position ourselves within that field as well. Surveys ofrelated works (focusing mostly on the domain of life sciences) are providedin [7, 34, 41]. The specific approaches we compare ourselves to are describedin [4, 2, 16, 49, 42, 11]. In most cases where we were able to directly compareour results with the related works, our approach was at least as good as andoften better than the state of the art. In addition, we managed to hint at sev-eral relevant insights that were not even discussed by the human expert in theoriginal studies [44, 45].The most significant advantages of our approach are, however, as follows.1. It is absolutely automatic. The only manual action we performed was pruningthe fulltext search results when mapping terms to the corresponding vertices,however, this is only required for the evaluation, not for the method itself.2. There are no domain-specific dependencies and thus our work is readily ap-plicable to any field, not just the biomedical literature-based discovery. 3. Weproduce extensive contextual information that can facilitate the interpretationof the results and thus make the machine-aided discovery process more efficient.4. Our approach is based on theoretical foundations motivated by the state ofthe art philosophical study of key features of scientific discoveries.The works [42, 16, 49] all depend on rather extensive manual effort (def-inition of semantic types and discovery patterns, result pruning, etc. ). Theapproaches [4, 2, 11] are automated, however, only [4] provides broader contextin order to elucidate the connections. Moreover, all works but [11] substantiallyrely on an external domain-specific source of background knowledge and/ordomain-specific NLP tools, namely the MeSH and UMLS vocabularies [3] andthe tools SemRep [38] and BioMedLEE [5]. It is quite plausible to assume thatwithout these resources, the related approaches dependent on them would per-form much less favourably when compared to our implementation. Last butnot least, all the related approaches lack the universally applicable theoreticalfoundations presented as the core contribution of this article.
6. Conclusions and Future Work
We have presented a novel approach to discovery informatics that is basedon formalisation of hypothesis virtues in the context of knowledge graphs. Wehave shown that the approach is naturally motivated, well-founded, extensibleand universally applicable. It can be used as a broader theoretical frame forother approaches to machine discovery, as briefly outlined in Section 5. Wehave delivered an implementation of the presented research and performed itsexperimental validation using standard scenarios in literature-based discovery.A successful comparison with related state of the art tools demonstrates thepractical relevance of our work.In near future, we will extend the theoretical framework in order to addressdirected multi-graphs with predicate edge labels and more complex semantics41ssociated with particular edge and vertex types. This will allow for straightfor-ward application of our approach to more expressive knowledge graphs, such asRDF [24] knowledge bases and ontologies in the Linked Open Data cloud [14].Furthermore, we intend to continue demonstrating the universality of our frame-work by using it in other experimental scenarios targeted by related works inmachine discovery. We also plan to explore the complex relationships betweenspecific measures and their influence on the properties of the evolutionary re-finement process ( e.g. , convergence, optimality and completeness bounds). Thiswill lead to deeper understanding of the refinement, and therefore also to moreefficient implementations. Finally, and perhaps most importantly, we wouldlike to use our approach in scenarios involving actual new discoveries, in directcollaboration with corresponding domain experts.
References [1] Abe, E., Ikeda, K., Nutahara, E., Hayashi, M., Yamashita, A., Taguchi, R.,Doi, K., Honda, D., Okino, N., Ito, M., 2014. Novel lysophospholipid acyl-transferase PLAT1 of aurantiochytrium limacinum F26-b responsible forgeneration of palmitate-docosahexaenoate-phosphatidylcholine and phos-phatidylethanolamine. PloS one 9 (8), e102377.[2] Blake, C., Pratt, W., 2002. Automatically identifying candidate treatmentsfrom existing medical literature. In: AAAI Spring Symposium on MiningAnswers from Texts and Knowledge Bases. pp. 9–13.[3] Bodenreider, O., 2004. The unified medical language system (UMLS): inte-grating biomedical terminology. Nucleic acids research 32 (suppl 1), D267–D270.[4] Cameron, D., Kavuluru, R., Rindflesch, T. C., Sheth, A. P., Thirunarayan,K., Bodenreider, O., 2015. Context-driven automatic subgraph creation forliterature-based discovery. Journal of Biomedical Informatics. In press.[5] Chen, L., Friedman, C., 2004. Extracting phenotypic information from theliterature via natural language processing. Medinfo 11 (Pt 2), 758–62.[6] Colazzo, D., Goasdou´e, F., Manolescu, I., Roatis, A., 2014. RDF Analytics:Lenses over Semantic Graphs. In: Proceedings of WWW’14. ACM.[7] de Bruijn, B., Martin, J., 2002. Getting to the (c)ore of knowledge: miningbiomedical literature. International Journal of Medical Informatics 67 (13),7 – 18.[8] Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K.,Strohmann, T., Sun, S., Zhang, W., 2014. Knowledge vault: A web-scaleapproach to probabilistic knowledge fusion. In: Proceedings of the 20thACM SIGKDD international conference on Knowledge discovery and datamining. ACM, pp. 601–610. 429] Eiben, A. E., Smith, J., 2007. Introduction to Evolutionary Computing.Springer.[10] Fabricius, M., Fuhr, S., Willumsen, L., Dreier, J. P., Bhatia, R., Boutelle,M. G., Hartings, J. A., Bullock, R., Strong, A. J., Lauritzen, M., 2008.Association of seizures with cortical spreading depression and peri-infarctdepolarisations in the acutely injured human brain. Clinical Neurophysiol-ogy 119 (9), 1973–1984.[11] Gordon, M. D., Lindsay, R. K., 1996. Toward discovery support systems: Areplication, re-examination, and extension of Swanson’s work on literature-based discovery of a connection between Raynaud’s and fish oil. Journal ofthe American Society for Information Science 47 (2), 116–128.[12] Guidetti, V., Fornara, R., Marchini, R., Moschetta, A., Pagliarini, M.,Ottaviano, S., Seri, S., 1986. Headache and epilepsy in childhood: analysisof a series of 620 children. Functional neurology 2 (3), 323–341.[13] Hay, C., Durber, A., Saynor, R., 1982. Effect of fish oil on platelet kineticsin patients with ischaemic heart disease. The Lancet 319 (8284), 1269–1272.[14] Heath, T., Bizer, C., 2011. Linked Data: Evolving the Web Into a GlobalData Space. Morgan & Claypool.[15] Honavar, V. G., 2014. The promise and potential of big data: A case fordiscovery informatics. Review of Policy Research 31 (4), 326–330.[16] Hristovski, D., Friedman, C., Rindflesch, T. C., Peterlin, B., 2006. Ex-ploiting semantic relations for literature-based discovery. In: AMIA annualsymposium proceedings. Vol. 2006. American Medical Informatics Associ-ation, p. 349.[17] Jackson, S. P., 2007. The growing complexity of platelet aggregation. Blood109 (12), 5087–5095.[18] Klahr, D., Simon, H. A., 1999. Studies of scientific discovery: Complemen-tary approaches and convergent findings. Psychological Bulletin 125 (5),524.[19] Kleckner, M. S., Allen, E. V., Wakim, K. G., 1951. The effect of localapplication of glyceryl trinitrate (nitroglycerine) on Raynaud’s disease andRaynaud’s phenomenon studies on blood flow and clinical manifestations.Circulation 3 (5), 681–689.[20] Langley, P., 2000. The computational support of scientific discovery. Inter-national Journal of Human-Computer Studies 53 (3), 393 – 410.[21] Langley, P., Zytkow, J. M., 1989. Data-driven approaches to empirical dis-covery. Artificial Intelligence 40 (1-3), 283–312.4322] Maedche, A., Staab, S., 2004. Ontology learning. In: Staab, S., Studer, R.(Eds.), Handbook on Ontologies. Springer, Ch. 9, pp. 173–190.[23] Maher, M. L., Fisher, D. H., 2012. Using AI to evaluate creative designs.In: 2nd International Conference on Design Creativity, Glasgow, UK.[24] Manola, F., Miller, E., 2004. RDF Primer. Available at (November 2008): .[25] Mowshowitz, A., Dehmer, M., 2012. Entropy and the complexity of graphsrevisited. Entropy 14 (3), 559–570.[26] Mozaffarian, D., 2007. Fish, n-3 fatty acids, and cardiovascular haemody-namics. Journal of Cardiovascular Medicine 8, S23–S26.[27] Newell, A., Shaw, J. C., Simon, H. A., 1959. The processes of creativethinking. Rand Corporation Santa Monica, CA.[28] Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E., 2015. A review ofrelational machine learning for knowledge graphs: From multi-relationallink prediction to automated knowledge graph construction. arXiv preprintarXiv:1503.00759.[29] Noriega-Sanchez, A., Markand, O. N., 1976. Clinical and electroencephalo-graphic correlation of independent multifocal spike discharges. Neurology26 (7), 667–667.[30] Nov´aˇcek, V., Burns, G. A., 2014. SKIMMR: Facilitating knowledge dis-covery in life sciences by machine-aided skim reading. PeerJ. Available at https://peerj.com/articles/483/ .[31] Onnela, J.-P., Saram¨aki, J., Hyv¨onen, J., Szab´o, G., Lazer, D., Kaski, K.,Kert´esz, J., Barab´asi, A.-L., 2007. Structure and tie strengths in mobilecommunication networks. Proceedings of the National Academy of Sciences104 (18), 7332–7336.[32] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vander-plas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay,E., 2011. Scikit-learn: Machine learning in Python. Journal of MachineLearning Research 12, 2825–2830.[33] Popper, K., 2005. The logic of scientific discovery. Routledge.[34] Preiss, J., Stevenson, M., McClure, M. H., 2012. Towards semantic litera-ture based discovery. In: 2012 AAAI Fall Symposium Series: InformationRetrieval and Knowledge Discovery in Biomedical Text. Vol. 30. AAAI, pp.7–18. 4435] Puri, R. N., 1999. ADP-induced platelet aggregation and inhibition ofadenylyl cyclase activity stimulated by prostaglandins: signal transductionmechanisms. Biochemical pharmacology 57 (8), 851–859.[36] Quine, W. V., Ullian, J. S., 1978. The Web of Belief. McGraw-Hill.[37] Ramadan, N., Halvorson, H., Vande-Linde, A., Levine, S. R., Helpern, J.,Welch, K., 1989. Low brain magnesium in migraine. Headache: The Journalof Head and Face Pain 29 (9), 590–593.[38] Rindflesch, T. C., Fiszman, M., Libbus, B., 2005. Semantic interpretationfor the biomedical research literature. In: Medical informatics. Springer,pp. 399–422.[39] Scott, J., 2012. Social network analysis. Sage.[40] Shibata, M., Bures, J., 1975. Techniques for termination of reverberatingspreading depression in rats. Journal of neurophysiology 38 (1), 158–166.[41] Smalheiser, N. R., 2012. Literature-based discovery: Beyond the ABCs.Journal of the American Society for Information Science and Technology63 (2), 218–224.[42] Srinivasan, P., 2004. Text mining: generating hypotheses from MEDLINE.Journal of the American Society for Information Science and Technology55 (5), 396–413.[43] Stegmann, J., Grohmann, G., 2003. Hypothesis generation guided by co-word clustering. Scientometrics 56 (1), 111–135.[44] Swanson, D. R., 1986. Fish oil, Raynaud’s syndrome, and undiscoveredpublic knowledge. Perspectives in Biology and Medicine 30 (1), 7–18.[45] Swanson, D. R., 1987. Migraine and magnesium: eleven neglected connec-tions. Perspectives in Biology and Medicine 31 (4), 526–557.[46] Tietjen, G. W., Chien, S., Leroy, E. C., Gavras, I., Gavras, H., Gump, F. E.,1975. Blood viscosity, plasma proteins, and Raynaud syndrome. Archivesof Surgery 110 (11), 1343–1346.[47] Valiant, L. G., 1979. The complexity of enumeration and reliability prob-lems. SIAM Journal on Computing 8 (3), 410–421.[48] Watts, D. J., Strogatz, S. H., 1998. Collective dynamics of ’small-world’networks. Nature 393 (6684).[49] Weeber, M., Klein, H., de Jong-van den Berg, L., Vos, R., et al., 2001. Usingconcepts in literature-based discovery: Simulating swanson’s Raynaud–fishoil and migraine–magnesium discoveries. Journal of the American Societyfor Information Science and Technology 52 (7), 548–557.4550] Wirrell, E. C., Hamiwka, L. D., 2006. Do children with benign rolandicepilepsy have a higher prevalence of migraine than those with other partialepilepsies or nonepilepsy controls? Epilepsia 47 (10), 1674–1681.[51] Wong, W., Liu, W., Bennamoun, M., 2012. Ontology learning from text:A look back and into the future. ACM Comput. Surv. 44 (4), 20:1–20:36.[52] Zhu, B.-Q., Parmley, W. W., 1990. Modification of experimental and clini-cal atherosclerosis by dietary fish oil. American heart journal 119 (1), 168–178.
Appendix A: Context Topics for the Intermediate Terms