[PDF] Formalising Hypothesis Virtues in Knowledge Graphs: A General Theoretical Framework and its Validation in Literature-Based Discovery Experiments

Abstract

We introduce an approach to discovery informatics that uses so called knowledge graphs as the essential representation structure. Knowledge graph is an umbrella term that subsumes various approaches to tractable representation of large volumes of loosely structured knowledge in a graph form. It has been used primarily in the Web and Linked Open Data contexts, but is applicable to any other area dealing with knowledge representation. In the perspective of our approach motivated by the challenges of discovery informatics, knowledge graphs correspond to hypotheses. We present a framework for formalising so called hypothesis virtues within knowledge graphs. The framework is based on a classic work in philosophy of science, and naturally progresses from mostly informative foundational notions to actionable specifications of measures corresponding to particular virtues. These measures can consequently be used to determine refined sub-sets of knowledge graphs that have large relative potential for making discoveries. We validate the proposed framework by experiments in literature-based discovery. The experiments have demonstrated the utility of our work and its superiority w.r.t. related approaches.

Full PDF

FFormalising Hypothesis Virtues in Knowledge Graphs:A General Theoretical Framework and its Validation inLiterature-Based Discovery Experiments (cid:73)

V´ıt Nov´aˇcek a a Insight @ NUI Galway (formerly known as DERI)IDA Business Park, Lower Dangan, Galway, Ireland

Abstract

We introduce an approach to discovery informatics that uses so called knowl-edge graphs as the essential representation structure. Knowledge graph is anumbrella term that subsumes various approaches to tractable representation oflarge volumes of loosely structured knowledge in a graph form. It has beenused primarily in the Web and Linked Open Data contexts, but is applicableto any other area dealing with knowledge representation. In the perspective ofour approach motivated by the challenges of discovery informatics, knowledgegraphs correspond to hypotheses. We present a framework for formalising socalled hypothesis virtues within knowledge graphs. The framework is based ona classic work in philosophy of science, and naturally progresses from mostlyinformative foundational notions to actionable speciﬁcations of measures cor-responding to particular virtues. These measures can consequently be used todetermine reﬁned sub-sets of knowledge graphs that have large relative potentialfor making discoveries. We validate the proposed framework by experiments inliterature-based discovery. The experiments have demonstrated the utility ofour work and its superiority w.r.t. related approaches.

Keywords: discovery informatics, hypotheses as knowledge graphs, hypothesisvirtue formalisation, automated knowledge graph construction, evolutionaryreﬁnement, literature-based discovery

1. Introduction

Ever since the dawn of computer age, researchers have been intrigued bythe possibility of automating the process of discovery [27]. Today, the ﬁeldof discovery informatics is getting more relevant than ever before. The large (cid:73)

This work has been supported by the “KI2NA” project funded by Fujitsu Laboratories,Limited in collaboration with Insight, NUI Galway. We also greatly appreciate comments ofPierre-Yves Vandenbussche who helped us to reﬁne the presentation of the article.

Email address: [email protected] (V´ıt Nov´aˇcek)

Preprint submitted to Elsevier October 16, 2018 a r X i v : . [ c s . A I] A p r mounts of data that are being made openly available for anyone to explorehave an immense potential for making new discoveries, and solutions that wouldenable this are highly sought after [15].Knowledge graphs are one of the most universal ways of representing ac-tionable, data-driven knowledge at large scale [8]. They represent knowledge asrelationships (edges) between items of interests (vertices), with the possibilityof adding additional annotations representing for instance multiple relationshiptypes ( i.e. , predicates). Such a representation has many advantages like uni-versal applicability and a wealth of well-founded methods for analysing graphstructures. Yet the full potential of knowledge graphs for practical applicationsin knowledge discovery is still largely to be explored [8].The motivation of the presented work is two-fold. Firstly, we want to pro-pose a general framework for deﬁning features of knowledge graphs that candetermine which parts of the graphs have highest potential for making discover-ies. We believe that this can facilitate the process of semi-automated knowledgediscovery in domains that have a lot of data available in graph-like format, butsuﬀer from high redundancy and noise ( e.g. , World Wide Web, social networksor biological pathway databases).The second motivation is more practical. In our previous work [30], weaddressed the problem of extracting simple knowledge graphs from biomedicaltexts. The graphs were then used for so called machine-aided skim reading –high-level navigation of a speciﬁc domain represented by a textual corpus whichwas assumed to facilitate the discovery process. Indeed, even highly experienceddomain experts were able to discover new and relevant facts using the proto-type system. However, the results also contained some noise and connectionsthat were correct, but rather obvious and/or uninteresting. This motivated thevalidation experiments presented here, which demonstrate that our frameworkfor formalising hypothesis virtues can tackle the problems of noise, redundancyand obviousness in knowledge graphs automatically extracted from texts.Our approach consists of formalising features applicable to ranking knowl-edge graphs (or their partitions) based on their potential for making discoveries.This can be used for instance for decomposing knowledge graphs into atomicsubgraphs and consequent construction of a graph that has higher “discoverypotential” than the original one. The formalisation is based on widely acceptedhypothesis virtues studied in philosophy of science [36]. Examples of virtuesare refutability or generality – a good scientiﬁc hypothesis has to be falsiﬁableand should also provide explanations of phenomena outside of its original scope.We present general conditions for each of the virtues and proceed with deﬁn-ing speciﬁc measures that conform to these conditions and can be eﬃcientlyimplemented.The validation of the approach was performed in the context of literature-based discovery [41]. We extracted knowledge graphs from two de facto standardbiomedical corpora traditionally used in evaluation of literature-based discoverytools. For that we used a very simple and domain-agnostic method that extractsstatistically signiﬁcant co-occurrence relationships. We opted for such a solu-tion to demonstrate the universal applicability of our approach. From these2asic graphs, we constructed reﬁned ones using a genetic algorithm that utilisesthe hypothesis virtue measures in the ﬁtness function. The reﬁned graphs wereanalysed according to the evaluation measures used in the literature-based dis-covery ﬁeld and compared to related works. The results of the validation werepositive, as we outperformed the state of the art in most respects. Moreover,we discovered relevant relationships that have not been covered by any relatedautomated system or manual study. This demonstrates the practical utility ofour approach.Our main contributions are as follows. We have proposed a novel theoreti-cal framework for extensible deﬁnition of measures that can be used to analysethe discovery potential of knowledge graphs. We have deﬁned speciﬁc measuresapplicable especially to reﬁnement of knowledge graphs automatically extractedfrom texts. We have implemented an evolutionary method for reﬁnement of theautomatically extracted knowledge graphs that is applicable out-of-the-box toany domain where English texts are available. We have demonstrated the prac-tical relevance of the presented research by a successful experimental validationin the ﬁeld of literature-based discovery. Last but not least, we have provideda data package containing a prototype implementation of our approach, resultsand other data necessary for the replication of our experiments.The rest of the article is organised as follows. Section 2 presents the generalframework for formalising the hypothesis virtues in the context of knowledgegraphs. Section 3 then introduces actual measures that follow the general re-quirements of the hypothesis virtue formalisations. Our approach is experimen-tally validated in Section 4. The section describes the evolutionary reﬁnementof knowledge graphs extracted from texts and elaborates on the experimentsin literature-based discovery. Related approaches are discussed in Section 5.Finally, we conclude the article and outline our future work in Section 6.

2. Formalising Hypothesis Virtues

The foundations of the presented work are built on [36], a classic work inphilosophy of science. The work introduces ﬁve virtues of hypothesis: conser-vatism, modesty, simplicity, generality and refutability. These virtues present acomprehensive compilation of the philosophical treatments of discovery rangingfrom antiquity to modern analytical philosophy, and have been frequently usedas a reference for determining quality of hypotheses in science.According to [36], the virtue of conservatism reﬂects the fact that goodhypothesis usually makes rather conservative claims. This is to minimise therisk of error by reaching too far from the state of the art in one step (eventhough the combination of the particular conservative claims may go very farafter all, indeed).

Modesty is related to conservatism – a hypothesis A is moremodest than A and B (since A and B entails A), and a more modest hypothesisis considered better as it minimises the risk of wrong and/or redundant claims.The simplicity virtue posits that a good hypothesis should simplify our view ofthe world by making new claims about it, even though the claims themselvesmay actually be quite complex. The generality virtue is related to the predictive3ower of hypothesis – the more phenomena (that have perhaps not even beenconsidered originally) it can explain, the better it is. Finally, refutability meansthat a hypothesis should be falsiﬁable in as obvious manner as possible. Thisis a factor of utmost importance, as discussed in arguably the most inﬂuentialwork on this topic [33].In the following, we ﬁrst deﬁne the notions of hypotheses and their claims inthe context of knowledge graphs (Section 2.1) and then continue with formalisingthe ﬁve virtues (Section 2.2).

First we deﬁne a universe – a general knowledge graph within which partic-ular hypotheses may be deﬁned.

Deﬁnition 1.

A universe graph U is a tuple ( V U , E U , Λ V , Λ E ) where V U is a setof vertices, E U ⊆ V U × V U is a set of edges and Λ V , Λ E are sets of labeling maps(i.e., morphisms) that associate values with the universe vertices and edges,respectively. The labeling maps can, for instance, assign predicate types to edges in seman-tic networks, assert vertex types like class or individual in ontology knowledgegraphs, or associate conﬁdence weights with edges of automatically extractedknowledge graphs. Such a deﬁnition can accommodate a broad range of knowl-edge graphs with varying levels of semantic complexity, while keeping the basicstructure still compatible with the analysis methods introduced here. The uni-verse can be either directed or undirected. The experiments presented in thisarticle deal with an undirected universe and therefore we assume undirectedgraphs in the following unless explicitly stated otherwise.A hypothesis in a universe is deﬁned as follows.

Deﬁnition 2.

A hypothesis H = ( V H , E H , Λ HV , Λ HE ) is a subgraph of the uni-verse U such that V H ⊆ V U , E H ⊆ E U and ∀ λ HV ∈ Λ HV ∃ λ V ∈ Λ V . λ HV ⊆ λ V , ∀ λ HE ∈ Λ HE ∃ λ E ∈ Λ E . λ HE ⊆ λ E . The second deﬁning condition of the hypothesis subgraph means that any spe-ciﬁc labeling map employed by a hypothesis has to be subsumed by a mapdeﬁned in the universe. This ensures that the universe is closed w.r.t. possibleinterpretations of the hypotheses existing within it.Most of the hypothesis virtues critically depend on what a claim of a hy-pothesis is, and therefore we need to deﬁne that as well.

Deﬁnition 3.

A claim of a hypothesis H is a simple (i.e., acyclic) path in thegraph H . Such a deﬁnition presents arguably the most universal view on what a particularknowledge graph may express. No matter what the actual semantics of therelationships in a hypothesis graph are, one can always study what they claimat least in terms of connections of vertices by means of edges, i.e. , paths (we4ill use the terms path and claim interchangeably in the rest of the article).This makes our approach applicable to any type of knowledge graph.Note that one practical implication of the last deﬁnition is that we canconsider only connected graphs as hypotheses – if there is no path between twovertices, no claim is being made about them and they should thus be parts ofdiﬀerent hypotheses. This is partly related to the open/closed world assumptiondichotomy. The fact there is no connection between vertices does not mean nosuch connection can exist, it only means nothing is known about it in the contextof the given knowledge graph.The ﬁnal preliminary deﬁnition concerns all claims possibly made by a hy-pothesis.

Deﬁnition 4.

A claim set of a hypothesis H is the set Π( H ) of all simple pathsin the corresponding graph. A claim volume of H is the size of its claim set,i.e., | Π( H ) | . The claim volume can be very large and is hard to compute even for relativelysmall graphs [47]. Also, it is not realistic to expect every possible path in aknowledge graph to convey a meaningful claim. Therefore in practice, it isconvenient to restrict the claim set to a more manageable size based on case-speciﬁc heuristics. However, the maximal possible number of claims is apt asa theoretical notion for describing general knowledge graphs without furtherinformation about their domain and more complex semantics.

The following ﬁve sections present formalisations of the particular hypothe-sis virtues using the preliminary notions introduced above. Note that we pro-vide general guidelines for measuring the virtues ﬁrst, giving minimalistic set ofconditions the measures should satisfy. Detailed examples of speciﬁc measuresfacilitating literature-based discovery are discussed in Sections 3 and 4.

Conservative claims should make small steps in a particular direction, how-ever, the combination of the steps can potentially be quite radical ( i.e. , far-reaching). The conservatism of a path in a hypothesis H can be measured by afunction f : Π( H ) → R that satisﬁes the following conditions:1. Assuming a metric δ : V U × V U → R on the vertices in the universe graph,the function f applied to a path p = ( v , v , . . . , v | p | ) is negatively cor-related with the g ( { δ ( v , v ) , δ ( v , v ) , . . . , δ ( v | p |− , v | p | ) } ) value, where g : 2 R → R is an aggregation function ( e.g. , sum, mimimum, maximum orarithmetic mean). Here and in the following, we use broad notions of positive and negative correlation. Theyare meant to generalise the respective notions of proportionality and inverse proportionality topossibly non-linear, non-algebraic or statistical relationships that may be speciﬁc to particularapplications.

5. If radical claims are preferred, then there is an additional requirement for f being positively correlated with the δ ( v , v | p | ) value.The conservatism of the whole hypothesis H is computed by aggregatingall path conservatism measures across the Π( H ) set. The higher the aggregatevalue, the larger the conservatism. Due to the complexity of enumerating theΠ( H ) set, practical conservatism measures can target only a subset of all possiblepaths. For instance, a set of shortest paths between all vertices in H w.r.t. the δ edge labeling is a viable option as it is comparatively easier to compute andalready satisﬁes condition 1. if sum is used as an aggregation function. Let us refer by H ω to the complete graph corresponding to a hypothesis H ( i.e. , a graph with an edge between any two vertices in V H ). Then the modestyof H can be deﬁned as | Π( H ω ) || Π( H ) | . This number reﬂects the ratio between all possible claims about the entitiescovered by H and the actual number of claims being made. The higher theratio, the larger the modesty (a modest hypothesis minimises the number ofclaims made in relation to the number of claims that can possibly be made).As mentioned before, computing the number of all simple paths in a graph isextremely diﬃcult in general. Therefore in practice, approximations of the mod-esty measure are necessary. The approximations, however, should be monotonic w.r.t. the ideal modesty measure: assuming f, g as the ideal and approximatemodesty measures, respectively, then g ( H ) > g ( I ) if and only if f ( H ) > f ( I )for any two hypotheses H, I . For this virtue, we use the dual notion of complexity which has been exten-sively studied in the context of graphs [25]. A good hypothesis should simplifyour view of the world despite of possibly being locally complex [36]. In orderto formalise this intuition, let us assume the simplicity of a graph is measuredby a function f : G U → R , where G U is a set of all graphs conceivable in theuniverse U . The function f should satisfy these conditions:1. Given a hypothesis graph H and a graph complexity measure c : G U → R , f is positively correlated with the expression c ( U \ H ) c ( U )which reﬂects the universe simpliﬁcation rate w.r.t. to the hypothesis . From here on, we use the set-theoretic operators for graphs as a convenience notationfor the operations applied on the corresponding vertex and edge sets in the actual tuplerepresentations of the graphs. The labeling sets of the result are assumed to be Λ V , Λ E , i.e. , the universe ones, unless speciﬁed otherwise.

6. If locally complex hypotheses are preferred, then the function f is alsorequired to be positively correlated with the value c ( H ).Strictly speaking, the rate in the ﬁrst condition should also be higher than 1in order for the hypothesis to make the universe actually simpler, but practicalapplications may relax that requirement and just rank the hypotheses based onthe measure. Generality can be quantiﬁed as a number of explanations ( i.e. , claims) thehypothesis H can provide for ‘out-of-scope’ phenomena ( i.e. , vertices) in the U \ H graph. This can be expressed as g ( { f ( u ) | u ∈ V U \ V H } ) , where the function g : 2 R → R is an aggregation (like sum or arithmetic mean)over all vertices that are out of the H scope. The function f : V U → R is requiredto be positively correlated with the h ( {|{ v | v ∈ p ∧ v ∈ V H }| | p ∈ Π u ( U ) } ) value,where h : 2 R → R is another aggregation function and Π u ( U ) is a set of all simplepaths in the universe U that start in the vertex u .The generality deﬁnition reﬂects the basic intuition that the higher the num-ber of H vertices on paths explaining phenomena outside of H , the higher thegenerality of H . As the numbers of simple paths can be diﬃcult to computeeven if limited to paths starting in single nodes, approximations of this measureare needed for implementations again. Similarly to the modesty condition, werequire the approximations to be monotonic w.r.t. the ideal generality measure. Refutability can be seen as a quantiﬁcation of: 1) the easiness with whichthe claim volume | Π( H ) | of a particular hypothesis graph H can be reduced;2) the rate of the reduction. The atomic part of the process of refutation inthe context of knowledge graphs is an invalidation, i.e. , removal, of a vertex.Let us assume a decreasing ranking R : N → V U of the vertices in H based onthe number of simple paths that no longer exist in the graph after the vertexremoval. Then we can deﬁne a top-k refutability as | Π( H ) || Π( H ) | + (cid:80) ki =1 | Π( H/R ( i )) | , where H/R ( i ) is a graph resulting from removal of the ﬁrst vertex in the ranking R from the graph H/R ( i − H/R (0) = H by deﬁnition. The lowerthe number of paths still existing after removing the top vertex according to R ,the higher the refutability. The | Π( H ) | expression is added to the denominatorto avoid potential division by zero, and also to normalise the measure value.Note that for growing k values, the top-k refutability generally convergesto similar values for any given set of hypotheses as the measure is relative tothe total number of paths in the graph. Therefore it is practical to use the7easure with rather low k values, perhaps even as low as 1 which measuresthe rate of refutability in a single vertex removal step. Additionally, the idealmeasure is diﬃcult to compute and approximations are required in practiceagain. In particular, one can approximate the Π function in the vertex rankingand refutability deﬁnition with one that is monotonic w.r.t. it.

3. Speciﬁc Virtue Measures

In this part, we introduce speciﬁc instances of hypothesis virtue measuresfollowing the general formalisation presented before. First we give an exampleof a universe and a couple of associated structures in Section 3.1. These willbe used for running examples illustrating the measure details in Section 3.2.Finally, Section 3.3 describes how to use the measures in concert.

The examples throughout this section are all based on an illustrative universegraph U depicted in Figure 1. The graph features real-valued edge labels in the

12 0.5 3 1 14 0.5 6 15 0.5 0.5

Figure 1: Sample universe graph U (0 ,

1] interval that represent conﬁdence weights of the edges (the higher the labelthe higher the expected degree of association between the corresponding ver-tices). These edge labels are used when constructing several auxiliary resourcesfrom the graph. There are no speciﬁc types of edges ( i.e. , predicates) in theexamples since in the experiments reported in this article, we focus only on onetype of relationship based on automatically extracted co-occurrence statements.First of all, we need to deﬁne a metric on the vertices. The most straight-forward option without any background knowledge on the graph is to use its8weighted) adjacency matrix for constructing characteristic context vectors forevery vertex. The vectors can then be used for computing the actual metric.The adjacency matrix A U of U is presented in Table 1. The context vector x A U for a vertex x is the row (or column, as the graph is undirected) correspondingto x in the adjacency matrix A U . Using the context vectors, we can deﬁne theEuclidean distance ( i.e. , a metric) on the vertices as δ ( x, y ) = (cid:112)(cid:80) ni =1 ( x i − y i ) where x i , y i correspond to the i -th elements of the x , y context vectors, respec-tively. The speciﬁc distances (up to 4-th decimal point) between the universevertices are given in Table 2. D U The last auxiliary structure we will need in the following sections (namelyfor deﬁning complexity measures) is clustering of the vertices in U . An exampleof a possible clustering is given in Figure 2. It is an overlapping clustering that

12 345 6A B C

Figure 2: Cluster structure of U groups vertices with mutual distances below 1 . A : { , } , B : { , , , } , C : { , , } . Note that the clustering can either be computed from9he universe graph itself or provided externally ( e.g. , in the form of an ontologythat deﬁnes a taxonomy upon the graph vertices). Having introduced the sample universe, we can continue with the speciﬁcmeasure deﬁnitions which we use later on in the literature-based discovery ex-periments.

Following the conditions provided in Section 2.2.1, we deﬁne a speciﬁc in-stance of the hypothesis graph conservatism measure C ( H ) = 1 | π s ( H, δ ) | (cid:88) p ∈ π s ( H,δ ) δ ( v , v | p | ) (cid:80) | p |− i =1 δ ( v i , v i +1 ) , where π s ( H, δ ) is a set of all shortest paths in H w.r.t. the Euclidean distance δ and p = ( v , v , . . . , v | p | ) is a speciﬁc shortest path of length | p | . In other words,the C measure is an arithmetic mean of the shortest path conservatism valueswhere the path conservatism is computed as a fraction of the distance betweenthe extreme vertices of the path and the path length .The measure satisﬁes the condition 1. from Section 2.2.1 as it already focusesonly on paths with minimal aggregate distance between the consecutive vertices(assuming the sum aggregation). The condition 2. is satisﬁed as well. For anypath p , δ ( v , v | p | ) ≤ (cid:80) | p |− i =1 δ ( v i , v i +1 ). The equality is achieved if and onlyif the context vectors of the consecutive vertices represent points that lie in astraight line, i.e. , maximise the distance between the extreme vertices of thepath. Therefore the maximum value 1 of the path conservatism measures isachieved exactly when the extreme distance is maximal. Example 1.

In Figure 3 there are three hypothesis graphs

E, F, G that existin the universe U described in Section 3.1. The edges are annotated with theEuclidean distance δ based on the vertex context vectors (see the examples inSection 3.1 for details).The numbers of all shortest paths for the hypothesis graphs E, F, G w.r.t. thedistance δ are , , , respectively. The conservatism measures of the hypothesesare C ( E ) = 13 ( 11 + 12 + 11 ) = 0 . ,C ( F ) = 13 ( 1 . . . . . . ,C ( G ) = 16 ( 1 . . . . . . . , Note that if there is only one shortest path guaranteed to exist between any pair of verticesin the H graph, then | π s ( H, δ ) | = (cid:0) | V H | (cid:1) as H is expected to be connected.

45 1 6 1 F 12 1.32293 1.5 1.8708 G 24 1.5 6 1.80285 1 1

Figure 3: Sample hypothesis graphs

E, F, G therefore the hypotheses can be ranked in the F (cid:31) C G (cid:31) C E order from the most to the least conservative one .3.2.2. Modesty As an approximation of the the ideal modesty measure presented in Sec-tion 2.2.2, we use inverse density of the hypothesis graph M ( H ) = | V H | ( | V H | − | E H | . This function is much easier to compute than the ideal one and is monotonic w.r.t. it. Since the enumerators of both functions are ﬁxed, we only need toshow that the number of edges is monotonic w.r.t. number of all simple paths ina hypothesis graph. This is quite easy – increase in | E H | ( i.e. , adding an edge)will cause | Π( H ) | to grow as well since adding an edge will result in at leastone new simple path in H , the edge itself. Conversely, if the set Π( H ) grows,it means that edges had to be added to the H graph as it is the only way howthe overall number of paths can be increased. Example 2.

The number of edges in the

E, F, G graphs from Example 1 is , , , respectively, while the maximum possible number of edges in the corre-sponding complete graphs is , , . Therefore the modesty values are M ( E ) = 32 = 1 . , M ( F ) = 33 = 1 , M ( G ) = 64 = 1 . From here on, we use convenience ordering relations (cid:31) X for ranking the hypotheses in adecreasing order according to a speciﬁc measure X . E (cid:31) X F if and only if X ( E ) > X ( F ). nd the modesty ranking of the hypotheses is E (cid:31) M F, G (cid:31) M F. As stated in Section 2.2.3, we use the dual notion of complexity for measur-ing hypothesis simplicity. For the speciﬁc instance of the measure, we employShannon’s entropy that has been frequently used for graph complexity [25]. Todeﬁne the entropy, we utilise the clustering of the hypothesis graph verticesbased on their context vectors. Let us assume a vertex labeling γ : V U → L where L is a set of cluster identiﬁers. Then we can deﬁne a cluster associationprobability p ( l, H ) for a speciﬁc cluster l ∈ L within a hypothesis H as p ( l, H ) = |{ v | v ∈ V H ∧ l ∈ γ ( v ) }|| V H | . It is a probability that a randomly selected vertex from H belongs to a cluster l . If we conceive clusters as higher-level topics the hypothesis graph deals with,then the probability reﬂects the distribution of the topics across the graph. The p ( l, H ) values can be used for computing the cluster association entropy for ahypothesis H as E ( H ) = − (cid:88) l ∈ L p ( l, H ) log p ( l, H ) . It reﬂects the information value of the hypothesis’ cluster structure – the more“unpredictably” distributed clusters, the higher the complexity and also theinformation value. This conforms to an intuitive assumption that hypothesesdealing with more topics representatively are more informative, i.e. , complex.We deﬁne two simplicity measures that employ the cluster association en-tropy and satisfy the respective conditions introduced in Section 2.2.3 S ( H ) = E ( H ) , S ( H ) = E ( U \ H ) E ( U ) . We use both measures in the following to capture diﬀerent aspects of simplicitysimultaneously.

Example 3.

In Figure 4 there are the three hypotheses graphs

E, F, G and theuniverse graph U depicted again, but this time with cluster annotations providedas vertex labels. The cluster association probabilities for each graph are p ( A, U ) = 13 , p ( B, U ) = 23 , p ( C, U ) = 12 ,p ( A, E ) = 0 , p ( B, E ) = 1 , p ( C, E ) = 23 ,p ( A, F ) = 23 , p ( B, F ) = 13 , p ( C, F ) = 13 ,

1: {A,B}2: {A}3: {C} 4: {B,C} 6: {B,C}5: {B}E 4: {B,C}5: {B} 6: {B,C} F 1: {A,B}2: {A}3: {C} G 2: {A}4: {B,C} 6: {B,C}5: {B}

Figure 4: Sample hypothesis graphs

E, F, G p ( A, G ) = 14 , p ( B, G ) = 34 , p ( C, G ) = 14 . The entropies corresponding to these probabilities are E ( U ) . = 1 . , E ( E ) . = 0 . , E ( F ) . = 2 . , E ( G ) . = 1 . ,E ( U \ E ) . = 2 . , E ( U \ F ) . = 0 . , E ( U \ G ) = 1 . . The hypothesis F is the lowest-ranking no matter which function we use – it hasthe lowest entropy and E ( U \ F ) < E ( U ) , therefore it makes the universe morecomplex. On the other hand, both E, G increase the simplicity of the universe.If only local complexity of the acceptable hypotheses is relevant (measure S ),then the ﬁnal ranking is G (cid:31) S E (cid:31) S F since E ( G ) > E ( E ) . However, if the rate of simplifying the universe is moreimportant (measure S ), the ranking is E (cid:31) S G (cid:31) S F s E ( U \ E ) E ( U ) . = 1 . > . . = E ( U \ G ) E ( U ) . To limit the potentially intractable number of paths in the ideal generalityformula introduced in Section 2.2.4, we apply two approximations in its spec-iﬁcation. Firstly, we focus only on explanations for the universe vertices V HA that are immediately adjacent to the measured hypothesis H . The set of edgesthat connect these vertices to H can then be deﬁned as E HA = { ( u, v ) | ( u, v ) ∈ E U ∧ ( u ∈ V HA , v ∈ V H ∨ u ∈ V H , v ∈ V HA ) } . The second approximation consistsof focusing only on shortest paths w.r.t. the δ distance. The speciﬁc generalitymeasure is then deﬁned as G ( H ) = |{ p | p ∈ π s ( U, δ ) ∧ p ∈ V HA ∧ p , p , . . . , p | p | ∈ V H }| . The measure corresponds to the number of shortest paths that start in an ad-jacent vertex and connect it with vertices in the hypothesis graph H , thus pro-viding an explanation for it using only H . As the graphs H are assumed to beconnected, the measure can further be simpliﬁed as G ( H ) = | E HA | (1+( V H − | E HA || V H | for graphs where only one shortest path exists between any two ver-tices.The G ( H ) measure uses sum aggregation as the g function present in thegeneral deﬁnition. The f function that leads to the presented deﬁnition of G ( H )returns zero for any vertex from the V U \ V H set that is not immediately adjacentto H . For other vertices, it returns the number of paths that provide explanationfor them in H . This number is positively correlated with the number of verticesin V H as required in the general deﬁnition, since the number of paths leadingfrom a vertex to other vertices in a connected graph H is | V H | − w.r.t. the ideal generality measure,though. If the number of shortest paths increases, then the number of all pathsnaturally has to be higher as well. The other direction is less obvious, andconditional. Assuming the number of all paths in a graph has increased, wehave to show that there also has to be more shortest paths. This is not true ingeneral – if edges between distant vertices are added, they may not contribute toincreasing the number of shortest paths. However, since the measure intuitivelycaptures the notion of generality in the context of knowledge graphs and is easyto compute, we decided to relax the absolute monotonicity requirement for thesake of practicality. Example 4.

The sets vertices adjacent to the

E, F, G hypotheses are V EA = { } , V FA = { , } , V GA = { , } nd the corresponding sets of connecting edges are E EA = E FA = { (4 , , (6 , } , E GA = { (2 , , (2 , } . Since there is only one shortest path between any pair of vertices in our example,the generality measures are G ( E ) = | E EA | · | V E | = 2 · , G ( F ) = | E FA | · | V F | = 2 · ,G ( G ) = | E GA | · | V G | = 2 · and the resulting ranking is G (cid:31) G E, G (cid:31) G F. Using the shortest paths approximation again, we deﬁne a speciﬁc refutabil-ity measure as R k ( H ) = | π s ( H, δ ) || π s ( H, δ ) | + (cid:80) ki =1 | π s ( H/R ( i ) , δ ) | . Similarly to Section 3.2.4, we consider only the shortest paths instead of allsimple ones, which makes the computation of the measure comparatively easier.Such an approximation is unfortunately not strictly monotonic as shown be-fore, however, we believe that the practicality and intuitiveness of the measureoutweighs the partial monotonicity violation.For the ranking R of the vertices in the R k ( H ) measure computation, weuse the betweenness centrality which is deﬁned as c B ( v, G ) = |{ p | p ∈ π s ( G, δ ) ∧ v ∈ p }|| π s ( G, δ ) | , where v is a vertex and G is a graph. In other words, betweenness centrality of avertex is the number of shortest paths passing though it divided by total numberof shortest paths. The ranking R ranks the vertices in a decreasing order basedon their betweenness centrality. Such ranking generally does not mean thatremoval of a high-ranking vertex results in a higher number of shortest pathsdisappearing when compared to a removal of a lower-ranking vertex – if thegraph remains connected in both cases, the number of shortest paths in it will bethe same after removal of either node. However, removing a vertex with higherbetweenness centrality will result in relative increase of the remaining paths’lengths. This can lead to a decrease of the graph conservatism and thus also to adecrease of its overall value w.r.t. the hypothesis virtues. Consequently, makinga hypothesis weaker more quickly can be seen as refuting it more eﬃciently. Webelieve that this justiﬁes the chosen ranking even though it means yet anotherrelaxation of the general requirements . An alternative option that fully conforms to the requirements would employ simple pathsinstead of shortest ones and vertex degree instead of betweenness centrality, however, such asolution can easily become intractable. xample 5. The sets of shortest paths w.r.t. the δ distance for the particularhypothesis graphs are π s ( E, δ ) = { (5 , , (5 , , , (4 , } ,π s ( F, δ ) = { (2 , , (2 , , (1 , } ,π s ( G, δ ) = { (2 , , (2 , , , (2 , , (4 , , (4 , , (5 , , } . The corresponding vertex betweenness centralities are then c B (4 , E ) = 1 , c B (5 , E ) = c B (6 , E ) = 0 . ,c B (1 , F ) = c B (2 , F ) = c B (3 , F ) = 0 . ,c B (2 , G ) = c B (5 , G ) = c B (6 , G ) = 0 . , c B (4 , G ) = 0 . . The top-1 refutability measure for the hypothesis E can be computed as follows.The centrality-based ranking of the vertices places on the top, therefore weremove it. The result is a disconnected graph consisting of isolated vertices , where no path exists anymore. The top-1 refutability measure of E is thus R ( E,

1) = 33 + 0 = 1 . Similarly, the top-1 refutability measures for the remaining two hypotheses (witharbitrary removal vertex selection for F due to uniform centrality ranking) are R ( F,

1) = 33 + 1 = 0 . , R ( G,

1) = 66 + 1 = 0 . . The resulting refutability ranking of

E, F, G is E (cid:31) R G (cid:31) R F. The speciﬁc measures deﬁned in the previous section can be used to rankthe hypothesis graphs independently of each other as shown in the examples.However, practical applications will very often imply the necessity to comparehypotheses along all the measures. Lacking any a priori information on whichmeasures may be more relevant for a particular application, we propose thefollowing way of ordering the hypothesis graphs.Let H = { H , H , . . . , H n } be the set of hypothesis graphs we wish to com-pare according to a set of measures X = { X , X , . . . , X m } of equal impor-tance. Then we can construct an edge-labeled directed ranking multigraph R = ( H , E ⊆ H × H , λ : E → X ). The multigraph’s vertices are the hypothesesin H . The edge set and the labeling function is constructed from the speciﬁcmeasure rankings so that ( H i , H j ) ∈ E , λ ( H i , H j ) = X k if and only if there is16 measure X k such that H i (cid:31) X k H j . Using the ranking multigraph R , we candeﬁne a combined ranking relation (cid:31) on the set H × H as H i (cid:31) H j if and only if d o ( H i , R ) d o ( H i , R ) + d i ( H i , R ) > d o ( H j , R ) d o ( H j , R ) + d i ( H j , R ) , where d i ( H x , R ) , d o ( H x , R ) is the in-degree and out-degree of the vertex H x inthe multigraph R , respectively. In plain words, the combined ranking relation (cid:31) orders the hypotheses based on the relative magnitude of their superiority(out-degree) w.r.t. the speciﬁc ranking relations given by the measures. Example 6.

Figure 5 shows the ranking multigraph corresponding to Exam-ples 1-5. A directed edge from vertex X to Y with a label Z means that X (cid:31) Z Y . FG C M S2 G RE C S1 G M S1 S2 R

Figure 5: Ranking multigraph for

E, F, G

The in-degrees and out-degrees of

E, F, G in the ranking graph are d i ( E ) = 3 , d o ( E ) = 4 , d i ( F ) = 6 , d o ( F ) = 1 , d i ( G ) = 3 , d o ( G ) = 7 , therefore G (cid:31) E (cid:31) F since > > .

4. Experimental Validation

In order to validate the proposed formalisation of hypothesis virtues in thecontext of knowledge graphs, we chose to follow-up on our work presented in [30]where we addressed automated extraction of conceptual networks from biomed-ical literature. The work deals with extraction of co-occurrence and similarityrelationships from abstracts available on PubMed ( c.f. , ih.gov/pubmed ) and consequent indexing, querying and navigation of the net-works in a knowledge discovery scenario.As we have shown in [30], the automatically extracted networks can alreadyprovide useful insights even for experts in the ﬁeld, however, they still containsome noise and irrelevant and/or obvious information. Tackling this challengehas been the main practical motivation for the research presented in this article.We believe we can use our approach to identify portions of the automaticallyextracted graphs that can not only provide general overview of the domain withless noise, but also isolate valid relationships that are surprising for experts.This can ultimately lead to more eﬃcient machine-aided discovery applications.In our validation experiments, we utilise the scenarios, data sets and eval-uation methodologies elaborated within the ﬁeld of literature-based discoverywhich we introduce in Section 4.1 below. Section 4.2 is the methodologicalcore of this part. It presents an evolutionary approach to the reﬁnement of au-tomatically extracted knowledge graphs using the hypothesis virtue measures.Section 4.3 describes the data sets and methods we use for the experimentalevaluation. Finally, Section 4.4 discusses the results of the experiments.Note that we have implemented our approach and the experiments reportedin this section using a Python prototype available under the GPL free softwarelicense. The corresponding code, experimental data and results are availableat http://skimmr.org/hyperkraph/ . Detailed README documentation onthe implementation and data is provided as a part of the respective archiveshosted at the referenced URL. The ﬁeld of literature-based discovery is widely considered to stem from thework [44]. Based on [44] and a follow-up article [45], the work [43] introducedthe notion of Swanson linking – connecting two pieces of knowledge in isolateddocuments A and B using concepts from intermediate documents (C) that aredirectly or indirectly related to A and B. Surveys of recent works addressingthis problem are provided in [7, 34, 41].The application of our framework to reﬁning knowledge graphs automati-cally extracted from literature is closely related to literature-based discovery.Our goal is to generate a set of graphs that reﬂect relationships between termsin literature and are optimised w.r.t. hypothesis virtues. Such a structure canvery straightforwardly facilitate the process of ﬁnding “interesting” links be-tween isolated concepts via intermediates, which is the key problem of literaturediscovery. Therefore we can use the standard approaches and man-made “goldstandard” discoveries from that ﬁeld to experimentally validate our approach inan established application scenario. HYPERKRAPH is a general name we use for the ongoing implementation of prototypesbased on the presented research. It stands for

HYPothEsis viRtues in Knowledge gRAPHs . .2. Evolutionary Reﬁnement of Automatically Extracted Knowledge Graphs The basic assumption we use for validating our framework is that applyingthe hypothesis virtue measures to reﬁning graphs extracted from literature willfacilitate literature-based discovery tasks better than the unreﬁned graphs. Toverify this, we have to tackle the graph reﬁnement ﬁrst. The key question is:

Given a knowledge graph based on statements automatically extracted from text,how can we reﬁne it so that only the parts of the graph that have comparativelyhigh hypothesis virtue measures remain?

This is essentially an optimisation problem in which we know how to tellwhether a solution X is better than Y, but we do not know much about whatthe actual solutions are and how the main knowledge graph is (or should be)composed of them. Such problems can quite eﬃciently be tackled by evolution-ary computing [9]. In the rest of this section, we describe a speciﬁc algorithmfor evolutionary reﬁnement of knowledge graphs.

Figure 6 presents the high-level overview of the graph extraction and reﬁne-ment process. First we use our SKIMMR tool [30] to extract basic co-occurrencestatements from the input texts. The statements are in the shape of tuples( t , t , w d , T ), where t , t are two terms that co-occur in an input text T and w d is the weight of the co-occurrence based on the sentence distances of theterms within T .In the next step (M2 in Figure 6), we:1. Use the basic statements to compute corpus-wide co-occurrence weightsusing normalised point-wise mutual information.2. Encode the terms in the statements using integer identiﬁers (to optimisethe memory usage in the consequent steps).3. Build a fulltext index upon the lexical vertex labels for accessing themduring the evaluation (this mitigates the impact of spelling alternativesand other irregularities in the automatically extracted names).4. Initialise an undirected edge-labeled universe graph U with edges con-structed from the corpus-wide statements. The graph can possibly belimited to edges with normalised point-wise mutual information weightsabove a pre-deﬁned threshold.5. Construct a context vector space for the U vertices based on their neigh-bors and corresponding edge weights.6. Use the vector space to compute the Euclidean distances between thevertices.Steps M3 and M4 in Figure 6 perform the K-means clustering of the universegraph U in order to provide a vertex labeling γ that associates each vertex withcluster(s) it belongs to (see Section 4.3.3 for details on the K-means settings in19

1: textsM1: SKIMMRD2: co-occurrence statementsM2: graph generatorD3: edge-labeled graphM3: K-means moduleM4: vertex annotatorD4: clusteringD5: edge- and vertex-labeled graphM5: evolutionary moduleD6: optimised graphs

Figure 6: High-level workﬂow of the graph construction and reﬁnement

1: universe graphM1: initialisationD2: populationM2: mutation, crossover and validationD3: expanded populationM3: ranking and trimmingD4: trimmed population IF NOT TERMINATING

Figure 7: Detailed workﬂow of the evolutionary reﬁnement the particular experiments we conducted). At this moment, everything is readyfor optimising U according to the hypothesis virtue measures of its sub-graphs. The optimisation step in Figure 6 is performed using a genetic algorithm [9].Its detailed workﬂow is presented in Figure 7. The genetic algorithm has thefollowing conﬁgurable parameters: 1. mutation and mating probabilities p m , p c deﬁning how likely it is for an individual in a population to mutate and mate( i.e. , engage in a crossover with another individual); 2. number k m deﬁning howmany times an individual can attempt to mate in a generation; 3. maximumnumber of generations N G ; 4. rate ρ p of the standard deviation of the populationsize – it sets the size of the population P i to | P i | = gauss ( | P i − | , ρ p | P i − | ) where gauss ( µ, σ ) returns a random number from the normal distribution with mean µ and standard deviation σ , truncated to integer; 5. the mean and standarddeviation µ i , σ i for determining the sizes of the individuals in the initial popu-lation. For speciﬁc values of the parameters and discussion of their inﬂuence onthe evolution process in our experiments, see Section 4.3.3.The population is initialised (step M1 in Figure 7) by a repetitive random21election of possibly overlapping stars of size gauss ( µ i , σ i ) from the graph U .Stars consist of one “hub” vertex and a set of vertices “fanning out” of the hubvia immediate edges. They are a speciﬁc type of sub-graphs that can be used asatomic graph construction blocks [25] and thus they are ﬁtting for the purposeof population initialisation.Step M2 in Figure 7 consists of applying the evolutionary operators on thepopulation and consequent validation of the newly added individuals whichdiscards disconnected ones. The mutation deletes or adds an edge from/to theindividual graph with equal probabilities. The crossover combines two parentsby randomly selecting half of the edges from each parent and combining them ina new individual. All existing edge labels are copied in the process of creatingnew individuals.Step M3 in Figure 7 is essential for the optimisation – it computes thehypothesis virtue measures of each individual in the expanded population andthen ranks the population according to the combined ranking (cid:31) introduced inSection 3.3. The population is then trimmed to a random size based on theprevious population size (computed using the ρ p parameter).Steps M2 and M3 are repeated until a termination condition is met. Thiscan either be reaching a pre-deﬁned number of generations N G , or achievingsome sort of population convergence. For the evaluation of our approach we chose two standard scenarios inliterature-based discovery based on the works [44, 45]. Details on the corre-sponding data sets and experiments we performed using them are described inthe following sections.

We used two data sets in the experimental evaluation, both of which addressdiscovery of connections between previously isolated concepts (and correspond-ing bodies of literature). One data set is based on [44] that explores the rela-tionship between ﬁsh oil and Raynaud’s syndrome. The other data set is basedon similar study of previously neglected connections between migraine and mag-nesium [45]. We refer to these two data sets and corresponding experiments asto T R , T M , respectively.The initial corpora of texts for the T R , T M experiments were obtained fromPubMed via queries compiled according to the speciﬁcations given in [44, 45].Each of these works deﬁnes source and target terms t s , t t together with a set I c of intermediate terms that connect them. A query for the PubMed abstractscorresponding to speciﬁc t s , t t , I c is compiled as a disjunction of atomic conjunc-tions (cid:95) t ∈{ t s ,t t } ,t c ∈ I c ( t ∧ t c )The particular queries we used for obtaining the T R , T M corpora were22 "raynaud" AND "blood") OR ("raynaud" AND "viscosity") OR("raynaud" AND "platelet") OR ("raynaud" AND "vascular") OR("raynaud" AND "reactivity") OR ("fish oil" AND "blood") OR("fish oil" AND "viscosity") OR ("fish oil" AND "platelet") OR("fish oil" AND "vascular") OR ("fish oil" AND "reactivity") and ("migraine" AND "vasospasm") OR ("migraine" AND "spreading depres-sion") OR ("migraine" AND "vascular reactivity") OR ("migraine"AND "depolarization") OR ("migraine" AND "epilepsy") OR ("migraine"AND "inflammation") OR ("migraine" AND "prostaglandins") OR("migraine" AND "platelet aggregation") OR ("migraine" AND "sero-tonin") OR ("migraine" AND "brain anoxia") OR ("migraine" AND "cal-cium channel blockers") OR ("magnesium" AND "vasospasm") OR ("mag-nesium" AND "spreading depression") OR ("magnesium" AND "vascularreactivity") OR ("magnesium" AND "depolarization") OR ("magnesium"AND "epilepsy") OR ("magnesium" AND "inflammation") OR ("magnesium"AND "prostaglandins") OR ("magnesium" AND "platelet aggregation")OR ("magnesium" AND "serotonin") OR ("magnesium" AND "brain ano-xia") OR ("magnesium" AND "calcium channel blockers") respectively. Note that the while the T M query exactly corresponds to the termsgiven in [45], the T R query is relaxed to sub-terms as the exact query only yieldsvery few abstracts. The PubMed search was limited to articles indexed untilNovember, 1985 and August, 1987 for T R , T M , respectively, so that we cancompare ourselves to the ﬁndings of the original works which have served as a de facto gold standard in the literature-based discovery ﬁeld [4].The characteristics of the T R , T M corpora are summarised in Table 3. Number Corpus T R T M of tokens is a sum of the word-length of the documents in the corpus and num-ber of base statements is the number of the base co-occurrence statements theSKIMMR tool extracted from the corpus. To generate knowledge graphs from the text corpora, we use the approachintroduced in Section 4.2. We construct the experimental graphs using only co-occurrence statements with above-average positive normalised point-wise mu-tual information scores. This ﬁlters out statements with comparatively lowco-occurrence weight. We use the general SKIMMR version that extracts enti-ties based on shallow parsing rather than domain-speciﬁc models (see https://github.com/vitnov/SKIMMR for details). This is to demonstrate the general-ity of our work – if we show that our approach can deliver good results even in23uite a speciﬁc domain using basic and universally applicable initial text mining,it indicates that it is likely to perform similarly well in any other domain.The characteristics of the extracted graphs are provided in Tables 4 and 5.The basic characteristics | V G | , | E G | , dn G , | C G | , | c maxG | , | c avgG | , | c medG | in Table 4 are Graph | V G | | E G | dn G | C G | | c maxG | | c avgG | | c medG | T R T M the number of vertices, number of edges, graph density, number of connectedcomponents, maximum, average and median component size in vertices, respec-tively. The component-wise characteristics in Table 5 are computed as a weighedarithmetic mean across all the components where the weight is the componentsize in vertices. The characteristics rd G , dm G are the graph radius and diam- Graph rd G dm G tr G asp G asp δG T R T M eter (minimum and maximum eccentricity, respectively, where eccentricity ofa vertex is its maximum distance to other vertices). The tr G characteristicsis transitivity – the fraction of all possible triangles reﬂecting the tendency ofvertices in the graph to cluster together [39]. The characteristics asp G , asp δG are average shortest path lengths in terms of edges and the distance labeling,respectively. Additional characteristics of the graph is the degree distributiondepicted in Figure 8 (the plot is log-scaled in both x- and y-axis).The extracted graphs both have one large connected component comprisingmost of the vertices, complemented by other trivial components mostly con-sisting of one edge. The largest components exhibit so called “small-world”property [48] – despite of being quite large and having small density, they haverelatively small diameters and average shortest paths. This observation is sup-ported by two additional facts. The graphs have relatively high transitivity, i.e. , high tendency of vertices to cluster together which is typical for complexsmall-world networks [39]. Also, the vertex degree distribution approximatelyfollows the power law as shown in Figure 8, which is characteristic for scale-freenetworks [31]. This means that the extracted graphs have relatively denselyconnected structure with many claims involving frequently repeated concepts,which is largely caused by highly (co)occurring terms. This is perhaps not idealfor making discoveries about as many previously disconnected phenomena aspossible, and we later show how our approach can remedy this problem. For the clustering, we use the K-means module of the scikit-learn pack-age [32]. As the algorithm’s scalability to large numbers of samples and fea-24 rank10 deg r ee Degree rank plots RM

Figure 8: Degree distributions for the experimental graphs tures is limited by available memory, we partition the set of context vectorscorresponding to the universe graph vertices to buckets of size 2 ,

000 and thenrun the K-means algorithm on them with the parameter K set to 40. Thepartitioning is done by incremental random selection of 50 seed vectors fromthe unpartitioned set, computing their centroid and then ﬁlling the partitionwith the seeds plus up to 1 ,

950 unpartitioned vectors closest to the centroid.We have experimented with diﬀerent settings of every parameter, however, wefound out that the resulting distributions of vectors into clusters are practicallyinvariant to the settings, with mean and median cluster sizes converging to thesame values no matter what the settings were.The parameters of the evolutionary reﬁnement were p m = 0 . , p c = 0 . , k m = 5 , N G = 50 , ρ p = 0 . , µ i = 100 , σ i = 80 . The initial individual size parameters are only reﬂected in rare extreme cases asthe size of the random stars is much more dependent on the data set structurein practice. For the other parameters, we applied values typically used by modelapproaches presented in the evolutionary computing literature [9]. The numberof generations has been set well above a threshold after which the performanceof the corresponding populations starts to oscillate around similar evaluationscores (see Section 4.4.1 for details).The evolutionary reﬁnement with these parameters took 56m and 6h36mfor the T R , T M experiments, respectively, using a 2010 make laptop with 4-core CPU, 8GB RAM and Ubuntu Linux 14.04 OS. The virtue measures (themost demanding part) were computed using six parallel processes. The number25f processes can be easily adjusted to the computing power available, whichfacilitates vertical scalability of the reﬁnement. Horizontal scalability is plannedfor future versions of the prototype and consists of using a distributed processinglibrary instead of the native Python multiprocessing module. We use several evaluation methods. Part of them is based on a recentwork [4] which deﬁnes evidence-based and literature frequency-based evalua-tion measures within the de facto standard literature-based discovery scenarioselaborated in [44, 45]. The additional beneﬁt of using [4] as a primary referencefor the evaluation is that the authors compared results of several representa-tive approaches to literature-based discovery. Thus we can interpret our resultswithin a broader context of the whole ﬁeld. In addition to the measures deﬁnedin [4], we perform qualitative evaluation of the actual contents of the resultsand compare ourselves to related state of the art where applicable.The evidence-based evaluation measures the capability of an approach to re-discover the intermediate concepts linking the source and target in the corpusas per discoveries made by human experts. It also measures the importance theapproach associates with the re-discovery. For an intermediate t c , the absoluteevidence-based evaluation measure directly corresponding to [4] is deﬁned as evd ( t c ) = min G ∈G c ( rnk ( G )) , where G c = { G | t s , t t , t c ∈ V G ∧ ∃ p ∈ Π( G ) .p = ( t s , . . . , t c , . . . , t t ) } is a set ofsolution graphs that contain the source and target terms t s , t t linked by theintermediate t c . The function rnk : G → N is a ranking of all solution graphs G = { G | t s , t t ∈ V G } from the most to the least relevant where the relevance isdetermined by the speciﬁc approach being evaluated.We construct the sets of ranked solution graphs from the set of individuals ina selected reﬁned generation by: 1. Creating a union graph from all populationindividuals. 2. Generating a set of paths between the source and target termvertices that also contain an intermediate vertex. 3. Ranking the paths usingtheir hypothesis virtue measures, i.e. , the (cid:31) relation, with the population uniongraph as a universe. The step 2. can either compute all simple paths or allshortest paths. In our experiments, we use the latter option due to tractabilityissues. The conception of paths as solution graphs represents another designchoice consistent with the previous deﬁnitions – a path linking certain conceptsis the simplest way of claiming (and potentially also explaining) something aboutthem. Note that for mapping terms to vertices in the resulting knowledge graphs, we use thefulltext index computed upon the lexical expressions corresponding to the graph vertices.This is done when generating the universe graph, see Section 4.2.1 for details. To get allterm manifestations in our automatically extracted knowledge graphs, we look up the term ofinterest in the index and then manually prune the results to get all alternatives that refer tothe corresponding concept.

26n addition to the absolute evd score, we compute the overall relative im-portance of an intermediate term t c . This measure is deﬁned as a mean relativeinverse rank of the graphs that contain t c among all solutions, i.e. , evd r ( t c ) = 1 |G c | (cid:88) G ∈G |G| − rnk ( G ) + 1 |G| . It eﬀectively measures the average relative relevance of the hypotheses linkingthe source and target terms via t c – the more often the link is discovered inhigh-ranking graphs, the higher the measure.The second evaluation method proposed in [4] measures the frequency of thediscovered claims in the scientiﬁc literature. Similarly to our deﬁnition, a pathin the result graph is considered a claim in [4]. The literature frequency can beused to deﬁne a measure of solution rarity as rar ( G ) = 1 | π s ( G I ) | (cid:88) p ∈ π s ( G I ) f pm ( Q A ( p )) , where G I = { G | G ∈ G ∧ ∃ t c ∈ I c , p ∈ Π( G ) .p = ( t s , . . . , t c , . . . , t t ) } is a set ofsolutions that contain an intermediate term, π s ( G I ) = (cid:83) G ∈G π s ( G ) is the unionof shortest paths taken across G I , and f pm is the number of results returned byPubMed for an association query Q A ( p ). The query for a path ( p , p , . . . , p | p | )corresponds to the conjunction (cid:86) t ∈ p t of all terms in the path (with a publica-tion time window limited according to the corresponding experimental corpus).For instance, the path (ﬁsh oil, platelet aggregation, Raynaud’s syndrome) corre-sponds to the PubMed query "fish oil" AND "platelet aggregation" AND"Raynaud’s syndrome" AND ("0001/01/01"[PDAT] : "1985/11/30"[PDAT]) in the T R experiment. Finally, the rarity measure can be straightforwardly usedfor deﬁning an interestingness measure [4] as a normalised inverse of the rarity int ( G ) = 11 + rar ( G ) . The qualitative evaluation of the results is based on the sets of topics coveredby the particular solutions. A topic is informally deﬁned by potentially relevantterms that lay on a path between source and target concepts in a solution.Potentially relevant terms are those that refer to non-trivial concepts that mayelucidate the meaning of the particular path. Using the notion of topics, wedeﬁne the measures of topical density, relative topical relevance and relativetopical novelty, respectively, as top d ( G ) = | T unq ( G I ) || T all ( G I ) | , top r ( G ) = | T rel ( G I ) || T unq ( G I ) | , top n ( G ) = | T nvl ( G I ) || T rel ( G I ) | for a set G of all solution graphs. The sets T unq ( G I ) , T all ( G I ) , T rel ( G I ) , T nvl ( G I )are sets of unique, all, relevant and novel topics covered by the solution graphsin G that contain an intermediate term.27he relevance of topics is determined by a review of the available scientiﬁcliterature. This tells us whether or not a given set of terms can provide a mean-ingful and non-trivial explanation of the connection between the source andtarget terms. More speciﬁcally, a topic is considered relevant if and only if thefollowing conditions are met simultaneously: 1. The terms in the topic refer tofeatures of a biomedically relevant relationship that can be traced in literature.2. The relationship is associated with the corresponding target, source and in-termediate terms. 3. The relationship is not trivial – it has to be a supportedby genuine discoveries presented in literature, not statements of obvious merelyoccurring in articles.A novel topic is one that is relevant and not covered by any single publishedwork in its whole. This can be determined using a publication search enginesuch as PubMed, where we can check the number of results of a conjunctivequery involving all terms in the corresponding claim path. If the number ofresults is zero, then the topic is unique.We compute the top d , top r , top n scores for the initially extracted and re-ﬁned graphs in both experiments, focusing on solutions involving correspond-ing source, target and intermediate terms. Whenever applicable, we comparethe relevant topics we generated with the topics (re)discovered by related ap-proaches. We split this section into three parts – ﬁrst we explain the process of selectionof the reﬁned graphs to be evaluated, then we analyse properties of the selectedgraphs, and ﬁnally we discuss the results of the evaluation.

Before analysing the actual results of the evolutionary reﬁnement, we haveto select the generation we will focus on. A natural criterion for that is theperformance of generations in terms of the evaluation measures. The relativeranking of intermediate concepts ( i.e. , the evd r measure) is best suited for thistask as it tells us to which extent the generations tend to “consider” the interme-diate connections important. Figure 9 shows how the mean evd r values for allintermediates evolve throughout the generations for each experiment. The blueand green lines represent the T R , T M experiments, respectively. The full linescorrespond to mean values taken across all intermediate terms (also marked bythe “star” character in the plot legend). The dashed lines are for mean valuesomitting intermediates that are not present in the given generation (marked bythe “plus” character in the legend).For the T R experiment, the generation 40 clearly performs best as it containssolution graphs for each intermediate term and their mean relative ranking isvery high (within the top 20% of solutions). For the T M experiment, the situa-tion is less clear. The best generation in terms of mean across all intermediateterms is number 39, however, if one takes only the present intermediates intoaccount, the generations 35-38 all perform better. Yet we decided to further28

10 20 30 40 50generation0.00.10.20.30.40.50.60.70.80.9 v a l ue Relative intermediate ranks (mean values) R*M*R+M+

Figure 9: Mean relative intermediate ranks through generations analyse the generation 39 as it covers three intermediates, while the generations35-38 only cover two. From here on, we refer to the selected generations by the T R , T M expressions, respectively.Further support for selecting the generation to be analysed can be drawnfrom the numbers of claims containing the source and target claims, and thenumbers of such claims that also contain an intermediate term. The evolutionof these values is depicted in Figure 10. Note that the ﬁgure’s y-axis is log-scaled due to diﬀerent orders of magnitude of the displayed values. Similarly tothe previous ﬁgure, the blue and green lines represent the T R , T M experiments,respectively. The full lines correspond to the total number of claims containingthe source and target term in a given generation (also marked by the “t” char-acter in the plot legend). The dashed lines represent the fraction of the claimsthat also contain an intermediate term (marked by the “r” character).The total number of relevant claims is steadily decreasing up until approx-imately 20-25th generation and then starts to oscillate. For the relative num-ber of solutions with intermediates, similar trend can be seen after the 40-thgeneration. This can be interpreted as an indication that the generations arestructurally stabilised then, at least for the evaluation data we work with. Before we proceed with discussing the results, let us have a look at thecharacteristics of the knowledge graphs corresponding to the generations weselected for evaluation. Tables 6 and 7 present the same type of data like thetables in Section 4.3.2. The extra rows with the ∆ preﬁxes show the relative29

10 20 30 40 50generation10 -2 -1 v a l ue Claim numbers RtMtRrMr

Figure 10: Claim numbers through generations diﬀerence between the reﬁned and initial graphs. The columns represent exactly

Graph | V G | | E G | dn G | C G | | c maxG | | c avgG | | c medG | T R T R T M T M Table 6: Basic characteristics of the evolved experimental graphs the same measures as in the tables in Section 4.3.2 – number of vertices, numberof edges, graph density, number of connected components, maximum, averageand median component size in nodes ( | V G | , | E G | , dn G , | C G | , | c maxG | , | c avgG | , | c medG | ),and the graph radius, diameter, transitivity and average shortest path lengths interms of edges and the distance labeling ( rd G , dm G , tr G , asp G , asp δG ). Figure 11 Graph rd G dm G tr G asp G asp δG T R T R T M T M Table 7: Component-wise characteristics of the evolved experimental graphs contains plots of the degree distribution in the reﬁned graphs.The reﬁned graphs still contain about 65% and 72% of the original verticesfor the T R , T M experiments, respectively, however, the edges are much more30 rank10 deg r ee Degree rank plots RM

Figure 11: Degree distributions for the evolved experimental graphs pruned, to about 9% and 10%, respectively. The graph density is thus lower,too (at about 20%). The numbers of connected components do not change much.This is only to be expected given the nature of the population preparation andthe tendency of the evolution process to preserve connectedness. The sizes ofthe components are more or less proportional to the reduction of the vertexnumber.What is more interesting are the component-wise characteristics of the re-ﬁned graphs summarised in Table 7. The radius, diameter and average shortestpath lengths are all increased by up to 78% and no less than 32%, despite of thegraphs getting smaller. The clustering coeﬃcient decreases quite radically – toabout 3.8% and 5.4% of the original value for the T R , T M experiments, respec-tively. The vertex degree distribution still approximately follows the power law,however, the curve is not as steep as for the original graphs. These combinedcharacteristics indicate that the reﬁned graphs exhibit the small world propertyto much lower extent than the original ones. This means that they are struc-turally more evenly organised and tend to have less vertices or vertex groupsthat connect large portions of the graph through very few edges. A possibleconsequence of this fact is lower redundancy and higher rate of non-obviousconnections in the reﬁned graphs. Indeed, the analysis of the data w.r.t. thestandard literature-based discovery application scenarios conﬁrms this, as weshow in the next section. 31 .4.3. Performance of the Reﬁned Graphs In this section, we ﬁrst discuss the performance of our experiments w.r.t. thequantitative measures used by related approaches. This is then followed byqualitative analysis of the knowledge graphs we generated.Table 8 lists the values of the evd measure for the T R data set. Our approach(the N column) is compared to the works [4, 42, 49, 11, 16] in columns C, S, W,G, H, respectively. For our approach, we list both evd, evd r values, while forthe others, only evd is present as they do not consider evd r . We also provide | G c | , i.e. , the number of solution graphs with intermediates. The evd numbers Intermediate N C [4] S [42] W [49] G [11] H [16] evd evd r | G c | Blood Viscosity 5 0.98 1 15 ∗ T R data correspond to the best rank of the result that contains given intermediate term.The “-” character means the intermediate cannot be found in any result for thatapproach. If there is “Y”, then the intermediate can be found in the results butno ranking is provided. Finally, the results with “ ∗ ” in the C column indicatethat the intermediate can only be found indirectly by manually exploring abroader context of the result [4].Our approach ﬁnds all intermediate terms which makes its performanceequivalent to or better than the related approaches in this respect. Blood vis-cosity and platelet aggregation are placed among the top 16% of the results(out of 205 in the T R experiment) while vascular reactivity is considered to berelatively less important intermediate.Table 9 lists the same type of results as Table 8, only for the T M experimentand slightly diﬀerent set of related works. Note that the related works aresometimes inconsistent in the exact wording of the intermediate terms, thereforewe only focused on nine out of eleven where we were able to clearly mash upthe diﬀerent alternatives of the term.The quantitative results of our approach are sparser than in case of the T R experiment. This has been caused mainly by the minimalistic, domain-agnosticapproach we chose, which resulted into relatively low coverage of the intermedi-ate synonyms appearing in the data (the fulltext mapping could only discoverterms rather similar to the canonical intermediate form used as a query, whilemany synonyms are quite dissimilar strings). All related approaches but one [11]use term expansion and mapping using biomedical vocabularies like MeSH, andsome even use quite extensive manual interventions (see Section 5.4 for details).Despite of these limitations, we re-discovered ﬁve out of nine intermediates. Outof these, only three were discovered using a mature-enough generation of thereﬁned knowledge graph, though.For the intermediates we managed to ﬁnd, we achieved results comparableto or better than the other approaches. For instance, three out of ﬁve related32 ntermediate N C [4] S [42] W [49] B [2] G [11] evd evd r | G c | Calcium ChannelBlockers - - - 22 3 Y 10 1Epilepsy

23 0.628 7 9 ∗ - Y 8 3Brain Anoxia /Hypoxia - - - - 5 - 6 77Inﬂammation - - - 3 ∗

335 0.333 2 1 ∗

352 0.274 4 4 1 Y 42 27Serotonin - - - 1 1 Y 5 1Cortical / SpreadingDepression

58 0.468 3 - 6 - 45 -Vascular Mechanism /Reactivity T M , T M data approaches were not able to re-discover the cortical depression intermediatewhich is considered very important in [45].The overall results of the evidence-based evaluation are encouraging. In the T R experiment, our approach performed better than [4, 16], worse than [42, 11]and equally to [49]. In the T M experiment, we bettered [4, 49, 11] while [42, 2]outperformed us . In total, we did better than more than half of the relatedapproaches in terms of the intermediate ranking.The rarity and interestingness measures for the two experiments are given inTable 10. We can only compare ourselves to [4] as the measures were deﬁned and Experiment N C [4] rar ( G ) int ( G ) rar ( G ) int ( G ) T R T M used there for the ﬁrst time. The average results of our approach are lower thanin [4] for the T R experiment. However, the median rarity and interestingness ofthe paths generated in our experiment is 0 and 1, respectively – only about onethird of the T R path associations have non-zero frequency on PubMed. Thismeans that two thirds of the claims generated by our approach have the sameperformance in terms of rarity and interestingness as in [4]. The average resultsof the T M experiment are better in our case. More than 98% of the T M claims Note that direct comparison of the ranking results is conceptually diﬃcult since the ap-proaches generate rather varied forms of results, e.g. , mere terms in [11] or oriented multi-graphs in [4]. However, we can at least give this basic summary, which we corroborate byanalysing the actual contents of the results later on. We also further discuss the major com-parative beneﬁts of our approach in Section 5.4.

Terms degree ranking betw. centrality ranking T R T R T M T M T R T R T M T M Source, target 0.522 0.609 0.541 0.708 0.537 0.592 0.656 0.736Intermediates 0.563 0.729 0.557 0.57 0.678 0.699 0.584 0.575Table 11: Degree-based ranking of the re-discovery terms targets and intermediates in terms of degree is increased by the reﬁnement inboth experiments. The increase is largest for source and target terms in the T M experiment and for the T R intermediates. The importance in terms of between-ness centrality is increasing relatively less, with the T M intermediates actuallybecoming slightly less important. These observations are consistent with theevidence-based evaluation in the sense of “sensitivity” of the experimental datasets towards the source, target and intermediate terms. The reﬁnement of the T R graph clearly raises the importance of all vertices, especially the interme-diates. Indeed, all the terms are present in relatively highly ranking claims ofthe resulting T R graph. For the T M data set where only the importance ofthe source and target vertices is markedly rising, the results are much sparser –although the T M graph contains many claims connecting the source and target,there is relatively few intermediates from [45] present in these claims.The qualitative analysis of the solution contents further elaborates on theabove observations about the initial and resulting graph structure. As speciﬁedin Section 4.3.4, the analysis is based on the topics covered by the solutiongraphs. These are terms that provide additional context for the intermediatesin the solutions. We provide comprehensive lists of the unique context topics inAppendix A, together with references to supporting literature.The contents of the Appendix A is summarised in Table 12 which containsthe top d , top r , top n score values for the initial and reﬁned knowledge graphsin both experiments . The table shows that our approach improves the qualityof the extracted knowledge graphs. The relative topical density top r ( i.e. , theratio of unique topics among the paths connecting source and target terms)increases by about 27% and 122% for the T R , T M experiments, respectively. As we are not experts in the domains involved, we adopted a very conservative strategyfor determining the topic relevance. If we could not directly verify any particular relationshipbetween biomedical concepts present in the solution graphs using a review of published lit-erature via PubMed, we asserted the corresponding solution irrelevant. We encourage moreknowledgeable readers to suggest possible updates of the detailed tables in Appendix A. core T R T M T R T R T M T M top d top r top n The relevance top r increases by about 31% and 46% for T R , T M , respectively.Finally, the relative topical novelty top n increases by about 86% for the T M experiment. In case of the T R experiment, the measure is slightly lower for T R than for T R , however, there is only one non-novel solution in both knowledgegraphs. The decrease in the relative top r value is caused by lower total numberof solutions in the reﬁned graph.These results conﬁrm our assumption that the reﬁnement improves the qual-ity of statements extracted from literature, at least in the context of two stan-dard literature-based discovery scenarios. The improvement in quality is three-fold. Firstly, the reﬁned knowledge graphs are less redundant (the topical den-sity is higher). Secondly, there is markedly more relevant solutions in the results.And thirdly, the reﬁned solutions are largely non-obvious (high top n measure).Direct and exact comparison of our qualitative results to related state ofthe art is unfortunately impossible due to the afore-mentioned diﬀerences in thesolution representations. However, we can at least discuss the commonalitiesand diﬀerences informally. Figure 12 displays the hierarchy of topics coveredby the T R results. Each vertex in the hierarchy graph represents a part of thetopic. The roots of the presented hierarchies are the intermediate concepts.The vertices shared across multiple topics have normal outlines. Vertices thatcomplete the topics on the way from the root have bold outlines.The impact of glyceryl trinitrate on vasodilation and consequently also onblood ﬂow has been studied in the context of possible treatment of Raynaud’ssyndrome [19]. Our method reﬂects these ﬁndings in constructing a correspond-ing connection between Raynaud’s syndrome and platelet aggregation which isquite closely related to blood ﬂow [17]. Phosphatidylcholine, also a relativelycommon vertex in the generated claims, refers to a class of phospholipids that isclosely related to metabolism of fatty acids, including those found in ﬁsh oils [1].The vertices connected to phosphatidylcholine mediate the relationship betweenﬁsh oil and platelet aggregation in the solutions. The topic with ADP-inducedplatelet aggregation [35] speciﬁes the type of platelet aggregation ﬁsh oils caninﬂuence. The solutions concerned with anti-thrombotic eﬀect put this vertexin connection with ﬁsh oils, possibly with an intermediate vertex referring tomyocardial infarction. This corresponds to the anti-thrombotic eﬀect of ﬁshoils demonstrated for instance in [52]. Finally, one of our solutions identiﬁeda link between platelet aggregation and ﬁsh oils via their inﬂuence on levels ofplasma-beta thromboglobulin, a marker in ischemic heart disease [13].The solution involving the vascular reactivity intermediate puts it in thecontext of inﬂuence of ﬁsh oils on lower vascular resistance, as discussed for35 l a t e l e t a gg r e g a ti on b l ood f l o w , g l y ce r y l t r i n it r a t e pho s ph a ti dy l c ho li n ea dp - i ndu ce dp l a t e l e t a gg r e g a ti on a n ti - t h r o m bo ti ce ff ec t p l a s m a - b e t a t h r o m bog l obu li n , i s c h e m i c h ea r t d i s ea s e , i n iti a l d ec r ea s e , p l a t e l e t c oun t n - a r ac h i don i c ac i d , do c oh e x ae no i c ac i dp r ec u r s o r c m yo ca r d i a li n f a r c ti on v a s c u l a r r eac ti v it y l o ca l c oo li ng , d i g it a l s y s t o li c p r e ss u r e , l e ss r e s pon s i v e , h ae m odyn a m i c p r o f il e b l ood v i s c o s it y p l a s m a - b e t a t h r o m bog l obu li n , i s c h e m i c h ea r t d i s ea s e , h i gh l e v e l o f p l a s m a f i b r i nog e n , s h ea r r a t e Figure 12: Hierarchy of relevant topics in T R T R solutions with related state of theart approaches, we can only refer to [4] and [44] as the other works generatemere lists of possible intermediates without further context. Many contextsassociated with the intermediates as possible explanations of the connectionsare missing in the related works. Examples are blood ﬂow, glyceryl trinitrate,ADP-induced platelet aggregation, phosphatidylcholine or plasma-beta throm-boglobulin within ischemic heart disease. However, most of these connectionsare rather explanatory and not essential in the scope of Raynaud’s syndrome de-spite of being valid. In [4], many of the graphs involve epoprostenol (essentiallya prostaglandin) as a mediator of the inﬂuence of ﬁsh oils on platelet aggrega-tion. This is consistent with [44] that establishes the connection between ﬁshoil and platelet aggregation as a result of increased level of prostaglandins. Thiscontext is missing in our results that involve the intermediates, however, it ispresent twice among the top-ten solutions (at ranks 4 and 8). Once it appearsin relation to the action of the drug indomethacin, and then also in relationto luteolytic activity in women with Raynaud disease. These are potentiallyinteresting ﬁndings that extend the results produced by comparable state of theart approaches.Figure 13 displays the hierarchy of topics covered by the T M results. Onesolution involving the epilepsy intermediate puts it in the context of magnesiumbeing used as a mechanism for management of reverberating brain waves [40].These are associated with epilepsy and vertigo attacks and the solution sug-gests that treatments for these conditions may be used for migraine as well.Other claims related to epilepsy all share multifocal EEG abnormalities whichare characteristic for epilepsy [29]. Two diﬀerent types of claims were comple-menting these ﬁndings – two solutions dealing with magnesium concentrationsin cerebrospinal ﬂuid in relation to migraine [37], and one solution related totransmitter release and nerve stimulation. The solutions involving the corticalspreading depression intermediate were all related to similar concepts as theepilepsy ones. This is not surprising, since cortical spreading depression is quiteclosely related to seizures [10].Similarly to the T R experiment, we can only compare the contents of our T M solutions to [4] and [45] which are the only works that provide context inaddition to the intermediates. The graphs presented in [4] for migraine and mag-nesium are generally much sparser than those for Raynaud and ﬁsh oil (typicallycontaining only the source, target and intermediate node). Moreover, none ofthe results discussed in the article in detail concern epilepsy or spreading depres-sion. The work [45] conﬁrms close relationship between epilepsy and corticalspreading depression which is consistent with a straightforward interpretation37 p il e p s y m u lti f o ca l EE G a bno r m a lit yv e r ti go , c onvu l s i on t r ea t m e n t , r e v e r b e r a ti on i n c r ea s e i n m a gn e s i u m c on ce n t r a ti on i n c s f M e r s k e y , ce r e b r o s p i n a l f l u i d t r a n s m itt e r r e l ea s e du r i ng r e p e titi v e n e r v e ac ti v it y , n e r v e s ti m u l a ti on m a gn e s i u m s u l f a t e c o r ti a l s p r ea d i ngd e p r e ss i on c h il dhood e p il e p s y w it h r o l a nd i c s p i k e , h e m i s p e h e r e h e m i p l e g i a , c h il dhoodp a r ti a l o cc i p it a l e p il e p s y , r e v e r b e r a ti on , m a gn e s i u m b l o c k a d e Figure 13: Hierarchy of relevant topics in T M

38f our results. Our solutions also managed to bring up the relationship betweenmagnesium and cerebrospinal ﬂuid in the context of epileptic attacks. In addi-tion to that, our results appear to strongly associate migraine with multifocalEEG abnormalities. This is consistent with relationship between the abnor-malities and headaches demonstrated for instance in [12]. Other potentiallyinteresting ﬁndings not covered by related works are vertigo, reverberation andthe relationship between migraine, cortical spreading depression and rolandicepilepsy [50].

5. Related Work

We split this section into four thematic blocks that correspond to the maintheoretical and application-speciﬁc facets of our work. In particular, we reviewthe areas of: 1. automated discovery, 2. ontology learning, 3. discovery supportedby knowledge graphs, 4. literature-based discovery.

Research of ways how discoveries can be automated or facilitated by ma-chines dates back to the dawn of the digital computer era. The work [27]provides a comprehensive analysis of the discovery process operationalised ascreative problem solving. It reviews several classic machine discovery systemsand the heuristics used by them, and also mentions several properties of worthydiscoveries like novelty and value. A more recent related work [18] reviews themajor approaches to studying the process of scientiﬁc discovery, provides an-other survey of automated discovery systems and analyses additional featuresof relevant discoveries, such as surprise. The works [21, 20] review still moremachine discovery systems and heuristics, and identify features like refutabilityand simplicity as essential to discoveries. One of the most recent and relevantworks from this area is [23]. It builds on [27, 18] and introduces formalisationsof several discovery features. In particular, it models novelty and value usingmetric spaces, and surprise using Bayesian probabilities.Discovery features discussed in the referenced works conform to our virtuedeﬁnitions, although most of them do not provide a systematic formalisation,only rather application-speciﬁc implementations. For instance, refutability andsimplicity as reviewed in [21] directly correspond to our virtues. Surprise andnovelty discussed in the other works can be modelled by putting emphasis onradical claims as addressed by the conservatism virtue, only using diﬀerent dis-tance metrics for each of the respective features. We believe that our approachpresents a new way to formalising discovery features that is consistent with re-lated state of the art, but is more systematic, comprehensive and extensible. Inaddition, we provide an actionable set of measures implemented in the contextof knowledge graphs. This enables universal applicability of our research, whichis not the case in most of the rather speciﬁc afore-mentioned approaches.39 .2. Ontology Learning

In the last ﬁfteen years, there has been a growing interest in exploring thepotential of automatically extracted graph structures for knowledge discovery.Many of such approaches can be clustered under the umbrella of ontology learn-ing [22] which aims at extracting complex statements from unstructured textualresources. This is done using speciﬁcally tailored methods from AI disciplineslike natural language processing or machine learning.As a recent survey [51] shows, the applicability of existing ontology learningapproaches to (semi)automated knowledge discovery is still quite limited. Manyof the techniques are dependent on manually curated resources. They alsointroduce a lot of assumptions during the extraction process (based on, forinstance, linguistic facts valid only in the context of a particular language ordiscourse). This limits their universal applicability. Another problem is thatthe more complex knowledge representation the learned ontologies use, the morerestrictive they are about their meaning. This typically leads to brittleness w.r.t. the often inherently vague and contextual nature of the knowledge theyrepresent. This can easily cause problems in machine-aided knowledge discoveryscenarios where we typically want to represent the knowledge implied by theinput data in as unbiased way as possible. Another practical limitation is thatmost ontology learning system do not scale very well as reported in [51].

More recent works related to machine discovery using knowledge graphsinclude [6, 8, 28] which contain also comprehensive reviews of prior similarapproaches. The approach elaborated in [6] presents methods for knowledgediscovery in RDF [24] data based on user-deﬁned query patterns and analyticalperspectives. Our approach complements [6] by oﬀering means for automatedanalysis and reﬁnement of knowledge graphs using application-independent,well-founded features.Google’s Knowledge Vault [8] presents a web-scale approach to probabilisticknowledge fusion that uses graphs represented in the RDF format. It tacklesthe scalability vs. accuracy trade-oﬀ of the manual and automatic approachesto construction of knowledge graphs. This is done by reﬁning statements ex-tracted from the web content using models learned from pre-existing highly ac-curate knowledge bases like YAGO or Freebase. Additional details and broadertheoretical context of the approach introduced in [8] is given in [28], which of-fers a comprehensive review of relational machine learning approaches in thecontext of RDF-compatible knowledge graphs. The main advantage of our ap-proach w.r.t. the works [8, 28] is that we are not critically dependent on thebackground knowledge model. In addition, we present a complementary well-founded approach to determining which relationships in automatically extractedknowledge graphs are worth preservation. Having said that, the techniques re-viewed in [28] can certainly provide valuable hints on future extensions of ourapproach to graphs with oriented edges representing more than one type ofrelationships ( i.e. , RDF graphs). 40 .4. Literature-Based Discovery

As our approach has been validated by experiments in literature-based dis-covery, we need to position ourselves within that ﬁeld as well. Surveys ofrelated works (focusing mostly on the domain of life sciences) are providedin [7, 34, 41]. The speciﬁc approaches we compare ourselves to are describedin [4, 2, 16, 49, 42, 11]. In most cases where we were able to directly compareour results with the related works, our approach was at least as good as andoften better than the state of the art. In addition, we managed to hint at sev-eral relevant insights that were not even discussed by the human expert in theoriginal studies [44, 45].The most signiﬁcant advantages of our approach are, however, as follows.1. It is absolutely automatic. The only manual action we performed was pruningthe fulltext search results when mapping terms to the corresponding vertices,however, this is only required for the evaluation, not for the method itself.2. There are no domain-speciﬁc dependencies and thus our work is readily ap-plicable to any ﬁeld, not just the biomedical literature-based discovery. 3. Weproduce extensive contextual information that can facilitate the interpretationof the results and thus make the machine-aided discovery process more eﬃcient.4. Our approach is based on theoretical foundations motivated by the state ofthe art philosophical study of key features of scientiﬁc discoveries.The works [42, 16, 49] all depend on rather extensive manual eﬀort (def-inition of semantic types and discovery patterns, result pruning, etc. ). Theapproaches [4, 2, 11] are automated, however, only [4] provides broader contextin order to elucidate the connections. Moreover, all works but [11] substantiallyrely on an external domain-speciﬁc source of background knowledge and/ordomain-speciﬁc NLP tools, namely the MeSH and UMLS vocabularies [3] andthe tools SemRep [38] and BioMedLEE [5]. It is quite plausible to assume thatwithout these resources, the related approaches dependent on them would per-form much less favourably when compared to our implementation. Last butnot least, all the related approaches lack the universally applicable theoreticalfoundations presented as the core contribution of this article.

6. Conclusions and Future Work

We have presented a novel approach to discovery informatics that is basedon formalisation of hypothesis virtues in the context of knowledge graphs. Wehave shown that the approach is naturally motivated, well-founded, extensibleand universally applicable. It can be used as a broader theoretical frame forother approaches to machine discovery, as brieﬂy outlined in Section 5. Wehave delivered an implementation of the presented research and performed itsexperimental validation using standard scenarios in literature-based discovery.A successful comparison with related state of the art tools demonstrates thepractical relevance of our work.In near future, we will extend the theoretical framework in order to addressdirected multi-graphs with predicate edge labels and more complex semantics41ssociated with particular edge and vertex types. This will allow for straightfor-ward application of our approach to more expressive knowledge graphs, such asRDF [24] knowledge bases and ontologies in the Linked Open Data cloud [14].Furthermore, we intend to continue demonstrating the universality of our frame-work by using it in other experimental scenarios targeted by related works inmachine discovery. We also plan to explore the complex relationships betweenspeciﬁc measures and their inﬂuence on the properties of the evolutionary re-ﬁnement process ( e.g. , convergence, optimality and completeness bounds). Thiswill lead to deeper understanding of the reﬁnement, and therefore also to moreeﬃcient implementations. Finally, and perhaps most importantly, we wouldlike to use our approach in scenarios involving actual new discoveries, in directcollaboration with corresponding domain experts.

References [1] Abe, E., Ikeda, K., Nutahara, E., Hayashi, M., Yamashita, A., Taguchi, R.,Doi, K., Honda, D., Okino, N., Ito, M., 2014. Novel lysophospholipid acyl-transferase PLAT1 of aurantiochytrium limacinum F26-b responsible forgeneration of palmitate-docosahexaenoate-phosphatidylcholine and phos-phatidylethanolamine. PloS one 9 (8), e102377.[2] Blake, C., Pratt, W., 2002. Automatically identifying candidate treatmentsfrom existing medical literature. In: AAAI Spring Symposium on MiningAnswers from Texts and Knowledge Bases. pp. 9–13.[3] Bodenreider, O., 2004. The uniﬁed medical language system (UMLS): inte-grating biomedical terminology. Nucleic acids research 32 (suppl 1), D267–D270.[4] Cameron, D., Kavuluru, R., Rindﬂesch, T. C., Sheth, A. P., Thirunarayan,K., Bodenreider, O., 2015. Context-driven automatic subgraph creation forliterature-based discovery. Journal of Biomedical Informatics. In press.[5] Chen, L., Friedman, C., 2004. Extracting phenotypic information from theliterature via natural language processing. Medinfo 11 (Pt 2), 758–62.[6] Colazzo, D., Goasdou´e, F., Manolescu, I., Roatis, A., 2014. RDF Analytics:Lenses over Semantic Graphs. In: Proceedings of WWW’14. ACM.[7] de Bruijn, B., Martin, J., 2002. Getting to the (c)ore of knowledge: miningbiomedical literature. International Journal of Medical Informatics 67 (13),7 – 18.[8] Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K.,Strohmann, T., Sun, S., Zhang, W., 2014. Knowledge vault: A web-scaleapproach to probabilistic knowledge fusion. In: Proceedings of the 20thACM SIGKDD international conference on Knowledge discovery and datamining. ACM, pp. 601–610. 429] Eiben, A. E., Smith, J., 2007. Introduction to Evolutionary Computing.Springer.[10] Fabricius, M., Fuhr, S., Willumsen, L., Dreier, J. P., Bhatia, R., Boutelle,M. G., Hartings, J. A., Bullock, R., Strong, A. J., Lauritzen, M., 2008.Association of seizures with cortical spreading depression and peri-infarctdepolarisations in the acutely injured human brain. Clinical Neurophysiol-ogy 119 (9), 1973–1984.[11] Gordon, M. D., Lindsay, R. K., 1996. Toward discovery support systems: Areplication, re-examination, and extension of Swanson’s work on literature-based discovery of a connection between Raynaud’s and ﬁsh oil. Journal ofthe American Society for Information Science 47 (2), 116–128.[12] Guidetti, V., Fornara, R., Marchini, R., Moschetta, A., Pagliarini, M.,Ottaviano, S., Seri, S., 1986. Headache and epilepsy in childhood: analysisof a series of 620 children. Functional neurology 2 (3), 323–341.[13] Hay, C., Durber, A., Saynor, R., 1982. Eﬀect of ﬁsh oil on platelet kineticsin patients with ischaemic heart disease. The Lancet 319 (8284), 1269–1272.[14] Heath, T., Bizer, C., 2011. Linked Data: Evolving the Web Into a GlobalData Space. Morgan & Claypool.[15] Honavar, V. G., 2014. The promise and potential of big data: A case fordiscovery informatics. Review of Policy Research 31 (4), 326–330.[16] Hristovski, D., Friedman, C., Rindﬂesch, T. C., Peterlin, B., 2006. Ex-ploiting semantic relations for literature-based discovery. In: AMIA annualsymposium proceedings. Vol. 2006. American Medical Informatics Associ-ation, p. 349.[17] Jackson, S. P., 2007. The growing complexity of platelet aggregation. Blood109 (12), 5087–5095.[18] Klahr, D., Simon, H. A., 1999. Studies of scientiﬁc discovery: Complemen-tary approaches and convergent ﬁndings. Psychological Bulletin 125 (5),524.[19] Kleckner, M. S., Allen, E. V., Wakim, K. G., 1951. The eﬀect of localapplication of glyceryl trinitrate (nitroglycerine) on Raynaud’s disease andRaynaud’s phenomenon studies on blood ﬂow and clinical manifestations.Circulation 3 (5), 681–689.[20] Langley, P., 2000. The computational support of scientiﬁc discovery. Inter-national Journal of Human-Computer Studies 53 (3), 393 – 410.[21] Langley, P., Zytkow, J. M., 1989. Data-driven approaches to empirical dis-covery. Artiﬁcial Intelligence 40 (1-3), 283–312.4322] Maedche, A., Staab, S., 2004. Ontology learning. In: Staab, S., Studer, R.(Eds.), Handbook on Ontologies. Springer, Ch. 9, pp. 173–190.[23] Maher, M. L., Fisher, D. H., 2012. Using AI to evaluate creative designs.In: 2nd International Conference on Design Creativity, Glasgow, UK.[24] Manola, F., Miller, E., 2004. RDF Primer. Available at (November 2008): .[25] Mowshowitz, A., Dehmer, M., 2012. Entropy and the complexity of graphsrevisited. Entropy 14 (3), 559–570.[26] Mozaﬀarian, D., 2007. Fish, n-3 fatty acids, and cardiovascular haemody-namics. Journal of Cardiovascular Medicine 8, S23–S26.[27] Newell, A., Shaw, J. C., Simon, H. A., 1959. The processes of creativethinking. Rand Corporation Santa Monica, CA.[28] Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E., 2015. A review ofrelational machine learning for knowledge graphs: From multi-relationallink prediction to automated knowledge graph construction. arXiv preprintarXiv:1503.00759.[29] Noriega-Sanchez, A., Markand, O. N., 1976. Clinical and electroencephalo-graphic correlation of independent multifocal spike discharges. Neurology26 (7), 667–667.[30] Nov´aˇcek, V., Burns, G. A., 2014. SKIMMR: Facilitating knowledge dis-covery in life sciences by machine-aided skim reading. PeerJ. Available at https://peerj.com/articles/483/ .[31] Onnela, J.-P., Saram¨aki, J., Hyv¨onen, J., Szab´o, G., Lazer, D., Kaski, K.,Kert´esz, J., Barab´asi, A.-L., 2007. Structure and tie strengths in mobilecommunication networks. Proceedings of the National Academy of Sciences104 (18), 7332–7336.[32] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vander-plas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay,E., 2011. Scikit-learn: Machine learning in Python. Journal of MachineLearning Research 12, 2825–2830.[33] Popper, K., 2005. The logic of scientiﬁc discovery. Routledge.[34] Preiss, J., Stevenson, M., McClure, M. H., 2012. Towards semantic litera-ture based discovery. In: 2012 AAAI Fall Symposium Series: InformationRetrieval and Knowledge Discovery in Biomedical Text. Vol. 30. AAAI, pp.7–18. 4435] Puri, R. N., 1999. ADP-induced platelet aggregation and inhibition ofadenylyl cyclase activity stimulated by prostaglandins: signal transductionmechanisms. Biochemical pharmacology 57 (8), 851–859.[36] Quine, W. V., Ullian, J. S., 1978. The Web of Belief. McGraw-Hill.[37] Ramadan, N., Halvorson, H., Vande-Linde, A., Levine, S. R., Helpern, J.,Welch, K., 1989. Low brain magnesium in migraine. Headache: The Journalof Head and Face Pain 29 (9), 590–593.[38] Rindﬂesch, T. C., Fiszman, M., Libbus, B., 2005. Semantic interpretationfor the biomedical research literature. In: Medical informatics. Springer,pp. 399–422.[39] Scott, J., 2012. Social network analysis. Sage.[40] Shibata, M., Bures, J., 1975. Techniques for termination of reverberatingspreading depression in rats. Journal of neurophysiology 38 (1), 158–166.[41] Smalheiser, N. R., 2012. Literature-based discovery: Beyond the ABCs.Journal of the American Society for Information Science and Technology63 (2), 218–224.[42] Srinivasan, P., 2004. Text mining: generating hypotheses from MEDLINE.Journal of the American Society for Information Science and Technology55 (5), 396–413.[43] Stegmann, J., Grohmann, G., 2003. Hypothesis generation guided by co-word clustering. Scientometrics 56 (1), 111–135.[44] Swanson, D. R., 1986. Fish oil, Raynaud’s syndrome, and undiscoveredpublic knowledge. Perspectives in Biology and Medicine 30 (1), 7–18.[45] Swanson, D. R., 1987. Migraine and magnesium: eleven neglected connec-tions. Perspectives in Biology and Medicine 31 (4), 526–557.[46] Tietjen, G. W., Chien, S., Leroy, E. C., Gavras, I., Gavras, H., Gump, F. E.,1975. Blood viscosity, plasma proteins, and Raynaud syndrome. Archivesof Surgery 110 (11), 1343–1346.[47] Valiant, L. G., 1979. The complexity of enumeration and reliability prob-lems. SIAM Journal on Computing 8 (3), 410–421.[48] Watts, D. J., Strogatz, S. H., 1998. Collective dynamics of ’small-world’networks. Nature 393 (6684).[49] Weeber, M., Klein, H., de Jong-van den Berg, L., Vos, R., et al., 2001. Usingconcepts in literature-based discovery: Simulating swanson’s Raynaud–ﬁshoil and migraine–magnesium discoveries. Journal of the American Societyfor Information Science and Technology 52 (7), 548–557.4550] Wirrell, E. C., Hamiwka, L. D., 2006. Do children with benign rolandicepilepsy have a higher prevalence of migraine than those with other partialepilepsies or nonepilepsy controls? Epilepsia 47 (10), 1674–1681.[51] Wong, W., Liu, W., Bennamoun, M., 2012. Ontology learning from text:A look back and into the future. ACM Comput. Surv. 44 (4), 20:1–20:36.[52] Zhu, B.-Q., Parmley, W. W., 1990. Modiﬁcation of experimental and clini-cal atherosclerosis by dietary ﬁsh oil. American heart journal 119 (1), 168–178.

Appendix A: Context Topics for the Intermediate Terms