Discovering Beaten Paths in Collaborative Ontology-Engineering Projects using Markov Chains
Simon Walk, Philipp Singer, Markus Strohmaier, Tania Tudorache, Mark A. Musen, Natalya F. Noy
aa r X i v : . [ c s . S I] F e b Discovering Beaten Paths inCollaborative Ontology-Engineering Projectsusing Markov Chains
Simon Walk a, ∗ , Philipp Singer b , Markus Strohmaier b,c , Tania Tudorache d , Mark A. Musen d , Natalya F. Noy d a Institute for Information Systems and Computer Media, Graz University of Technology, Austria b GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany c Dept. of Computer Science, University of Koblenz-Landau, Germany d Stanford Center for Biomedical Informatics Research, Stanford University, USA
Abstract
Biomedical taxonomies, thesauri and ontologies in the form of the International Classification of Diseases as a taxonomy or theNational Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in acquiring, representing and processinginformation about human health. With increasing adoption and relevance, biomedical ontologies have also significantly increasedin size. For example, the 11th revision of the International Classification of Diseases, which is currently under active developmentby the World Health Organization contains nearly 50 ,
000 classes representing a vast variety of di ff erent diseases and causes ofdeath. This evolution in terms of size was accompanied by an evolution in the way ontologies are engineered. Because no singleindividual has the expertise to develop such large-scale ontologies, ontology-engineering projects have evolved from small-scalee ff orts involving just a few domain experts to large-scale projects that require e ff ective collaboration between dozens or evenhundreds of experts, practitioners and other stakeholders. Understanding the way these di ff erent stakeholders collaborate will enableus to improve editing environments that support such collaborations. In this paper, we uncover how large ontology-engineeringprojects, such as the International Classification of Diseases in its 11th revision, unfold by analyzing usage logs of five di ff erentbiomedical ontology-engineering projects of varying sizes and scopes using Markov chains. We discover intriguing interactionpatterns (e.g., which properties users frequently change after specific given ones) that suggest that large collaborative ontology-engineering projects are governed by a few general principles that determine and drive development. From our analysis, we identifycommonalities and di ff erences between di ff erent projects that have implications for project managers, ontology editors, developersand contributors working on collaborative ontology-engineering projects and tools in the biomedical domain. Keywords:
Collaborative ontology engineering; Markov chains; sequential patterns; collaboration; ontology-engineering tool;user interface
1. Introduction
Today, biomedical ontologies play a critical role in acquir-ing, representing and processing information about human health.For example, the International Classification of Diseases (ICD)is a taxonomy that is used in more than 100 countries to en-code patient diseases, to compile health-related statistics andto collect health-related spending statistics. Similarly, the Na-tional Cancer Institute’s Thesaurus (NCIt) represents an impor-tant OWL-based vocabulary for classifying cancer and cancer-related terms.With their increase in relevance, biomedical taxonomies,thesauri and ontologies have also significantly increased in sizeto cover new findings and to extend and complement their orig-inal areas of application. For example, the 11th revision ofthe International Classification of Diseases (ICD-11), currentlyunder active development by the World Health Organization ∗ Corresponding author ( [email protected] ) (WHO), consists of nearly 50 ,
000 classes representing a vastvariety of di ff erent diseases and causes of death. In contrastto previous revisions, the foundation component of ICD-11 isimplemented as an OWL ontology with a broader scope thanprevious ICD revisions.This growth was accompanied by a need to adapt the waythese ontologies are engineered as no single individual or smallgroup of domain experts have the expertise to develop suchlarge-scale ontologies. New tools and processes have to be de-veloped in order to coordinate, augment and manage collabo-ration between the dozens or hundreds of experts, practitionersand stakeholders when engineering an ontology.Understanding the ways in which such a large number ofparticipants – e.g., more than 100 experts contribute to ICD-11 – collaborate with one another when creating a structuredknowledge representation is a prerequisite for quality controland e ff ective tool support. Objectives:
Consequently, we aim at understanding howlarge collaborative ontology-engineering projects such as ICD-
Preprint submitted to Elsevier August 17, 2018 a single user on any class or(b) a single class by any user in an ontology over time. Forexample, as depicted in Figure 2, a sequential property path fora single user (user-based) consists of a chronologically orderedlist of all properties (e.g., title , definition etc.), which have beenchanged by that user on any class, while a sequential propertypath for a single class (class-based) consists of a chronologi-cally ordered list of properties that were changed on that classby any user. Instead of only modeling sequences for singleusers or classes, our data contains a set of paths; e.g., each pathin the dataset consists of sequences of properties whose valuehas been changed by a single user over time. This allows us totap into accumulated patterns. Concretely, we are interested instudying emerging patterns of subsequent steps in such sequen-tial paths – e.g., which properties do users frequently changeafter a specific given property.The analyzed datasets range from large-scale datasets suchas ICD-11 to smaller ones such as the Ontology for ParasiteLifecycle (OPL). Given the di ff erences of our datasets in a num-ber of salient characteristics, we investigate if specific patternscan be found across all or only in certain biomedical ontology-engineering projects. Furthermore, we investigate and discussfeatures of these projects that potentially a ff ect observed pat-terns, which can only be found in specific datasets. This anal-ysis can be seen as a stepping stone for collaborative ontology-engineering project managers to devise infrastructures and toolsupport to augment collaborative ontology engineering. Contributions:
We present new insights on social interac-tions and editing patterns that suggest that large collaborativeontology-engineering projects are governed by a few generalprinciples that determine and drive development. Specifically,our results indicate that general edit patterns can be found inall investigated datasets, even though they (i) represent di ff er-ent projects with di ff erent goals, (ii) use variations of the sameontology-editors and tools for the engineering process and (iii)di ff er in the way the projects are coordinated.To the best of our knowledge, the work presented in this pa-per represents the most fine-grained and comprehensive studyof patterns in large-scale collaborative ontology-engineering projectsin the domain of biomedicine. In addition, our analysis is con-ducted across five datasets of di ff erent sizes, which have beendeveloped using di ff erent versions of Collaborative Prot´eg´e (Ta-ble 1).
2. Collaborative ontology engineering
According to Gruber [1], Borst [2], Studer et al. [3] an on-tology is an explicit specification of a shared conceptualization.In particular, this definition refers to a machine-readable con-struct (the formalization) that represents an abstraction of thereal world (the shared conceptualization), which is especiallyimportant in the field of computer science as it allows a com-puter (among other things) to “understand” relationships be-tween entities and objects that are modeled in an ontology.Collaborative ontology engineering is a new field of researchwith many new problems, risks and challenges that we mustfirst identify and then address. In general, contributors of col-laborative ontology-engineering projects, similar to traditionalcollaborative online production systems (e.g., Wikipedia), en-gage remotely (e.g., via the internet or a client–server archi-tecture) in the development process to create and maintain anontology. As an ontology represents a formalized and abstractrepresentation of a specific domain, disagreements between au-thors on certain subjects can occur. Similar to face-to-face meet-ings, these collaborative ontology-engineering projects need toolsthat augment collaboration and help contributors in reachingconsensus when modeling topics of the real world.Indeed, the majority of the literature about collaborativeontology engineering sets its focus on surveying, finding anddefining requirements for the tools used in these projects [4, 5].The Semantic Web community has developed a number oftools aimed at supporting the collaborative development of on-tologies. For example, Semantic MediaWikis [6] and its deriva-tives [7, 8, 9] add semantic, ontology modeling and collabora-tive features to traditional MediaWiki systems.Prot´eg´e, and its extensions for collaborative development,such as WebProt´eg´e and iCAT [10] (see Figure 1 for a screen-shot of the iCAT ontology-editor interface) are prominent stand-alone tools that are used by a large community worldwide todevelop ontologies in a variety of di ff erent projects. Both Web-Prot´eg´e and Collaborative Prot´eg´e provide a robust and scalableenvironment for collaboration and are used in several large-scale projects, including the development of ICD-11 [11].P¨oschko et al. [12] and Walk et al. [13] have created Prag-matiX , a tool to visualize and analyze a collaboratively engi-neered ontology and aspects of its history and the engineeringprocess, providing quantitative insights into the ongoing collab-orative development processes.Falconer et al. [14] investigated the change-logs of collabo-rative ontology-engineering projects, showing that users exhibitspecific roles, which can be used to group and classify users,when contributing to the ontology. Pesquita and Couto [15] in-vestigated whether the location and specific structural featurescan be used to determine if and where the next change is goingto occur in the Gene Ontology . Note that the term traditional online production systems refers to onlineplatforms that have users collaborate in engineering digital goods, opposedto a structured knowledge base that is the result of collaborative ontology-engineering. igure 1: A screenshot of iCAT, a custom tailored, web-based version of WebProt´eg´e, developed for the collaborative engineering of ICD-11. The left part of theinterface visualizes the ICD-11 class hierarchy, the class titles, the number of annotations each class has received (speech bubbles) and its overall progress (colorand symbol before the class title). The right part of the interface shows the di ff erent user-interface sections (e.g, Title & Definition or Classification Properties ),listing all properties and property values for each class.
Goncalves et. al [16, 17, 18] performed an analysis of di ff er-ent versions of ontologies by applying and categorizing Di ff al-gorithms, with the goal of categorizing the di ff erences betweenconsecutive and chronologically ordered versions of the ontolo-gies. Furthermore, they conducted reasoner performance testsand identified factors that potentially increase reasoner perfor-mance. For the analysis presented in this paper we were able torely on ChAO [19], which is a change-log provided by Prot´eg´eand its derivatives that already provides us with detailed andunambiguous logs of changes for the investigated ontologies.In a similar context Grau et al. [20, 21] proposed a logi-cal framework for modularity of ontologies and a definition ofwhat is to be considered as an ontology module. In general,an ontology module can be used to extract the meaning of aspecified set of terms from an ontology. Extracting the rightamount of information is especially important for the topic ofontology reuse. According to Grau et al. modularity also rep-resents a crucial factor in collaborative ontology-engineeringenvironments as modular representations of ontologies are eas-ier to understand, to extend and to reuse, similar to modularityin software engineering projects.Mikroyannidi et al. [22] investigated the detection and useof (design) patterns in the content of an ontology, using a clus-tering approach. In contrast to Mikroyannidi et al., our analy-sis focuses on the detection of sequential patterns in interactiondata rather than content.Strohmaier et al. [23] investigated the hidden social dy-namics that take place in collaborative ontology-engineeringprojects from the biomedical domain and provides new met-rics to quantify various aspects of the collaborative engineeringprocesses. Wang et al. [24] have used association-rule min-ing to analyze user editing patterns in collaborative ontology-engineering projects. The approach presented in this paper uses Markov chains to extract much more fine grained user-interactionpatterns incorporating a variable number of historic editing in-formation.The only requirement to perform the pattern analysis thatwe present in this paper is the availability of a structured log ofchanges that can be mapped to the underlying ontology. Themajority of the discussed collaborative ontology-engineeringenvironments provide such a log, allowing for a similar analy-sis. For example, the Semantic MediaWikis store all the changesto the articles, and thus the ontology, allowing to expand theapplication of Markov chains to analyze sequential patterns asshown in this paper.
3. Materials & methods
For the analysis conducted in this paper we concentrated oure ff orts on five ontology-engineering projects in the biomedicaldomain. Each of the projects (i) has at least two users who con-tributed to the project, (ii) provides a structured log of changesand (iii) represents knowledge from the biomedical domain. InSection 3.1 we provide a brief history for each dataset and inSection 3.2 we describe the sequential path analysis. To aidreaders in understanding the analyses conducted in this paperand its implications we provide a very brief overview of Markovchains and the involved model selection methodology in Sec-tion 3.3. Table 1 lists the detailed features and observation periodsfor the following five datasets that we used in our analysis. Alldatasets have been created either with WebProt´eg´e or special3 able 1: Detailed information of the datasets used for the sequential pattern analysis to extract beaten paths in collaborative ontology-engineering projects.ICD-11 ICTM NCIt BRO OPLOntology classes 48,771 1,506 102,865 528 393changes 439,229 67,522 294,471 2,507 1,993DL expressivity
SHOIN ( D ) SHOIN ( D ) SH SHIF ( D ) SHOIF
Editor tool iCAT iCAT-TM Collaborative Prot´eg´e WebProt´eg´e Collaborative Prot´eg´eUsers users 109 27 17 5 3bots (changes) 1 (935) 1 (1) 0 (0) 0 (0) 0 (0)Duration first change 18.11.2009 02.02.2011 01.06.2010 12.02.2010 09.06.2011last change 29.08.2013 17.7.2013 19.08.2013 06.03.2010 23.09.2011observation period (ca.) 4 years 2.5 years 3 years 1 month 3 months versions of WebProt´eg´e. To be able to conduct the pattern de-tection analysis for a di ff erent dataset, there is only one require-ment that needs to be satisfied: The availability of a change-logthat can be mapped onto the ontology so that changes can beassociated with users and classes without ambiguity.The DL expressivity [25, 26] of the five datasets is added toTable 1 to highlight that the investigated ontologies exhibit dif-ferent strategies regarding their OWL-DL expressivity. As alllevels of expressivity shown in Table 1 allow for the definitionand assignment of properties and classes, they do not influencethe conducted pattern detection analyses. Also, in the case ofWebProt´eg´e and its derivatives, the data used for the pattern de-tection analysis can be extracted from the change-logs, allowingus to prevent parsing and extracting values from OWL directly. The International Classification of Diseases (ICD) is theinternational standard for diagnostic classification used to en-code information relevant to epidemiology, health management,and clinical use in over 100 United Nations countries. TheWorld Health Organization (WHO) develops ICD, and pub-lishes new revisions of the classification every decade or more.The current revision in use is ICD-10, a taxonomy that containsover 15 ,
000 classes. The 11th revision of ICD, ICD-11 , is cur-rently taking place and brings two major changes with respectto previous revisions. First, ICD-11’s foundation component isdeveloped as an OWL ontology using a much richer represen-tation formalism than previous revisions. ICD-11 contains verydetailed descriptions of several aspects of diseases, mostly rep-resented as properties in the ontology. Second, the developmentof ICD-11 takes place in a Web-based collaborative environ-ment, called iCAT (see Figure 1), which allows domain expertsaround the world to contribute and review the ontology online.ICD-11 is planned to be finalized in May 2017.
The International Classification of Traditional Medicine(ICTM) is a WHO led project that aimed to produce an inter-national standard terminology and classification for diagnosesand interventions in Traditional Medicine. ICTM, similarly toICD-11, is implements an OWL based ontology as foundationcomponent, which tries to unify the knowledge from the tradi-tional medicine practices from China, Japan and Korea. Its con-tent is authored in 4 languages: English, Chinese, Japanese andKorean. More than 20 domain experts from the three countries http://tinyurl.com/ictmbulletin developed ICTM using a customized version of the iCAT sys-tem, called iCAT-TM. The development of ICTM was stoppedin 2012, and a subset of ICTM is also included as a branch inthe ICD-11 ontology. The National Cancer Institute’s Thesaurus (NCIt) [27]has over 100 ,
000 classes and has been in development for morethan a decade. It is a reference vocabulary covering areas forclinical care, translational, basic research, and cancer biology.A multidisciplinary team of editors works to edit and updatethe terminology based on their respective areas of expertise,following a well-defined workflow. A lead editor reviews allchanges made by the editors. The lead editor accepts or re-jects the changes and publishes a new version of the NCI The-saurus. The NCI Thesaurus is , at its core, an OWL ontology,which uses many OWL primitives such as defined classes andrestrictions. It was named thesaurus due to historical reasons,however fully conforms to OWL semantics, thus represents anactual ontology.
The Biomedical Resource Ontology (BRO) originated inthe Biositemaps project, an initiative of the Biositemaps Work-ing Group of the NIH National Centers for Biomedical Com-puting [28]. Biositemaps is a mechanism for researchers work-ing in biomedicine to publish metadata about biomedical data,tools, and services. Applications can then aggregate this infor-mation for tasks such as semantic search. BRO is the enablingtechnology used in Biositemaps; a controlled terminology fordescribing the resource types, areas of research, and activity ofa biomedical related resource. BRO was developed by a smallgroup of editors, who use a Web-based interface (WebProt´eg´e)to modify the ontology and to carry out discussions to reachconsensus on their modeling choices. The Ontology for Parasite Lifecycle (OPL) models thelife cycle of the
T.cruzi , a protozoan parasite, which is respon-sible for a number of human diseases. OPL is an OWL ontol-ogy that extends several other OWL ontologies. It uses manyOWL constructs such as restrictions and defined classes. Sev-eral users from di ff erent institutions collaborate on OPL devel-opment. This ontology is much smaller and has far fewer usersthan NCIt, ICD-11, or ICTM. The ICD-11 dataset used in our analysis did not include the ICTM branch. http://biositemaps.ncbcs.org .2. Sequential interaction paths For our sequential pattern analysis we analyze three di ff er-ent kinds of paths, which all represent interactions with theunderlying ontology. A sequential path is represented by thechronologically ordered list of extracted interactions for eithera single user or a single class (see Figure 2). For example, a se-quential property path for a single user (user-based) consists ofa chronologically ordered list of all properties (e.g., title , defi-nition etc.), which have been changed by that user on any class,while a sequential property path for a single class (class-based)consists of a chronologically ordered list of properties that werechanged on that class by any user. U P2 P3 P1
C P3P2P2P1 :: Figure 2: The top row of the figure depicts an exemplary class-based sequentialproperty path ( P P
3) for class C . This means that for class C the property P P P
3. The bottom row of the figure depicts the sequential property path ( P P U ( user-based ). Analogously, user U has firstchanged P
2, continued to change property P P User-sequence paths:
First, we analyze activity patternswithin the collaborative ontology-engineering project. This meansthat we analyze sequences of users who change a class. Wewant to detect and describe the di ff erent sequential patterns (thestructure) that can be extracted from the change-logs of the in-vestigated collaborative ontology-engineering projects. Structural paths:
Analogously to the User-Sequence Paths,we investigate edit-strategies, such as bottom-up or top-down development, that users follow. Is it possible to detect commonpatterns of which depth level a user frequently contributes toafter a given current depth level? In addition to development-strategies, we look at the relationships (e.g., parent, child, sib-ling, etc.) between the current and the next class a user is goingto contribute to. Property paths:
On a content-based level, we investigatethe series of property-changes users perform on. In particu-lar, we want to identify common successive property-changes –i.e., which properties users (user-based) regularly change con-secutively and which properties are changed back-to-back for classes (class-based).
For the analysis conducted in this paper we are adopting themethodology presented by Singer et al. [29] and mapped to col-laborative ontology-engineering change logs by Walk et al. [30]to detect sequential patterns identified in and extracted fromchange-logs of collaborative ontology-engineering projects.For a better understanding of the collected results, we willprovide a short description of Markov chains. For an in-depthdescription of our methodology we point to Singer et al. [29],Walk et al. [30]. In general, Markov chain models are used for stochasti-cally modeling transitions between states on a given state space.In our case, a Markov chain consists of a finite state-space (e.g., properties that a user edits over time; see Section 3.2)and the corresponding transition probabilities (e.g., the prob-ability of changing property j after property i) between thesestates. Markov chain models are usually described as memo-ryless which means that the next state in a sequences only de-pends on the current one and not on a sequence of precedingones (also known as Markovian property). Hence, this propertydefines serial dependence between adjacent nodes in trajecto-ries – this is where the term ”chain” comes from. Such a modelis usually called a first-order or memoryless model.As we are interested in modeling sequential interaction pathsof collaborative ontology-engineering projects (see Section 3.2),we fit a Markov chain model on such sequences D = ( x , x , ..., x n )with states from a finite set S . Then, we can write the Marko-vian property as: P ( x n + | x , x , ..., x n ) = P ( x n + | x n ) (1)After the model fitting on the data, a Markov chain modelis usually represented via a stochastic transition matrix P withelements p i j = P ( x j | x i ) where it holds that for all i : X j p i j = ff erentstates. For example, if we fit the Markov chain model on se-quential property paths for users (see Section 3.2), element p i j of the transition matrix would tell us the probability that userschange property j right after i (e.g., in 60% of all cases). Bynow, e.g., looking for the highest transition probabilities fromstate i to all other states of S , we can identify potential high-frequent patterns in our data.
4. Results
In the
User-Sequence Paths analysis we investigate patternsemerging when looking at sequences of users who contribute toa class of an ontology. Hence, given a sequence of n contrib-utors for a class over time, we identify consecutive users whoedit the class (e.g., user Y frequently contribute to a class afteruser X).Analyzing the chronologically ordered list of contributorsfor each class of the five investigated datasets provides the nec-essary information to identify users who perform changes onclasses after (or before) other users. Note that this analysison its own, without regarding additional factors, such as the Note that throughout this article we usually refer to the entities modeled(i.e., interactions) instead of states. However, we speak about transition prob-abilities between these entities as we derive them directly from the resultingmodel transition matrix. o User F r o m U s e r F r equen cy User 5User 8User 9User 11User 17User 18User 26User 27User 33User 39User 41User 43User 45User 47User 58User 66User 68User 70User 71User 72User 79User 83User 96User 100User 105User 108 U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r (a) International Classification of Diseases(ICD-11) To User F r o m U s e r F r equen cy User 1User 3User 4User 7User 10User 13User 14User 15User 18User 19User 20User 22User 26User 27 U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r (b) International Classification of TraditionalMedicine (ICTM) To User F r o m U s e r F r equen cy User 1User 4User 6User 7User 9User 10User 11User 12User 15User 16User 17 U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r U s e r (c) National Cancer Institute Thesaurus (NCIt) To User F r o m U s e r F r equen cy User 1User 2User 3User 4User 5 U s e r U s e r U s e r U s e r U s e r (d) Biomedical Resource Ontology (BRO) To User F r o m U s e r F r equen cy User 1User 2User 3 U s e r U s e r U s e r (e) Ontology for Parasite Lifecycle (OPL)Figure 3: Results for the
User-Sequence Paths analysis:
The columns and rows of the transition maps ( bottom area of Figures 3(a) to 3(e)) represent thetransition-probabilities between the users of each dataset for a first-order Markov chain, where rows are source users and columns are target users . A sequence(or transition-probability) is always read from row to column . Darker colors represent higher transition-probabilities while lighter colors indicate lesser transition-probabilities. Absolute probability values are dependent on the number of investigated rows and columns, hence relative di ff erences are of greater importance. Darkercolored columns identify gardeners, a contributor focused on pruning ontology classes and fixing syntactical errors. The histograms ( top area of Figures 3(a) to3(e)) show the number of changes performed by each user (again for a first-order Markov chain) within the five ontologies in alphabetical order. Note, that the y -axes for all histograms are scaled di ff erently for each dataset. All datasets have a few users who contributed the majority of changes, while the rest of the users(the long-tail) only contributed a very small number of changes. Note that the transition-probabilities depicted in the transition maps are relative numbers for eachcolumn and row individually. The sum of all transition probabilities for one row in the transition maps is 1. For example, if User 1 exhibits a transition probabilityof 0 .
30 to another
User 2 it means that
User 2 has a 30% probability of changing a class after
User 1 . Thus, an inspection of the transition maps and histograms isnecessary for proper interpretation. To increase readability we have removed users from the plots who have contributed only a very limited number of changes forICD-11, ICTM and NCIt. gardener , a contribu-tor focused on pruning ontology classes and fixing syntacticalerrors) after other users.
Path & model description:
To analyze user sequences, weiterated over each class of our datasets and extracted a chrono-logically ordered list of contributors. For example, a given pathfor a given class can look like the following:
User A, User B,User B, User C . As we are interested in uncovering patterns ofdistinct users, we merged multiple consecutive changes by thesame user into a single change – our previous example wouldthen unfold into:
User A, User B, User C . By doing so weremove biases emerging when one single user consecutivelychanges the same class over and over as this may result in un-reasonable high transition probabilities between equal users.We fit a first-order Markov chain model on this set of paths,where each path represents a single class of the ontology andeach element of a path constitutes a change by a single user onthe class. The resulting transition probabilities between usersthen e.g., tell us the probability that
User B changed a class af-ter
User A . Hence, they give us thorough insights into frequentconsecutive user patterns that emerge when looking at whichusers contribute to classes in an ontology. Due to reasons ofprivacy we obfuscated the usernames and replaced them withgeneric names.
Results:
When investigating the transition probabilities (rep-resenting a Markov chain of first order) between contributors(see bottom area of Figures 3(a) to 3(e)) we can identify veryactive users by looking at darker colored columns of the tran-sition maps. Note that these darker colored columns can alsobe used to identify gardeners, a contributor focused on prun-ing ontology classes and fixing syntactical errors. As we havemerged all consecutive changes of the same user into one singlechange, the diagonal, representing the transition probabilitiesbetween the same users, is 0. The absolute transition proba-bilities, depicted next to each transition map, are dependent onthe absolute amount of observations and users, thus are to beinterpreted relatively to each other for each row individually.When looking at the probabilities between the three most ac-tive users (being users 66, 45 and 47), and all correspondingtarget users in ICD-11 we can see that the probabilities are veryevenly distributed among them. Meaning that, when investi-gating the rows (
From User ) that correspond to the top threemost active users, probabilities to all target users (
To User ) arevery evenly distributed, with very minor exceptions. This in-dicates that users who contribute many changes to ICD-11 arenot followed by specific other contributors, but exhibit an evendistribution of users that edited a class after them. Nonetheless,we can clearly identify
User 66 to be the most likely user thatedits a class after nearly all other users. This suggests, that
User66 may represent a gardener, a contributor focused on pruningontology classes and fixing syntactical errors, in ICD-11.For NCIt we can clearly observe that
User 7 appears to bea gardener , who is checking all the changes contributed by all other users. For BRO
Users 2 and are prominent target users,evident in the high transition probabilities as To User (darkcolumns) – i.e., they frequently edit a class after other users do.Interestingly, the user with the highest number of changes (
User1 ) exhibits very low and evenly distributed transition proba-bilities (row) and is not necessarily the user that most likelychanges a class after another users. This shows us that theredoes not need to be a necessary connection between the overallactivity of users and their activity as a gardener. This could alsomean that
User 1 is possibly working independently from theother users in BRO, or that
User 1 is a domain specialist andall other users only change concepts that have not been workedon by that specialist. However, further investigations in futurework are required to confirm this observation as our Markovchain analysis is not able to determine this kind of distinction.For OPL we can observe that
User 3 frequently changes thesame classes after
User 2 . A similar observation can be madefor
Users 1 and . However, one has to keep in mind that User 1 has contributed a limited number of changes, rendering the ob-served transition probabilities less useful as they rarely occur.The histograms (see top area of Figures 3(a) to 3(e)) in-dicate that a small number of users contribute the majority ofchanges (similar to a long-tail distribution). However, this ap-pears to be more dominant for specific ontologies comparedto others. In order to measure the inequality among contribu-tions of changes to a specific ontology by users, we analyzedthe
Normalized Entropy , which is determined by calculatingthe Shannon Entropy and normalizing the entropy by dividingby the logarithm of the length (i.e., number of users) of a dis-tribution. This coe ffi cient measures the statistical dispersionof a distribution – i.e., the coe ffi cient is one if all users con-tributed equally to the ontology, while it is zero in case of totalinequality where a single user conducts all changes. The re-sults indicate that ICD-11 (0 .
55) exhibits a low entropy value,i.e., the changes are dominated by only a few users. For NCIt(0 . .
64) and ICTM (0 .
68) we receive medium nor-malized entropies indicating a more democratic contribution tothe ontology by users. A high entropy can be observed for BRO(0 . Interpretation & practical implications:
The transitionprobabilities for a first-order Markov chain unveil the roles ofcertain users and can help to identify users or even groups ofusers who frequently change the same classes. Users that fre-quently change classes after other users (i.e., exhibit high tran-sition probabilities in their columns) were identified by us as ac-tual gardeners, curators and administrators of the correspondingprojects. If certain users always change the same classes afterspecific other users, it could be worthwhile for project admin-istrators to investigate if these users are actually collaborating,for example by looking at the changed properties and property Additionally, we calculated the
Gini Coe ffi cient for each distribution con-firming the results presented here. Note that we do not necessarily know whether the di ff erences betweenthese distributions are statistically significant as we are mainly interested inthe behavior of single distributions. The investigation of
Structural Paths involves an analysisof di ff erent aspects regarding how and where users contributeto the ontology, such as the depth level of the class that userscontribute to next (Section 4.2.1) as well as looking at the rela-tionship distances between consecutively changed classes (Sec-tion 4.2.2). In this analysis, we investigate if users concentrate their ef-forts on specific depth levels of the ontology and if there are cer-tain depth levels that are frequently consecutively changed andreceive less concentrated workflows. The gathered results pro-vide the necessary information to implement prefetching mech-anisms, potentially helping to minimize the loading and waitingtimes for contributors. Furthermore, we can determine whetherusers move along the structure of the underlying ontology whenediting classes.
Path & model description:
For this analysis, we storedthe chronologically ordered depth levels of each changed classfor each user (user-based). The depth level of a class is thelength of the shortest path between the root node of the ontol-ogy and the corresponding class. For example, a given path fora given user can look like the following:
Depth 3 (for class A),Depth 3 (for class A), Depth 3 (for class A), Depth 3 (for classB), Depth 4 (for class C) . We merged consecutive changes thatwere conducted by the same user on the same class into onesingle sequent change between the same depth levels. Hence,for our previous example we would merge the three successivechanges of class A into just two consecutive ones which resultsin the following final depth-level path:
Depth 3, Depth 3, Depth3, Depth 4 . This approach helps us to investigate patterns ofchanging distinct depth levels while still retaining the notion ofusers consecutively editing the same classes. Consequently, we fit a first-order Markov chain model onthese paths – each path represents a single user and each ele-ment of a path represents a corresponding depth level of a classthe user has changed. The final transition probabilities give usinformation about consecutive depth levels that users changeover time. For example, they might tell us the probability thatusers change a class belonging to the third depth level of theontology after one that has a depth level of 2.
Results:
First, the histograms (see top area of Figures 4(a)to 4(e)) show that work is concentrated on certain depth levelsof the ontology, with the highest and lowest levels not receivingas much attention as the levels in-between.As depicted in the transition maps (bottom area of Figures 4(a)to 4(e)), users have a high tendency to edit classes in the samedepth levels, visible in the darker colored diagonal. In ICD-11,for the first five depth levels, users appear to have a tendency to-wards top-down editing, evident in the darker immediately rightof the diagonal, while this tendency turns around into a bottom-up editing behavior, evident in the darker colored squares im-mediately left of the diagonal, at a depth level of 6 and higher,and appears to be strictly limited to surrounding depth levels.For ICTM (see Figure 4(b)), we can observe a similar trend,again with the tendency towards top-down editing appearing tobe minimally more dominant. For NCIt, when only looking atthe transition map, we can identify a trend towards bottom-up editing, evident in the squares directly left of the diagonal beingdarker than the ones right of the diagonal. However, when alsoconsidering the absolute number of changes, depicted in the his-togram of Figure 4(c), we can infer that the levels with a higherfrequency of occurrence, even though their transition probabil-ities are more evenly distributed, have a greater impact on theediting strategy. This means that while we can see a bottom-up editing behavior for levels 8 to 5 and a top-down editing behav-ior for levels 1 to 4, classes on levels 1 to 4 are more frequentlychanged than classes on the other levels, hence a tendency to-wards top-down editing can be observed. Thus, when users arenot changing the same classes, they still exhibit a preferencetowards top-down editing. Given the short observation periodsfor BRO and OPL it is hard to infer edit strategies. However,similar to the other projects, we can observe a concentrationon the same depth levels with alternating preferences towardshigher and lower depth levels. Similar to ICD-11, all datasetsexhibit higher transition probabilities between the immediatelysurrounding depth levels.Furthermore, we investigate whether the total number ofclasses as well as the total number of links to the immediatehigher (children; edges to classes one level further away fromroot) and lower (parents; edges to classes one level closer toroot) depth level correlate with our findings (Figures 5(f) to5(j)). For example, the transition map for ICD-11 (see Fig-ure 4(a)) shows that contributors exhibit a top-down editing be-havior for the first five depth levels, with level 5 exhibiting firstsigns of bottom-up editing. Figure 5(f) shows a higher numberof possible transitions from children than parents, indicatingthat users are in general likelier to follow top-down editing-strategies when changing classes, following relationships bychance, of the first four levels. This changes for ICD-11 at level8 o Depth Level F r o m D ep t h Le v e l F r equen cy B R EAK (a) International Classification of Diseases(ICD-11)
To Depth Level F r o m D ep t h Le v e l F r equen cy B R EAK (b) International Classification of TraditionalMedicine (ICTM)
To Depth Level F r o m D ep t h Le v e l F r equen cy B R EAK −0.10.00.10.20.30.40.50.60.70.8 (c) National Cancer Institute Thesaurus (NCIt)
To Depth Level F r o m D ep t h Le v e l F r equen cy B R EAK (d) Biomedical Resource Ontology (BRO)
To Depth Level F r o m D ep t h Le v e l F r equen cy B R EAK −0.10.00.10.20.30.40.50.60.70.8 (e) Ontology for Parasite Lifecycle (OPL)Figure 4:
Results for the
Depth-Level Paths analysis:
The columns and rows of the transition maps ( bottom area of Figures 4(a) to 4(e)) represent the transitionprobabilities of a first-order Markov chain between depth levels, where rows are source depth levels and columns are target depth levels . A sequence (or transitionprobability) is always read from row to column . Darker colors represent higher transition probabilities while lighter colors indicate lesser transition-probabilities.Absolute probability values are dependent on the number of investigated rows and columns, hence relative di ff erences are of greater importance. For classes closerto root a top-down editing manner can be observed, while this is reversed for classes further away from root. The sum of all transition probabilities for one row inthe transition maps is 1. For example, if Depth-Level 6 exhibits a transition probability of 0 .
30 to another
Depth-Level 5 it means that a class on
Depth-Level 5 hasa 30% probability of being changed after a class on
Depth-Level 6 . The histograms ( top area of Figures 4(a) to 4(e)) show the number of changes performed ineach depth level aggregated over all users of the respective projects (again for a first-order Markov chain). Throughout all projects, classes located between the firstand last few depth levels (in the middle) are changed substantially more frequently than others, suggesting that work is concentrated on some depth levels whileothers receive none to very few changes at all. Note, that the y -axes for all histograms are scaled di ff erently for each dataset. For the x -axes (and column / rows ofthe transition maps) we only display depth levels which exhibit at least one change, thus, the depth level sequences are not necessarily continuous from lowest tohighest depth level. Depth−Level F r equen cy (f) International Classification of Diseases(ICD-11) Depth−Level F r equen cy (g) International Classification of TraditionalMedicine (ICTM) Depth−Level F r equen cy (h) National Cancer Institute Thesaurus (NCIt) Depth−Level F r equen cy (i) Biomedical Resource Ontology (BRO) Depth−Level F r equen cy (j) Ontology for Parasite Lifecycle (OPL)Figure 5: The Figures 5(f) to 5(j) depict the absolute numbers ( y -axis; Frequency) of classes as well as the number of edges ( isKindOf ) to classes on the immediatehigher ( parents ; closer to root) and lower ( children ; further away from root) depth level for all depth levels ( x -axis; Depth-Level). According to Figures 5(f) to 5(j)the transition probabilities depicted in the transition maps correlate with the total number of edges to children and parents for each depth level across all datasets.
5, with a higher number of transitions to parents than to chil-dren, and continues until level 10. Resulting in a higher prob-ability of users performing bottom-up editing-strategies whenchanging classes from levels 6 to 10. The same observationscan be made for all other datasets, indicating that the class hi-erarchy influences the edit behavior of contributors.In all datasets, after taking a
BREAK (representing an artifi-cially introduced session break when two consecutive changesof the same user are more than 5 minutes apart; for more infor-mation see Section 5.4), users exhibit a clear tendency towardschanging classes on certain depth levels (e.g., levels 3 to 5 forICD-11, levels 4 to 5 for ICTM, levels 4 to 7 for NCIt, levels 2to 4 for BRO and levels 6 to 9 for OPL).
Interpretation & practical implications:
The results ofthis analysis show if, to what extent and where (limited to lo-cality being determined by isKindOf relationships) work is con-ducted and concentrated within the ontology. This informationcan potentially be used in a variety of ways, for example byontology-engineering tool developers to adapt the interface ofthe ontology-engineering tool dynamically to display specific classes after users return from a
BREAK . Project managers canadapt milestones and project progress reports to reflect the un-derlying editing strategies (e.g., top-down editing), for exampleby aligning progress with created branches (opposed to com-plete coverage). Another potential use-case for the results ofthis analysis involves the prefetching of content in certain en-vironments (e.g., mobile or embedded systems) to minimizewaiting times. Across all projects we can observe that classesclose to and very far away from the root of the ontology are notedited as frequently as other classes. One explanation for thisobservation could be that classes in lower depth levels (closer to root ) are mainly used as content dividers and are usually createdin the beginning of a project. Thus, they may be more stableand less frequently updated. Classes at the higher depth levels(further away from root ) on the other hand most likely requireextensive expert knowledge. Hence, only a small number ofusers have the necessary expertise to contribute to these classes.Additionally, the absolute number of classes in the higher andlower depth levels is much lower in all investigated datasets.Note that absolute values of depth levels are less important for10he interpretation of the results than their relative position (i.e.,closest to root, furthest away from root, etc.). For example, aclass at level 6 can exhibit di ff erent behaviors in ontologies with6 or 10 levels.In all projects, except for NCIt, the depth levels where usersstart to edit the ontology after they return from a BREAK aresimilar to the ones where they stop editing before taking a
BREAK .To be able to make that observation we have to take the abso-lute numbers of changes on each depth level (bottom area ofFigure 4) into account when looking at the transition probabili-ties (top area of Figure 4). NCIt is the only dataset where usersappear to be similarly likely to take a
BREAK after changingclasses across all depth levels, except for 0 and 12.When we combine the results of this analysis with the re-sults of the
User-Sequence Paths (Section 4.1) we may be ableto develop automatic mechanisms to curate and delegate workto users. For example, if we know that a specific user is mostprobably going to contribute to a class on level 3 and we havea set of classes on that level where that specific user is the mostprobable next user to contribute to, determined by the
User-Sequence Paths analysis, we may combine these two observa-tions to create class (and thus work) suggestions for users.
Given the high number of observed transitions between thesame depth levels in the
Depth-Level Paths analyses (Section 4.2.1;bottom area of Figure 4), we conducted an additional analysisinvestigating the relationships between the changed classes forall users. Hence, we wanted to know if all worked-on classeson the same depth-levels are siblings, cousins or any other kindof close relative? And in general, can we determine if usersfollow these hierarchical orders of an ontology when contribut-ing to classes on the same depth level? To further strengthen ourobservation that users are actually moving along the ontologicalhierarchy when contributing to an ontology (see Section 4.2.1),we analyzed the relationships between the changed classes foreach user. Note that whenever we talk about relationships forthis analysis, we refer to the hierarchical isKindOf relationshipsbetween two classes, e.g., parent, child, sibling or cousin. Forexample, when traversing the shortest-path distance of 2, mul-tiple di ff erent nodes can be reached, such as a grandparent (i.e.,2 times up), a grandchild (i.e., 2 times down), a sibling (i.e., 1time up, 1 time down) or even some other relationship (e.g., 1time down, 1 time up). Path & model description:
By combining the informationfrom the
Depth-Level Paths and the relative movement betweendepth levels, we inferred the hierarchical relationships betweentwo consecutively changed classes of a single user (user-based).For example, if the di ff erence between the depth levels of theinvestigated classes would be exactly the size of the shortest-path between them (with the shortest-path being > Child , a
Parent , an
Ancestor or a
Descendent of the first-changed class. Given a relative
DOWN movement (to a lower depth level) value, depending on theshortest-path value, the second class could be classified as
Child (shortest-path of 1) or
Descendent (shortest-path > Parent and
Ancestor with a relative UP movement. A Sibling is defined as the two classesbeing (i) connected via the same parent with (ii) a shortest-pathdistance of 2 and (iii) both classes are located on the
SAME depth level. A
Cousin is used when two classes on the
SAME depth level are connected by the same grand parent while ex-hibiting a shortest-path distance of 4. Every other possiblecombination of depth level and shortest-path was classified as
Other . Self indicates that the same class that was changed lasttime was changed again. For example, a consecutive change of
Sibling and
Self means that a change was first performed on aclass that is a sibling of the previous class (not displayed in thisexample) and then another change was performed on the sameclass, however now the relationship changed to
Self as no newclass was involved.Again, consecutive changes on the same class by the sameuser have been merged into one single sequent change (c.f. Sec-tion 4.2.1), meaning that multiple (more than 2) consecutivechanges of the same user on the same class have been mergedinto
Self to Self . Hence, a given path for a single user can, e.g.,look like the following: Sibling, Self, Self, Child.We fit a first-order Markov chain model to the data – eachpath represents a single user and each element represents a hi-erarchical relationship between the classes changed by the user.The resulting transition probabilities of the fitted model canthen give us insights into common emerging patterns. E.g., wecan identify how probable it is that users change a
Sibling aftera
Child . Results:
When looking at the histograms (see top area ofFigures 6(a) to 6(e)), we can observe that the relationships
Self , Sibling and
Other are highly represented across all datasets.The transition maps (bottom area of Figures 6(a) to 6(e)) showthat after a
BREAK , across all five datasets, users tend to changeclasses “somewhere els” in the ontology, evident in the hightransition probability from
BREAK towards
Other , and are likelynot to resume work in the same area of the ontology that theystopped working on. For ICD-11, ICTM and OPL, no matterwhich relationship type occurs, users tend to edit the same classconsecutively (dark colors in the
Self column). From this
Self relationship, which is also the one that occurs the most often inICD-11, ICTM and OPL, users are very likely either to changethe same class again (
Self ) or to change a
Sibling of the currentclass.For NCIt, BRO and OPL we can observe that users, whenchanging a
Parent are very likely to change a
Child of that par-ent afterwards. Note, that this
Child does not necessarily haveto be the same class that was changed prior to the traversal to
Parent . In all datasets, except for OPL, very high transitionprobabilities towards
Other can be observed for all not so fre-quently present relationships. In particular for NCIt we can ob-serve that
Other is the most frequently observed transition, evenbefore
Self and
Sibling . Interpretation & practical implications:
By combiningthe results of this analysis with the results of the
Depth-LevelPaths analysis, we can infer that users exhibit a tendency to-wards top-down editing while contributing to the ontology, whenonly considering changes that occur on di ff erent depth levels.If they concentrate their e ff orts on the same depth levels, users11 o Relationship F r o m R e l a t i on s h i p F r equen cy AncestorChildCousinDescendantOtherParentSelfSiblingBREAK A n c e s t o r C h il d C ou s i n D e sc endan t O t he r P a r en t S e l f S i b li ng B R EAK (a) International Classification of Diseases(ICD-11) To Relationship F r o m R e l a t i on s h i p F r equen cy AncestorChildCousinDescendantOtherParentSelfSiblingBREAK A n c e s t o r C h il d C ou s i n D e sc endan t O t he r P a r en t S e l f S i b li ng B R EAK (b) International Classification of TraditionalMedicine (ICTM) T o Relationship F r o m R e l a t i on s h i p F r equen cy AncestorChildCousinDescendantOtherParentSelfSiblingBREAK A n c e s t o r C h il d C ou s i n D e sc endan t O t he r P a r en t S e l f S i b li ng B R EAK −0.10.00.10.20.30.40.50.60.70.8 (c) National Cancer Institute Thesaurus (NCIt) T o Relationship F r o m R e l a t i on s h i p F r equen cy AncestorChildCousinDescendantOtherParentSelfSiblingBREAK A n c e s t o r C h il d C ou s i n D e sc endan t O t he r P a r en t S e l f S i b li ng B R EAK (d) Biomedical Resource Ontology (BRO) T o Relationship F r o m R e l a t i on s h i p F r equen cy ChildCousinOtherParentSelfSiblingBREAK C h il d C ou s i n O t he r P a r en t S e l f S i b li ng B R EAK (e) Ontology for Parasite Lifecycle (OPL)Figure 6:
Results for the
Hierarchical-Relationship Paths analysis:
The columns and rows of the transition maps ( bottom area of Figures 6(a) to 6(e)) representthe transition-probabilities of a first-order Markov chain between hierarchical-relationship levels, where rows are source relationships and columns are targetrelationships . A sequence (or transition-probability) is always read from row to column . Darker colors represent higher transition-probabilities while lighter colorsindicate lesser transition-probabilities. Absolute probability values are dependent on the number of investigated rows and columns, hence relative di ff erences are ofgreater importance. Across all datasets, aside from Self , a very clear trend towards editing the ontology along
Siblings can be observed. The histograms ( top area of Figures 6(a) to 6(e)) show the total number of occurrences of each relationship in the corresponding datasets aggregated over all users (again for a first-orderMarkov chain). Note, that the y -axes for all histograms are scaled di ff erently for each dataset. For the x -axes (and column / rows of the transition maps) we onlyrelationships that occur at least once in the corresponding paths, thus the x -axes could be di ff erent from project to project. Given the very high amount of Self and
Sibling transitions we can concur that users, when they contribute to classes on the same depth level follow a breadth-first strategy, meaning that they firstconcentrate their work on closely related classes (
Siblings ) on the same depth-level before switching to a di ff erent branch on the same or any other depth-level. breadth-first editing behavior, meaning that they firstconcentrate their work on closely related classes ( Siblings ) onthe same depth-level before switching to a di ff erent branch onthe same or any other depth-level, either changing the sameclass multiple times or traversing along siblings of the currentclass. We can leverage this information not only to refine thepreviously suggested pre-fetching of classes but also to enhancepossible class recommendations. Similarly, it is possible forontology-engineering tool developers to minimize the neces-sary e ff orts of users to contribute to the ontology by implement-ing, for example, guided workflows that take the underlying editstrategies of the contributors into account.As classes in ICD-11 and ICTM have a large number ofproperties and for ICTM certain properties have to be addedin multiple languages, the high transition probabilities towards Self (dark colors in the
Self column) are not surprising. Onepossible explanation for this observation for ICD-11 could bethe special functionality available in iCAT (for ICD-11) thatallows users to export parts of the ontology as spreadsheets forlocal editing and adding property values. Once contributors fin-ished editing the spreadsheet they have to enter the data intothe system manually, as no automatic import functionality ispresent. In the iCAT interface, users are simultaneously pre-sented with the ontology tree for navigating through the classesand the corresponding properties and property values. Whenusers select a property they can easily switch between classes,with the selected property staying selected, thus allowing toquickly enter the same properties for di ff erent classes.A similar, yet not as dominant as in ICD-11 and ICTM, be-havior can be observed for NCIt and BRO and even to someextent in OPL, which all do not use the export functionality.According to our observations, users travel along the underly-ing hierarchy when contributing to the ontology. Given the ob-servations made for ICD-11 this behavior can be enforced byproviding certain functionalities in the user-interface especiallywhen they compliment the workflows of the contributors.The results of this analysis have also shown that users arelikely to pursue a certain strategy or intermediate goal for theiredit sessions, for example changing all classes in a specific(narrow) area of the ontology. This is evident in the obser-vation that after returning from a BREAK , users have a veryhigh tendency to change the ontology “somewhere else” (seethe transition probabilities from
BREAK towards
Other in thetop-row of Figure 6), rather than picking up the work, wherethey left o ff . This discovery is very important for developingclass-recommender, as we may use the results of this analy-sis to suggest closely related classes to the current class a useris working on, however when that user stays inactive for theduration defined for introducing BREAK s the recommendationstrategy has to be changed.
Aside from analyzing di ff erent aspects of activity (Section 4.1)and the correlation between contribution patterns and the struc-ture of an ontology (Section 4.2), we can use Markov chainsto perform an analysis on the properties that are consecutivelychange by users in an ontology. This means that, for example, if a property value was edited by a user, we extracted the prop-erty (not the value) and created chronologically ordered listsof properties, whose values were changed by the correspondingusers. For example, if a user changed the title of a specific class,we would extract title , rather than the value inserted into the titleproperty. Now, we provide insights into emerging patterns fromdi ff erent viewing angles for the observations. Thus, we lookat property sequences for (a) single users (user-based) and for(b) single classes (class-based) – see Section 3.2. We were notable to perform the Property Paths analysis on OPL and BROas these datasets contain only a very limited number of uniqueproperty value changes during our observation periods. We alsohad to discard the results from NCIt, as the ontology-editing en-vironment for NCIt provides a unique change-queuing mecha-nism that allows for multiple property values to be changed atthe same time, making it impossible to extract chronologicallyordered sequential property patterns.
Path & model description:
First, we extracted the proper-ties whose values were changed in ICD-11 and ICTM, sortedeither by user and timestamp or by class and timestamp. Fi-nally, two di ff erent types of chronologically ordered propertylists were extracted, one ordered per user and one ordered perclass (for both datasets). The properties in Property Paths rep-resent the ones which can be assigned a value for each classin ICD-11 and ICTM. Whenever a change did not modify aproperty (e.g., because the change action dealt with movingor creating a class) we added the element no property to thecorresponding path. A potential path for a single user or classthen may look like: title, title, title, use . Similar to previousanalyses, if the same user has consecutively changed the sameproperty (e.g., in the previous example title ) on the same class,we merged these multiple changes into one successive change.Analogously, however without the restriction of the same user,if the same property was changed on the same class, we mergedthese changes into one sequent change. For previous example,if changes would have been performed editing the referencedproperties for a single class, we would end up with the path: title, title, use .Consequently, we fit a first-order Markov chain model onthis set of paths (for users or classes). The final transition prob-abilities of the model then give us information about the proba-bility of changing a value of one property Y after another prop-erty X either for users or for classes. For instance, we canfind the property Y that most frequently has been changed afterproperty X for classes.
Results:
When looking at the histograms (top area in Fig-ures 7(a) to 7(d)) we can see that even after removing not veryfrequently used properties, both datasets exhibit a few prop-erties which have received a high number of changes, while theremaining majority of properties only received a very limitednumber of changes. For both datasets, aside from no prop-erty , the properties use , title and definition appear to be themost frequently used properties. As can be seen in the top area All properties which where rarely edited have been removed from Figure 7as they do not hold information but their removal increased the readability ofthe plots dramatically. o Property F r o m P r ope r t y + + F r equen cy base exclusion termsbase inclusion termsbase index termsbody systemcoding hintdefinitiondiagnostic criteriadisplay statusfully specified nameicd−10icd numerical codeno propertynoteother involved tagsprimary tagsigns and symptomssorting labeltitletypeuse ba s e e xc l u s i on t e r m s ba s e i n c l u s i on t e r m s ba s e i nde x t e r m s bod y sys t e m c od i ng h i n t de f i n i t i ond i agno s t i c c r i t e r i ad i s p l a y s t a t u s f u ll y s pe c i f i ed na m e i c d − i c d nu m e r i c a l c odeno p r ope r t y no t eo t he r i n v o l v ed t ag s p r i m a r y t ag s i gn s and sy m p t o m ss o r t i ng l abe l t i t l e t y peu s e (a) International Classification of Diseases (ICD-11) (Class) T o Property F r o m P r ope r t y F r equen cy base index termscold or heatconstitution risk factordeficiency or excessdefinitionessence componentexterior or interiorexternal codesexternal contractionmeridian systemno propertyother factors (tm)pathological productssanjiao regionsseven emotionssigns and symptomssorting labeltitletypeuseyin or yangzangfu system ba s e i nde x t e r m sc o l d o r hea t c on s t i t u t i on r i sk f a c t o r de f i c i en cy o r e xc e ss de f i n i t i one ss en c e c o m ponen t e x t e r i o r o r i n t e r i o r e x t e r na l c ode s e x t e r na l c on t r a c t i on m e r i d i an sys t e m no p r ope r t y o t he r f a c t o r s ( t m ) pa t ho l og i c a l p r odu c t ss an ji ao r eg i on ss e v en e m o t i on ss i gn s and sy m p t o m ss o r t i ng l abe l t i t l e t y peu s e y i n o r y ang z ang f u sys t e m (b) International Classification of Traditional Medicine(ICTM) (Class) T o Property F r o m P r ope r t y F r equen cy base exclusion termsbase inclusion termsbase index termsbody systemcoding hintdefinitiondiagnostic criteriadisplay statusfully specified nameicd−10icd numerical codeno propertynoteother involved tagsprimary tagsigns and symptomssorting labeltitletypeuseBREAK ba s e e xc l u s i on t e r m s ba s e i n c l u s i on t e r m s ba s e i nde x t e r m s bod y sys t e m c od i ng h i n t de f i n i t i ond i agno s t i c c r i t e r i ad i s p l a y s t a t u s f u ll y s pe c i f i ed na m e i c d − i c d nu m e r i c a l c odeno p r ope r t y no t eo t he r i n v o l v ed t ag s p r i m a r y t ag s i gn s and sy m p t o m ss o r t i ng l abe l t i t l e t y peu s e B R EAK (c) International Classification of Diseases (ICD-11) (User) T o Property F r o m P r ope r t y F r equen cy base index termscold or heatconstitution risk factordeficiency or excessdefinitionessence componentexterior or interiorexternal codesexternal contractionmeridian systemno propertyother factors (tm)pathological productssanjiao regionsseven emotionssigns and symptomssorting labeltitletypeuseyin or yangzangfu systemBREAK ba s e i nde x t e r m sc o l d o r hea t c on s t i t u t i on r i sk f a c t o r de f i c i en cy o r e xc e ss de f i n i t i one ss en c e c o m ponen t e x t e r i o r o r i n t e r i o r e x t e r na l c ode s e x t e r na l c on t r a c t i on m e r i d i an sys t e m no p r ope r t y o t he r f a c t o r s ( t m ) pa t ho l og i c a l p r odu c t ss an ji ao r eg i on ss e v en e m o t i on ss i gn s and sy m p t o m ss o r t i ng l abe l t i t l e t y peu s e y i n o r y ang z ang f u sys t e m B R EAK (d) International Classification of Traditional Medicine(ICTM) (User)Figure 7:
Results for the
Property Paths analysis:
The columns and rows of the transition maps ( bottom area of Figures 7(a) to 7(d)) represent the transition-probabilities of a first-order Markov chain between consecutively changed properties, where rows are source properties and columns are target properties . Fig-ures 7(a) and 7(c) represent class-based patterns while Figures 7(b) and 7(d) visualize user-based patterns. A sequence (or transition-probability) is always read fromrow to column . Darker colors represent higher transition-probabilities while lighter colors indicate lesser transition-probabilities. Absolute probability values aredependent on the number of investigated rows and columns, hence relative di ff erences are of greater importance. Across all datasets a very clear trend towards con-secutively editing the same properties can be observed. The histograms ( top area of Figures 7(a) to 7(d)) show the total edits of each property in the correspondingdatasets aggregated over all users and classes (again for a first-order Markov chain). Note, that the y -axes for all histograms are scaled di ff erently for each dataset.As ICTM and ICD-11 only share a limited amount of properties the x -axes (and column / rows of the transition maps) are di ff erent from project to project. In bothprojects and across all 4 di ff erent approaches the title , definition and use properties are frequently used. Due to reasons of readability we were forced to removeproperties from the plots, which exhibited only a very limited number of changes, thus did not provide substantial information for the purpose of this analysis.
14f Figures 7(a) and 7(b), multiple consecutive changes of thesame property appear to be fairly common for both datasets. Incontrast, when looking at Figures 7(c) and 7(d), which depictthe transition probabilities between the sequences of propertieschanged by each user, we can see an even stronger trend to-wards consecutively changing the same properties across dif-ferent classes, especially definition , title and use . For ICD-11Figures 7(a) and 7(c) show that the class-based approach is lessfocused on consecutively changing the same property, evidentin the brighter diagonal, when compared to the user-based ap-proach. This is due to the export functionality available in iCATcombined with the manual process of inserting the same prop-erty for di ff erent classes by users of ICD-11. In contrast, suchfunctionality is absent in ICTM, thus leading to similar behav-iors for the class and user-based approaches for ICTM. The factthat a large portion of successive changes are conducted on thesame property for both approaches analyzed for ICTM couldalso be due to the multilingual nature of the project, meaningthat certain properties, such as title and definition , have to beentered multiple times in multiple languages. Similar resultshave been presented by Wang et al. [24], who used associationrule mining techniques to analyze the change-logs of ICD-11and ICTM.Contributors in ICD-11 have a high tendency of performing no property changes after they return from a BREAK followedby use , title and definition . In ICTM, users resume their workprimarily by changing the title property, the definition propertyfollowed by no property changes. Interpretation & practical implications:
One of the mainbenefits of this analysis is the identification of commonly andconsecutively changed properties for classes and users. In turn,this information might potentially be used to suggest work (e.g.,prompting a user to check a certain property by combining the
User-Sequence Paths analysis and the
Property Paths analy-sis), or by ontology-engineering tool developers to potentiallyanticipate the property a user is most likely to change next.The fact that classes appear to exhibit more diverse property-contribution patterns when being changed than users could be adirect result of the multi-lingual nature of ICTM and the alreadymentioned export functionality present in iCAT. This meansthat given the most recent property of a class that was edited,we may predict which property is most likely to be changednext. Similarly, we can predict the property a user is going toedit next.
5. Findings and discussion
In this section we first summarize our findings in Section 5.1before we shortly discuss the potential applicability of higherorder Markov chain models in Section 5.2. Next, we discussdi ff erences between the investigated projects in Section 5.3 andfinally, point out potential limitations of this work in Section 5.4. We will now discuss our main findings (Table 2) and exploretheir consequences.
Emergence of micro-workflows:
By investigating whethersequential user-contribution patterns (see Section 4.1) can beidentified in five di ff erent collaborative ontology-engineeringprojects, we have shown that users appear to work in micro-workflows, indicating that for all investigated projects, eachuser contains predictive information about the user, who is go-ing to contribute to a specific class next.Additionally, however not presented in this paper due toreasons of space, we have also conducted an analysis to de-termine the change type (e.g., adding a property value, movinga class, replacing a property value, etc.) a user is most likelyto perform next (as shown in Walk et al. [30] for ICD-11). Inthis analysis we were able to extract a first-order Markov chainfor all datasets presented in this paper, meaning that the lastchange type that a user performed contains information aboutthe next change type of that user. When combining the infor-mation about the user who is most likely to contribute to a classnext and the specific change action that this user is most likelyto conduct (or the change action that is most likely conductedon a class next), we can create specific tasks for contributors,asking them to perform a certain change on a specific class.Our results could be used by project managers and ontology-engineering tool developers to identify classes for users andusers for classes, helping editors to minimize the necessary ef-forts for finding and identifying classes to contribute to. More-over, automatic means of curating and delegating work-tasks tousers can be derived by ontology-engineering tool developers,which can help to potentially increase participation as discussedin Kittur and Kraut [31]. User roles can be identified:
Across all datasets we wereable to identify that a limited number of users have contributedto the majority of all changes. These highly active users arevery likely to be target users for all other users, meaning thatthey are very likely to change the same class after another user.Across all five datasets, the roles of these target users could beidentified by us as moderators or administrators of the corre-sponding projects performing maintenance tasks, such as gar-dening (e.g., pruning outdated classes, fixing errors, etc.) ormanual verification of newly added data.Furthermore, we were able to show that moderators and ad-ministrators divide work among each other, as they are not verylikely to change the same classes directly after another admin-istrator or moderator, even though these users exhibit the high-est absolute numbers of changes in the corresponding projects.Looking at the transition probabilities of Figure 3 it is possibleto identify users or even groups of users who have a high ten-dency to work on the same classes, thus might be collaboratorsor reverting / correcting changes of each other. Users edit the ontology top-down and breadth-first:
The
Depth-Level Paths analysis (see Section 4.2.1) demonstratedthat users have a very high tendency of staying in the samedepth level when contributing to the ontology. If editors changedepth levels while editing the ontology they exhibit a minimalpreference to do so in a top-down rather than a bottom-up man-ner. Furthermore, the results suggest that users move along thehierarchy as we were able to show that they follow a top-down editing strategy for classes that are closer to the root node while15 able 2:
A summary of all findings applicable to all investigated biomedical ontologies. All listed findings are discussed in more detail in Section 5.User-sequence paths(cf. Section 4.1)
Users work in micro-workflows
Information about which users successively change a class can beidentified; i.e., information about who has edited classes in the pastcontains predictive information about who is going to change aclass next.
User-roles can be identified
Looking at historic data, we can identify di ff erent user roles, i.e.,administrators and moderators, gardeners (a contributor focused onpruning ontology classes and fixing syntactical errors) and usersthat frequently interact with (collaborate / revert) each other.Structural paths(cf. Section 4.2) Users’ edit behavior is influenced by the class hierarchy
Contributors, when adding content to the ontology, are influencedby the class hierarchy.
Users edit the ontology top-down and breadth-first
By and large, users exhibit a minor tendency towards top-downediting behavior when changing hierarchy levels while contribut-ing. However, when staying in the same hierarchy level, contrib-utors rather follow a breadth-first edit behavior, moving from onesibling of a class to the next sibling.
Users edit closely related classes
Contributors have a very high tendency to consecutively changeclosely related classes, as opposed to randomly and distantly re-lated classes.Property paths(cf. Section 4.3)
Users perform property-based workflows
Contributors, when adding content to the ontology, tend to concen-trate their e ff orts on one single property, which is added and editedfor multiple classes. this changes to a bottom-up editing strategy for classes closer tothe deepest depth levels and transitions are more likely to occuralong the immediate higher or lower depth level.To further investigate the distances between changed classesat the same depth levels we investigated the Hierarchical Rela-tionship Paths (e.g., child, parent, sibling, cousin, etc.) betweenthese changed classes. We found that users, when they editclasses on the same depth level, follow a breadth-first manner,focusing on editing all the siblings of a class before switchingto a completely di ff erent area of the ontology to continue theirwork after a BREAK . Users edit closely related classes:
Additionally to the breadth-first manner that users follow when editing classes in the samedepth level, we discovered that users have a very high tendencyto work on closely related classes (e.g., the sibling or cousinof the currently changed class). The information collected inSection 4.2 allows to potentially predict (or narrow down) theclass a user is going to contribute to next, which, if accurate, is avery valuable information that could be used for a variety of im-provements and adaptions. For example, project-administratorscould adjust the milestones of the development-strategy to bet-ter reflect the way users contribute to the ontology while user-interface designers could emphasize certain areas of the ontol-ogy to direct users towards specific classes – especially afterthey return from a
BREAK – or implement pre-fetching algo-rithms to minimize load-times. For contributors in particular,the task of identifying and finding classes that they (i) want and(ii) have the necessary expert knowledge to contribute to is atime-consuming task, which potentially can be minimized byimplementing class recommender based on the results of the
Structural Paths Analysis and
User-Sequence Paths Analysis . Users perform property-based workflows:
The investi-gation of sequential patterns for property-contributions showedthat in ICD-11, users have a very high tendency of consecu-tively changing the same property across multiple classes. Wecould also identify specific patterns that emerge when users suc- cessively change properties in collaborative ontology-engineeringprojects.The results collected in the Section 4.3 provide new insightsfor administrators and ontology-engineering tool developers, asthey allow the generation of work-tasks (e.g., Please verify theproperty title of the class
XII Diseases of the skin !). So far,users are always presented first with the section of the inter-face that allows for changing or adding the title and definition ,which could be one explanation for the high probabilities ofusers changing these properties when returning from a
BREAK .Note, that for this analysis we have used the data from ICD-11 and ICTM, which both share a very similar ontology-engineeringtool, thus the results might be biased towards the used ontology-editor.
Based on our proposed methodology of using first-orderMarkov chain models (see Section 3.3) resulting in the find-ings summarized in Section 5.1, we currently lay our focuson detecting patterns only derived from successive interactionswithin collaborative ontology-engineering projects. This means,that we identify how likely it is that one specific interaction fol-lows another one (e.g., which user edits a class after anotherone). This is reasoned by the definition of a first-order Markovchain based on the Markovian property which postulates thatthe next interaction only depends on the current one.Contrary, Markov chain models can also be defined on higherorders; this means that the next state of the model (or interac-tion in our case) depends on a series of preceding ones insteadof only the current one. For example, a second-order
Markovchain model postulates that the next state depends on the cur-rent state and also the previous one. Previous studies suggestthat human navigation on the Web might be better modeled byusing higher order models compared to first-order models (e.g.,[32, 29]). Hence, we could assume that this might also be thecase for our use-case. By also modeling our data with such16igher order models, we would potentially be able to identifylonger patterns (e.g.,
User A regularly edits a class after
User Band User C ). Also, possible recommender systems could ben-efit from the additional predictive power of such higher orderchains. While highly interesting, this analyses would be out-of-scope for this article which is why we leave this open forfuture work. ff erences between the investigated projects Even though each project exhibits a di ff erent number ofdepth levels, which all receive a di ff erent amount of attention bythe contributors, we can observe commonalities of edit strate-gies between them. For example, the levels 3 to 6 exhibit thehighest number of changes in our observation period for ICD-11, while for OPL these levels are 6 and 7.Regarding the hierarchical relationships we can see that con-secutively changing the same class is very likely to happen inICD-11, ICTM, BRO and OPL regardless of the source rela-tionship (evident in the darker colored Self columns in Fig-ures 6(a), 6(b), 6(d) and 6(e)). This
Self -relationship is still veryprominent, however the transition probabilities towards
Self forNCIt are not as dominant as they are for the other datasets.Another observation depicted in the transition maps is theclear focus on transitions from
Sibling to Sibling across threeout of five datasets, with the exception of ICTM and OPL. Oneexplanation for ICTM could be the fact that some properties ofthe ontology are multi-lingual, thus require users to add multi-ple languages for the same property, which are all stored as asingle change. For OPL, transitions, except towards
Self are ingeneral really scarce, indicating that users focused on editingand entering multiple property values (or one property value)of a single class before continuing to the next class.When looking at the sequence of changed properties foreach class (in contrast to: for each user) we can observe aconcentration on consecutively changing the same property inICTM, which is most likely a direct result of the multi-lingualnature of the properties used in this project. In ICD-11 on theother hand, transitions between changed properties of classesare much more diverse and less focused on transitions betweenthe same properties. This observation indicates that either notall properties have received a substantial amount of values forall the possible properties and / or that users make use of thisspecial export functionality of iCAT, thus successively chang-ing the same property is less common as the content is onlyinserted once into the system.In the User-Interface Sections Paths analysis we have mappedthe changed properties to the corresponding sections of the userinterface of the used ontology-engineering tools, which essen-tially represents a more abstract analysis of the
Property Paths analysis. By investigating the sequences of user interface sec-tions we could confirm that, for ICD-11, users have a very hightendency to consecutively change the same properties for mul-tiple classes, evident in the scarce transitions between di ff erent Note that it is necessary to apply model selection techniques as describedin [29] in order to identify the most appropriate Markov chain order based onstatistical significant improvements of higher orders compared to lower orders sections and the high concentration on transitions between thesame sections. For ICTM this behavior was not as distinctiveas it was for ICD-11, which could be due to the missing exportfunctionality and therefore the lack of the previously explainedmanual import sessions.In general these observations indicate that the absence orpresence of a given functionality of the ontology-engineeringtool can produce (and influence) di ff erent editing behaviors whendeveloping an ontology. We were not able to recreate the exact class hierarchy of theontology for every single change across our observation periodsfor all datasets. This limitation is partly due to a lack of detail inthe change-logs. Thus, we decided to focus our analysis, usingall five ontologies as is at the latest point in time, which is alsowhat would most likely be used in a real-world scenario.For example, if a class was changed by a user while it waslocated on depth level 3 and at a later point in time moved toa di ff erent location where it now resides at depth level 5, wewould assume that this class has always been on depth level 5.Please note that this bias is only present in the Structural Paths analyses (Section 4.2). To measure the extent of the potentialbias, we counted all changes that were performed on a class be-fore it was moved within in the ontology. Applying this rule toour change dataset, we collected a total of 116 ,
204 of 439 , ,
958 of 67 ,
522 for ICTM. Thesenumbers represent about 1 / / ,
507 (ca. 1 /
10) andfor OPL 2 of 1 ,
993 of all changes were performed on classes,which have been moved afterwards.Note that an additional requirement for the identification ofsequential patterns in collaborative ontology-engineering projectsusing Markov chains is the availability of rather large change-logs. In general, the less common entities (e.g., properties) arepresent in the change-log the more (exponentially) observationshave to be available in order to detect more fine-grained pat-terns. Without enough observations (changes), the identifica-tion of sequential patterns is either very hard, and can only beapproximated, or not possible at all. As can be seen in Table 1,we have selected all of our datasets to satisfy this requirement,as all chosen datasets exhibit a substantial number of changes.Furthermore, we have included artificial session breaks intoour analysis as described by Walk et al. [30] to analyze whereor what users start to edit in the ontology and where or whatusers edit before they take a break. For all user-based analyseswe have introduced a
BREAK if two consecutive changes of thesame user were apart longer than 5 minutes.All analyses in this paper are based on isKindOf relation-ships for determining distances and locations within the ontol-ogy. We plan on further expanding this analysis by investigatingthe impact of other kinds of relationships and other features thatare available in ontologies on our pattern detection approach.Even though all datasets presented in this paper are createdwith WebProt´eg´e or one of its derivatives, there is only one re-quirement that prevents practitioners from performing this anal-ysis on other ontologies: The availability of a change-log (in17he required granularity for the deemed analyses) that can bemapped onto the underlying ontology. Note that it would bepossible to conduct this analysis for ontologies created by sin-gle individuals, meaning that “collaboration” is only a require-ment when the nature of the analysis requires investigating tran-sitions between multiple users.Also, the kind of knowledge base (classification, taxonomyor ontology), the used representation language (e.g., OWL andOWL-DL expressivity, RDF, Turtle) or the development tool ofa particular collaborative ontology-engineering project in ques-tion does not prohibit conducting a pattern analysis as presentedin this paper, as long as the underlying knowledge base (andthus the change-log) exhibits the necessary granularity and thesemantic properties of interest for the analysis.However, this also means that the di ff erences of the knowl-edge representation used languages (i.e., expressivity and types)are not considered by our analysis, with NCIt being a thesaurusand the rest of the investigated datasets being ontologies. Thus,whenever di ff erences are observed between NCIt and the re-maining datasets, further research is warranted to determine theorigin of this observation.Furthermore, the analysis presented relies on investigatingusage logs of collaborative ontology-engineering projects bylooking at changes, performed by users of the correspondingsystems. As this only represents one possible way of interactingwith the underlying ontology, albeit the most frequently usedone, an extension of the conducted Markov chain investigationwarrants future work to include, for example, discussions forconsensus building, suggestions of terms by users or automaticimports.
6. Related work
For the analysis and evaluation conducted in this paper, weidentified relevant information and publications in the domainsof (i) Markov chain models, (ii) collaborative authoring systemsand (iii) sequential pattern mining.
In the past, Markov chain models have been heavily appliedfor modeling Web navigation – some sample applications ofMarkov chains can be found in [33, 34, 35, 36, 37, 38]. Also,the Random Surfer model in Google’s PageRank [39] can beseen as a special case of a Markov chain.Previously, researchers investigated whether human naviga-tion is memoryless (i.e., of first order) in a series of studies (e.g.,[40, 36]). However, these studies mostly showed that the mem-oryless model seems to be a quite plausible abstraction (seee.g., [41, 42, 37, 38]). Recently, a study picked up on these in-vestigations and suggested that the Markovian assumption (i.e.,property) might be wrong [32]. However, this study did not re-veal any statistically significant improvements of higher ordermodels. Singer et al. [29] solved this problem by developing aframework for determining the appropriate order of a Markovchain for a given set of input data. In Walk et al. [30] we ap-plied and mapped the presented framework onto structured logs of changes and provided an in-depth description of the require-ments and steps necessary to use the framework in this setting.In this paper we present a detailed analysis of sequentialpatterns by applying and analyzing Markov chains across thechange-logs of five collaborative ontology-engineering projectsin the biomedical domain. A more detailed explanation of thenecessary steps to be able to apply Markov chains onto thechange-logs of collaborative ontology-engineering projects ispresented in Walk et al. [30]. Note that we focus on applyingfirst-order Markov chain models in this work while we see theapplication of also higher order models as highly interesting fu-ture work as discussed in Section 5.2.
Research on collaborative authoring systems such as Wikipediahas in part focused on developing methods and studying factorsthat improve article quality or increase user participation. Theseproblems represent important facets of collaborative authoringsystems and solutions to tackle these problems are of interestfor collaborative ontology-engineering projects.For example, Cabrera and Cabrera [43] demonstrated thee ff ect of minimizing the costs and e ff orts necessary for usersto contribute on potentially achieving higher contribution rates.Another approach, also presented by Cabrera and Cabrera [43],focuses on providing an environment where interactions andcommunication between contributors are encouraged and per-formed frequently over a long period of time to establish agroup identity and to promote personal responsibility.More recent research on collaborative authoring systems,such as Wikipedia, focuses on describing and defining not onlythe act of collaboration amongst strangers and uncertain situ-ations that contribute to a digital good [44] but also on an-tagonism and sabotage of said systems [45]. It has also beendiscovered only recently that Wikipedia editors are slowly butsteadily declining [46]. Therefore Halfaker et al. [47] have an-alyzed what impact reverts have on new editors of Wikipedia.Kittur and Kraut [31] showed that an increase in participationcan be achieved by directly delegating specific tasks to con-tributors. As simple as this approach may appear, the identi-fication of work (and thus specific tasks) is still a tedious andtime-consuming process, which can only partly be automateddue to its assigned complexity.With the analysis that we described here, we provide newresults that we can use to tackle some of the problems for col-laborative authoring systems. These problems are also presentin collaborative ontology-engineering projects. For example,we can identify new tasks by combining the results of the User-Sequence Paths (Section 4.1) and
Property Paths (Section 4.3)analyses to suggest classes and the corresponding properties towork on to users.
In 1995 Agrawal and Srikant [48] have first addressed theproblem of sequential pattern mining. They stated that given acollection of chronologically ordered sequences, sequential pat-tern mining is about discovering all sequential patterns weighted18ccording to the number of sequences that contain these pat-terns. The presented algorithm represents one of the first apriori sequential pattern mining algorithms. This means that aspecific pattern cannot occur more frequently (above a thresh-old) if a sub-pattern of this pattern occurs less often (below thatthreshold). Other examples of a priori algorithms are [49, 50].One of the biggest problems assigned to the a priori basedsequential pattern mining algorithms was (in the worst case)the exponential number of candidate generation. To tackle thisproblem Han et al. [51] developed the FP-Growth algorithm.Many researchers have adapted di ff erent algorithms and ap-proaches for di ff erent domains to anticipate changing require-ments, such as Wang and Han [52] and Hsu et al. [53] who an-alyzed algorithms for sequential pattern mining in the biomed-ical domain.In Walk et al. [30] the authors have presented a novel ap-plication of Markov chains to mine and determine sequentialpatterns from the structured logs of changes of collaborativeontology-engineering projects. Making use of this frameworkwe investigate di ff erences and commonalities across five di ff er-ent collaborative ontology-engineering projects from the biomed-ical domain.
7. Conclusions & future work
In this work, we discovered intriguing social and sequentialpatterns that suggest that large collaborative ontology-engineeringprojects are governed by a few general principles that determineand drive development. Specifically, our results indicate thatpatterns can be found in all investigated projects, even thoughthe National Cancer Institute Thesaurus (NCIt), the Interna-tional Classification of Diseases (ICD-11), the International Clas-sification of Traditional Medicine (ICTM), the Ontology forParasite Lifecycle (OPL) and the Biomedical Resource Ontol-ogy (BRO) (i) represent di ff erent projects with di ff erent goals,(ii) use variations of the same ontology-editors and tools for theengineering process and (iii) di ff er in the way the projects arecoordinated. Using the presented Markov chain analysis, mul-tiple di ff erent user-roles could be identified in all investigateddatasets. We were also able to see that users work in micro-workflows, meaning that given a specific user, we can iden-tify the most likely users that are editing a specific class next,again independent from the investigated project. When con-tributing to a project that is created using WebProt´eg´e, iCAT,iCAT-TM or Collaborative Prot´eg´e, users exhibit a tendency todo so in a top-down and breadth-first manner, editing primarilyclosely related classes while moving along the ontological hier-archy. In ICD-11 and ICTM we were able to identify property-based workflows, meaning that users concentrate their e ff ortson adding and editing values for one specific property for mul-tiple classes.The analysis presented not only provides new insights aboutthe engineering and development processes of each single project,but also shows that the analysis of sequential patterns poten-tially provides actionable insights for di ff erent stakeholders incollaborative ontology-engineering projects. Furthermore, the information of the next possible action(e.g., a user, a change-type, a property, set of classes) or thecombination of multiple of these next actions could be usedby ontology-engineering tool developers to potentially augmentusers in collaboratively creating an ontology. For example, bymaking use of the Property Paths analysis to highlight, prefetch,rearrange or adjust sections and content of the interface dynam-ically, according to the user’s needs.The next logical step to further deepen our understandingof collaborative ontology-engineering projects involves apply-ing the gathered results to productive and live environments,for example as plug-in for (Web)Prot´eg´e. Simultaneously, thiswould allow us to collect valuable data to quantify the useful-ness and actionability of the results, generated with our pre-sented approach, in real world scenarios.Additionally, expanding the Markov chain analysis to takeother types of interactions (e.g., discussions, automatic importsand term suggestions by users) into account, represents a poten-tial topic of future work. This also includes a detailed analysisof human factors studies in terms of user-studies (e.g., with aheuristic evaluation or A / B testing) or more sophisticated ap-proaches, such as eye tracking, to assess the usefulness of thepresented results for augmenting users when collaboratively en-gineering an ontology.Furthermore, as change tracking and click tracking data willlikely become available more broadly in the future, we believethat the analysis of this paper and the possible benefits of puttingthe results into practical use represent an import step towardsthe development of better (and simpler) ontology editors, whichcan dynamically anticipate the editing-style of the users. Projectadministrators could make use of the results of the analysis, forexample by allowing for easier delegation of work to the “right”users. This is even more emphasized when considering that theMarkov chain analysis is not computationally intensive, makingit highly suitable for productive use.As biomedical ontologies play an increasingly critical rolein acquiring, representing, and processing information abouthuman health, we can use quantitative analysis of editing be-havior to generate potentially useful insights for building bettertools and infrastructures to support these tasks.
Acknowledgement
This work was generously funded by a Marshall Plan Scholarshipwith support from Graz University of Technology. Further, this workis supported in part by grants GM086587 and GM103316 from U.S.National Institutes of Health.
References [1] T. Gruber, A translation approach to portable ontology specifications,Knowledge Acquisition 5 (1993) 199–220.[2] W. Borst, Construction of engineering ontologies for knowledge sharingand reuse (1997).[3] R. Studer, V. R. Benjamins, D. Fensel, Knowledge engineering: Princi-ples and methods, volume 25, 1998, pp. 161–197.[4] N. F. Noy, T. Tudorache, Collaborative ontology development on the(semantic) web., in: AAAI Spring Symposium: Symbiotic Relationships etween Semantic Web and Knowledge Engineering, AAAI, 2008, pp.63–68.[5] T. Groza, T. Tudorache, M. Dumontier, Commentary: Stateof the art and open challenges in community-driven knowl-edge curation, Journal of Biomedical Informatics 46 (2013) 1–4. URL: http://dx.doi.org/10.1016/j.jbi.2012.11.007 .doi: .[6] M. Kr¨otzsch, D. Vrandecic, M. V¨olkel, Semantic MediaWiki, in: Pro-ceedings of the 5th International Semantic Web Conference 2006 (ISWC2006), Springer, 2006, pp. 935–942.[7] S. Auer, S. Dietzold, T. Riechert, OntoWiki–A Tool for Social, SemanticCollaboration, in: Proceedings of the 5th International Semantic WebConference (ISWC 2006), volume LNCS 4273, Springer, Athens, GA,2006.[8] C. Ghidini, B. Kump, S. Lindstaedt, N. Mahbub, V. Pammer,M. Rospocher, L. Serafini, MoKi: The Enterprise Modelling Wiki, in:L. Aroyo, P. Traverso, F. Ciravegna, P. Cimiano, T. Heath, E. Hyv¨onen,R. Mizoguchi, E. Oren, M. Sabou, E. P. B. Simperl (Eds.), Proceedingsof the 6th European Semantic Web Conference on The Semantic Web:Research and Applications 2009, Springer, Berlin, Heidelberg, 2009, pp.831–835.[9] T. Schandl, A. Blumauer, Poolparty: SKOS thesaurus management uti-lizing linked data, The Semantic Web: Research and Applications 6089(2010) 421–425.[10] T. Tudorache, C. Nyulas, N. F. Noy, M. A. Musen, WebProt´eg´e: A Dis-tributed Ontology Editor and Knowledge Acquisition Tool for the Web,Semantic Web Journal 4 (2013) 89–99.[11] T. Tudorache, S. M. Falconer, C. I. Nyulas, N. F. Noy, M. A. Musen, WillSemantic Web technologies work for the development of ICD-11?, in:Proceedings of the 9th International Semantic Web Conference (ISWC2010), ISWC (In-Use), Springer, Shanghai, China, 2010.[12] J. P¨oschko, M. Strohmaier, T. Tudorache, N. F. Noy, M. A. Musen, Prag-matic analysis of crowd-based knowledge production systems with icatanalytics: Visualizing changes to the icd-11 ontology, in: Proceedingsof the Association for the Advancement of Artificial Intelligence (AAAI)Spring Symposium: Wisdom of the Crowd, Stanford, CA, USA, 2012.[13] S. Walk, J. P¨oschko, M. Strohmaier, K. Andrews, T. Tudorache, C. Nyu-las, M. A. Musen, N. F. Noy, PragmatiX: An Interactive Tool for Visualiz-ing the Creation Process Behind Collaboratively Engineered Ontologies,International Journal on Semantic Web and Information Systems (2013).[14] S. M. Falconer, T. Tudorache, N. F. Noy, An analysis of collaborativepatterns in large-scale ontology development projects., in: M. A. Musen,. Corcho (Eds.), K-CAP, ACM, 2011, pp. 25–32.[15] C. Pesquita, F. M. Couto, Predicting the extension of biomed-ical ontologies, PLoS Comput Biol 8 (2012) e1002630. URL: http://dx.doi.org/10.1371%2Fjournal.pcbi.1002630 .doi: .[16] R. S. Goncalves, B. Parsia, U. Sattler, Analysing the evo-lution of the nci thesaurus, in: Proceedings of the 201124th International Symposium on Computer-Based Medical Systems,CBMS ’11, IEEE Computer Society, Washington, DC, USA, 2011,pp. 1–6. URL: http://dx.doi.org/10.1109/CBMS.2011.5999163 .doi: .[17] R. S. Gonc¸alves, B. Parsia, U. Sattler, Facilitating the analysis of ontol-ogy di ff erences, in: Proceedings of the Joint Workshop on KnowledgeEvolution and Ontology Dynamics (EvoDyn), 2011.[18] R. S. Gonc¸alves, B. Parsia, U. Sattler, Categorising logical di ff er-ences between owl ontologies, in: Proceedings of the 20th ACMInternational Conference on Information and Knowledge Manage-ment, CIKM ’11, ACM, New York, NY, USA, 2011, pp. 1541–1546. URL: http://doi.acm.org/10.1145/2063576.2063797 .doi: .[19] N. F. Noy, A. Chugh, W. Liu, M. A. Musen, A framework for ontologyevolution in collaborative environments, in: The Semantic Web-ISWC2006, Springer, 2006, pp. 544–558.[20] B. C. Grau, I. Horrocks, Y. Kazakov, U. Sattler, Just the right amount:extracting modules from ontologies, in: Proceedings of the 16th interna-tional conference on World Wide Web, ACM, 2007, pp. 717–726.[21] B. C. Grau, I. Horrocks, Y. Kazakov, U. Sattler, A logical frameworkfor modularity of ontologies, in: Proceedings of the 20th InternationalJoint Conference on Artifical Intelligence, IJCAI’07, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2007, pp. 298–303. URL: http://dl.acm.org/citation.cfm?id=1625275.1625322 .[22] E. Mikroyannidi, L. Iannone, R. Stevens, A. Rector, Inspecting reg-ularities in ontology design using clustering, in: Proceedings of the10th International Conference on The Semantic Web - Volume Part I,ISWC’11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 438–453. URL: http://dl.acm.org/citation.cfm?id=2063016.2063045 .[23] M. Strohmaier, S. Walk, J. P¨oschko, D. Lamprecht, T. Tudo-rache, C. Nyulas, M. A. Musen, N. F. Noy, How ontologiesare made: Studying the hidden social dynamics behind collabo-rative ontology engineering projects, Web Semantics: Science,Services and Agents on the World Wide Web 20 (2013). URL: .[24] H. Wang, T. Tudorache, D. Dou, N. F. Noy, M. A. Musen, Analysis ofuser editing patterns in ontology development projects, in: On the Moveto Meaningful Internet Systems: OTM 2013 Conferences, Springer, 2013,pp. 470–487.[25] S. Staab, R. Studer, Handbook on Ontologies, 2nd ed., Springer Publish-ing Company, Incorporated, 2009.[26] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, P. F. Patel-Schneider (Eds.), The Description Logic Handbook: Theory, Implemen-tation, and Applications, Cambridge University Press, New York, NY,USA, 2003.[27] N. Sioutos, S. de Coronado, M. W. Haber, F. W. Hartel, W.-L. Shaiu, L. W.Wright, NCI Thesaurus: A semantic model integrating cancer-relatedclinical and molecular information, Journal of Biomedical Informatics 40(2007) 30–43.[28] J. D. Tenenbaum, P. L. Whetzel, K. Anderson, C. D. Borromeo, I. D.Dinov, D. Gabriel, B. A. Kirschner, B. Mirel, T. D. Morris, N. F. Noy,C. Nyulas, D. Rubenson, P. R. Saxman, H. Singh, N. Whelan, Z. Wright,B. D. Athey, M. J. Becich, G. S. Ginsburg, M. A. Musen, K. A. Smith,A. F. Tarantal, D. L. Rubin, P. Lyster, The Biomedical Resource Ontology(BRO) to enable resource discovery in clinical and translational research,Journal of Biomedical Informatics 44 (2011) 137–145.[29] P. Singer, D. Helic, B. Taraghi, M. Strohmaier, Memory and structure inhuman navigation patterns, arXiv preprint arXiv:1402.0790 (2014).[30] S. Walk, P. Singer, M. Strohmaier, D. Helic, N. F. Noy, M. A. Musen,Sequential usage patterns in collaborative ontology-engineering projects,arXiv preprint arXiv:1403.1070 (2014).[31] A. Kittur, R. E. Kraut, Harnessing the wisdom of crowds in wikipedia:quality through coordination, in: Proceedings of the 2008 ACM confer-ence on Computer supported cooperative work, CSCW ’08, ACM, NewYork, NY, USA, 2008, pp. 37–46.[32] F. Chierichetti, R. Kumar, P. Raghavan, T. Sarlos, Are web users re-ally markovian?, in: Proceedings of the 21st international conference onWorld Wide Web, WWW ’12, ACM, New York, NY, USA, 2012, pp.609–618. URL: http://doi.acm.org/10.1145/2187836.2187919 .doi: .[33] J. Borges, M. Levene, Evaluating variable-length markovchain models for analysis of user web navigation sessions,IEEE Trans. on Knowl. and Data Eng. 19 (2007) 441–452.URL: http://dx.doi.org/10.1109/TKDE.2007.1012 .doi: .[34] M. Deshpande, G. Karypis, Selective markov models for predict-ing web page accesses, ACM Trans. Internet Technol. 4 (2004)163–184. URL: http://doi.acm.org/10.1145/990301.990304 .doi: .[35] R. Lempel, S. Moran, The stochastic approach for link-structureanalysis (salsa) and the tkc e ff ect, Comput. Netw. 33 (2000) 387–401.URL: http://dx.doi.org/10.1016/S1389-1286(00)00034-7 .doi: .[36] P. L. T. Pirolli, J. E. Pitkow, Distributions of surfers’ paths through theworld wide web: Empirical characterizations, World Wide Web 2 (1999)29–45. URL: http://dx.doi.org/10.1023/A:1019288403823 .doi: .[37] R. Sen, M. Hansen, Predicting a web user’s next access based on log data,Journal of Computational Graphics and Statistics 12 (2003) 143–155.URL: http://citeseer.ist.psu.edu/sen03predicting.html ew York, Inc., Secaucus, NJ, USA, 1999, pp. 275–284. URL: http://dl.acm.org/citation.cfm?id=317328.317370 .[39] S. Brin, L. Page, The anatomy of a large-scale hypertextual web searchengine, in: Proceedings of the seventh international conference on WorldWide Web 7, WWW7, Elsevier Science Publishers B. V., Amsterdam,The Netherlands, The Netherlands, 1998, pp. 107–117.[40] J. Borges, M. Levene, Data mining of user navigation pat-terns, in: Revised Papers from the International Workshopon Web Usage Analysis and User Profiling, WEBKDD ’99,Springer-Verlag, London, UK, UK, 2000, pp. 92–111. URL: http://dl.acm.org/citation.cfm?id=648036.744399 .[41] I. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White, Model-based clustering and visualization of navigation patterns ona web site, Data Min. Knowl. Discov. 7 (2003) 399–424.URL: http://dx.doi.org/10.1023/A:1024992613384 .doi: .[42] R. R. Sarukkai, Link prediction and path analysis using markovchains, Proceedings of the 9th international World Wide Web confer-ence on Computer networks: the international journal of computer andtelecommunications netowrking, North-Holland Publishing Co., Ams-terdam, The Netherlands, The Netherlands, 2000, pp. 377–386. URL: http://dl.acm.org/citation.cfm?id=347319.346322 .[43] A. Cabrera, E. F. Cabrera, Knowledge-Sharing Dilemmas, OrganizationStudies 23 (2002) 687–710.[44] B. Keegan, D. Gergle, N. S. Contractor, Hot o ff the wiki: dynamics, prac-tices, and structures in Wikipedia’s coverage of the Tohoku catastrophes.,in: F. Ortega, A. Forte (Eds.), Int. Sym. Wikis, ACM, 2011, pp. 105–113.[45] N. Shachaf, Beyond vandalism: Wikipedia trolls., Journal of InformationScience; Jun2010, Vol. 36 Issue 3, p357-370, 14p, 2 Charts (2010).[46] B. Suh, G. Convertino, E. H. Chi, P. Pirolli, The singularity is not near:slowing growth of wikipedia, in: WikiSym ’09: Proceedings of the 5thInternational Symposium on Wikis and Open Collaboration, ACM, NewYork, NY, USA, 2009, pp. 1–10.[47] A. Halfaker, A. Kittur, J. Riedl, Don’t bite the newbies: how reverts a ff ectthe quantity and quality of Wikipedia work., in: F. Ortega, A. Forte (Eds.),Int. Sym. Wikis, ACM, 2011, pp. 163–172.[48] R. Agrawal, R. Srikant, Mining sequential patterns, in: Proceedings ofthe Eleventh International Conference on Data Engineering, ICDE ’95,IEEE Computer Society, Washington, DC, USA, 1995, pp. 3–14. URL: http://dl.acm.org/citation.cfm?id=645480.655281 .[49] R. T. Ng, L. V. S. Lakshmanan, J. Han, A. Pang, Exploratory miningand pruning optimizations of constrained associations rules, in: Pro-ceedings of the 1998 ACM SIGMOD International Conference on Man-agement of Data, SIGMOD ’98, ACM, New York, NY, USA, 1998,pp. 13–24. URL: http://doi.acm.org/10.1145/276304.276307 .doi: .[50] S. Sarawagi, S. Thomas, R. Agrawal, Integrating association rule miningwith relational database systems: Alternatives and implications, in: Pro-ceedings of the 1998 ACM SIGMOD International Conference on Man-agement of Data, SIGMOD ’98, ACM, New York, NY, USA, 1998, pp.343–354. URL: http://doi.acm.org/10.1145/276304.276335 .doi: .[51] J. Han, J. Pei, Y. Yin, Mining frequent patterns withoutcandidate generation, in: Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data,SIGMOD ’00, ACM, New York, NY, USA, 2000, pp. 1–12. URL: http://doi.acm.org/10.1145/342009.335372 .doi: .[52] J. Wang, J. Han, Bide: E ffi cient mining of frequent closed sequences, in:Proceedings of the 20th International Conference on Data Engineering,ICDE ’04, IEEE Computer Society, Washington, DC, USA, 2004, pp. 79–. URL: http://dl.acm.org/citation.cfm?id=977401.978142 .[53] C.-M. Hsu, C.-Y. Chen, B.-J. Liu, C.-C. Huang, M.-H. Laio, C.-C. Lin,T.-L. Wu, Identification of hot regions in protein-protein interactions bysequential pattern mining, BMC bioinformatics 8 (2007) S8..[53] C.-M. Hsu, C.-Y. Chen, B.-J. Liu, C.-C. Huang, M.-H. Laio, C.-C. Lin,T.-L. Wu, Identification of hot regions in protein-protein interactions bysequential pattern mining, BMC bioinformatics 8 (2007) S8.