What is all this new MeSH about? Exploring the semantic provenance of new descriptors in the MeSH thesaurus
Anastasios Nentidis, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras
WWhat is all this new MeSH about?
Exploring the semantic provenance of new descriptors in the MeSH thesaurus
Anastasios Nentidis , · Anastasia Krithara · Grigorios Tsoumakas · Georgios Paliouras Abstract
The Medical Subject Headings (MeSH) the-saurus is a controlled vocabulary widely used in biomed-ical knowledge systems, particularly for semantic in-dexing of scientific literature. As the MeSH hierarchyevolves through annual version updates, some new de-scriptors are introduced that were not previously avail-able. This paper explores the conceptual provenanceof these new descriptors. In particular, we investigatewhether such new descriptors have been previously cov-ered by older descriptors and what is their current re-lation to them. To this end, we propose a framework tocategorize new descriptors based on their current rela-tion to older descriptors. Based on the proposed clas-sification scheme, we quantify, analyse and present thedifferent types of new descriptors introduced in MeSHduring the last fifteen years. The results show that onlyabout 25% of new MeSH descriptors correspond to newemerging concepts, whereas the rest were previouslycovered by one or more existing descriptors, either im-plicitly or explicitly. Most of them were covered by asingle existing descriptor and they usually end up asdescendants of it in the current hierarchy, graduallyleading towards a more fine-grained MeSH vocabulary.These insights about the dynamics of the thesaurus areuseful for the retrospective study of scientific articlesannotated with MeSH, but could also be used to informthe policy of updating the thesaurus in the future.
Keywords
MeSH · terminology extension · semanticindexing National Center for Scientific Research “Demokritos”,Athens, GreeceE-mail: { tasosnent, akrithara, paliourg } @iit.demokritos.gr Aristotle University of Thessaloniki, Thessaloniki, GreeceE-mail: { nentidis, greg } @csd.auth.gr The
Medical Subject Headings (MeSH) thesaurus isa collection of hierarchically organized entities for an-notating biomedical knowledge, primarily literature inPubMed/MEDLINE , with topic labels. The basic con-ceptual entity is the MeSH concept which is a collectionof synonymous terms for a particular domain meaning.Each concept has a preferred term , which is also usedas the name of the concept. MeSH concepts are not di-rectly used for annotating the literature. Each MeSHconcept belongs to exactly one MeSH descriptor , whichis a collection of closely related concepts, and formsthe basic element used for annotating biomedical liter-ature with topic labels. All the concepts and terms ofa descriptor are equivalent for the purposes of index-ing and searching MEDLINE. Beyond MeSH conceptsand descriptors, MeSH also provides some Supplemen-tary Concept Records (SCRs) that are directly used forannotating articles with labels for substances, rare dis-eases and organisms .As the MeSH hierarchy evolves through annual up-dates, new descriptors are introduced that were previ-ously unavailable. This evolution of MeSH is essential,in order to follow the development of knowledge in thefield. For example, new descriptors can be more fine-grained than old ones, providing a level of detail previ-ously unavailable in the vocabulary. On the other hand,new high-level descriptors can also be added, providingnew groupings of topics, under the light of the currentunderstanding of the domain. In some cases, the topicscovered by the new descriptors may have been presentin MeSH previously, covered by older descriptors. How- https://meshb.nlm.nih.gov/ a r X i v : . [ c s . D L ] J a n Nentidis et al. ever, some new descriptors may cover topics that aretotally new to the vocabulary, representing emergingconcepts in the domain.Despite their necessity for keeping MeSH up-to-date,the introduction of new descriptors raises practical chal-lenges. Several applications, such as (semi-)automatedsemantic indexing of biomedical literature with MeSHlabels, are based on supervised learning techniques thatexploit accumulated data from previous use of the vo-cabulary. However, for new descriptors no such anno-tated literature is available at the time of their intro-duction. Therefore, it becomes important to devise amapping of existing literature to the new descriptors.Towards this direction, for each new descriptor we areinterested in whether the corresponding topic was al-ready covered by old descriptors in MeSH or not.In this context, the basic questions motivating thisstudy on the provenance of new descriptors are the fol-lowing: – To what extent do the new MeSH descriptors coveremerging domain concepts that are really new forthe MeSH thesaurus? – For those new descriptors that do not cover emerg-ing domain concepts, can we identify older descrip-tors that were used to cover these concepts? – What is the current relation of the new descriptorswith the old ones that they are related to? – Is there any pattern over time concerning the intro-duction of new descriptors in MeSH and how thenew descriptors relate to the old ones?The main contribution of this work consists in devel-oping a conceptual framework for exploring the prove-nance of new MeSH descriptors considering the hierar-chical structure of the thesaurus. In particular, we de-scribe an approach for identifying predecessor descrip-tors, that used to cover the topic of a new descriptorpreviously. Namely, a coding system is introduced fororganizing the new descriptors based on two key dimen-sions: a) whether and how they have been covered inMeSH prior to their introduction, and b) their currentposition in the hierarchy in relation to their predeces-sors. In addition, a method is developed for the compu-tational identification of predecessors and conceptualprovenance codes for new MeSH descriptors. Finally,based on the proposed framework we perform an anal-ysis that sheds light on the conceptual provenance ofdescriptors introduced in MeSH during the last fifteenyears.The rest of this paper is structured as follows. InSection 2 some background knowledge is summarized,regarding the structure of basic elements of the MeSHthesaurus and their relationships. In section 3 we pro-vide a brief overview of work related to the extension
Fig. 1
MeSH concepts are grouped into descriptors whichare hierarchically organized and can also have a “PreviousIndexing” note (PI). Each Supplementary Concept Record(SCR) is mapped to at least one descriptor. of biomedical thesauri with new concepts, focusing onthe MeSH thesaurus. In section 4 we propose a frame-work for identifying the predecessors of new descriptorsand introduce new types of conceptual provenance tocharacterize the current relation of new MeSH descrip-tors with their predecessors. In section 5 we proposea method to automatically analyze the versions of theMeSH hierarchy, in order to identify the various typesof provenance. In section 6 we present and discuss theresults of this analysis, which lead to useful insightsabout the evolution of MeSH. Finally, in section 7 weconclude and indicate potential uses of our results infuture research.
In the MeSH hierarchy, each descriptor has exactly one preferred concept and may also have some subordinate(narrower, broader, or related) concepts that attach ad-ditional terms to the descriptor. For example, the de-scriptor for Dementia (Fig. 1) consists of three con-cepts. The preferred concept, which has two synony-mous terms (“Dementia” and “Amentia”) and two nar-rower concepts with a single term each. The preferredconcept is the reference point for defining the subordi-nate concepts as narrower, broader, or related. There-fore, we consider the preferred concept as the dominantentity representing the main topic of a descriptor.The MeSH descriptors are hierarchically organizedso that most descriptors have at least one broader de-scriptor as parent. For example, the “Alzheimer Dis- hat is all this new MeSH about? 3 ease” (AD) descriptor has two parents, namely the de-scriptors “Dementia” and “Tauopathies”, as shown inFig. 1. Additionally, there are some top-level descriptorsthat have no parents and are the roots of the trees inthe MeSH hierarchy, which are called MeSH trees . TheMeSH trees are grouped into sixteen MeSH categories ,and each descriptor belongs to one or more MeSH treesand corresponding MeSH categories. For example, theAD descriptor belongs to the “Nervous System Diseases(C10)” tree in the “Diseases” (C) MeSH category andto the “Mental Disorders (F03)” tree in the “Psychiatryand Psychology” (F).The exact position of a descriptor in a tree is de-termined by one or more tree numbers or tree paths .Each MeSH tree number of a descriptor is extendinga tree number of some parent, and recursively includesthe tree numbers for a series of ancestors reaching up tothe corresponding root. For example, the AD descrip-tor has two tree numbers extending the tree numbersof “Dementia” ( F03.615.400.100 , C10.228.140.380.100 ) and one tree number extending the tree numberof “Tauopathies” (
C10.574.945.249 ).The SCRs, that are also known as
SupplementaryChemical Records , have similar conceptual structure todescriptors, with one preferred MeSH concept and po-tentially some subordinate ones, but they are not partof any descriptor and are not directly included in theMeSH hierarchy. However, they are mapped to at leastone descriptor. In Fig. 1 for example, the SCR “Prese-nile And Senile Dementia” is mapped to the AD de-scriptor as indicated by a dotted arrow towards thelatter. In practice, this means that when indexers inPubMed/MEDLINE use this SCR to annotate an ar-ticle, the article will also get automatically annotatedwith the mapped descriptors [12]. This mapping is im-portant because it defines which descriptors cover themeaning of each SCR at the level of main MeSH topicannotations, that are primarily used for indexing andsearching the literature.
The MeSH evolution is often studied in the broader con-text of the evolution of dynamic biomedical terminolo-gies [8]. In this area, the effort has often been on defin-ing and studying elementary and composite changes,that require one or more basic operations, e.g. adding,removing, merging, splitting, editing and moving ele-ments of a biomedical terminology. For instance, the CONCORDIA framework, which stands for “CONceptand Change-Operation Representation for any DIAlect”was proposed for representing, reporting and document-ing different types of change in medical terminologies [13],and MeSH was explicitly considered in this study. Study-ing different types of change in MeSH is also the focusof the work presented in this paper. The basic premiseis that by studying the basic operations that lead to achange, one can identify the conceptual source of newelements.Though MeSH is not an ontology, it is often men-tioned or even treated as such in relevant literature, andit has also been transformed into a (meta-)ontology, inan effort to formally express all knowledge about se-mantically indexing MEDLINE [1]. McCray and Lee [11]studied the evolution of MeSH in an ontological context,as a conceptualization of the biomedical domain. Theyfocused on the evolution of MeSH category “Psychi-atry and Psychology” (F), capturing and quantifyingchange at the level of descriptors, as well as at morefine-grained terminological and lexical levels. In par-ticular, they investigated whether change reflects theevolution of corresponding knowledge in the biomedi-cal domain. Their results reveal that change in MeSHreflects both the evolution of biomedical knowledge, aswell as some internal ontological restructuring efforts,such as the separation of behaviors from disorders.Recently, Balogh et al. [3] studied the evolution ofMeSH as a network focusing on the addition and re-moval of links between MeSH descriptors, which theycall “attachment” and “detachment” of links respec-tively. Interestingly, they investigated whether these re-wiring events are associated with certain descriptor prop-erties, such as the number of parents or descendantsin the hierarchy. Their results, suggest that old MeSHdescriptors with many descendants appear to receiveand loose children descriptors more than expected bychance. On the other hand, descriptors with many an-cestors appear to receive and loose children descriptorsless than expected by chance.More recently, Cardoso et al. [5] suggested the inter-linking of distinct versions of MeSH developing a his-torical knowledge graph, to extend queries for biomed-ical literature retrieval and for supporting the mainte-nance of semantic annotations. In particular, they in-troduce “evolution connections” between descriptor el-ements (e.g. terms) in different versions. In some cases,these evolutionary relationships can indicate the con-ceptual provenance for new descriptors, whereas in othercases they express more fine-grained internal restruc-turing, such as relocating a term from one MeSH con-cept to another. Identifying the conceptual provenanceof descriptors is also central in the work presented in
Nentidis et al. this paper. However, the focus here is at the level oftopics and the goal is to capture additional provenanceconnections, beyond MeSH concepts, namely throughSCRs and Previous Indexing information.Some related studies also focus on the identifica-tion of the elements that need to change in MeSH andtry to automate this process. For example, Sari [14]proposed an approach for propagating changes alreadyincorporated in the Gene Ontology into appropriatechanges in MeSH. However, other studies attempt topredict the extension of MeSH based on different ap-proaches. For instance, Fabian et al. [10] proposed amethod for finding siblings to a set of MeSH terms, an-alyzing the structure and the content of HTML pagesin the Web. Guo et al. [16], on the other hand, proposeda “structure-based” method for recommending new sib-lings for MeSH descriptors, that was exclusively basedon the positions of existing terms in the MeSH hierar-chy. Eljasik-Swoboda et al. [9] proposed an embedding-based method for suggesting new sub-topics for existingtopics of MeSH, combining both knowledge about thehierarchy and the analysis of documents already anno-tated with specific MeSH labels.Other studies proposed approaches that beyond theanalysis of annotated corpora and the structure of thehierarchy, they incorporate temporal information forthe history of MeSH as well. Tsatsaronis et al. [15]proposed a method to predict which MeSH descriptorsshould be expanded with new children. This methodcombined information about the number of articles an-notated with each descriptor in PubMed with informa-tion about the hierarchical position of each descriptorin MeSH and temporal features that capture changes.Cardoso et. al [6] also proposed a method for identify-ing concepts that require revision, based on structuraland temporal information, as well as information fromother resources including the UMLS and article anno-tations in PubMed. Beyond the expansion of conceptswith more children, their method also suggested othertypes of revision, such as removal and relocation.Finally, the MeSH thesaurus has also been consid-ered in some studies in the context of topic modelingand evolution in the biomedical domain. In these works,MeSH labels are treated as keywords to develop a net-work of keyword co-occurrence from a document cor-pus, where latent (meta-)topics can be detected as clus-ters or communities. In this context, Castillo et al. [7]aligned such (meta-)topics of MeSH terms extractedfrom different time intervals based on the similarity ofthe corresponding sets. Then, they present an overviewof the evolution of these matched (meta-)topics as aphylogeny-inspired network, where evolutionary eventslike merge and split can be identified. Balili et al. [2] propose the TermBall approach for both tracking andforecasting the evolution of such (meta-)topics of MeSHterms, treating them as evolving communities in theterm co-occurrence network. The work presented hereinvestigates evolutionary relations between biomedicaltopics as well, however we focus at the level of MeSH de-scriptors, as used for indexing in PubMed, rather thanthe broader level of latent (meta-)topics reflecting theevolution of a domain.In this study, we focus on newly added descriptorsduring the extension of MeSH. In particular, we studywhether the meaning of each new descriptor has beencovered by old descriptors previously, and if so, howits new position in the hierarchy relates to those ofits predecessors. In contrast to most related work inthe biomedical terminology evolution context, which fo-cuses either on the operations to implement a change(addition, merge etc) or on general features of the de-scriptors, such as depth in the hierarchy, we aim atcharacterizing the new descriptors according to theirconceptual provenance. In other words, how they are re-lated to their predecessors in previous versions of MeSH.In order to achieve this characterization, we investi-gate how we can identify the predecessors and intro-duce provenance types that provide a new insight onthe study of MeSH evolution.
In this section, we introduce a conceptual model tocharacterize and group new MeSH descriptors based ontheir conceptual provenance. That is, we investigate thecases of previous coverage of new descriptors during theextension of MeSH. We define the notion of PreviousHost (PH), as a predecessor of a new descriptor, anddescribe categories of descriptors based on how thesepredecessors can be identified. Subsequently, we intro-duce types of conceptual provenance, to characteriseinteresting cases of new descriptors, based on their cur-rent relation with each of their PHs in the hierarchy ofMeSH.4.1 MeSH extension and provenanceAs the MeSH hierarchy evolves,the new descriptors in-troduced may cover domain concepts that are not to-tally new to the vocabulary. Some concepts may havebeen explicitly present in the previous version of MeSH.In particular, a concept of a new descriptor may havebeen available as a subordinate concept of an old de-scriptor or as an SCR concept. The latter case, of turn- hat is all this new MeSH about? 5
Fig. 2
The promotion of the SCR “Adenocarcinoma ofLung” into a descriptor in 2019. ing a SCR concept into a descriptor, is usually reportedin a textual note in the new descriptor, called
Pub-lic MeSH Note (PMN). For example, the “Adenocarci-noma of Lung” descriptor that was introduced in 2019,shown in Fig. 2, has a PMN field indicating its previousstate as an SCR mapped to the “Adenocarcinoma” and“Lung Neoplasms” descriptors. Therefore, literature for“Adenocarcinoma of Lung” annotated in 2018, can befound with “Adenocarcinoma” and “Lung Neoplasms”topic labels.In addition, even in cases where the concepts ofthe new descriptor have not been explicitly available assuch, their meaning may have been implicitly coveredby old descriptors. Such information is usually availableas a Previous-Indexing note (PI) in the new descrip-tors, as in the case of the “Tauopathies” descriptor inFig. 1. The PI note indicates that some old descrip-tors were used to annotate literature for the topic ofthe new descriptor, during a specific period prior to itsintroduction. For example, the “tau Proteins” descrip-tor was used to annotate articles about “Tauopathies”since 1997. This changed in 2002, with the introductionof a descriptor for “Tauopathies”.In this work, we refer to the MeSH version of theintroductory year of a new descriptor as version 1 , andthe last year before version 1 as version 0 . In addi-tion, we refer to such old descriptors that were used toannotate literature for the topic of the new descriptorin the version prior to its introduction ( version 0 ), as its Previous Hosts (PHs). Apart from identifying thePHs of a new descriptor, the current relation of thenew descriptor with its PHs is also important. For ex-ample, the new descriptor “Adenocarcinoma of Lung”was positioned in the hierarchy as a child of its two PHs(Fig. 2). Therefore, literature for “Adenocarcinoma ofLung” is still covered by “Adenocarcinoma” and “LungNeoplasms” as done prior to the introduction of the newdescriptor. On the other hand, the new descriptor for“Tauopathies” is not hierarchically related with its PH“tau Proteins”. As a result, literature for “Tauopathies”is not covered by the “tau Proteins” descriptor after theintroduction of the new descriptor in 2002.Although the above cases are common, they are notthe only types of relation one encounters between a newdescriptor and its PH(s). Furthermore, as the MeSH hi-erarchy keeps evolving, the relation of a descriptor withone or more of its PHs can change in subsequent years,complicating the situation further. Therefore, this re-lationship depends on the version of MeSH considered,which we call reference version . In this work, aimingat a profound understanding and improved handling ofnew MeSH descriptors we investigate their origin. Thatis, whether they have been covered by descriptors (PHs)in the corresponding version 0 , and if so, what their cur-rent relation to each of these descriptors is. In order tobetter quantify and organize these cases we define typesof “conceptual provenance” for the new descriptors.4.2 Previous Hosts (PHs)A PH of a new descriptor is defined as a descriptor thatwas used to annotate articles for the topic of the newdescriptor in the version 0 of the new descriptor. In thatsense, we say that the PH used to cover the topic of thenew descriptor for the purposes of literature annotationin version 0 . However, it is not required that a PH usedto cover the topic exclusively. That is, a PH may havebeen used for indexing other topics as well, apart fromthe topic of the new descriptor. Therefore, several newdescriptors may share the same PH. In addition, it isnot required that a PH used to cover the topic of thenew descriptor entirely. That is, a PH may have beenused for indexing only part of a topic, for example incases of new high-level descriptor added to provide anew grouping of related topics.A formal definition for a PH descriptor d0 for a newdescriptor d1 can be based on the condition of topic-overlap as follows: – The topic-overlap(d1, d0, v) is true when articlesfor the main topic of d1 used to be indexed under Nentidis et al. d0 in the MeSH version v . In all other cases, topic-overlap(d1, d0, v) is false. – The previous-host(d1, d0) , denoting that d0 isa PH of d1 , is true when topic-overlap(d1, d0, v) is true, where v is the version 0 of d1 . In all othercases, previous-host(d1, d0) is false.This definition of a PH focuses only in the MeSHversion that precedes the introduction of the new de-scriptor ( version 0 ). Descriptors that used to cover thetopic in older versions can be recursively described asthe PHs of a PH and so on. However, our original mo-tivation is to characterize each new descriptor based onwhether its topic was already covered by MeSH, at thetime of its introduction ( version 1 ), or not. Therefore,in this work we do not track the history of each newtopic in the distant past.As already discussed in subsection 4.1 there are twotypes of coverage for a new descriptor in a previousversion of MeSH. a) Explicit coverage, which is basedon the conceptual structure of MeSH descriptors andSCRs into concepts, and b) implicit coverage, that canbe identified based on the PI information. Based on thecoverage type, we also characterize the correspondingPHs. That is, an explicit PH used to host a subordi-nate concept or used to be mapped from an SCR, thatcorresponds to the new descriptor. On the other hand,an implicit
PH was used by the indexers for annotat-ing articles that correspond to the topic of the newdescriptor, without any explicit link with the latter inits conceptual structure.Explicit PHs are of primary importance, as theyprovide strong conceptual links to the new descriptors.In our quest for a conceptual link to PHs, we focus onthe preferred concept of each new descriptor. This isbecause the preferred concept is the dominant entitythat represents the main meaning of a descriptor, aswell as the vast majority of articles indexed with thedescriptor. For descriptors that are new to the vocabu-lary, in the absence of any explicit PH, we exploit thePI field to identify any implicit PH. The “Tauopathies”descriptor, shown in Fig. 1, is such a case of a new de-scriptor without any explicit PH, where the PI field isexploited to identify the implicit PH “tau Proteins”.4.3 Provenance CategoriesFor the purpose of identifying the PHs of a new descrip-tor d1 , we seek its preferred concept in the correspond-ing version 0 of MeSH. Based on whether and how weidentify it in existing descriptors, we define four cases( categories ) of conceptual provenance, as depicted inFig. 3 and described below: Fig. 3
Identifying the provenance category and the PreviousHosts (PHs) for a new descriptor.
Category 1. Old Concept:
Although d1 is a new de-scriptor, its preferred concept is available in the previ-ous version of MeSH ( version 0 ) as a subordinate con-cept of another descriptor d0 . In this case of explicitcoverage, since d0 used to hold the preferred conceptof d1 , topic-overlap(d1, d0, version 0) is true. The de-scriptor d0 therefore, is an explicit PH of d1 . In addi-tion, as each MeSH concept can only belong to a singledescriptor in a given version of MeSH [12], d0 is theunique PH of d1 .For example, “Prunus africana”, shown in Fig. 4,introduced in 2016 as a descriptor, was a subordinate(narrower) concept of the “Pygeum” descriptor, whichis not included in MeSH any more. In this case theunique PH of “Prunus africana” is the “Pygeum” de-scriptor, which explicitly included the concept “Prunusafricana” in the version prior to the introduction of adedicated descriptor for it. Category 2. Old SCR:
Alternatively, since the SCRsaccount for a large volume of domain concepts that arenot included in MeSH descriptors [4], the preferred con-cept of d1 may have been available as a concept in anSCR scr , prior to the introduction of d1 ( version 0 ).In this second case of explicit coverage, for each de-scriptor d0 mapped from scr holds that the literatureindexed under scr was also indexed under d0 . There-fore, topic-overlap(d1, d0, version 0) is also true. As aresult, each descriptor d0 is an explicit PH of d1 . Forexample, the descriptor “Adenocarcinoma of Lung” in-troduced in 2019, was previously available as an SCRmapped to the descriptors “Lung Neplasms” and “Ade- hat is all this new MeSH about? 7 Fig. 4
An example of descriptor succession . nocarcinoma” (Fig. 2). These two descriptors are theexplicit PHs of the new descriptor. Category 3. New PI Concept:
The preferred conceptmay be new, introduced together with the new descrip-tor d1 . For such new descriptors, if previous-indexing(PI) information is available, this means that some otherdescriptors were previously used to index articles for thetopic of d1 (new PI concept). Therefore, the preferredconcept of d1 , though new in the MeSH thesaurus, itwas previously indexed under some older descriptorswith other concepts, hence implicitly covered by them.In such cases of implicit coverage, the PI descriptorsthat were used until the introduction of d1 are the oneswith the most recent ending year in the accompanyingperiod ( version 0 ). Therefore, for each descriptor d0 that was used until the introduction of d1 , we have that topic-overlap(d1, d0, version 0) is true. As a result, themost recent PI descriptors are the implicit PHs of thenew descriptor d1 .For example, “Zika Virus Infection” was introducedas descriptor in 2015, and was not previously present asconcept in MeSH. However, it is annotated with a PInote, revealing that the descriptors named “ArbovirusInfections” and “Flavivirus Infections” have been usedfor indexing literature relevant to “Zika Virus Infec-tion” until 2015. Therefore, these two descriptors arethe PHs of “Zika Virus Infection”. Category 4. New Emerging Concept:
On the other hand,there exist new descriptors where no PI information isavailable, no PH can be identified and their PHs is anempty set. Such totally new descriptors are expected toinclude emerging domain concepts without significantpresence in prior literature. Therefore, the curators be-gin indexing articles for a domain topic previously notindexed under any specific MeSH descriptor.For example, “Long Term Adverse Effects” was in-troduced in 2015 as presented in Fig. 5. No “Long Term
Fig. 5
An example of descriptor emersion . Adverse Effects” concept was previously present in MeSHand no PI information is available, to report that ar-ticles for “Long Term Adverse Effects” were indexedunder some particular descriptor until 2015. Therefore,this totally new descriptor has no PHs at all.4.4 Provenance TypesHaving identified the PHs and the provenance categoryof each new descriptor, next we investigate the hierar-chical relation of the new descriptor with each one of itsPHs. This relation starts with the introduction of thenew descriptor in the MeSH hierarchy, but may changein the course of the years, as the hierarchy evolves fur-ther. Therefore, to characterise the relation of a newdescriptor d1 with a PH d0 in the context of a given reference version of MeSH, we focus on two basic prop-erties of this relation in the corresponding hierarchy.Namely the relation type of d1 with d0 and the dis-tance between them in the hierarchy. – The relation type(d1, d0) is: (a) ancestor when d1 has at least one tree number that includes a treenumber of d0 , (b) descendant when d0 has at leastone tree number that includes a tree number of d1 ,(c) unrelated when none of the tree numbers of d0 includes or is included in any of the tree numbers of d1 , and (d) undefined when d0 is not present in the reference version of MeSH. – The distance(d1, d0) is the number of other de-scriptors included in the shortest path connecting d1 and d0 . If d0 is located in a position in the hierarchythat is not connected with d1 , then the distance(d1,d0) can be considered to be infinite. If the d0 is notpresent in the reference version of the hierarchy the distance(d1, d0) is undefined . Nentidis et al.
For example, the relation between the new descrip-tor “Adenocarcinoma of Lung” and its PH “Lung Neo-plasms” has ancestor relation type and zero distance with MeSH 2019 as reference version (see Fig. 2). Onthe other hand, the current relation of “Prunus africana”and its PH “Pygeum”, in the context of MeSH 2020 ref-erence version , has undefined relation type and infinitedistance (see Fig. 4). Based on the relation type andthe distance of the relation between a new descriptor d1 and a PH d0 we also define some cases of interest,which we call conceptual provenance types . Type 0. Emersion: No PH found.
For new descriptorsin category 4, where no PH can be identified. In thesecases there is no PH for which to investigate the currentrelation, therefore we define the trivial type of prove-nance emersion , which includes all descriptors of prove-nance category 4 and only descriptors of category 4 .This exceptional type of provenance does not reflectthe relationship with any PH, therefore it is not basedon relation type and distance in a specific reference ver-sion of MeSH. The meaning of such a completely newdescriptor is emerging when the new descriptor is intro-duced, and is characterized as emergent hereafter. The“Long Term Adverse Effects” descriptor introduced in2015, is an example of emersion (see Fig. 5).
Type 1. Succession: relation type(d1, d0) = undefinedand distance(d1, d0) = undefined.
For some new de-scriptors a PH can be no longer present in the referenceversion of MeSH. In this case, d1 is considered one ofthe successors of d0 , because at least some of the ar-ticles that used to be annotated with d0 , in version 0 for d1 , are annotated with d1 instead, in the referenceversion of MeSH. In the example of Fig. 4, the newdescriptor “Prunus africana” is a case of succession, asits PH is not available in the context of the referenceversion , MeSH 2020. Type 2. Subdivision: relation type(d1, d0) = ancestorand distance(d1, d0) = 0.
A new descriptor d1 , whosePH d0 has become its parent. In this case, d0 coversthe topic of the new descriptor entirely, but d1 sup-ports the partition of the corresponding literature intomore fine-grained conceptual sets. This is the most ex-pected type of relation between new descriptors andtheir PHs, as the vocabulary evolves towards more de-tailed descriptors to support more precise topic annota-tions. In the subdivision example of Fig. 6, “RegulatedCell Death” introduced in 2020, used to be indexed un-der “Cell Death” until 2019, which became its parent. Fig. 6 “Regulated Cell Death” as a subdivision of “CellDeath”, the submersion of “Ferroptosis” and the detachment of “Necroptosis”.
Type 3. Submersion: relation type(d1, d0) = ancestorand distance(d1, d0) > A new descriptor d1 , whosePH d0 has become an ancestor, but not a parent. This issimilar to subdivision , as they both are characterized by ancestor relation type , but at least one other descriptorappears between d0 and d1 in the hierarchy. This is alsoin accordance with the evolution towards more detaileddescriptors, as the d0 keeps covering the topic of thenew descriptor entirely. However, the distance betweenthem suggests that intermediate levels of detail are alsoavailable.“Ferroptosis”, introduced in 2020 (Fig. 6), is an ex-ample of submersion , as it was indexed under “CellDeath” until 2019, which is now an ancestor but not aparent of it. In this example, the fact that “RegulatedCell Death”, which operates as the intermediate level ofdetail, was also introduced together with “Ferroptosis”,can explain why “Ferroptosis” articles were previouslyindexed under “Cell Death” instead of “Regulated CellDeath”. Type 4. Overtopping: relation type(d1, d0) = descen-dant.
A new descriptor d1 , whose PH d0 has becomeits descendant. In this case, although literature for thenew topic used to be indexed under d0 in the past( version 0 ), d1 is an ancestor of d0 in the referenceversion of MeSH, hence broader than it. Such new de-scriptors provide a new grouping of the old topics, po-tentially enhanced with additional terms for the aggre-gate topic. In the example depicted in Fig. 7, “Crys- hat is all this new MeSH about? 9 Fig. 7
The “Crystal Arthropathies” overtopping its PHs. tal Arthropathies”, introduced in 2017, has two im-plicit PHs, as it was indexed as “Chondrocalcinosis”and “Gout” until 2016. Both of them are children of“Crystal Arthropathies” in 2020, hence overtopped byit. Such cases seem less expected than the ones with ancestor relation type ( subdivision and submersion ), asthis situation suggests that d0 used to cover only a partof the topic of d1 . In addition, overtopping is less in-teresting from a practical point of view, as the use ofthe narrower descriptor that covers a topic is a commonMeSH-indexing practice [12]. Therefore, though differ-ent levels of detail may exists between the new descrip-tor and its descended PH, splitting this small group ofcases based on the distance would not add particularvalue. Type 5. Detachment: relation type(d1, d0) = unrelated.
A new descriptor d1 that is not related to its PH d0 with any of the above relations. In this case, d1 is de-tached from d0 , placed in a position without the onebeing included by the other. In the example of Fig. 6,“Necroptosis”, introduced in 2020 as a child to “Reg-ulated Cell Death” in the “Phenomena and Processes”MeSH category (G), was previously indexed as “Necro-sis”. Although “Necrosis” used to be a child of “CellDeath” in 2019, in 2020 it belongs only to the “Dis-eases” MeSH category (C) and is not directly relatedto “Necroptosis”. Therefore, we consider the “Necrop-tosis” descriptor to be detached from its PH “Necrosis”in 2020.Detached descriptors may be positioned quite closeto their PH in terms of distance , but are not related asancestors or descendants to it. In the example of Fig. 8,“Undiagnosed Diseases”, introduced in 2020 as a childdescriptor to “Disease Attributes”, was previously in-dexed under “Rare Diseases” is also a child of “DiseaseAttributes”. However, we consider that “UndiagnosedDiseases” is detached from its PH “Rare Diseases”, astheir topics are effectively disjoint. That is, neither of Fig. 8
The detachment of “Undiagnosed Diseases” from“Rare diseases”.
Fig. 9
The detachment of the “Shoulder Dystocia” descrip-tor has two provenance codes, namely code 3.2 for the subdi-vision of the PH “Dystocia” and code 3.5 for the detachmentfrom the PH “Shoulder”. the two topics includes the other in the reference ver-sion (MeSH 2020).
Provenance codes:
In order to easily refer to both cat-egory and type of conceptual provenance, we adopta composite provenance code , with a prefix indicatingthe category of a descriptor and a suffix indicating the type of its relation to some PH, separated by a dot, asshown in Table 1. For example, the provenance code for“Necroptosis” (Fig. 6) is 3.5 indicating a provenance category 3 for “new concept”, as the PH has been iden-tified based on PI information, and a provenance type5 for detachment from “Necrosis”. Similarly, the prove-nance code for “Prunus africana” (Fig. 4) is 1.1 with category 1 for “old concept”, and type 1 for succession of “Pygeum”. In the special case of type 0, emersion ,the preferred concept of the new descriptor is by def-inition id category 4 , hence, all emersion cases have atrivial provenance code 4.0.As some new descriptors can have more than onePHs, the provenance types described above are not mu-tually exclusive. Therefore, a new descriptor can havemultiple provenance codes . This is not true for the prove-nance categories, therefore all provenance codes of aspecific new descriptor begin with the same prefix. Forexample, “Shoulder Dystocia” depicted in Fig. 9, wasintroduced in 2020 as a child descriptor to “Dystocia”.Articles for shoulder dystocia were indexed as both et al.
Table 1
Provenance codes characterizing the relationship of a new descriptor with a PH, encoding categories and types asprefixes and suffices respectively. The exceptional case of emersion type corresponds to code 4.0.
Provenance Type Properties Provenance Categoryrelation type distance 1. Oldconcept 2. Old SCR 3. New PIconcept.1 Succession undefined undefined 1.1 2.1 3.1 .2 Subdivision ancestor 0 1.2 2.2 3.2 .3 Submersion ancestor > .4 Overtopping descendant ≥ .5 Detachment unrelated ≥ “Dystocia” and “Shoulder” until 2019, hence it is botha case of subdivision (3.2) of the PH “Dystocia”, whichbecame its parent, and a case of detachment (3.5) fromthe PH “Shoulder” which is not directly related withthe new descriptor. In this section, we describe the computational tools de-veloped for the automated identification of new MeSHdescriptors, their PHs and provenance codes , in the con-text of the conceptual model introduced in section 4.These tools, access the original source files of MeSH ,as provided by NLM, in the MeSH XML format .Therefore, all available information is accessible by thetools and any new versions of the hierarchy can be di-rectly incorporated upon release. Figure 10 illustratesthe sequence of processing steps that are involved inrelating new descriptors to their PHs. The source codeof the tools is openly available in GitHub .5.1 Harvesting MeSH versionsAs mentioned in previous sections, we focus our anal-ysis to the provenance of descriptors that are presentin a reference version of MeSH, namely the latest one.Therefore, we do not process descriptors that appearand disappear in various older versions. However, weare still interested in annotating descriptors that appearin older versions and remain available in the referenceversion . As a result, we need to process older versionsas well, covering a period from year 0 to year N .In particular, the process starts with the harvestingof MeSH files for different versions of the hierarchy. Thisstep begins with parsing the basic XML file for eachyear to extract the descriptors available in this version.This set of descriptors is then compared to those of the https://github.com/tasosnent/MeSH_Extension previous year to identify the new ones. The same pro-cess is repeated for each year, with the exception of thevery first one, for which no previous version is avail-able. Apart from the basic file comprising the MeSHdescriptors, the XML file of the SCRs is also parsed foreach version, to extract the corresponding set of avail-able SCRs. These are needed for the identification ofprovenance categories and types. Extracting descriptor attributes:
For the descriptors ofinterest, a number of attributes need to be extracted, inorder to help us trace its provenance. The most impor-tant attribute is the MeSH code of the descriptor, whichis the unique identifier considered for checking descrip-tor existence and identity. Other relevant informationinclude the positions of a descriptor in the hierarchy( tree numbers ), its preferred concept and the contentof the PMN and the PI fields. Most parts of this at-tribute extraction step are quite straightforward, as weprimarily rely on the unique identifiers of the entitiesinvolved in the analysis. For example, the informationneeded for identifying the earlier status of a descriptoras a subordinate concept in its version 0 , is the uniqueconcept identifier of its preferred concept. This is be-cause as we need to compare this identifier with theidentifiers of subordinate concepts of any descriptor in version 0 .However, automated extraction of information fromthe PMN and PI fields proved more challenging as thesefields contain information in semi-structured text, meantto be read by humans. Therefore the structure of thistext is inconsistent, while descriptors and SCRs arementioned with their current preferred terms, insteadof the corresponding unique identifiers. Consequently,we adopted a semi-automated approach, based on reg-ular expressions, in order to extract information fromthese fields. In the large majority of cases we managedto minimize the required manual effort as described be-low.
Extraction from the PMN field:
The PMN (
Public MeSHNote ) field of a MeSH descriptor typically consists of hat is all this new MeSH about? 11
Fig. 10
The computational process for identifying new descriptors and annotating them with provenance codes. sentences separated by semicolons and may provide vary-ing information, such as the year the descriptor wasintroduced and changes in the preferred term. Of par-ticular interest for this work, are PMN sentences thatreport earlier status of the descriptor as an SCR. This isdone with expressions of the form “
X was indexed underY ”, where X is the SCR and Y comprises one or moredescriptors together with the corresponding time peri-ods, as shown in the example of Fig. 2. This is useful asin some cases an SCR that gets “promoted” to descrip-tor may undergo some minor term modifications andreceive a new identifier. In such cases, exploiting thePMN is the only way to identify the old SCR for thenew descriptor, which would otherwise be consideredtotally new.Therefore, when attempting to associate a new de-scriptor to an earlier SCR, we start by comparing theidentifier of the preferred concept of the descriptor tothe concept identifiers in earlier SCRs. If this exact-match search fails, we resort to the use of the PMNexpressions mentioned above. In particular, we first useregular expressions to extract from the PMN field thepreferred term ( X ) of the old SCR and map it to someSCR identifier in the corresponding version of MeSH.In our analysis, this method managed to automaticallyidentify the missing links for the majority of cases (74%)where the PMN field matches the “ X was indexed underY ” expression and the exact-match search fails.For the few remaining cases, we calculated the sim-ilarity of X and the current descriptor name to earlierSCR terms. Based on this similarity, the system pro-duced best-match suggestions, which were confirmedmanually. More details about this method are availablein a technical report available online . There is also asmall number of cases where more than one old SCRs isreported in the PMN field. In such cases, only the firstSCR was considered, as this usually corresponds to thepreferred concept of the new descriptor, representing itscentral meaning. Extraction from the PI field:
The PI (
Previous-Indexing )field of a MeSH descriptor can be used to link a new https://docs.google.com/document/d/1J3X5OlrkIErDR-qJf0KT669Du9xndfPNBRumhYS9Yxw/edit?usp=sharing descriptor to old ones, when such a link is not providedexplicitly, that is by a previous state of the descriptor asa subordinate concept or SCR. The PI field contains alist of semi-structured notes in English. Each note usu-ally consists of the relevant descriptors for a previousperiod, often followed by the corresponding time periodin parentheses (Fig. 1). Exploiting this pattern we usedregular expressions to extract the terms and the corre-sponding time periods . In cases where the PI fieldconsists of multiple notes, all the descriptors with themost recent end year are considered as PHs, as donefor “Shoulder Dystocia” in the example of Fig. 9. Anyolder PI elements are neglected. Selecting provenance type:
In the last part of the MeSHharvesting step, each new descriptor is annotated withconceptual provenance codes. In particular, the firststep is to select the provenance category based on theprevious state of the current preferred concept as asubordinate concept or an SCR concept in the corre-sponding version 0 , as depicted in the schema of Fig. 3.Then, the provenance type is selected, based on thecurrent relation of the new descriptor to its PHs, whichhave been identified by the extraction process. Combin-ing the provenance types with the category, the com-plete set of provenance codes is formed. The end re-sult is a collection of all the new descriptors that havebeen introduced during the period considered and re-main available in the reference version of MeSH. Thesedescriptors are annotated with their basic informationand provenance annotations, and stored in CSV filesnamed after the year that corresponds to the vesrion for each descriptor. Some exceptions not fitting the patterns were identifiedand handled manually.2 Nentidis et al.
Table 2
The distribution of the 6,915 new descriptors (2006- 2020) into provenance codes. The total per category can belower than the sum of distinct type counts as the types arenot mutually exclusive.
Prov. Category1. 2. 3. TotalProv. Type Oldcon. OldSCR New PIcon. /type.1 Succession
21 12 84 117 .2 Subdivision
276 967 1,603 2,846 .3 Submersion
47 535 506 1,088 .4 Overtopping
24 7 91 122 .5 Detachment
151 364 1,313 1,828
Total/category
519 1,616 3,060
The total for category 4 , Emersion (4.0), is 1,720. tified and annotated all the descriptors introduced dur-ing this period, considering MeSH 2020 as the referenceversion . In other words, we are interested in the currentstatus of the descriptors, but we use the year of their in-troduction version 1 , in order to identify their previoushosts (PHs) and provenance category. The result of thecomputational processing is a CSV file for each MeSHversion, comprising the new descriptors introduced thisyear and their provenance annotations.As a final step, these files are parsed and analysedto produce statistics and diagrams that provide alter-native views of conceptual provenance in the course ofMeSH expansion in order to answer the basic questionsdriving this study. In particular, the diagrams that aregenerated present the frequencies of provenance cate-gories, types and codes per year of introduction andin total. Based on these diagrams, we attempt to an-swer the basic questions driving this study and identifypatterns and observations that may be of interest forunderstanding the dynamics of MeSH extension.6.2 Overview of new descriptors and their provenanceTable 2 presents the distribution of new descriptors intoprovenance categories and types. In total, 6,915 descrip-tors were introduced in MeSH since 2006 and were re-tained until 2020. This corresponds to an extension ofabout 30%, compared with the 22,997 descriptors avail-able back in 2005, and indicates that about 23% of allcurrent descriptors have been introduced during the lastfifteen years.The new descriptors introduced for new conceptsthat have been implicitly covered in their version 0 byold descriptors ( category 3 ) is the most frequent prove-nance category, accounting for about 44% of all new that are publicly available here https://github.com/tasosnent/MeSH_Extension/blob/main/NewDescriptors_2006_2020.csv descriptors. New descriptors for old concepts that havebeen explicitly covered in previous versions account forabout 31% of all new descriptors, with the majority ofcases covered by SCRs ( category 2 , ∼ category 1 , ∼ category 2 ), rather for promoting subor-dinate concepts restructuring old descriptors ( category1 ). On the other hand, new descriptors for emergingconcepts ( category 4 ), that are totally new for the MeSHvocabulary, account for 25% of all new descriptors. Thisrelatively low frequency of Emersion suggests that inmost cases new descriptors are linked to domain enti-ties that are already covered by other descriptors eitherimplicitly ( category 3 ) or explicitly ( category 1 and ).Therefore, the new conceptual entities that are veryoften introduced ( category 3 and account for 69%of new descriptors) are not completely novel, but theyusually offer dedicated descriptors to known concepts( category 3 ).Furthermore, the annual distribution of new descrip-tor categories, shown in Fig. 11, confirms the consis-tently high frequency of categories 3 and throughoutthe years. In particular, both the introduction of de-scriptors for new PI concepts and new emerging con-cepts accounts for at least around 100 cases annuallyfor the whole period of study. However, category 4 ismore stable around its mean value (AVG) of almost115 cases per year, with standard deviation (SD) of22 cases, whereas category 3 presents more variationaround its mean of 204 cases (SD ∼
85 cases), reachingup to 300 and 400 cases in certain years.On the other hand, the promotion of existing SCRsinto descriptors ( category 2 ) seems the less predictablecategory with an AVG around 108 and a SD around131 cases per year. In particular, in certain years (e.g.2006, 2019) there seems to be a surge of such cases,while in others the number is much smaller. Finally, theevolution of existing subordinate MeSH concepts intoindependent descriptors ( category 1 ) seems the leastfrequent and the most stable category with an AVG ofaround 35 and a SD of around 13 new descriptors peryear.The extreme peak of more than 900 new descrip-tors observed in 2006, may be the result of an effortat NLM to restructure descriptors for chemicals thatcombined meanings for activity and structure. This ef-fort, that has been spanning across many years, was hat is all this new MeSH about? 13
Fig. 11
Frequency of provenance categories for new descriptors, per year of introduction. continued in 2006 . In addition, promoting SCRs toDescriptors was particularly encouraged this year inNLM , which is in agreement with the fact that thispeak seems to be almost exclusively attributed to pro-moted SCRs ( category 2 ), which are known to repre-sent mainly chemicals. This is also confirmed by thedistribution of new descriptors into MeSH categories(Fig. 12), as 73% of the new descriptors introduced in2006 belong to “Chemicals and Drugs” (D). This rela-tive frequency for 2006 far exceeds the overall relativefrequency of category D for the whole period consid-ered, that is around 41%.Two less extreme peaks are also observed in 2011and 2017, with the introduction of about 600 new de-scriptors each. In contrast to the 2006 peak, these onesseem to be primarily attributed in category 3 cases, asother categories present frequencies close to the ones ofthe adjacent years. In addition, the distribution of thecorresponding new descriptors into MeSH categoriessuggests that, though the chemicals category D has rel-atively high frequencies these years, other MeSH cate-gories also have considerable contribution to these peaks.In other words, these peaks of new descriptors for newPI concepts ( category 3 ) seem to be more evenly dis-tributed across MeSH categories, that the 2006 peak of category 2 cases. Cho, Dan-Sung (NIH/NLM) personal communication
For 2011, this is in agreement with a focus in MeSHon projects related to categories “Biological Siences”(G) and “Analytical, Diagnostic and Therapeutic Tech-niques, and Equipment” (E) in MeSH . The peakof 2017, on the other hand seems to be affected bythe “MeSH Protein Project” , as part of which, al-most 290 new descriptors were added. The aim of thisproject was to achieve alignment of gene families, asdescribed by the Human Genome Nomenclature Com-mittee (HGNC), with protein classes in MeSH. In ad-dition, more new descriptors that usual are introducedin 2017 for some less frequent MeSH categories, such as“Health Care” (M) and “Persons” (N).Regarding the provenance types of new descriptors, Subdivision (.2) is the most common case (41%), fol-lowed by
Detachment (.5, 26%) and
Emersion (.0, 25%).
Submersion has also a considerable frequency of 16%,but
Succession (.1) and
Overtopping (.4) are quite scarce,accounting for about 2% each. This distribution seemsto be in agreement with the expected low frequency ofnew descriptors being broader of their PHs (
Overtop-ping ) or having their PHs removed from the vocabulary(
Succession ). However, the frequency of new descrip-tors that are no longer covered by any of their PHs(
Detachment ) seems quite notable, representing 35%of non-emerging new descriptors ( categories 1, 2 and ). This implies that the addition of dedicated descrip- Cho, Dan-Sung (NIH/NLM) personal communication et al. Fig. 12
Frequency of MeSH categories for new descriptors, per year of introduction. The four MeSH categories accountingfor at least 10% of new descriptors each, are presented independently. The rest twelve cases, that have overall frequency of lessthat 10% of new descriptors each, are collectively prevented as “Other Categories”. tors for concepts that used to be covered by older de-scriptors (PHs), often serves the removal of these sub-ordinate, supplementary or implicitly covered conceptsfrom these PHs, improving the specificity of the latter.On the other hand, the majority of new descrip-tors appear to be still covered by their PHs, offeringsubtopics to the latter. In particular, about 55% of allthe new descriptors have at least one ancestor in theirPHs, that is they belong to
Subdivision or Submersion cases, with the last being far less frequent as expected(16%). This suggests that only half of the new descrip-tors end up as descendants of their PHs. However, fo-cusing on the 5,195 non-emerging new descriptors, thatactually have at least one PH ( categories 1, 2 and ),this relative frequency increases to 73%, with Subdi-vision accounting for 55% of the cases and
Submer-sion for only 21% of them. This is in agreement withthe expected evolution of the topic vocabulary towardsmore fine-grained descriptors. The latter support moreprecise topic annotations and retrieval, especially whenmore documents are accumulated for some descriptorsduring the years.Figure 13 presents the annual distribution of newdescriptors into provenance types. Despite annual fluc-tuations, there seems to be a clear separation of the fre-quent types (
Emersion , Subdivision , and
Detachment ),from the infrequent ones (
Succession and
Overtopping )throughout the period of study. Finally, the
Submer-sion type seems to fall in-between the two groups. In addition, it seems that the infrequent types of
Succes-sion and
Overtopping vary the least through the years(SD 7 and 5 respectively). The more frequent types of
Subdivision , Detachment and
Submersion seem to bethe less predictable (SD 81, 56 and 66 respectively),whereas the trivial type of
Emersion , though quite fre-quent as well, appears to be relatively stable, as alreadynoticed for category 4 .As with MeSH categories, the surge of cases in cer-tain years is not evenly distributed across all prove-nance types. Although, the representation of all prove-nance types appears to be close to their overall relativefrequency in the peak of 2011, this is not always thecase. In 2006,
Submersion seems to be over-represented,accounting for 31% of the cases, which is more thandouble its overall relative frequency for the period ofstudy (16%). This could be related with the complexorganization of chemical SCRs into groups and sub-groups. For example, “Receptors, Scavenger” as well asthe six classes of them (“Scavenger Receptors, ClassA” etc) used to be SCRs indexed under “Receptors,Immunologic” until their promotion into descriptors in2006. Although “Receptors, Scavenger” was added as achild (2.2) to their PH “Receptors, Immunologic”, thesix classes were added as children of “Receptors, Scav-enger”, hence more distant descendants of “Receptors,Immunologic” (2.3).On the other hand,
Detachment seems to be over-represented in the peak of 2017, accounting for 39% hat is all this new MeSH about? 15
Fig. 13
Frequency of provenance types for new descriptors, per year of introduction. of the new descriptors, whereas its overall relative fre-quency for the whole period is 26%. Some of these
De-tachment cases are new descriptors for protein domainsor motifs detached from the corresponding protein de-scriptors, which can be related with the “MeSH Pro-tein Project”. For example, the new descriptor “MethylCpG Binding Domain” detached from its PH “DNA-Binding Proteins”. In addition, several new descriptorsin MeSH categories “Health Care” (M) and “Persons”(N) appear to represent medical professions detachedfrom the corresponding medical domains. For examplethe new descriptor “Nephrologists” was detached fromits PH “Nephrology”.Some of the types, In particular
Subdivision (.2) and
Detachment (.5), seem to be correlated in the way theyincrease or decrease over the years. It would therefore,be of interest to investigate whether the correlation oftheir annual frequencies observed in Fig. 13 should beattributed to the addition of descriptors that exhibitboth these provenance types simultaneously. This isonly possible in category 2 and category 3 where theavailability of multiple PHs for a new descriptor canlead to multiple provenance codes. In practice, howevernew descriptors with multiple provenance codes are notvery common, representing almost 17% of all new de-scriptors in these two categories.Focusing on the majority of new descriptors thathave a single provenance type, we compare the annualfrequencies of the
Subdivision (.2) and
Detachment (.5)(Fig. 14). The correlation of the frequencies seems tobe preserved in the frequent category 3 (blue lines with square markers). In other words, even when looking atdistinct new descriptors that share no common prove-nance types,
Subdivision (3.2) and
Detachment (3.5)seem to fluctuate in the same way across the years. For category 2 on the other hand (green lines with trianglemarkers),
Detachment (2.5) doesn’t seem to keep-upwith
Subdivision (2.2) which presents some high peaks(2006, 2016, 2019). This is reasonable, as the link of thenew descriptors to their PHs is stronger in category 2 ,which is based on explicit coverage, compared to cate-gory 3 where the PHs used to cover the new descriptorsonly implicitly.It appears that in category 3 , the amounts of newdescriptors that are added as children of their PHs isusually comparable to the ones that are detached fromtheir PHs. This observation could be the effect of aninternal procedure in the maintenance of MeSH andmay warrant further investigation. On the other hand,the frequency of emerging descriptors without any PHs(Emersion 3.0) (Fig. 13) exhibits fluctuations that arenot particularly correlated to the other frequent typesof provenance. This suggests that the addition of de-scriptors with totally new preferred concepts forms adistinct subset of the new descriptors added each year.
In this work we proposed a novel conceptual frameworkfor organizing and studying the conceptual provenanceof new descriptors in the Medical Subject Headings(MeSH) Hierarchy. In particular, we defined the notion et al.
Fig. 14
Frequency of type
Subdivision (.2) and
Detachment (.5) in new descriptors introduced during the last fifteen years,per provenance category. The asterisk (*) denotes that only descriptors with a single type are considered, excluding descriptorscombining more than one types. of the previous host (PH), as a descriptor covering themain topic of a new descriptor prior to its introduction,and suggested an approach to identify such PHs for anew descriptor. Then, based on the current relation-ship of the descriptor with its PHs we also defined a setof provenance types and codes. In addition, we devel-oped an open-source computational process for the au-tomated extraction, annotation and analysis of new de-scriptors, using the raw files of different versions MeSHas distributed by the US National Library of Medicine(NLM). Employing this approach, we investigated theconceptual provenance of new MeSH descriptors for theperiod 2006-2020.The results reveal that about 115 new descriptorsfor emerging concepts ( category 4 ) are introduced eachyear quite steadily. These descriptors represent about25% of all new descriptors of the study period, indi-cating that the majority of the new descriptors cov-ers non-emerging domain concepts that are not reallynew for the MeSH thesaurus. Less than half of thesenon-emerging concepts were explicitly covered in MeSHprior to the introduction of dedicated descriptors forthem ( category 1 and category 2 ). The majority of non-emerging concepts, though not explicitly included inolder versions of MeSH, used to be indexed under spe-cific older descriptors (PHs) that covered their meaningimplicitly ( category 3 ).This suggests that the main force which is consis-tently driving the extension of MeSH during this period is the need to explicitly cover more conceptual entities.Namely, a stable annual amount of new emerging con-cepts ( category 4 ) and a similar or greater amount ofnew PI concepts ( category 3 ), that used to be implic-itly covered by MeSH. The need to introduce descrip-tors for reorganizing concepts that are already explicitlycovered ( category 1 and category 2 ) appears to be aux-iliary, with low amounts of new descriptors for mostyears. However, in certain years, we also observed asurge in the promotion of existing SCRs into descriptors( category 3 ), particularly for chemicals. Such surges in category 2 and category 3 , seem to be related with in-ternal MeSH projects and resource allocation in NLM.In addition, the results on conceptual provenancetypes reveal that more than 70% of all non-emergingnew descriptors ( categories 1, 2 and ) become subtopicsof their PHs’ topics. That is, they remain under thecoverage of the latter, usually as children of them (.2, Subdivision ) and less often as more distant descendants(.3,
Submersion ). However, the amount of new descrip-tors that are detached from their PHs (.5,
Detachment )is also considerable, particularly for implicit PHs ( cate-gory 3 ). These observations, suggest that the extensionof MeSH primarily serves the need to enrich the MeSHthesaurus with more detailed subtopics, supporting theannotation of articles with new fine-grained topic la-bels. Nevertheless, it appears that a notable amount ofnew descriptors also serve to rid the PHs of some im- hat is all this new MeSH about? 17 plicitly covered topics, rendering the PHs more preciseas well.This grouping can be particularly useful for improv-ing semantic indexing models for new descriptors. Forexample, the articles annotated with their PHs can bea source of weakly-labeled data for topical annotations.In addition, the provenance types can provide indica-tions for the prevalence of such weak labels. In the caseof
Detachment for example, we may expect that only asmall part of the articles annotated with the PHs willbe relevant to the new descriptor. In the case of newdescriptors for new emerging concepts ( category 4 ) onthe other hand, Zero-Shot Learning approaches may bemore appropriate as no PHs are available as a source ofweak labels.Although our findings primarily provide insight toresearchers working with MeSH, we also believe thatthe proposed viewpoint is of more general interest. Inparticular it can be used to analyse the extension dy-namics of other similar topic hierarchies. The annota-tions of conceptual provenance produced by the pro-posed method capture the hierarchical relationship ofa new topic with the topics that were previously usedin its place. Such information can be used to charac-terise and group the topics, facilitating the process ofmaintaining topic hierarchies.Our future plans include the investigation of fur-ther uses of the provenance information provided bythe proposed method. In particular, we are examiningwhether new descriptors with the same provenance cat-egory, types or codes, present similarities that can beexploited in the semantic indexing of documents withnewly introduced labels. Additionally, we are lookinginto the use of the provenance information for predict-ing ontological expansion. Last but not least, we wouldlike to explore the use of the conceptual frameworkand computational procedures for tasks related to themaintenance of the hierarchy itself, such as identifyingspecial cases and inconsistencies in textual descriptivefields.
Acknowledgements
This research work was supported bythe Hellenic Foundation for Research and Innovation (HFRI)under the HFRI PhD Fellowship grant (Fellowship Number:697). We are grateful to James Mork and Dan-Sung Cho fromthe National Library of Medicine (NLM) for kindly providingvaluable feedback on this work.
References
1. Abcckcr, A., Stojanovic, L.: Ontology Evolution: MED-LINE Case Study. In: Wirtschaftsinformatik 2005, pp.1291–1308. Physica-Verlag HD, Heidelberg (2005). DOI10.1007/3-7908-1624-8 68 2. Balili, C., Lee, U., Segev, A., Kim, J., Ko, M.: TermBall:Tracking and Predicting Evolution Types of ResearchTopics by Using Knowledge Structures in Scholarly BigData. IEEE Access , 108514–108529 (2020). DOI10.1109/ACCESS.2020.30009483. Balogh, S.G., Zagyva, D., Pollner, P., Palla, G.: Timeevolution of the hierarchical networks between PubMedMeSH terms. PLOS ONE (8), e0220648 (2019). DOI10.1371/journal.pone.02206484. Bushman, B., Anderson, D., Fu, G.: Transforming theMedical Subject Headings into Linked Data: Creatingthe Authorized Version of MeSH in RDF. Journalof Library Metadata (3-4), 157–176 (2015). DOI10.1080/19386389.2015.10999675. Cardoso, S.D., Da Silveira, M., Pruski, C.: Constructionand exploitation of an historical knowledge graph to dealwith the evolution of ontologies. Knowledge-Based Sys-tems , 105508 (2020). DOI 10.1016/j.knosys.2020.1055086. Cardoso, S.D., Pruski, C., Da Silveira, M.: Supportingbiomedical ontology evolution by identifying outdatedconcepts and the required type of change. Journal ofBiomedical Informatics (August), 1–11 (2018). DOI10.1016/j.jbi.2018.08.0137. Castillo, S., Naacke, H., Amann, B., Chavalarias, D.: Ex-ploring the evolution of science through interactive phy-lomemetic topic maps. BDA 2016 Gestion de Donn´ees–Principes, Technologies et Applications 32 e anniversaire15-18 novembre 2016, Poitiers, Futuroscope p. 89 (2016)8. Da Silveira, M., Dos Reis, J.C., Pruski, C.: Managementof Dynamic Biomedical Terminologies: Current Statusand Future Challenges. Yearbook of Medical Informatics (01), 125–133 (2015). DOI 10.15265/IY-2015-0029. Eljasik-Swoboda, T., Engel, F., Kaufmann, M., Hemmje,M.: Word embedding based extension of text categoriza-tion topic taxonomies. In: CERC, pp. 15–26 (2019)10. Fabian, G., W¨achter, T., Schroeder, M.: Extending on-tologies by finding siblings using set expansion tech-niques. Bioinformatics (12), 292–300 (2012). DOI10.1093/bioinformatics/bts21511. McCray, A.T., Lee, K.: Taxonomic Change as a Reflec-tion of Progress in a Scientific Discipline. In: Evolu-tion of Semantic Systems, pp. 189–208. Springer BerlinHeidelberg, Berlin, Heidelberg (2013). DOI 10.1007/978-3-642-34997-3 1012. Nelson, S.J., Johnston, W.D., Humphreys, B.L.: Rela-tionships in Medical Subject Headings (MeSH), pp. 171–184. Springer Netherlands, Dordrecht (2001). DOI10.1007/978-94-015-9696-1 1113. Oliver, D.E., Shahar, Y., Shortliffe, E.H., Musen, M.A.:Representation of change in controlled medical termi-nologies. Artificial Intelligence in Medicine (1), 53–76(1999). DOI 10.1016/S0933-3657(98)00045-114. Sari, A.K.: Mapping of change operations from gene on-tology into medical subject headings. International Jour-nal of Intelligent Engineering and Systems (4), 44–55(2020). DOI 10.22266/IJIES2020.0831.0515. Tsatsaronis, G., Varlamis, I., Kanhabua, N., Nørv, K.:Temporal Classifiers for Predicting the Expansion ofMedical Subject Headings. Proceedings of the 14th In-ternational Conference on Intelligent Text Processingand Computational Linguistics (CICLing’13) pp. 98–113(2013). DOI 10.1007/978-3-642-37247-6-916. Yu-Wen Guo, Yi-Tsung Tang, Hung-Yu Kao:Genealogical-Based Method for Multiple Ontol-ogy Self-Extension in MeSH. IEEE Transactions8 Nentidis et al. on NanoBioscience13