[PDF] What is all this new MeSH about? Exploring the semantic provenance of new descriptors in the MeSH thesaurus

Abstract

The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary widely used in biomedical knowledge systems, particularly for semantic indexing of scientific literature. As the MeSH hierarchy evolves through annual version updates, some new descriptors are introduced that were not previously available. This paper explores the conceptual provenance of these new descriptors. In particular, we investigate whether such new descriptors have been previously covered by older descriptors and what is their current relation to them. To this end, we propose a framework to categorize new descriptors based on their current relation to older descriptors. Based on the proposed classification scheme, we quantify, analyse and present the different types of new descriptors introduced in MeSH during the last fifteen years. The results show that only about 25% of new MeSH descriptors correspond to new emerging concepts, whereas the rest were previously covered by one or more existing descriptors, either implicitly or explicitly. Most of them were covered by a single existing descriptor and they usually end up as descendants of it in the current hierarchy, gradually leading towards a more fine-grained MeSH vocabulary. These insights about the dynamics of the thesaurus are useful for the retrospective study of scientific articles annotated with MeSH, but could also be used to inform the policy of updating the thesaurus in the future.

Full PDF

WWhat is all this new MeSH about?

Exploring the semantic provenance of new descriptors in the MeSH thesaurus

Anastasios Nentidis , · Anastasia Krithara · Grigorios Tsoumakas · Georgios Paliouras Abstract

The Medical Subject Headings (MeSH) the-saurus is a controlled vocabulary widely used in biomed-ical knowledge systems, particularly for semantic in-dexing of scientiﬁc literature. As the MeSH hierarchyevolves through annual version updates, some new de-scriptors are introduced that were not previously avail-able. This paper explores the conceptual provenanceof these new descriptors. In particular, we investigatewhether such new descriptors have been previously cov-ered by older descriptors and what is their current re-lation to them. To this end, we propose a framework tocategorize new descriptors based on their current rela-tion to older descriptors. Based on the proposed clas-siﬁcation scheme, we quantify, analyse and present thediﬀerent types of new descriptors introduced in MeSHduring the last ﬁfteen years. The results show that onlyabout 25% of new MeSH descriptors correspond to newemerging concepts, whereas the rest were previouslycovered by one or more existing descriptors, either im-plicitly or explicitly. Most of them were covered by asingle existing descriptor and they usually end up asdescendants of it in the current hierarchy, graduallyleading towards a more ﬁne-grained MeSH vocabulary.These insights about the dynamics of the thesaurus areuseful for the retrospective study of scientiﬁc articlesannotated with MeSH, but could also be used to informthe policy of updating the thesaurus in the future.

Keywords

MeSH · terminology extension · semanticindexing National Center for Scientiﬁc Research “Demokritos”,Athens, GreeceE-mail: { tasosnent, akrithara, paliourg } @iit.demokritos.gr Aristotle University of Thessaloniki, Thessaloniki, GreeceE-mail: { nentidis, greg } @csd.auth.gr The

Medical Subject Headings (MeSH) thesaurus isa collection of hierarchically organized entities for an-notating biomedical knowledge, primarily literature inPubMed/MEDLINE , with topic labels. The basic con-ceptual entity is the MeSH concept which is a collectionof synonymous terms for a particular domain meaning.Each concept has a preferred term , which is also usedas the name of the concept. MeSH concepts are not di-rectly used for annotating the literature. Each MeSHconcept belongs to exactly one MeSH descriptor , whichis a collection of closely related concepts, and formsthe basic element used for annotating biomedical liter-ature with topic labels. All the concepts and terms ofa descriptor are equivalent for the purposes of index-ing and searching MEDLINE. Beyond MeSH conceptsand descriptors, MeSH also provides some Supplemen-tary Concept Records (SCRs) that are directly used forannotating articles with labels for substances, rare dis-eases and organisms .As the MeSH hierarchy evolves through annual up-dates, new descriptors are introduced that were previ-ously unavailable. This evolution of MeSH is essential,in order to follow the development of knowledge in theﬁeld. For example, new descriptors can be more ﬁne-grained than old ones, providing a level of detail previ-ously unavailable in the vocabulary. On the other hand,new high-level descriptors can also be added, providingnew groupings of topics, under the light of the currentunderstanding of the domain. In some cases, the topicscovered by the new descriptors may have been presentin MeSH previously, covered by older descriptors. How- https://meshb.nlm.nih.gov/ a r X i v : . [ c s . D L ] J a n Nentidis et al. ever, some new descriptors may cover topics that aretotally new to the vocabulary, representing emergingconcepts in the domain.Despite their necessity for keeping MeSH up-to-date,the introduction of new descriptors raises practical chal-lenges. Several applications, such as (semi-)automatedsemantic indexing of biomedical literature with MeSHlabels, are based on supervised learning techniques thatexploit accumulated data from previous use of the vo-cabulary. However, for new descriptors no such anno-tated literature is available at the time of their intro-duction. Therefore, it becomes important to devise amapping of existing literature to the new descriptors.Towards this direction, for each new descriptor we areinterested in whether the corresponding topic was al-ready covered by old descriptors in MeSH or not.In this context, the basic questions motivating thisstudy on the provenance of new descriptors are the fol-lowing: – To what extent do the new MeSH descriptors coveremerging domain concepts that are really new forthe MeSH thesaurus? – For those new descriptors that do not cover emerg-ing domain concepts, can we identify older descrip-tors that were used to cover these concepts? – What is the current relation of the new descriptorswith the old ones that they are related to? – Is there any pattern over time concerning the intro-duction of new descriptors in MeSH and how thenew descriptors relate to the old ones?The main contribution of this work consists in devel-oping a conceptual framework for exploring the prove-nance of new MeSH descriptors considering the hierar-chical structure of the thesaurus. In particular, we de-scribe an approach for identifying predecessor descrip-tors, that used to cover the topic of a new descriptorpreviously. Namely, a coding system is introduced fororganizing the new descriptors based on two key dimen-sions: a) whether and how they have been covered inMeSH prior to their introduction, and b) their currentposition in the hierarchy in relation to their predeces-sors. In addition, a method is developed for the compu-tational identiﬁcation of predecessors and conceptualprovenance codes for new MeSH descriptors. Finally,based on the proposed framework we perform an anal-ysis that sheds light on the conceptual provenance ofdescriptors introduced in MeSH during the last ﬁfteenyears.The rest of this paper is structured as follows. InSection 2 some background knowledge is summarized,regarding the structure of basic elements of the MeSHthesaurus and their relationships. In section 3 we pro-vide a brief overview of work related to the extension

Fig. 1

MeSH concepts are grouped into descriptors whichare hierarchically organized and can also have a “PreviousIndexing” note (PI). Each Supplementary Concept Record(SCR) is mapped to at least one descriptor. of biomedical thesauri with new concepts, focusing onthe MeSH thesaurus. In section 4 we propose a frame-work for identifying the predecessors of new descriptorsand introduce new types of conceptual provenance tocharacterize the current relation of new MeSH descrip-tors with their predecessors. In section 5 we proposea method to automatically analyze the versions of theMeSH hierarchy, in order to identify the various typesof provenance. In section 6 we present and discuss theresults of this analysis, which lead to useful insightsabout the evolution of MeSH. Finally, in section 7 weconclude and indicate potential uses of our results infuture research.

In the MeSH hierarchy, each descriptor has exactly one preferred concept and may also have some subordinate(narrower, broader, or related) concepts that attach ad-ditional terms to the descriptor. For example, the de-scriptor for Dementia (Fig. 1) consists of three con-cepts. The preferred concept, which has two synony-mous terms (“Dementia” and “Amentia”) and two nar-rower concepts with a single term each. The preferredconcept is the reference point for deﬁning the subordi-nate concepts as narrower, broader, or related. There-fore, we consider the preferred concept as the dominantentity representing the main topic of a descriptor.The MeSH descriptors are hierarchically organizedso that most descriptors have at least one broader de-scriptor as parent. For example, the “Alzheimer Dis- hat is all this new MeSH about? 3 ease” (AD) descriptor has two parents, namely the de-scriptors “Dementia” and “Tauopathies”, as shown inFig. 1. Additionally, there are some top-level descriptorsthat have no parents and are the roots of the trees inthe MeSH hierarchy, which are called MeSH trees . TheMeSH trees are grouped into sixteen MeSH categories ,and each descriptor belongs to one or more MeSH treesand corresponding MeSH categories. For example, theAD descriptor belongs to the “Nervous System Diseases(C10)” tree in the “Diseases” (C) MeSH category andto the “Mental Disorders (F03)” tree in the “Psychiatryand Psychology” (F).The exact position of a descriptor in a tree is de-termined by one or more tree numbers or tree paths .Each MeSH tree number of a descriptor is extendinga tree number of some parent, and recursively includesthe tree numbers for a series of ancestors reaching up tothe corresponding root. For example, the AD descrip-tor has two tree numbers extending the tree numbersof “Dementia” ( F03.615.400.100 , C10.228.140.380.100 ) and one tree number extending the tree numberof “Tauopathies” (

C10.574.945.249 ).The SCRs, that are also known as

SupplementaryChemical Records , have similar conceptual structure todescriptors, with one preferred MeSH concept and po-tentially some subordinate ones, but they are not partof any descriptor and are not directly included in theMeSH hierarchy. However, they are mapped to at leastone descriptor. In Fig. 1 for example, the SCR “Prese-nile And Senile Dementia” is mapped to the AD de-scriptor as indicated by a dotted arrow towards thelatter. In practice, this means that when indexers inPubMed/MEDLINE use this SCR to annotate an ar-ticle, the article will also get automatically annotatedwith the mapped descriptors [12]. This mapping is im-portant because it deﬁnes which descriptors cover themeaning of each SCR at the level of main MeSH topicannotations, that are primarily used for indexing andsearching the literature.

The MeSH evolution is often studied in the broader con-text of the evolution of dynamic biomedical terminolo-gies [8]. In this area, the eﬀort has often been on deﬁn-ing and studying elementary and composite changes,that require one or more basic operations, e.g. adding,removing, merging, splitting, editing and moving ele-ments of a biomedical terminology. For instance, the CONCORDIA framework, which stands for “CONceptand Change-Operation Representation for any DIAlect”was proposed for representing, reporting and document-ing diﬀerent types of change in medical terminologies [13],and MeSH was explicitly considered in this study. Study-ing diﬀerent types of change in MeSH is also the focusof the work presented in this paper. The basic premiseis that by studying the basic operations that lead to achange, one can identify the conceptual source of newelements.Though MeSH is not an ontology, it is often men-tioned or even treated as such in relevant literature, andit has also been transformed into a (meta-)ontology, inan eﬀort to formally express all knowledge about se-mantically indexing MEDLINE [1]. McCray and Lee [11]studied the evolution of MeSH in an ontological context,as a conceptualization of the biomedical domain. Theyfocused on the evolution of MeSH category “Psychi-atry and Psychology” (F), capturing and quantifyingchange at the level of descriptors, as well as at moreﬁne-grained terminological and lexical levels. In par-ticular, they investigated whether change reﬂects theevolution of corresponding knowledge in the biomedi-cal domain. Their results reveal that change in MeSHreﬂects both the evolution of biomedical knowledge, aswell as some internal ontological restructuring eﬀorts,such as the separation of behaviors from disorders.Recently, Balogh et al. [3] studied the evolution ofMeSH as a network focusing on the addition and re-moval of links between MeSH descriptors, which theycall “attachment” and “detachment” of links respec-tively. Interestingly, they investigated whether these re-wiring events are associated with certain descriptor prop-erties, such as the number of parents or descendantsin the hierarchy. Their results, suggest that old MeSHdescriptors with many descendants appear to receiveand loose children descriptors more than expected bychance. On the other hand, descriptors with many an-cestors appear to receive and loose children descriptorsless than expected by chance.More recently, Cardoso et al. [5] suggested the inter-linking of distinct versions of MeSH developing a his-torical knowledge graph, to extend queries for biomed-ical literature retrieval and for supporting the mainte-nance of semantic annotations. In particular, they in-troduce “evolution connections” between descriptor el-ements (e.g. terms) in diﬀerent versions. In some cases,these evolutionary relationships can indicate the con-ceptual provenance for new descriptors, whereas in othercases they express more ﬁne-grained internal restruc-turing, such as relocating a term from one MeSH con-cept to another. Identifying the conceptual provenanceof descriptors is also central in the work presented in

Nentidis et al. this paper. However, the focus here is at the level oftopics and the goal is to capture additional provenanceconnections, beyond MeSH concepts, namely throughSCRs and Previous Indexing information.Some related studies also focus on the identiﬁca-tion of the elements that need to change in MeSH andtry to automate this process. For example, Sari [14]proposed an approach for propagating changes alreadyincorporated in the Gene Ontology into appropriatechanges in MeSH. However, other studies attempt topredict the extension of MeSH based on diﬀerent ap-proaches. For instance, Fabian et al. [10] proposed amethod for ﬁnding siblings to a set of MeSH terms, an-alyzing the structure and the content of HTML pagesin the Web. Guo et al. [16], on the other hand, proposeda “structure-based” method for recommending new sib-lings for MeSH descriptors, that was exclusively basedon the positions of existing terms in the MeSH hierar-chy. Eljasik-Swoboda et al. [9] proposed an embedding-based method for suggesting new sub-topics for existingtopics of MeSH, combining both knowledge about thehierarchy and the analysis of documents already anno-tated with speciﬁc MeSH labels.Other studies proposed approaches that beyond theanalysis of annotated corpora and the structure of thehierarchy, they incorporate temporal information forthe history of MeSH as well. Tsatsaronis et al. [15]proposed a method to predict which MeSH descriptorsshould be expanded with new children. This methodcombined information about the number of articles an-notated with each descriptor in PubMed with informa-tion about the hierarchical position of each descriptorin MeSH and temporal features that capture changes.Cardoso et. al [6] also proposed a method for identify-ing concepts that require revision, based on structuraland temporal information, as well as information fromother resources including the UMLS and article anno-tations in PubMed. Beyond the expansion of conceptswith more children, their method also suggested othertypes of revision, such as removal and relocation.Finally, the MeSH thesaurus has also been consid-ered in some studies in the context of topic modelingand evolution in the biomedical domain. In these works,MeSH labels are treated as keywords to develop a net-work of keyword co-occurrence from a document cor-pus, where latent (meta-)topics can be detected as clus-ters or communities. In this context, Castillo et al. [7]aligned such (meta-)topics of MeSH terms extractedfrom diﬀerent time intervals based on the similarity ofthe corresponding sets. Then, they present an overviewof the evolution of these matched (meta-)topics as aphylogeny-inspired network, where evolutionary eventslike merge and split can be identiﬁed. Balili et al. [2] propose the TermBall approach for both tracking andforecasting the evolution of such (meta-)topics of MeSHterms, treating them as evolving communities in theterm co-occurrence network. The work presented hereinvestigates evolutionary relations between biomedicaltopics as well, however we focus at the level of MeSH de-scriptors, as used for indexing in PubMed, rather thanthe broader level of latent (meta-)topics reﬂecting theevolution of a domain.In this study, we focus on newly added descriptorsduring the extension of MeSH. In particular, we studywhether the meaning of each new descriptor has beencovered by old descriptors previously, and if so, howits new position in the hierarchy relates to those ofits predecessors. In contrast to most related work inthe biomedical terminology evolution context, which fo-cuses either on the operations to implement a change(addition, merge etc) or on general features of the de-scriptors, such as depth in the hierarchy, we aim atcharacterizing the new descriptors according to theirconceptual provenance. In other words, how they are re-lated to their predecessors in previous versions of MeSH.In order to achieve this characterization, we investi-gate how we can identify the predecessors and intro-duce provenance types that provide a new insight onthe study of MeSH evolution.

In this section, we introduce a conceptual model tocharacterize and group new MeSH descriptors based ontheir conceptual provenance. That is, we investigate thecases of previous coverage of new descriptors during theextension of MeSH. We deﬁne the notion of PreviousHost (PH), as a predecessor of a new descriptor, anddescribe categories of descriptors based on how thesepredecessors can be identiﬁed. Subsequently, we intro-duce types of conceptual provenance, to characteriseinteresting cases of new descriptors, based on their cur-rent relation with each of their PHs in the hierarchy ofMeSH.4.1 MeSH extension and provenanceAs the MeSH hierarchy evolves,the new descriptors in-troduced may cover domain concepts that are not to-tally new to the vocabulary. Some concepts may havebeen explicitly present in the previous version of MeSH.In particular, a concept of a new descriptor may havebeen available as a subordinate concept of an old de-scriptor or as an SCR concept. The latter case, of turn- hat is all this new MeSH about? 5

Fig. 2

The promotion of the SCR “Adenocarcinoma ofLung” into a descriptor in 2019. ing a SCR concept into a descriptor, is usually reportedin a textual note in the new descriptor, called

Pub-lic MeSH Note (PMN). For example, the “Adenocarci-noma of Lung” descriptor that was introduced in 2019,shown in Fig. 2, has a PMN ﬁeld indicating its previousstate as an SCR mapped to the “Adenocarcinoma” and“Lung Neoplasms” descriptors. Therefore, literature for“Adenocarcinoma of Lung” annotated in 2018, can befound with “Adenocarcinoma” and “Lung Neoplasms”topic labels.In addition, even in cases where the concepts ofthe new descriptor have not been explicitly available assuch, their meaning may have been implicitly coveredby old descriptors. Such information is usually availableas a Previous-Indexing note (PI) in the new descrip-tors, as in the case of the “Tauopathies” descriptor inFig. 1. The PI note indicates that some old descrip-tors were used to annotate literature for the topic ofthe new descriptor, during a speciﬁc period prior to itsintroduction. For example, the “tau Proteins” descrip-tor was used to annotate articles about “Tauopathies”since 1997. This changed in 2002, with the introductionof a descriptor for “Tauopathies”.In this work, we refer to the MeSH version of theintroductory year of a new descriptor as version 1 , andthe last year before version 1 as version 0 . In addi-tion, we refer to such old descriptors that were used toannotate literature for the topic of the new descriptorin the version prior to its introduction ( version 0 ), as its Previous Hosts (PHs). Apart from identifying thePHs of a new descriptor, the current relation of thenew descriptor with its PHs is also important. For ex-ample, the new descriptor “Adenocarcinoma of Lung”was positioned in the hierarchy as a child of its two PHs(Fig. 2). Therefore, literature for “Adenocarcinoma ofLung” is still covered by “Adenocarcinoma” and “LungNeoplasms” as done prior to the introduction of the newdescriptor. On the other hand, the new descriptor for“Tauopathies” is not hierarchically related with its PH“tau Proteins”. As a result, literature for “Tauopathies”is not covered by the “tau Proteins” descriptor after theintroduction of the new descriptor in 2002.Although the above cases are common, they are notthe only types of relation one encounters between a newdescriptor and its PH(s). Furthermore, as the MeSH hi-erarchy keeps evolving, the relation of a descriptor withone or more of its PHs can change in subsequent years,complicating the situation further. Therefore, this re-lationship depends on the version of MeSH considered,which we call reference version . In this work, aimingat a profound understanding and improved handling ofnew MeSH descriptors we investigate their origin. Thatis, whether they have been covered by descriptors (PHs)in the corresponding version 0 , and if so, what their cur-rent relation to each of these descriptors is. In order tobetter quantify and organize these cases we deﬁne typesof “conceptual provenance” for the new descriptors.4.2 Previous Hosts (PHs)A PH of a new descriptor is deﬁned as a descriptor thatwas used to annotate articles for the topic of the newdescriptor in the version 0 of the new descriptor. In thatsense, we say that the PH used to cover the topic of thenew descriptor for the purposes of literature annotationin version 0 . However, it is not required that a PH usedto cover the topic exclusively. That is, a PH may havebeen used for indexing other topics as well, apart fromthe topic of the new descriptor. Therefore, several newdescriptors may share the same PH. In addition, it isnot required that a PH used to cover the topic of thenew descriptor entirely. That is, a PH may have beenused for indexing only part of a topic, for example incases of new high-level descriptor added to provide anew grouping of related topics.A formal deﬁnition for a PH descriptor d0 for a newdescriptor d1 can be based on the condition of topic-overlap as follows: – The topic-overlap(d1, d0, v) is true when articlesfor the main topic of d1 used to be indexed under Nentidis et al. d0 in the MeSH version v . In all other cases, topic-overlap(d1, d0, v) is false. – The previous-host(d1, d0) , denoting that d0 isa PH of d1 , is true when topic-overlap(d1, d0, v) is true, where v is the version 0 of d1 . In all othercases, previous-host(d1, d0) is false.This deﬁnition of a PH focuses only in the MeSHversion that precedes the introduction of the new de-scriptor ( version 0 ). Descriptors that used to cover thetopic in older versions can be recursively described asthe PHs of a PH and so on. However, our original mo-tivation is to characterize each new descriptor based onwhether its topic was already covered by MeSH, at thetime of its introduction ( version 1 ), or not. Therefore,in this work we do not track the history of each newtopic in the distant past.As already discussed in subsection 4.1 there are twotypes of coverage for a new descriptor in a previousversion of MeSH. a) Explicit coverage, which is basedon the conceptual structure of MeSH descriptors andSCRs into concepts, and b) implicit coverage, that canbe identiﬁed based on the PI information. Based on thecoverage type, we also characterize the correspondingPHs. That is, an explicit PH used to host a subordi-nate concept or used to be mapped from an SCR, thatcorresponds to the new descriptor. On the other hand,an implicit

PH was used by the indexers for annotat-ing articles that correspond to the topic of the newdescriptor, without any explicit link with the latter inits conceptual structure.Explicit PHs are of primary importance, as theyprovide strong conceptual links to the new descriptors.In our quest for a conceptual link to PHs, we focus onthe preferred concept of each new descriptor. This isbecause the preferred concept is the dominant entitythat represents the main meaning of a descriptor, aswell as the vast majority of articles indexed with thedescriptor. For descriptors that are new to the vocabu-lary, in the absence of any explicit PH, we exploit thePI ﬁeld to identify any implicit PH. The “Tauopathies”descriptor, shown in Fig. 1, is such a case of a new de-scriptor without any explicit PH, where the PI ﬁeld isexploited to identify the implicit PH “tau Proteins”.4.3 Provenance CategoriesFor the purpose of identifying the PHs of a new descrip-tor d1 , we seek its preferred concept in the correspond-ing version 0 of MeSH. Based on whether and how weidentify it in existing descriptors, we deﬁne four cases( categories ) of conceptual provenance, as depicted inFig. 3 and described below: Fig. 3

Identifying the provenance category and the PreviousHosts (PHs) for a new descriptor.

Category 1. Old Concept:

Although d1 is a new de-scriptor, its preferred concept is available in the previ-ous version of MeSH ( version 0 ) as a subordinate con-cept of another descriptor d0 . In this case of explicitcoverage, since d0 used to hold the preferred conceptof d1 , topic-overlap(d1, d0, version 0) is true. The de-scriptor d0 therefore, is an explicit PH of d1 . In addi-tion, as each MeSH concept can only belong to a singledescriptor in a given version of MeSH [12], d0 is theunique PH of d1 .For example, “Prunus africana”, shown in Fig. 4,introduced in 2016 as a descriptor, was a subordinate(narrower) concept of the “Pygeum” descriptor, whichis not included in MeSH any more. In this case theunique PH of “Prunus africana” is the “Pygeum” de-scriptor, which explicitly included the concept “Prunusafricana” in the version prior to the introduction of adedicated descriptor for it. Category 2. Old SCR:

Alternatively, since the SCRsaccount for a large volume of domain concepts that arenot included in MeSH descriptors [4], the preferred con-cept of d1 may have been available as a concept in anSCR scr , prior to the introduction of d1 ( version 0 ).In this second case of explicit coverage, for each de-scriptor d0 mapped from scr holds that the literatureindexed under scr was also indexed under d0 . There-fore, topic-overlap(d1, d0, version 0) is also true. As aresult, each descriptor d0 is an explicit PH of d1 . Forexample, the descriptor “Adenocarcinoma of Lung” in-troduced in 2019, was previously available as an SCRmapped to the descriptors “Lung Neplasms” and “Ade- hat is all this new MeSH about? 7 Fig. 4

An example of descriptor succession . nocarcinoma” (Fig. 2). These two descriptors are theexplicit PHs of the new descriptor. Category 3. New PI Concept:

The preferred conceptmay be new, introduced together with the new descrip-tor d1 . For such new descriptors, if previous-indexing(PI) information is available, this means that some otherdescriptors were previously used to index articles for thetopic of d1 (new PI concept). Therefore, the preferredconcept of d1 , though new in the MeSH thesaurus, itwas previously indexed under some older descriptorswith other concepts, hence implicitly covered by them.In such cases of implicit coverage, the PI descriptorsthat were used until the introduction of d1 are the oneswith the most recent ending year in the accompanyingperiod ( version 0 ). Therefore, for each descriptor d0 that was used until the introduction of d1 , we have that topic-overlap(d1, d0, version 0) is true. As a result, themost recent PI descriptors are the implicit PHs of thenew descriptor d1 .For example, “Zika Virus Infection” was introducedas descriptor in 2015, and was not previously present asconcept in MeSH. However, it is annotated with a PInote, revealing that the descriptors named “ArbovirusInfections” and “Flavivirus Infections” have been usedfor indexing literature relevant to “Zika Virus Infec-tion” until 2015. Therefore, these two descriptors arethe PHs of “Zika Virus Infection”. Category 4. New Emerging Concept:

On the other hand,there exist new descriptors where no PI information isavailable, no PH can be identiﬁed and their PHs is anempty set. Such totally new descriptors are expected toinclude emerging domain concepts without signiﬁcantpresence in prior literature. Therefore, the curators be-gin indexing articles for a domain topic previously notindexed under any speciﬁc MeSH descriptor.For example, “Long Term Adverse Eﬀects” was in-troduced in 2015 as presented in Fig. 5. No “Long Term

Fig. 5

An example of descriptor emersion . Adverse Eﬀects” concept was previously present in MeSHand no PI information is available, to report that ar-ticles for “Long Term Adverse Eﬀects” were indexedunder some particular descriptor until 2015. Therefore,this totally new descriptor has no PHs at all.4.4 Provenance TypesHaving identiﬁed the PHs and the provenance categoryof each new descriptor, next we investigate the hierar-chical relation of the new descriptor with each one of itsPHs. This relation starts with the introduction of thenew descriptor in the MeSH hierarchy, but may changein the course of the years, as the hierarchy evolves fur-ther. Therefore, to characterise the relation of a newdescriptor d1 with a PH d0 in the context of a given reference version of MeSH, we focus on two basic prop-erties of this relation in the corresponding hierarchy.Namely the relation type of d1 with d0 and the dis-tance between them in the hierarchy. – The relation type(d1, d0) is: (a) ancestor when d1 has at least one tree number that includes a treenumber of d0 , (b) descendant when d0 has at leastone tree number that includes a tree number of d1 ,(c) unrelated when none of the tree numbers of d0 includes or is included in any of the tree numbers of d1 , and (d) undeﬁned when d0 is not present in the reference version of MeSH. – The distance(d1, d0) is the number of other de-scriptors included in the shortest path connecting d1 and d0 . If d0 is located in a position in the hierarchythat is not connected with d1 , then the distance(d1,d0) can be considered to be inﬁnite. If the d0 is notpresent in the reference version of the hierarchy the distance(d1, d0) is undeﬁned . Nentidis et al.

For example, the relation between the new descrip-tor “Adenocarcinoma of Lung” and its PH “Lung Neo-plasms” has ancestor relation type and zero distance with MeSH 2019 as reference version (see Fig. 2). Onthe other hand, the current relation of “Prunus africana”and its PH “Pygeum”, in the context of MeSH 2020 ref-erence version , has undeﬁned relation type and inﬁnitedistance (see Fig. 4). Based on the relation type andthe distance of the relation between a new descriptor d1 and a PH d0 we also deﬁne some cases of interest,which we call conceptual provenance types . Type 0. Emersion: No PH found.

For new descriptorsin category 4, where no PH can be identiﬁed. In thesecases there is no PH for which to investigate the currentrelation, therefore we deﬁne the trivial type of prove-nance emersion , which includes all descriptors of prove-nance category 4 and only descriptors of category 4 .This exceptional type of provenance does not reﬂectthe relationship with any PH, therefore it is not basedon relation type and distance in a speciﬁc reference ver-sion of MeSH. The meaning of such a completely newdescriptor is emerging when the new descriptor is intro-duced, and is characterized as emergent hereafter. The“Long Term Adverse Eﬀects” descriptor introduced in2015, is an example of emersion (see Fig. 5).

Type 1. Succession: relation type(d1, d0) = undeﬁnedand distance(d1, d0) = undeﬁned.

For some new de-scriptors a PH can be no longer present in the referenceversion of MeSH. In this case, d1 is considered one ofthe successors of d0 , because at least some of the ar-ticles that used to be annotated with d0 , in version 0 for d1 , are annotated with d1 instead, in the referenceversion of MeSH. In the example of Fig. 4, the newdescriptor “Prunus africana” is a case of succession, asits PH is not available in the context of the referenceversion , MeSH 2020. Type 2. Subdivision: relation type(d1, d0) = ancestorand distance(d1, d0) = 0.

A new descriptor d1 , whosePH d0 has become its parent. In this case, d0 coversthe topic of the new descriptor entirely, but d1 sup-ports the partition of the corresponding literature intomore ﬁne-grained conceptual sets. This is the most ex-pected type of relation between new descriptors andtheir PHs, as the vocabulary evolves towards more de-tailed descriptors to support more precise topic annota-tions. In the subdivision example of Fig. 6, “RegulatedCell Death” introduced in 2020, used to be indexed un-der “Cell Death” until 2019, which became its parent. Fig. 6 “Regulated Cell Death” as a subdivision of “CellDeath”, the submersion of “Ferroptosis” and the detachment of “Necroptosis”.

Type 3. Submersion: relation type(d1, d0) = ancestorand distance(d1, d0) > A new descriptor d1 , whosePH d0 has become an ancestor, but not a parent. This issimilar to subdivision , as they both are characterized by ancestor relation type , but at least one other descriptorappears between d0 and d1 in the hierarchy. This is alsoin accordance with the evolution towards more detaileddescriptors, as the d0 keeps covering the topic of thenew descriptor entirely. However, the distance betweenthem suggests that intermediate levels of detail are alsoavailable.“Ferroptosis”, introduced in 2020 (Fig. 6), is an ex-ample of submersion , as it was indexed under “CellDeath” until 2019, which is now an ancestor but not aparent of it. In this example, the fact that “RegulatedCell Death”, which operates as the intermediate level ofdetail, was also introduced together with “Ferroptosis”,can explain why “Ferroptosis” articles were previouslyindexed under “Cell Death” instead of “Regulated CellDeath”. Type 4. Overtopping: relation type(d1, d0) = descen-dant.

A new descriptor d1 , whose PH d0 has becomeits descendant. In this case, although literature for thenew topic used to be indexed under d0 in the past( version 0 ), d1 is an ancestor of d0 in the referenceversion of MeSH, hence broader than it. Such new de-scriptors provide a new grouping of the old topics, po-tentially enhanced with additional terms for the aggre-gate topic. In the example depicted in Fig. 7, “Crys- hat is all this new MeSH about? 9 Fig. 7

The “Crystal Arthropathies” overtopping its PHs. tal Arthropathies”, introduced in 2017, has two im-plicit PHs, as it was indexed as “Chondrocalcinosis”and “Gout” until 2016. Both of them are children of“Crystal Arthropathies” in 2020, hence overtopped byit. Such cases seem less expected than the ones with ancestor relation type ( subdivision and submersion ), asthis situation suggests that d0 used to cover only a partof the topic of d1 . In addition, overtopping is less in-teresting from a practical point of view, as the use ofthe narrower descriptor that covers a topic is a commonMeSH-indexing practice [12]. Therefore, though diﬀer-ent levels of detail may exists between the new descrip-tor and its descended PH, splitting this small group ofcases based on the distance would not add particularvalue. Type 5. Detachment: relation type(d1, d0) = unrelated.

A new descriptor d1 that is not related to its PH d0 with any of the above relations. In this case, d1 is de-tached from d0 , placed in a position without the onebeing included by the other. In the example of Fig. 6,“Necroptosis”, introduced in 2020 as a child to “Reg-ulated Cell Death” in the “Phenomena and Processes”MeSH category (G), was previously indexed as “Necro-sis”. Although “Necrosis” used to be a child of “CellDeath” in 2019, in 2020 it belongs only to the “Dis-eases” MeSH category (C) and is not directly relatedto “Necroptosis”. Therefore, we consider the “Necrop-tosis” descriptor to be detached from its PH “Necrosis”in 2020.Detached descriptors may be positioned quite closeto their PH in terms of distance , but are not related asancestors or descendants to it. In the example of Fig. 8,“Undiagnosed Diseases”, introduced in 2020 as a childdescriptor to “Disease Attributes”, was previously in-dexed under “Rare Diseases” is also a child of “DiseaseAttributes”. However, we consider that “UndiagnosedDiseases” is detached from its PH “Rare Diseases”, astheir topics are eﬀectively disjoint. That is, neither of Fig. 8

The detachment of “Undiagnosed Diseases” from“Rare diseases”.

Fig. 9

The detachment of the “Shoulder Dystocia” descrip-tor has two provenance codes, namely code 3.2 for the subdi-vision of the PH “Dystocia” and code 3.5 for the detachmentfrom the PH “Shoulder”. the two topics includes the other in the reference ver-sion (MeSH 2020).

Provenance codes:

In order to easily refer to both cat-egory and type of conceptual provenance, we adopta composite provenance code , with a preﬁx indicatingthe category of a descriptor and a suﬃx indicating the type of its relation to some PH, separated by a dot, asshown in Table 1. For example, the provenance code for“Necroptosis” (Fig. 6) is 3.5 indicating a provenance category 3 for “new concept”, as the PH has been iden-tiﬁed based on PI information, and a provenance type5 for detachment from “Necrosis”. Similarly, the prove-nance code for “Prunus africana” (Fig. 4) is 1.1 with category 1 for “old concept”, and type 1 for succession of “Pygeum”. In the special case of type 0, emersion ,the preferred concept of the new descriptor is by def-inition id category 4 , hence, all emersion cases have atrivial provenance code 4.0.As some new descriptors can have more than onePHs, the provenance types described above are not mu-tually exclusive. Therefore, a new descriptor can havemultiple provenance codes . This is not true for the prove-nance categories, therefore all provenance codes of aspeciﬁc new descriptor begin with the same preﬁx. Forexample, “Shoulder Dystocia” depicted in Fig. 9, wasintroduced in 2020 as a child descriptor to “Dystocia”.Articles for shoulder dystocia were indexed as both et al.

Table 1

Provenance codes characterizing the relationship of a new descriptor with a PH, encoding categories and types aspreﬁxes and suﬃces respectively. The exceptional case of emersion type corresponds to code 4.0.

Provenance Type Properties Provenance Categoryrelation type distance 1. Oldconcept 2. Old SCR 3. New PIconcept.1 Succession undeﬁned undeﬁned 1.1 2.1 3.1 .2 Subdivision ancestor 0 1.2 2.2 3.2 .3 Submersion ancestor > .4 Overtopping descendant ≥ .5 Detachment unrelated ≥ “Dystocia” and “Shoulder” until 2019, hence it is botha case of subdivision (3.2) of the PH “Dystocia”, whichbecame its parent, and a case of detachment (3.5) fromthe PH “Shoulder” which is not directly related withthe new descriptor. In this section, we describe the computational tools de-veloped for the automated identiﬁcation of new MeSHdescriptors, their PHs and provenance codes , in the con-text of the conceptual model introduced in section 4.These tools, access the original source ﬁles of MeSH ,as provided by NLM, in the MeSH XML format .Therefore, all available information is accessible by thetools and any new versions of the hierarchy can be di-rectly incorporated upon release. Figure 10 illustratesthe sequence of processing steps that are involved inrelating new descriptors to their PHs. The source codeof the tools is openly available in GitHub .5.1 Harvesting MeSH versionsAs mentioned in previous sections, we focus our anal-ysis to the provenance of descriptors that are presentin a reference version of MeSH, namely the latest one.Therefore, we do not process descriptors that appearand disappear in various older versions. However, weare still interested in annotating descriptors that appearin older versions and remain available in the referenceversion . As a result, we need to process older versionsas well, covering a period from year 0 to year N .In particular, the process starts with the harvestingof MeSH ﬁles for diﬀerent versions of the hierarchy. Thisstep begins with parsing the basic XML ﬁle for eachyear to extract the descriptors available in this version.This set of descriptors is then compared to those of the https://github.com/tasosnent/MeSH_Extension previous year to identify the new ones. The same pro-cess is repeated for each year, with the exception of thevery ﬁrst one, for which no previous version is avail-able. Apart from the basic ﬁle comprising the MeSHdescriptors, the XML ﬁle of the SCRs is also parsed foreach version, to extract the corresponding set of avail-able SCRs. These are needed for the identiﬁcation ofprovenance categories and types. Extracting descriptor attributes:

For the descriptors ofinterest, a number of attributes need to be extracted, inorder to help us trace its provenance. The most impor-tant attribute is the MeSH code of the descriptor, whichis the unique identiﬁer considered for checking descrip-tor existence and identity. Other relevant informationinclude the positions of a descriptor in the hierarchy( tree numbers ), its preferred concept and the contentof the PMN and the PI ﬁelds. Most parts of this at-tribute extraction step are quite straightforward, as weprimarily rely on the unique identiﬁers of the entitiesinvolved in the analysis. For example, the informationneeded for identifying the earlier status of a descriptoras a subordinate concept in its version 0 , is the uniqueconcept identiﬁer of its preferred concept. This is be-cause as we need to compare this identiﬁer with theidentiﬁers of subordinate concepts of any descriptor in version 0 .However, automated extraction of information fromthe PMN and PI ﬁelds proved more challenging as theseﬁelds contain information in semi-structured text, meantto be read by humans. Therefore the structure of thistext is inconsistent, while descriptors and SCRs arementioned with their current preferred terms, insteadof the corresponding unique identiﬁers. Consequently,we adopted a semi-automated approach, based on reg-ular expressions, in order to extract information fromthese ﬁelds. In the large majority of cases we managedto minimize the required manual eﬀort as described be-low.

Extraction from the PMN ﬁeld:

The PMN (

Public MeSHNote ) ﬁeld of a MeSH descriptor typically consists of hat is all this new MeSH about? 11

Fig. 10

The computational process for identifying new descriptors and annotating them with provenance codes. sentences separated by semicolons and may provide vary-ing information, such as the year the descriptor wasintroduced and changes in the preferred term. Of par-ticular interest for this work, are PMN sentences thatreport earlier status of the descriptor as an SCR. This isdone with expressions of the form “

X was indexed underY ”, where X is the SCR and Y comprises one or moredescriptors together with the corresponding time peri-ods, as shown in the example of Fig. 2. This is useful asin some cases an SCR that gets “promoted” to descrip-tor may undergo some minor term modiﬁcations andreceive a new identiﬁer. In such cases, exploiting thePMN is the only way to identify the old SCR for thenew descriptor, which would otherwise be consideredtotally new.Therefore, when attempting to associate a new de-scriptor to an earlier SCR, we start by comparing theidentiﬁer of the preferred concept of the descriptor tothe concept identiﬁers in earlier SCRs. If this exact-match search fails, we resort to the use of the PMNexpressions mentioned above. In particular, we ﬁrst useregular expressions to extract from the PMN ﬁeld thepreferred term ( X ) of the old SCR and map it to someSCR identiﬁer in the corresponding version of MeSH.In our analysis, this method managed to automaticallyidentify the missing links for the majority of cases (74%)where the PMN ﬁeld matches the “ X was indexed underY ” expression and the exact-match search fails.For the few remaining cases, we calculated the sim-ilarity of X and the current descriptor name to earlierSCR terms. Based on this similarity, the system pro-duced best-match suggestions, which were conﬁrmedmanually. More details about this method are availablein a technical report available online . There is also asmall number of cases where more than one old SCRs isreported in the PMN ﬁeld. In such cases, only the ﬁrstSCR was considered, as this usually corresponds to thepreferred concept of the new descriptor, representing itscentral meaning. Extraction from the PI ﬁeld:

The PI (

Previous-Indexing )ﬁeld of a MeSH descriptor can be used to link a new https://docs.google.com/document/d/1J3X5OlrkIErDR-qJf0KT669Du9xndfPNBRumhYS9Yxw/edit?usp=sharing descriptor to old ones, when such a link is not providedexplicitly, that is by a previous state of the descriptor asa subordinate concept or SCR. The PI ﬁeld contains alist of semi-structured notes in English. Each note usu-ally consists of the relevant descriptors for a previousperiod, often followed by the corresponding time periodin parentheses (Fig. 1). Exploiting this pattern we usedregular expressions to extract the terms and the corre-sponding time periods . In cases where the PI ﬁeldconsists of multiple notes, all the descriptors with themost recent end year are considered as PHs, as donefor “Shoulder Dystocia” in the example of Fig. 9. Anyolder PI elements are neglected. Selecting provenance type:

In the last part of the MeSHharvesting step, each new descriptor is annotated withconceptual provenance codes. In particular, the ﬁrststep is to select the provenance category based on theprevious state of the current preferred concept as asubordinate concept or an SCR concept in the corre-sponding version 0 , as depicted in the schema of Fig. 3.Then, the provenance type is selected, based on thecurrent relation of the new descriptor to its PHs, whichhave been identiﬁed by the extraction process. Combin-ing the provenance types with the category, the com-plete set of provenance codes is formed. The end re-sult is a collection of all the new descriptors that havebeen introduced during the period considered and re-main available in the reference version of MeSH. Thesedescriptors are annotated with their basic informationand provenance annotations, and stored in CSV ﬁlesnamed after the year that corresponds to the vesrion for each descriptor. Some exceptions not ﬁtting the patterns were identiﬁedand handled manually.2 Nentidis et al.

Table 2

The distribution of the 6,915 new descriptors (2006- 2020) into provenance codes. The total per category can belower than the sum of distinct type counts as the types arenot mutually exclusive.

Prov. Category1. 2. 3. TotalProv. Type Oldcon. OldSCR New PIcon. /type.1 Succession

21 12 84 117 .2 Subdivision

276 967 1,603 2,846 .3 Submersion

47 535 506 1,088 .4 Overtopping

24 7 91 122 .5 Detachment

151 364 1,313 1,828

Total/category

519 1,616 3,060

The total for category 4 , Emersion (4.0), is 1,720. tiﬁed and annotated all the descriptors introduced dur-ing this period, considering MeSH 2020 as the referenceversion . In other words, we are interested in the currentstatus of the descriptors, but we use the year of their in-troduction version 1 , in order to identify their previoushosts (PHs) and provenance category. The result of thecomputational processing is a CSV ﬁle for each MeSHversion, comprising the new descriptors introduced thisyear and their provenance annotations.As a ﬁnal step, these ﬁles are parsed and analysedto produce statistics and diagrams that provide alter-native views of conceptual provenance in the course ofMeSH expansion in order to answer the basic questionsdriving this study. In particular, the diagrams that aregenerated present the frequencies of provenance cate-gories, types and codes per year of introduction andin total. Based on these diagrams, we attempt to an-swer the basic questions driving this study and identifypatterns and observations that may be of interest forunderstanding the dynamics of MeSH extension.6.2 Overview of new descriptors and their provenanceTable 2 presents the distribution of new descriptors intoprovenance categories and types. In total, 6,915 descrip-tors were introduced in MeSH since 2006 and were re-tained until 2020. This corresponds to an extension ofabout 30%, compared with the 22,997 descriptors avail-able back in 2005, and indicates that about 23% of allcurrent descriptors have been introduced during the lastﬁfteen years.The new descriptors introduced for new conceptsthat have been implicitly covered in their version 0 byold descriptors ( category 3 ) is the most frequent prove-nance category, accounting for about 44% of all new that are publicly available here https://github.com/tasosnent/MeSH_Extension/blob/main/NewDescriptors_2006_2020.csv descriptors. New descriptors for old concepts that havebeen explicitly covered in previous versions account forabout 31% of all new descriptors, with the majority ofcases covered by SCRs ( category 2 , ∼ category 1 , ∼ category 2 ), rather for promoting subor-dinate concepts restructuring old descriptors ( category1 ). On the other hand, new descriptors for emergingconcepts ( category 4 ), that are totally new for the MeSHvocabulary, account for 25% of all new descriptors. Thisrelatively low frequency of Emersion suggests that inmost cases new descriptors are linked to domain enti-ties that are already covered by other descriptors eitherimplicitly ( category 3 ) or explicitly ( category 1 and ).Therefore, the new conceptual entities that are veryoften introduced ( category 3 and account for 69%of new descriptors) are not completely novel, but theyusually oﬀer dedicated descriptors to known concepts( category 3 ).Furthermore, the annual distribution of new descrip-tor categories, shown in Fig. 11, conﬁrms the consis-tently high frequency of categories 3 and throughoutthe years. In particular, both the introduction of de-scriptors for new PI concepts and new emerging con-cepts accounts for at least around 100 cases annuallyfor the whole period of study. However, category 4 ismore stable around its mean value (AVG) of almost115 cases per year, with standard deviation (SD) of22 cases, whereas category 3 presents more variationaround its mean of 204 cases (SD ∼

85 cases), reachingup to 300 and 400 cases in certain years.On the other hand, the promotion of existing SCRsinto descriptors ( category 2 ) seems the less predictablecategory with an AVG around 108 and a SD around131 cases per year. In particular, in certain years (e.g.2006, 2019) there seems to be a surge of such cases,while in others the number is much smaller. Finally, theevolution of existing subordinate MeSH concepts intoindependent descriptors ( category 1 ) seems the leastfrequent and the most stable category with an AVG ofaround 35 and a SD of around 13 new descriptors peryear.The extreme peak of more than 900 new descrip-tors observed in 2006, may be the result of an eﬀortat NLM to restructure descriptors for chemicals thatcombined meanings for activity and structure. This ef-fort, that has been spanning across many years, was hat is all this new MeSH about? 13

Fig. 11

Frequency of provenance categories for new descriptors, per year of introduction. continued in 2006 . In addition, promoting SCRs toDescriptors was particularly encouraged this year inNLM , which is in agreement with the fact that thispeak seems to be almost exclusively attributed to pro-moted SCRs ( category 2 ), which are known to repre-sent mainly chemicals. This is also conﬁrmed by thedistribution of new descriptors into MeSH categories(Fig. 12), as 73% of the new descriptors introduced in2006 belong to “Chemicals and Drugs” (D). This rela-tive frequency for 2006 far exceeds the overall relativefrequency of category D for the whole period consid-ered, that is around 41%.Two less extreme peaks are also observed in 2011and 2017, with the introduction of about 600 new de-scriptors each. In contrast to the 2006 peak, these onesseem to be primarily attributed in category 3 cases, asother categories present frequencies close to the ones ofthe adjacent years. In addition, the distribution of thecorresponding new descriptors into MeSH categoriessuggests that, though the chemicals category D has rel-atively high frequencies these years, other MeSH cate-gories also have considerable contribution to these peaks.In other words, these peaks of new descriptors for newPI concepts ( category 3 ) seem to be more evenly dis-tributed across MeSH categories, that the 2006 peak of category 2 cases. Cho, Dan-Sung (NIH/NLM) personal communication

For 2011, this is in agreement with a focus in MeSHon projects related to categories “Biological Siences”(G) and “Analytical, Diagnostic and Therapeutic Tech-niques, and Equipment” (E) in MeSH . The peakof 2017, on the other hand seems to be aﬀected bythe “MeSH Protein Project” , as part of which, al-most 290 new descriptors were added. The aim of thisproject was to achieve alignment of gene families, asdescribed by the Human Genome Nomenclature Com-mittee (HGNC), with protein classes in MeSH. In ad-dition, more new descriptors that usual are introducedin 2017 for some less frequent MeSH categories, such as“Health Care” (M) and “Persons” (N).Regarding the provenance types of new descriptors, Subdivision (.2) is the most common case (41%), fol-lowed by

Detachment (.5, 26%) and

Emersion (.0, 25%).

Submersion has also a considerable frequency of 16%,but

Succession (.1) and

Overtopping (.4) are quite scarce,accounting for about 2% each. This distribution seemsto be in agreement with the expected low frequency ofnew descriptors being broader of their PHs (

Overtop-ping ) or having their PHs removed from the vocabulary(

Succession ). However, the frequency of new descrip-tors that are no longer covered by any of their PHs(

Detachment ) seems quite notable, representing 35%of non-emerging new descriptors ( categories 1, 2 and ). This implies that the addition of dedicated descrip- Cho, Dan-Sung (NIH/NLM) personal communication et al. Fig. 12

Frequency of MeSH categories for new descriptors, per year of introduction. The four MeSH categories accountingfor at least 10% of new descriptors each, are presented independently. The rest twelve cases, that have overall frequency of lessthat 10% of new descriptors each, are collectively prevented as “Other Categories”. tors for concepts that used to be covered by older de-scriptors (PHs), often serves the removal of these sub-ordinate, supplementary or implicitly covered conceptsfrom these PHs, improving the speciﬁcity of the latter.On the other hand, the majority of new descrip-tors appear to be still covered by their PHs, oﬀeringsubtopics to the latter. In particular, about 55% of allthe new descriptors have at least one ancestor in theirPHs, that is they belong to

Subdivision or Submersion cases, with the last being far less frequent as expected(16%). This suggests that only half of the new descrip-tors end up as descendants of their PHs. However, fo-cusing on the 5,195 non-emerging new descriptors, thatactually have at least one PH ( categories 1, 2 and ),this relative frequency increases to 73%, with Subdi-vision accounting for 55% of the cases and

Submer-sion for only 21% of them. This is in agreement withthe expected evolution of the topic vocabulary towardsmore ﬁne-grained descriptors. The latter support moreprecise topic annotations and retrieval, especially whenmore documents are accumulated for some descriptorsduring the years.Figure 13 presents the annual distribution of newdescriptors into provenance types. Despite annual ﬂuc-tuations, there seems to be a clear separation of the fre-quent types (

Emersion , Subdivision , and

Detachment ),from the infrequent ones (

Succession and

Overtopping )throughout the period of study. Finally, the

Submer-sion type seems to fall in-between the two groups. In addition, it seems that the infrequent types of

Succes-sion and

Overtopping vary the least through the years(SD 7 and 5 respectively). The more frequent types of

Subdivision , Detachment and

Submersion seem to bethe less predictable (SD 81, 56 and 66 respectively),whereas the trivial type of

Emersion , though quite fre-quent as well, appears to be relatively stable, as alreadynoticed for category 4 .As with MeSH categories, the surge of cases in cer-tain years is not evenly distributed across all prove-nance types. Although, the representation of all prove-nance types appears to be close to their overall relativefrequency in the peak of 2011, this is not always thecase. In 2006,

Submersion seems to be over-represented,accounting for 31% of the cases, which is more thandouble its overall relative frequency for the period ofstudy (16%). This could be related with the complexorganization of chemical SCRs into groups and sub-groups. For example, “Receptors, Scavenger” as well asthe six classes of them (“Scavenger Receptors, ClassA” etc) used to be SCRs indexed under “Receptors,Immunologic” until their promotion into descriptors in2006. Although “Receptors, Scavenger” was added as achild (2.2) to their PH “Receptors, Immunologic”, thesix classes were added as children of “Receptors, Scav-enger”, hence more distant descendants of “Receptors,Immunologic” (2.3).On the other hand,

Detachment seems to be over-represented in the peak of 2017, accounting for 39% hat is all this new MeSH about? 15

Fig. 13

Frequency of provenance types for new descriptors, per year of introduction. of the new descriptors, whereas its overall relative fre-quency for the whole period is 26%. Some of these

De-tachment cases are new descriptors for protein domainsor motifs detached from the corresponding protein de-scriptors, which can be related with the “MeSH Pro-tein Project”. For example, the new descriptor “MethylCpG Binding Domain” detached from its PH “DNA-Binding Proteins”. In addition, several new descriptorsin MeSH categories “Health Care” (M) and “Persons”(N) appear to represent medical professions detachedfrom the corresponding medical domains. For examplethe new descriptor “Nephrologists” was detached fromits PH “Nephrology”.Some of the types, In particular

Subdivision (.2) and

Detachment (.5), seem to be correlated in the way theyincrease or decrease over the years. It would therefore,be of interest to investigate whether the correlation oftheir annual frequencies observed in Fig. 13 should beattributed to the addition of descriptors that exhibitboth these provenance types simultaneously. This isonly possible in category 2 and category 3 where theavailability of multiple PHs for a new descriptor canlead to multiple provenance codes. In practice, howevernew descriptors with multiple provenance codes are notvery common, representing almost 17% of all new de-scriptors in these two categories.Focusing on the majority of new descriptors thathave a single provenance type, we compare the annualfrequencies of the

Subdivision (.2) and

Detachment (.5)(Fig. 14). The correlation of the frequencies seems tobe preserved in the frequent category 3 (blue lines with square markers). In other words, even when looking atdistinct new descriptors that share no common prove-nance types,

Subdivision (3.2) and

Detachment (3.5)seem to ﬂuctuate in the same way across the years. For category 2 on the other hand (green lines with trianglemarkers),

Detachment (2.5) doesn’t seem to keep-upwith

Subdivision (2.2) which presents some high peaks(2006, 2016, 2019). This is reasonable, as the link of thenew descriptors to their PHs is stronger in category 2 ,which is based on explicit coverage, compared to cate-gory 3 where the PHs used to cover the new descriptorsonly implicitly.It appears that in category 3 , the amounts of newdescriptors that are added as children of their PHs isusually comparable to the ones that are detached fromtheir PHs. This observation could be the eﬀect of aninternal procedure in the maintenance of MeSH andmay warrant further investigation. On the other hand,the frequency of emerging descriptors without any PHs(Emersion 3.0) (Fig. 13) exhibits ﬂuctuations that arenot particularly correlated to the other frequent typesof provenance. This suggests that the addition of de-scriptors with totally new preferred concepts forms adistinct subset of the new descriptors added each year.

In this work we proposed a novel conceptual frameworkfor organizing and studying the conceptual provenanceof new descriptors in the Medical Subject Headings(MeSH) Hierarchy. In particular, we deﬁned the notion et al.

Fig. 14

Frequency of type

Subdivision (.2) and

Detachment (.5) in new descriptors introduced during the last ﬁfteen years,per provenance category. The asterisk (*) denotes that only descriptors with a single type are considered, excluding descriptorscombining more than one types. of the previous host (PH), as a descriptor covering themain topic of a new descriptor prior to its introduction,and suggested an approach to identify such PHs for anew descriptor. Then, based on the current relation-ship of the descriptor with its PHs we also deﬁned a setof provenance types and codes. In addition, we devel-oped an open-source computational process for the au-tomated extraction, annotation and analysis of new de-scriptors, using the raw ﬁles of diﬀerent versions MeSHas distributed by the US National Library of Medicine(NLM). Employing this approach, we investigated theconceptual provenance of new MeSH descriptors for theperiod 2006-2020.The results reveal that about 115 new descriptorsfor emerging concepts ( category 4 ) are introduced eachyear quite steadily. These descriptors represent about25% of all new descriptors of the study period, indi-cating that the majority of the new descriptors cov-ers non-emerging domain concepts that are not reallynew for the MeSH thesaurus. Less than half of thesenon-emerging concepts were explicitly covered in MeSHprior to the introduction of dedicated descriptors forthem ( category 1 and category 2 ). The majority of non-emerging concepts, though not explicitly included inolder versions of MeSH, used to be indexed under spe-ciﬁc older descriptors (PHs) that covered their meaningimplicitly ( category 3 ).This suggests that the main force which is consis-tently driving the extension of MeSH during this period is the need to explicitly cover more conceptual entities.Namely, a stable annual amount of new emerging con-cepts ( category 4 ) and a similar or greater amount ofnew PI concepts ( category 3 ), that used to be implic-itly covered by MeSH. The need to introduce descrip-tors for reorganizing concepts that are already explicitlycovered ( category 1 and category 2 ) appears to be aux-iliary, with low amounts of new descriptors for mostyears. However, in certain years, we also observed asurge in the promotion of existing SCRs into descriptors( category 3 ), particularly for chemicals. Such surges in category 2 and category 3 , seem to be related with in-ternal MeSH projects and resource allocation in NLM.In addition, the results on conceptual provenancetypes reveal that more than 70% of all non-emergingnew descriptors ( categories 1, 2 and ) become subtopicsof their PHs’ topics. That is, they remain under thecoverage of the latter, usually as children of them (.2, Subdivision ) and less often as more distant descendants(.3,

Submersion ). However, the amount of new descrip-tors that are detached from their PHs (.5,

Detachment )is also considerable, particularly for implicit PHs ( cate-gory 3 ). These observations, suggest that the extensionof MeSH primarily serves the need to enrich the MeSHthesaurus with more detailed subtopics, supporting theannotation of articles with new ﬁne-grained topic la-bels. Nevertheless, it appears that a notable amount ofnew descriptors also serve to rid the PHs of some im- hat is all this new MeSH about? 17 plicitly covered topics, rendering the PHs more preciseas well.This grouping can be particularly useful for improv-ing semantic indexing models for new descriptors. Forexample, the articles annotated with their PHs can bea source of weakly-labeled data for topical annotations.In addition, the provenance types can provide indica-tions for the prevalence of such weak labels. In the caseof

Detachment for example, we may expect that only asmall part of the articles annotated with the PHs willbe relevant to the new descriptor. In the case of newdescriptors for new emerging concepts ( category 4 ) onthe other hand, Zero-Shot Learning approaches may bemore appropriate as no PHs are available as a source ofweak labels.Although our ﬁndings primarily provide insight toresearchers working with MeSH, we also believe thatthe proposed viewpoint is of more general interest. Inparticular it can be used to analyse the extension dy-namics of other similar topic hierarchies. The annota-tions of conceptual provenance produced by the pro-posed method capture the hierarchical relationship ofa new topic with the topics that were previously usedin its place. Such information can be used to charac-terise and group the topics, facilitating the process ofmaintaining topic hierarchies.Our future plans include the investigation of fur-ther uses of the provenance information provided bythe proposed method. In particular, we are examiningwhether new descriptors with the same provenance cat-egory, types or codes, present similarities that can beexploited in the semantic indexing of documents withnewly introduced labels. Additionally, we are lookinginto the use of the provenance information for predict-ing ontological expansion. Last but not least, we wouldlike to explore the use of the conceptual frameworkand computational procedures for tasks related to themaintenance of the hierarchy itself, such as identifyingspecial cases and inconsistencies in textual descriptiveﬁelds.

Acknowledgements

This research work was supported bythe Hellenic Foundation for Research and Innovation (HFRI)under the HFRI PhD Fellowship grant (Fellowship Number:697). We are grateful to James Mork and Dan-Sung Cho fromthe National Library of Medicine (NLM) for kindly providingvaluable feedback on this work.

References

1. Abcckcr, A., Stojanovic, L.: Ontology Evolution: MED-LINE Case Study. In: Wirtschaftsinformatik 2005, pp.1291–1308. Physica-Verlag HD, Heidelberg (2005). DOI10.1007/3-7908-1624-8 68 2. Balili, C., Lee, U., Segev, A., Kim, J., Ko, M.: TermBall:Tracking and Predicting Evolution Types of ResearchTopics by Using Knowledge Structures in Scholarly BigData. IEEE Access , 108514–108529 (2020). DOI10.1109/ACCESS.2020.30009483. Balogh, S.G., Zagyva, D., Pollner, P., Palla, G.: Timeevolution of the hierarchical networks between PubMedMeSH terms. PLOS ONE (8), e0220648 (2019). DOI10.1371/journal.pone.02206484. Bushman, B., Anderson, D., Fu, G.: Transforming theMedical Subject Headings into Linked Data: Creatingthe Authorized Version of MeSH in RDF. Journalof Library Metadata (3-4), 157–176 (2015). DOI10.1080/19386389.2015.10999675. Cardoso, S.D., Da Silveira, M., Pruski, C.: Constructionand exploitation of an historical knowledge graph to dealwith the evolution of ontologies. Knowledge-Based Sys-tems , 105508 (2020). DOI 10.1016/j.knosys.2020.1055086. Cardoso, S.D., Pruski, C., Da Silveira, M.: Supportingbiomedical ontology evolution by identifying outdatedconcepts and the required type of change. Journal ofBiomedical Informatics (August), 1–11 (2018). DOI10.1016/j.jbi.2018.08.0137. Castillo, S., Naacke, H., Amann, B., Chavalarias, D.: Ex-ploring the evolution of science through interactive phy-lomemetic topic maps. BDA 2016 Gestion de Donn´ees–Principes, Technologies et Applications 32 e anniversaire15-18 novembre 2016, Poitiers, Futuroscope p. 89 (2016)8. Da Silveira, M., Dos Reis, J.C., Pruski, C.: Managementof Dynamic Biomedical Terminologies: Current Statusand Future Challenges. Yearbook of Medical Informatics (01), 125–133 (2015). DOI 10.15265/IY-2015-0029. Eljasik-Swoboda, T., Engel, F., Kaufmann, M., Hemmje,M.: Word embedding based extension of text categoriza-tion topic taxonomies. In: CERC, pp. 15–26 (2019)10. Fabian, G., W¨achter, T., Schroeder, M.: Extending on-tologies by ﬁnding siblings using set expansion tech-niques. Bioinformatics (12), 292–300 (2012). DOI10.1093/bioinformatics/bts21511. McCray, A.T., Lee, K.: Taxonomic Change as a Reﬂec-tion of Progress in a Scientiﬁc Discipline. In: Evolu-tion of Semantic Systems, pp. 189–208. Springer BerlinHeidelberg, Berlin, Heidelberg (2013). DOI 10.1007/978-3-642-34997-3 1012. Nelson, S.J., Johnston, W.D., Humphreys, B.L.: Rela-tionships in Medical Subject Headings (MeSH), pp. 171–184. Springer Netherlands, Dordrecht (2001). DOI10.1007/978-94-015-9696-1 1113. Oliver, D.E., Shahar, Y., Shortliﬀe, E.H., Musen, M.A.:Representation of change in controlled medical termi-nologies. Artiﬁcial Intelligence in Medicine (1), 53–76(1999). DOI 10.1016/S0933-3657(98)00045-114. Sari, A.K.: Mapping of change operations from gene on-tology into medical subject headings. International Jour-nal of Intelligent Engineering and Systems (4), 44–55(2020). DOI 10.22266/IJIES2020.0831.0515. Tsatsaronis, G., Varlamis, I., Kanhabua, N., Nørv, K.:Temporal Classiﬁers for Predicting the Expansion ofMedical Subject Headings. Proceedings of the 14th In-ternational Conference on Intelligent Text Processingand Computational Linguistics (CICLing’13) pp. 98–113(2013). DOI 10.1007/978-3-642-37247-6-916. Yu-Wen Guo, Yi-Tsung Tang, Hung-Yu Kao:Genealogical-Based Method for Multiple Ontol-ogy Self-Extension in MeSH. IEEE Transactions8 Nentidis et al. on NanoBioscience13