Multilingual enrichment of disease biomedical ontologies
Léo Bouscarrat, Antoine Bonnefoy, Cécile Capponi, Carlos Ramisch
MMultilingual enrichment of disease biomedical ontologies
L´eo Bouscarrat , , Antoine Bonnefoy , C´ecile Capponi , Carlos Ramisch EURA NOVA, Marseille, France Aix Marseille Univ, Universit´e de Toulon, CNRS, LIS, Marseille, France { leo.bouscarrat, antoine.bonnefoy } @euranova.eu { leo.bouscarrat, cecile.capponi, carlos.ramisch } @lis-lab.fr Abstract
Translating biomedical ontologies is an important challenge, but doing it manually requires much time and money. We study thepossibility to use open-source knowledge bases to translate biomedical ontologies. We focus on two aspects: coverage and quality. Welook at the coverage of two biomedical ontologies focusing on diseases with respect to Wikidata for 9 European languages (Czech,Dutch, English, French, German, Italian, Polish, Portuguese and Spanish) for both ontologies, plus Arabic, Chinese and Russian for thesecond one. We first use direct links between Wikidata and the studied ontologies and then use second-order links by going throughother intermediate ontologies. We then compare the quality of the translations obtained thanks to Wikidata with a commercial machinetranslation tool, here Google Cloud Translation.
Keywords: biomedical, ontology, translation, wikidata
1. Introduction
Biomedical ontologies, like Orphanet (INSERM, 1999b),play an important role in many downstream tasks (Androniset al., 2011; Li et al., 2015; Phan et al., 2017), especially innatural language processing (Maldonado et al., 2017; Nayeland Shashrekha, 2019). Today either the vast majority ofthese ontologies are only available in English or their re-strictive licenses reduce the scope of their usage. There isnowadays a real focus on reducing the prominence of En-glish, thus on working on less-resourced languages. To doso, there is a need for resources in other languages, but thecreation of such resources is time and money consuming.At the same time, the Internet is also a source of incredi-ble projects aiming to gather a maximum of knowledge ina maximum of languages. One of them is the collabora-tive encyclopedia Wikipedia, opened in 2001, which cur-rently exists in more than 300 languages. As it containsmainly plain text, it is hard to use it as a resource as is.However, several knowledge bases have been built from it:DBpedia (Lehmann et al., 2015) and Wikidata (Vrandeˇci´cand Kr¨otzsch, 2014). The main difference between thesetwo knowledge graphs is the update process: while Wiki-data is manually updated by users, DBpedia extracts its in-formation directly from Wikipedia. Compared to biomedi-cal ontologies they are structured using less expressive for-malisms and they gather information about a larger domain.They are open-source, thus can be used for any down-stream tasks. For each entity they have a preferred label,but sometimes also alternative labels that can be used assynonyms. For example, the entity
Q574227 in Wikidatahas the preferred label in English alongwith the alternative labels in English:
Albright HereditaryOsteodystrophy-Like Syndrome and
Brachydactyly Men-tal Retardation Syndrome . Moreover, entities in thesetwo knowledge bases also have translations in several lan-guages. For example, the entity
Q574227 in Wikidata hasthe preferred label in English and the pre-ferred label
Zesp´oł delecji 2q37 in Polish. They also fea- ture some links between their own entities and entities inexternal biomedical ontologies. For example, the entity
Q574227 in Wikidata has a property
Orphanet ID ( P1550 )with the value .By using both kinds of resources, biomedical ontologiesand open-source knowledge bases, we could partially en-rich biomedical ontologies in languages other than English.As links between the entities of these resources are alreadyexisting, we expect good quality. To further enrich them wecould even look at second-order links since many biomedi-cal ontologies also contain some links to other ontologies.The goal of this work is twofold: • to study the coverage of such open-source collabo-rative knowledge graphs compared to biomedical on-tologies, • to study the quality of the translations using first- andsecond-order links and comparing this quality with thequality obtained by machine translation tools.This paper is part of a long-term project whose goal is towork on multilingual disease extraction from news withstrategies based on dictionary expansion. Consequently, weneed a multilingual vocabulary with diseases which are nor-malized with respect to an ontology. Thus, we focus on onekind of biomedical ontologies, that is, ontologies about dis-eases.
2. Resources and Related Work
There has already been some work trying to use open-source knowledge bases to translate biomedical ontologies.Bretschneider et al. (2014) obtain a German-English medi-cal dictionary using DBPedia. The goal is to perform in-formation extraction from a German biomedical corpus.They could not directly use the RadLex ontology (Langlotz,2006) as it is only available in English. So, they first ex-tract term candidates in their German corpus. Then, theytry to match the candidates with the pairs in their German-English dictionary. If a candidate is in the dictionary, they a r X i v : . [ q - b i o . Q M ] A p r igure 1: Example of first-order link (left) and second-order link (right)use the translation to match with the RadLex ontology. Fi-nally, this term candidate alongside with the match in theRadLex ontology is processed by a human to validate thematching.Alba et al. (2017) create a language-independent method tomaintain up-to-date ontologies by extracting new instancesfrom text. This method is based on a human-in-the-loopwho helps tuning scores and thresholds for the extraction.Their method requires some “contexts” to start finding newentities to add to the ontology. To bootstrap the contexts,they can either ask a human to annotate some data or usean oracle made by the dictionary extracted from the DBpe-dia and Wikidata using word matching on the corpus. Theythen look for good candidates, i.e., a set of words surround-ing an item, by looking for elements in similar contexts tothe one found using the bootstrapping. Then, a human-in-the-loop validates the newly found entities, adding them tothe dictionary if they are correct, or down-voting the con-text if they are not relevant entities.Hailu et al. (2014) work on the translation of the Gene On-tology from English to German and compare three differentapproaches: DBpedia, the Google Translate API withoutcontext, and the Google Translate API with context. Tofind the terms in DBpedia they use keyword-based search.After a human evaluation, they find that translations ob-tained with DBpedia have the lowest coverage (only 25%)and quality compared to those obtained with Google Trans-late API. However, to compare the quality of the differentmethods they only use the translation of 75 terms obtainedwith DBpedia compared to 1,000 with Google TranslateAPI. They also note that synonyms could be a useful toolfor machine translation and that using keyword-based exactmatch query to match the two sources could explain the lowcoverage.Silva et al. (2015) compare three methods to translateSNOMED CT from English to Portuguese: DBpedia, ICD-9 and Google Translate. To verify the quality of the dif-ferent approaches they use the CPARA ontology which hasbeen hand-mapped to SNOMED CT. It is composed of 191terms and focused on allergies and adverse reactions. Theydetect coverage of 10% with the ICD-9, 37% with DBpediaand 100% with Google Translate. To compare the quality oftheir translations they use the Jaro Similarity (Jaro, 1989).We elaborate on these ideas by adding some elements. Firstof all, compared to Hailu et al. (2014) and Silva et al.(2015), we use already existing properties to perform thematching between the biomedical ontology and the knowl- edge graph, which should improve the quality with regardto the previous works. We also go further than these first-order links and explore the possibility of using second-order links to improve the coverage of the mappings be-tween the sources. Compared to the same works, we alsopresent a more complete study, Hailu et al. (2014) onlyevaluate on 75 terms and Silva et al. (2015) on 191 terms.We compare the coverage and quality of the entire biomed-ical ontology containing 10,444 terms. Furthermore, as wewant to use the result of this work for biomedical entityrecognition, synonyms of entities are really important forrecall and also for normalisation, thus we also quantify thedifference of quantity of synonyms between the originalbiomedical ontology and those found with Wikidata.In this work, as we focus on diseases, we use a free datasetextracted from Orphanet (INSERM, 1999b) to perform theevaluation. Orphanet is a resource built to gather and im-prove knowledge about rare diseases. Through Orphadata(INSERM, 1999a), free datasets of aggregated data are up-dated monthly. One of them is about rare diseases, includ-ing cross-references to other ontologies. The Orphadatadataset contains the translation of 10,444 entities for En-glish, French, German, Spanish, Dutch, Italian, Portuguese,10,418 entities in Polish and 9,323 in Czech. All the trans-lations have been validated by experts, thus can be used asa gold standard for multilingual ontology enrichment. Oneissue of this dataset is that rare diseases are, by definition,not well known. Therefore, one may expect a lower cov-erage than a less focused dataset; thus we propose to alsomeasure the coverage of another dataset, Disease Ontology(Schriml et al., 2019). However we cannot use it to evaluatethe translation task as it does not contain translations.As an external knowledge base, we use Wikidata. Ithas many links to external ontologies, especially links tobiomedical ontologies such as wdt:P1550 for Orphanet, wdt:P699 for Disease Ontology, and wdt:P492 for the On-line Mendelian Inheritance in Man (OMIM). It is also im-portant to note that, over the 9 languages we studied, onlythe Czech Wikipedia has less than 1,000,000 articles. Thisinformation can be used as a proxy for the completeness ofthe information in each language on Wikidata. We prefer itover DBpedia as we find it easier to use, especially to findthe properties.As a machine translation tool, we use Google Cloud Trans-lation. It is a paying service offered by Google Cloud. . Methods and Experiments In this section, we first define the notations used in this pa-per, then we describe how we extract the first- and second-order links from our sources. Afterwards, we describe howwe perform machine translation. The evaluation metrics aresubsequently explained and finally we describe our evalua-tion protocol.
We define: • e Si as an entity in the source knowledge base S , S ∈ [ O, W, B ] where O is Orphanet, W is WikiDataand B are all the other external biomedical ontologiesused. An entity is either a concept in an ontology orin a knowledge graph. • E S = { e Si } i =1 ... | E S | is the set of all the entities in thesource S . • E = E O ∪ E W ∪ E B is the set of all the entities inall the sources. • L l ( e ) is the preferred label of the entity e in thelanguage l , or ∅ if there is no label in this language. • L l ( e ) represents all the possible labels of the entity e in the language l or ∅ if there is no label in thislanguage. Furthermore, L l ( e ) ∈ L l ( e ) • T is a set of links, such that t ∈ T with t = ( e si , e s (cid:48) j ) , s (cid:54) = s (cid:48) . • G = ( E, T ) is an undirected graph. • V ( e i ) = { e j ∈ E |∃ t ∈ T, t = ( e i , e j ) } , defines theset of all the neighbours of the entity e i . • W ( e ) = { v ∈ V ( e ) | v ∈ W } , defines the set of all theneighbours that are in Wikidata of the entity e . • M T ( { s , ..., s n } , l ) is a function that returns the la-bels { s , ..., s n } translated from English to the lan-guage l thanks to Google Cloud Translation. The first step of our method consists in gathering all theinformation about the sources. To obtain the gold transla-tions, we use Orphadata. We collected all the JSON filesfrom their website on January 15, 2020. We extract the OrphaNumber, the Name, the SynonymList and the Exter-nalReferenceList of each element in the files.For WikiData we use the SPARQL endpoint . We queryall the entities having a property OrphaNumber wdt:P1550 ,and, for these entities, we obtain all their preferred labels( rdfs:label ) and synonyms ( skos:altLabel ), correspondingto E Oi in the 9 European languages included in Orphanet.The base aggregator of the synonyms uses a comma to sep-arate them. In our case, this error-prone because the commacan also be part of the label, for example one of the alterna-tive label of the entity Q55786560 is
49, XXXYY syndrome .We needed to concatenate the synonyms with another sym-bol . Thanks to the property which gives the Orphanum-ber of the related entity in Orphanet we can create links t = ( e O , e W ) between an entity e Wi in Wikidata and andentity e Oi in Orphanet. The queries have been made onApril 01, 2020.The mapping is then trivial, as we have the OrphaNum-ber in the two sources. On the left of Figure 1 we cansee that the entity Q1077505 in Wikidata has a property
Orphanet ID with the value , thus we can create t = ( Q W , O ) . Nonetheless, the mapping isnot always unary, because several Wikidata entities can belinked to the same Orphanet entity.Formally, the set of Orphanet entities with at least one first-order link is: E F = { e ∈ E O |∃ w ∈ W, ( e, w ) ∈ T } Orphanet provides some external references to auxiliaryontologies. We add these references to our graph: t =( e O , e B ) ∈ T . Even if there are already first-order linksbetween Orphanet and Wikidata, we cannot ensure that allthe entities are linked. To improve the coverage of trans-lations, we can use second-order links, creating an indirectlink when entities from Wikidata and Orphanet are linkedto the same entity in a third external source B . For exam-ple, on the right of Figure 1, we extract the link betweenthe entity Q1495005 of Wikidata and the entity ofOMIM. We also extract from Orphanet that the entity of Orphanet is link to the same entity of OMIM. Therefore,as a second-order relation, the entity
Q1495005 of Wikidataand the entity of Orphanet are linked.The objective is to find some links t (cid:48) = ( e W , e B ) where ∃ v ∈ V ( e B ) and v ∈ E O . Consequently, we are lookingfor links between entities from Wikidata and the externalbiomedical ontologies, whenever the entity in the externalbiomedical ontology already has a link with an entity inOrphanet.For that purpose, we extract all the links between Wiki-data and the external biomedical ontologies in the samefashion as from Orphanet, using the appropriate Wiki-data properties. In the previous example, we create https://query.wikidata.org/sparql can bequeried with the interface https://query.wikidata.org/ We made a package to extract entities from Wiki-data: https://github.com/euranova/wikidata_property_extraction inks ( Q W , OM IM : 121270 B ) ∈ T and (1551 O , OM IM : 121270 B ) ∈ T .We can now map Wikidata and Orphanet using second-order links. This set of links is denoted as: C = { e ∈ E O |∃ ( w, b ) ∈ E W × E B , ( e, b ) ∈ T, ( w, b ) ∈ T } We also define the set of all the second-order linkedWikipedia entities of a specific Orphanet entity: C ( e O ) = { w ∈ E W |∃ b ∈ E B , ( e, b ) ∈ T, ( w, b ) ∈ T } We use Google Cloud Translation as a machine translationtool to translate the labels of the ontology from English to atarget language. As we want to have the same entities in thetest set as for Wikidata, for each language we only translatethe Orphanet entities which have at least one first-order linkto an entity in Wikidata with a label in the target language.So for an entity e , for the language l the output of GoogleCloud Translation is: M T ( L en ( e ) , l ) In this section, we define the different evaluation metricsthat are used to evaluate the efficiency of the method.
To estimate the coverage of Wikipedia on a biomedical on-tology we use the following metric:
Coverage ( E , E , l ) = |{ e ∈ E | L l ( e ) (cid:54) = ∅}||{ e (cid:48) ∈ E | L l ( e (cid:48) ) (cid:54) = ∅}| where E and E are sets of entities. In order to evaluate the quality of the translations, we fol-low Silva et al. (2015) choosing the Jaro similarity, whichis a type of edit distance. We made this choice as we arelooking at entities. Whereas other measures such as BLEU(Papineni et al., 2002) are widely used for translation tasks,they have been designed for full sentences instead of rela-tively short ontology labels. The Jaro Similarity is definedas: J( s, s (cid:48) ) = 13 (cid:18) m | s | + m | s (cid:48) | + m − tm (cid:19) s, s (cid:48) ∈ { a, ..., z } ∗ with s and s (cid:48) two strings, | s | the length of s , t is halfthe number of transpositions, m the number of matchingcharacters . Two characters from s and s (cid:48) are matching ifthey are the same and not further than max ( | s | , | s (cid:48) | )2 − . TheJaro Similarity ranges between 0 and 1, where the score is1 when the two strings are the same.However, since one Orphanet entity may have severalneighbour Wikidata entities, we cannot use the Jaro simi-larity directly. We choose to use the max , for consideringthe quality of the closest entity: J max ( s, [ s , ..., s n ]) = max s (cid:48) ∈ [ s ,...,s n ] J( s, s (cid:48) ) From assessing the quality of the translations, we create 4different measures with different goals. For each entity ineach language, there is a preferred label L l ( e ) and a listof all the possible labels L l ( e ) . All of the metrics rangebetween 0 and 1, the higher the better. M p l ( e, [ e , ..., e n ] , l ) = J max (L l ( e ) , [L l ( e ) , .., L l ( e n )]) M b l ( e, [ e , ..., e n ] , l ) = J max (cid:32) L l ( e ) , n (cid:91) i =1 L l ( e i ) (cid:33) M m bl ( e, [ e , ..., e n ] , l ) = mean s ∈L l ( e ) J max (cid:32) s, n (cid:91) i =1 L l ( e i ) (cid:33) M M bl ( e, [ e , ..., e n ] , l ) = max s ∈L l ( e ) J max (cid:32) s, n (cid:91) i =1 L l ( e i ) (cid:33) M p l , for principal label, compares the preferred labelsfrom Orphanet and Wikidata. This number is expected tobe high, but as there is no reason that Wikidata and Or-phanet use the same preferred label, we do not expect it tobe the highest score. Nonetheless, as Wikidata is a collabo-rative platform, a score of 1 on a high number of entities ina different language could also indicate that the translationscome from Orphanet. M b l , for best label, compares the preferred label from Or-phanet against all the labels in Wikidata. The goal here isto verify that the preferred label of Orphanet is available inWikidata. M m bl , for mean best label, takes the average of the similar-ity of one label in Orphanet against all the labels in Wiki-data. This score can be seen as a completeness score, itevaluates the ability of finding all the labels of Orphanet inWikidata. M M bl , for max best label, takes the maximum of the sim-ilarity of one label in Orphanet against all the labels inWikidata. The question behind this metric is: Do we haveat least one label in common between Orphanet and Wiki-data? A low score here could mean that the relation is erro-neous. We expect a score close to 1 here.We used the same measures for the machine-translateddataset, however, the difference between M p l and M b l is expected to be smaller, as we are sure that the preferredlabel from the translated dataset is the translation of the pre-ferred label from Orphanet.To obtain a score for these measures on the entire dataset,we compute the average of the scores over all Orphanet en-tities. The first step of our experiments is the extraction of first-order and second-order links from Wikidata and Orphanetas explained in 3.2.. Once these links are available, westudy them, starting with their coverage. To evaluate p l M b l M m bl M M bl Lang 1st W 1+2nd W GCT 1st W 1+2nd W GCT 1st W 1+2nd W GCT 1st W 1+2nd W GCTEN 85.5
N/A 91.5
N/A
DE 77.1 67.8
ES 81.3 70.1
PL 78.0 63.8
IT 79.4 66.7
PT 79.9 64.9
NL 72.9 59.1
CS 76.3 52.8
Table 1: Scores of the different methods with the different metrics in function of the languages. 1st W represents thequality of the first-order links with Wikidata, 1+2nd W the first and second-order links, and GCT the translations obtainedby Google Cloud Translation.the coverage of Wikidata for each language, we compute
Coverage ( E F , E O , l ) for the 9 languages. We also com-pute Coverage ( C, E O , l ) for second-order links. As Or-phanet is focused on rare diseases, we do not expect a highcoverage in Wikidata. To verify this hypothesis, we do thesame evaluation on the Disease Ontology, which does notfocus on rare diseases.Then, we study the quality of the different methods. We ap-ply the 4 quality metrics defined in 3.4.3. for each languageon each method: • First-order links: mean e O ∈ E F ( M ( e O , W ( e O ) , l ) • Second-order links: mean e O ∈ C ( M ( e O , C ( e O ) , l ) • Machine translation: mean e O ∈ E F ( M ( e O , M T ( L e O ( l ) , l ) , l ) Finally, we look at the number of labels we can obtain forboth sources. • Orphanet: mean e ∈ E F | L l ( e ) |• Wikidata: mean e ∈ E F (cid:80) w ∈W ( e ) | L l ( w ) |• GCT: mean e ∈ E F | M T ( L en ( e ) , l ) | The number of synonyms of an entity e in a language l is: | L l ( e ) | , and we also remove the duplicates. We then aver-age this over all the entities which are in a first-order linkand in Wikidata and Orphanet.
4. Results
In this part, we first present the results on the coverage ofWikipedia on Orphanet, then we present the quality of thetranslation. Afterwards, we show results about the numberof synonyms in both sources and finally we discuss theseresults. The results can be reproduced with this code: https://github.com/euranova/orphanet_translation
First, we evaluate the coverage for each language, i.e., thepercentage of entities in Orphanet which have at least onetranslation in Wikidata.The Orphadata dataset contains translations of English,French, German, Spanish, Dutch, Italian, Portuguese, Pol-ish and Czech. For Wikidata, the results depend on thelanguage as not all the entities have translations in everylanguage.Language Orphanet Wikidata (%)English 10,444 8,870 (84.9%)French 10,444 5,038 (48.2%)German 10,444 1,946 (18.6%)Spanish 10,444 1,565 (15.0%)Polish 10,171 1,329 (13.1%)Italian 10,444 1,175 (11.3%)Portuguese 10,444 921 (8.8%)Dutch 10,444 888 (8.5%)Czech 9,323 452 (4.8%)Table 2: Number of translated entities in Orphanet andnumber of Orphanet entities having at least one translationin Wikidata with first-order links. The percentage of cover-age is shown in parentheses.As we can see in Table 2 that coverage depends on the lan-guage. The coverage of English gives us the amount ofentities from Orphanet having at least one link with Wiki-data. Here, we have 84.9% of the entities which are alreadylinked to at least one entity in Wikidata. It means that theproperty of the OrphaNumber is widely used. We can alsonote that the French Wikidata seems to carry more infor-mation about rare diseases than the German Wikipedia. In-deed French and German Wikipedias have approximatelythe same global size , but the German Wikidata containsmuch less information about rare diseases. As of the 6th February 2020: https://meta.wikimedia.org/wiki/List_of_Wikipedias anguage Cov 1st (%) Cov 1st+2nd (%)English 8,870 (84.9%) 9,317 (89.2%)French 5,038 (48.2%) 7,922 (75.9%)German 1,946 (18.6%) 6,350 (60.8%)Spanish 1,565 (15.0%) 6,122 (58.6%)Polish 1,329 (13.1%) 5,797 (57.0%)Italian 1,175 (11.3%) 5,715 (54.7%)Portuguese 921 (8.8%) 5,016 (48.0%)Dutch 888 (8.5%) 5,081 (48.6%)Czech 452 (4.8%) 3,180 (34.1%)Table 3: Coverage in terms of number and percentage of en-tities in Wikidata linked to Orphanet using first-order links(Cov 1st) and first- plus second-order links (Cov 1st+2nd).The next question is the quantity of new links we can obtainby gathering second-order links.Table 3 shows that the second-order links improve the cov-erage. For English, the improvement is small. Thus, forall the other languages, second-order links really help toincrease the coverage. It seems to be a good help foraverage-resourced languages. We have used ICD-10, Med-ical Subject Heading (MeSH), Online Mendelian Inheri-tance in Man (OMIM), and, Unified Medical LanguageSystem (UMLS) as auxiliary ontologies.
Even if the coverage for Orphanet in English is alreadyhigh, Orphanet is focused on rare diseases, which is reallyspecific. This specificity could have an impact on the cov-erage as Wikidata is not made by experts. To verify if thespecificity of this ontology has an influence on coverage,we have also looked at another biomedical ontology on dis-eases, Disease Ontology. It is also about diseases but doesnot focus on rare disease. Thus, this difference in generalityis expected to have an impact on the coverage.The Disease Ontology contains 12,171 concepts. We planto use it for future works on other languages: Arabic,Russian and Chinese. These three languages also haveWikipedias with more than 1,000,000 articles on which wecould rely.As expected, this less expert ontology seems to have bet-ter coverage than Orphanet. Table 4 shows that, even if thecoverage for all the languages is better than for Orphanet,the difference is not the same for all the languages. Espe-cially, Spanish has a coverage in Disease Ontology superiorto that in Orphadata by more than 11%. We do not have anexplanation for these differences.We do not compute the second-order links for Disease On-tology because 97.2% of the Orphanet entities are alreadylinked using first-order links.
The next question concerns the quality of the translationsobtained. We can expect high-quality translations fromGoogle Cloud Translation, but to what extent? We alsowant to compare the quality of translations obtained fromWikidata using first-order and second-order links. The on-tology we use is heavily linked directly to Wikidata, but Language Wikidata (%)English 11,833 (97.2%)French 7,156 (58.8%)Spanish 3,178 (26.1%)Arabic 2,507 (20.6%)German 2,500 (20.5%)Italian 2,098 (17.2%)Polish 1,869 (15.3%)Chinese 1,789 (14.7%)Portuguese 1,748 (14.3%)Russian 1,706 (14.0%)Dutch 1,650 (13.6%)Czech 1,001 (8.2%)Table 4: Number of entities in Disease Ontology translated,number of Disease Ontology entities having at least onetranslation in Wikidata with first order links and the per-centage of coverage.this is not the case for all the ontologies. For ontologieswith lower first-order coverage, one could expect higher in-crease of the second-order coverage as observed in Table3.The first line of Table 1 shows the matching between theEnglish labels of the entities of Orphanet and Wikidata. M b l and M M bl are interesting here as they can be usedas an indicator of a good match. A score of 1 means thatone of the labels of Wikidata is the same as the preferredlabel from Orphanet ( M b l ) or one of the labels from Or-phanet ( M M bl ). Considering that the scores are close to 1,the matching seems to be good.In Table 1 we can see that Google Cloud Translation givesthe best translations when evaluated with the Jaro Simi-larity. Nonetheless, there are still some small dissimilar-ities depending on the languages, it seems to works wellfor Spanish and less well for German and Polish. We canalso note that for Portuguese, if the preferred label is welltranslated ( M p l , M b l ), it is less the case for the synonyms( M m bl ).Then, the first-order links from Wikidata have also somesatisfactory results, there are also dissimilarities betweenthe languages. Especially, first-order links seem to workbetter than the average in French. Compared to second-order links, first-order links are always better and the de-crease in quality between both is substantial. Some noise isprobably added by the intermediate ontologies. Hailu et al. (2014) suggests that synonyms play an im-portant role in translation. Therefore, in addition to high-quality translation, we are also interested in a high numberof synonyms. In our case, the synonyms are the different la-bels available for each language for Orphanet and Wikidata,and the translations of the English labels for Google CloudTranslation. We want to evaluate the richness of each meth-ods in terms of numbers of synonyms. For a fair compar-ison, for each language we only work on the subset wherethe entities in Wikidata have at least one label in the evalu-ted language.Lang Orphanet Wiki 1st Wiki 1+2nd GCTEN 2.3 5.8 166.77 2.3FR 2.36 1.49 10.59 2.39DE 2.56 1.84 5.93 2.65ES 2.26 2.61 9.50 2.39PL 2.54 2.01 6.88 2.65IT 2.36 1.85 3.50 2.5PT 1.62 1.60 2.40 2.41NL 2.6 1.74 3.74 2.48CS 2.2 1.74 1.71 2.13Table 5: Average number of labels in the different sourcesin function of the language. For Orphanet we only use thesubset of entities linked to entities in Wikidata with at leastone label in the studied language. For Google Cloud Trans-lation, it is the translation of the English labels of Orphanet.Table 5 shows that generally Orphanet seems to have moresynonyms than Wikidata when using first-order links only.And the fact GCT has more synonyms means that Orphanethas more labels in English than in other languages on thestudied subset for majority language, except Dutch andCzech. Thus, this is not the case in English. For this lan-guage Wikidata is more diverse.When using first and second-order links, the number of syn-onyms is much higher, especially for English. This is re-lated to the fact that second-order links add many new rela-tions. This new relations always have labels in English butnot always habe labels in other languages.
5. Discussion
Regarding coverage, in terms of entities only, the coverageof first-order links is already high for Orphanet and Dis-ease Ontology, respectively 84.9% and 97.2% (for Englishas, in our case, all the entities have English labels). Theissue comes from the labels: even if Wikidata is multilin-gual, in our study we see that the information is mainly inEnglish and French, but for the other studied languages theresults are substantially worse. All the entities with a linkhave labels in English, more than half have labels in Frenchand then for German, only around 20% of the 8,870 linkedentities in Wikidata have at least one label in German. Thelanguages we study are among the most used languages inWikipedia. Thus, it is already an important amount of en-tities that could have their labels translated from Englishto another of these languages. As Wikidata is a collabo-rative project, this number should only increase over time.Second-order links help a lot for languages other than En-glish.Regarding quality, Google Cloud Translation is the bestmethod. Compared to the results obtained by Silva et al.(2015) on the translation of a subpart of MeSH in Por-tuguese, the quality of the label translations seems to havegreatly improved. Then translations obtained through first-order links are not so distant from Google Cloud Trans-lation. However, the quality of the translations obtainedthrough second-order links has a substantial difference with the translation coming from first-order links. Thus, we canexpect Google Cloud Translation to have an advantage asOrphanet is primarily maintained in English and Frenchand then translated by experts to other languages. Evenif Google Cloud Translation is not free, translating the en-tirety of the English labels of Orphanet would only costaround 16$ with the pricing as of February 6, 2020.For the synonyms, as Orphanet seems to have more labelsin English than in the other languages, translating all the la-bels from English to the different languages allows havingmore synonyms than Orphanet in other languages. More-over, Wikidata is poorer in terms of synonyms than Or-phanet except for English. This is interesting as GoogleCloud Translation seems to perform good translations, andhaving more synonyms in English also means that if wetranslate them with Google Cloud Translation we couldhave also more synonyms in other languages. It is also im-portant to note that Google Cloud Translation only providesone translation by label. Second-order links also bringmany more synonyms for all the languages, but especiallyfor those which have a larger Wikidata.
6. Conclusions and Future Work
One of the limitations of this work concerns informationthat was not used. Especially in Orphanet and Wikidata,when an entity is linked to another ontology, there is addi-tional information about the nature of the link, for example,whether it is an exact match or a more general entity. Wedid not use at all this information and it could be used toimprove the links we create. Wikidata also contains moreinformation about the entities than just the labels, e.g., Jianget al. (2013) extracts multilingual textual definitions.We also focus our study on one type of biomedical entities,diseases. The results of this work may not be generalized toall types of entities. Hailu et al. (2014) have found equiv-alent results for the translation of the Gene Ontology be-tween English and German, but Silva et al. (2015) did notfind the same results on their partial translation of MeSH.Another limitation is our study about synonyms. Hav-ing the maximum number of synonyms is useful for entityrecognition and normalization. Thus, here we only havequantitatively studied the synonyms, and have not exploredtheir quality and diversity. First- and second-order link ex-traction from Wikidata seems to be a good method to havemore synonyms. A further assessment with an expert thatcould validate the synonyms could be interesting.Furthermore, as we are interested in entity recognition, alow coverage on the ontology is not correlated with a lowcoverage for entities in a corpus. In Bretschneider et al.(2014), by only translating a small sub-part of an ontologythey could improve the coverage of the entities in their cor-pus by a high margin. It will be interesting to verify this ona dataset on disease recognition.To summarize, as of now, Google Cloud Translate seemsto be the best way to translate an ontology about diseases.If the ontology does not have many synonyms, Wikidatacould be a way to expand language-wise the ontology.Wikidata also contains other information about its entitieswhich could be interesting, but have not been used in thisstudy such as symptoms and links to Wikipedia pages. . Bibliographical References
Alba, A., Coden, A., Gentile, A. L., Gruhl, D., Ristoski, P.,and Welch, S. (2017). Multi-lingual Concept Extractionwith Linked Data and Human-in-the-Loop. In
Proceed-ings of the Knowledge Capture Conference on - K-CAP2017 , pages 1–8, Austin, TX, USA. ACM Press.Andronis, C., Sharma, A., Virvilis, V., Deftereos, S., andPersidis, A. (2011). Literature mining, ontologies andinformation visualization for drug repurposing.
Brief-ings in Bioinformatics , 12(4):357–368, 06.Bretschneider, C., Oberkampf, H., Zillner, S., Bauer, B.,and Hammon, M. (2014). Corpus-based Translation ofOntologies for Improved Multilingual Semantic Annota-tion. In
Proceedings of the Third Workshop on Seman-tic Web and Information Extraction , pages 1–8, Dublin,Ireland. Association for Computational Linguistics andDublin City University.Hailu, N. D., Cohen, K. B., and Hunter, L. E. (2014). On-tology translation: A case study on translating the GeneOntology from English to German.
Natural languageprocessing and information systems : ... InternationalConference on Applications of Natural Language to In-formation Systems, NLDB ... revised papers. Interna-tional Conference on Applications of Natural Languageto Info. , 8455:33–38, June.INSERM. (1999a). Orphadata: Free access data from or-phanet. . Accessed:2020-02-11.INSERM. (1999b). Orphanet: an online rare disease andorphan drug data base. .Accessed: 2020-02-11.Jaro, M. A. (1989). Advances in record-linkage method-ology as applied to matching the 1985 census of tampa,florida.
Journal of the American Statistical Association ,84(406):414–420.Jiang, G. D., Solbrig, H. R., and Chute, C. G. (2013).A semantic web-based approach for harvesting multilin-gual textual definitions from wikipedia to support icd-11revision. In .CEUR-WS.Langlotz, C. (2006). Radlex: a new method for indexingonline educational materials.
Radiographics: a reviewpublication of the Radiological Society of North Amer-ica, Inc , 26(6):1595.Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kon-tokostas, D., Mendes, P. N., Hellmann, S., Morsey,M., Van Kleef, P., Auer, S., et al. (2015). Dbpedia–alarge-scale, multilingual knowledge base extracted fromwikipedia.
Semantic Web , 6(2):167–195.Li, J., Zheng, S., Chen, B., Butte, A. J., Swamidass, S. J.,and Lu, Z. (2015). A survey of current trends in compu-tational drug repositioning.
Briefings in Bioinformatics ,17(1):2–12, 03.Maldonado, R., Goodwin, T. R., Skinner, M. A., andHarabagiu, S. M. (2017). Deep learning meets biomedi- cal ontologies: knowledge embeddings for epilepsy. In
AMIA Annual Symposium Proceedings , volume 2017,page 1233. American Medical Informatics Association.Nayel, H. A. and Shashrekha, H. L. (2019). IntegratingDictionary Feature into A Deep Learning Model for Dis-ease Named Entity Recognition. arXiv:1911.01600 [cs] ,November. arXiv: 1911.01600.Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).Bleu: a method for automatic evaluation of machinetranslation. In
Proceedings of the 40th annual meetingon association for computational linguistics , pages 311–318. Association for Computational Linguistics.Phan, N., Dou, D., Wang, H., Kil, D., and Piniewski, B.(2017). Ontology-based deep learning for human be-havior prediction with explanations in health social net-works.
Information sciences , 384:298–313.Schriml, L. M., Mitraka, E., Munro, J., Tauber, B., Schor,M., Nickle, L., Felix, V., Jeng, L., Bearer, C., Lichen-stein, R., et al. (2019). Human disease ontology 2018update: classification, content and workflow expansion.
Nucleic acids research , 47(D1):D955–D962.Silva, M. J., Chaves, T., and Simoes, B. (2015). Anontology-based approach for SNOMED CT translation.
ICBO 2015 .Vrandeˇci´c, D. and Kr¨otzsch, M. (2014). Wikidata: a freecollaborative knowledgebase.