Spanish Biomedical and Clinical Language Embeddings
Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Casimiro Pio Carrino, Ona De Gibert, Aitor Gonzalez-Agirre, Marta Villegas
aa r X i v : . [ c s . C L ] F e b Spanish Biomedical and Clinical Language Embeddings
Asier Guti´errez-Fandi˜no ∗ , Jordi Armengol-Estap´e † , Casimiro Pio Carrino ‡ , Ona DeGibert § , Aitor Gonzalez-Agirre ¶ , and Marta Villegas ‖ Barcelona Supercomputing CenterFebruary 26, 2021
Abstract
We computed both Word and Sub-word Embeddingsusing FastText. For Sub-word embeddings we se-lected Byte Pair Encoding (BPE) algorithm to rep-resent the sub-words.We evaluated the Biomedical Word Embeddingsobtaining better results than previous versions show-ing the implication that with more data, we obtainbetter representations.
BERT-like (Devlin et al., 2019) and GPT-like(Brown et al., 2020) Language Models’ effectivenessis corroborated for most of the Natural LanguageProcessing tasks; however, computing more tra-ditional embeddings is still useful as, for taskswith small data and/or scenarios not using largecomputational resources, they are still competitive.These new embeddings use a new Spanish Biomed-ical Corpus, a Spanish Clinical Corpus, and the use ofBPE embeddings. We explain the process of generat-ing the embeddings from two unprecedented Spanishcorpora of health. First, we describe the data and ∗ [email protected] † [email protected] ‡ [email protected] § [email protected] ¶ [email protected] ‖ Corresponding author: [email protected] the cleaning process, then we explain the embeddingmethods and, finally, we report the evaluation results.
We have developed two types of embeddings usingtwo different corpora: the Spanish Biomedical Cor-pora and the Spanish Clinical Corpora. Since theSpanish Biomedical Corpora is of a much larger mag-nitude in size than the Clinical Corpora, we decidedto compute embeddings separately and provide themas distinct resources.
We used a big biomedical corpora gathering from avariety of medical resources, namely scientific litera-ture, clinical cases and crawled data.Table 1 shows the composition of the largestSpanish Biomedical Corpora ever made. Thecorpus includes: cardiology clinical cases, radiol-ogy clinical cases, clinical cases books, COVIDclinical cases, EMEA clinical cases (Tiedemann,2012), medical patents, Life Sciences Wikipediadownload, barr2 background (Intxaurrondo et al.,2018), PubMed data, Reec , Medline data, Gen- EMEA is a corpus of biomedical documents retrieved fromthe European Medicines Agency (EMEA). The corpus includesdocuments related to medicinal products and their translationsinto 22 official languages of the European Union. Registro espa˜nol de estudios clinicos: https://reec.aemps.es/reec/public/web.html
The clinical Corpora is conformed by 5 main cor-pora, the information contained by these corpora aremainly COVID-19 cases and ictus cases.
The source of the corpora of both biomedical andclinical domains is of multiple typologies: PDF,WARCs, plain text, etcetera. We cleaned each corpusindependently applying a cleaning pipeline with cus-tomized operations designed to read data in differentformats, split into sentences, perform language detec-tion, remove noisy and malformed sentences, dedupli-cate and eventually output the data with their orig-inal document boundaries. Finally, in order to avoidrepetitive content, we concatenated all the individualcorpora and deduplicated again common documentsamong them.
We provide two type of embeddings: FastText WordEmbeddings and BPE Sub-word Embeddings.
FastText embeddings are explained inBojanowski et al. (2017). We tokenized thesentences and used the script available in thewebsite . As embedding size we used 50, 100 and300 dimensions. For embedding methods, we usedCBOW and Skip-gram. For the Biomedical Corpora,we set the minimum threshold for word frequencyto 1 but for the Clinical Corpora we increased thethreshold to 4 to avoid leaking sensitive data. BPE embeddings are introduced inHeinzerling and Strube (2018). The vocabularysize parameter controls the sub-word splitting mech-anism. We set the vocabulary size to 8,000 in thecase of the Clinical domain and 10,000 in the caseof the Biomedical domain. For the uncased version,before computing the BPE vocabulary, the corpusis lower cased. After computing the BPE subwords,FastText embeddings are computed using the officialscript, omitting the word threshold in the clinicalcorpus.
We evaluated the biomedical embeddings us-ing the same scenario of a previous work(Soares et al., 2019) using the PharmaCoNERdataset (Gonzalez-Agirre et al., 2019). ClinicalWord Embeddings and Sub-word embeddings arenot evaluated due to the lack of a suitable evaluationscenario. Table 2 shows the results.With this new version (v3.0) of the BiomedicalWord Embeddings, we obtain almost 91% and 90% inboth cased and uncased validation sets and the Skip-gram embedding method. In the test set we obtain https://fasttext.cc/docs/en/unsupervised-tutorial.html ased UncasedVersion Method Corpora Validation Test Validation Test v1.0 Skip-gram Wiki 88.55 87.78 - -SciELO 89.47 87.31 - -SciELO+Wiki 89.42 88.17 - -v2.0 CBOW Wiki 86.55 85.46 86.70 86.34SciELO 88.11 87.75 86.99 87.58SciELO+Wiki 88.68 86.58 86.65 85.27Skip-gram Wiki 88.62 87.16 88.31 87.43SciELO 89.66 88.77 89.57 89.61SciELO+Wiki 88.76 88.64 89.82 88.28v3.0 CBOW Bio-Corpus 88.92 88.12 88.86 88.41Skip-gram 90.91 Table 2: Bio-Corpus embeddings (v3.0) compared to previous versions of the Spanish Biomedical WordEmbeddings.almost 90% in both cased and uncased test sets us-ing the Skip-gram embedding method. We improvethe results of previous versions of the embeddings inboth validation and test sets.
In this work we provide new materials for the NaturalLanguage Processing community regarding medicaldomain in Spanish. With these resources we aim tofill the lack of resources in medical AI due to thesensitivity of medical data.We explained how our corpora is conformed, theembedding methods we used and the evaluation wefollowed.We have shown that with larger corpus, our em-beddings capture more information to accomplish theevaluation providing better results. Our embeddingsshow a steady improvement as more corpora havebeen available.
Materials
All the Embeddings have been uploaded to Zenodo: • Biomedical Word Embeddings: https://zenodo.org/record/4543236 • Biomedical Sub-word Embeddings: https://zenodo.org/record/4557459 • Clinical Word Embeddings: https://zenodo.org/record/4552042 • Clinical Sub-word Embeddings: https://zenodo.org/record/4555598
Acknowledgements
This work has been partially funded by the State Sec-retariat for Digitalization and Artificial Intelligence(SEDIA) to carry out specialised technical supportactivities in supercomputing within the frameworkof the Plan TL signed on 14 December 2018; andthe ICTUSnet INTERREG Sudoe programme. eferences Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. Enriching word vectors withsubword information, 2017.Tom B. Brown, Benjamin Mann, Nick Ryder,Melanie Subbiah, Jared Kaplan, Prafulla Dhari-wal, Arvind Neelakantan, Pranav Shyam, GirishSastry, Amanda Askell, Sandhini Agarwal, ArielHerbert-Voss, Gretchen Krueger, Tom Henighan,Rewon Child, Aditya Ramesh, Daniel M. Ziegler,Jeffrey Wu, Clemens Winter, Christopher Hesse,Mark Chen, Eric Sigler, Mateusz Litwin, ScottGray, Benjamin Chess, Jack Clark, ChristopherBerner, Sam McCandlish, Alec Radford, IlyaSutskever, and Dario Amodei. Language modelsare few-shot learners, 2020.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language under-standing, 2019.Aitor Gonzalez-Agirre, Montserrat Marimon, An-der Intxaurrondo, Obdulia Rabal, Marta Ville-gas, and Martin Krallinger. PharmaCoNER:Pharmacological substances, compounds and pro-teins named entity recognition track. In
Pro-ceedings of The 5th Workshop on BioNLP OpenShared Tasks , pages 1–10, Hong Kong, China,November 2019. Association for ComputationalLinguistics. doi: 10.18653/v1/D19-5701. URL .Benjamin Heinzerling and Michael Strube. BPEmb:Tokenization-free Pre-trained Subword Embed-dings in 275 Languages. In Nicoletta Calzo-lari (Conference chair), Khalid Choukri, Christo-pher Cieri, Thierry Declerck, Sara Goggi, KoitiHasida, Hitoshi Isahara, Bente Maegaard, JosephMariani, H´el`ene Mazo, Asuncion Moreno, JanOdijk, Stelios Piperidis, and Takenobu Tokunaga,editors,
Proceedings of the Eleventh InternationalConference on Language Resources and Evaluation(LREC 2018) , Miyazaki, Japan, May 7-12, 20182018. European Language Resources Association(ELRA). ISBN 979-10-95546-00-9. Ander Intxaurrondo, Montserrat Marimon, AitorGonzalez-Agirre, Jose Antonio L´opez-Mart´ın,Heidy Rodriguez, Jesus Santamar´ıa, Marta Vil-legas, and Martin Krallinger. Finding Mentionsof Abbreviations and Their Definitions in SpanishClinical Cases: The BARR2 Shared Task Eval-uation Results. In Paolo Rosso, Julio Gonzalo,Raquel Mart´ınez, Soto Montalvo, and Jorge Car-rillo de Albornoz, editors,
IberEval@SEPLN , vol-ume 2150 of
CEUR Workshop Proceedings , pages280–289. CEUR-WS.org, 2018.Martin Krallinger, Jordi Armengol-Estap´e, OnaDe Gibert, Casimiro Pio Carrino, AitorGonzalez-Agirre, Asier Guti´errez-Fandi˜no,and Marta Villegas. Spanish Biomedi-cal Crawled Corpus, February 2021. URL https://doi.org/10.5281/zenodo.4561971 .Felipe Soares, Marta Villegas, Aitor Gonzalez-Agirre,Martin Krallinger, and Jordi Armengol-Estap´e.Medical word embeddings for Spanish: Devel-opment and evaluation. In
Proceedings of the2nd Clinical Natural Language Processing Work-shop , pages 124–133, Minneapolis, Minnesota,USA, June 2019. Association for ComputationalLinguistics. doi: 10.18653/v1/W19-1916. URL .J¨org Tiedemann. Parallel Data, Tools and Inter-faces in OPUS. In Nicoletta Calzolari (Confer-ence Chair), Khalid Choukri, Thierry Declerck,Mehmet Ugur Dogan, Bente Maegaard, JosephMariani, Jan Odijk, and Stelios Piperidis, edi-tors,
Proceedings of the Eight International Con-ference on Language Resources and Evaluation(LREC’12) , Istanbul, Turkey, may 2012. EuropeanLanguage Resources Association (ELRA). ISBN978-2-9517408-7-7.Marta Villegas, Ander Intxaurrondo, A. Gonzalez-Agirre, M. Marimon, and Martin Krallinger. TheMeSpEN Resource for English-Spanish MedicalMachine Translation and Terminologies : Censusof Parallel Corpora , Glossaries and Term Transla-tions. In