Representing Semantified Biological Assays in the Open Research Knowledge Graph
Marco Anteghini, Jennifer D'Souza, Vitor A.P. Martins dos Santos, Sören Auer
RRepresenting Semantified Biological Assays inthe Open Research Knowledge Graph (cid:63)
Marco Anteghini , − − − , Jennifer D’Souza − − − ,Vitor A.P. Martins dos Santos , − − − , and S¨orenAuer − − − Lifeglimmer GmbH, Markelstr. 38, 12163 Berlin, Germany Wageningen University & Research, Laboratory of Systems & Synthetic Biology,Stippeneng 4, 6708 WE, Wageningen, The Netherlands { anteghini,vds } @lifeglimmer.com TIB Leibniz Information Centre for Science and Technology, Hannover, Germany { jennifer.dsouza,soeren.auer } @tib.eu Abstract.
In the biotechnology and biomedical domains, recent textmining efforts advocate for machine-interpretable, and preferably, se-mantified, documentation formats of laboratory processes. This includeswet-lab protocols, (in)organic materials synthesis reactions, genetic ma-nipulations and procedures for faster computer-mediated analysis andpredictions. Herein, we present our work on the representation of se-mantified bioassays in the Open Research Knowledge Graph (ORKG) . Inparticular, we describe a semantification system work-in-progress to gen-erate, automatically and quickly, the critical semantified bioassay datamass needed to foster a consistent user audience to adopt the ORKGfor recording their bioassays and facilitate the organisation of research,according to FAIR principles.
Keywords:
Bioassays · Open Research Knowledge Graph · Open Sci-ence Graphs
More and more scholarly digital library initiatives aim at fostering the digi-talization of traditional document-based scholarly articles [1,2,3,6,10,11,18,26].This means structuring and organizing, in a fine-grained manner, knowledge el-ements from previously unstructured scholarly articles in a Knowledge Graph.These efforts are analogous to the digital transformation seen in recent years inother information-rich publishing and communication services, e.g., e-commerceproduct catalogs instead of mailorder catalogs, or online map services instead ofprinted street maps. For these services, the traditional document-based publi-cation was not just digitized (by making digitized PDFs of the analog artifactsavailable) but has seen a comprehensively transformative digitalization. (cid:63)
Supported by TIB Leibniz Information Centre for Science and Technology, the EUH2020 ERC project ScienceGRaph (GA ID: 819536) and the ITN PERICO (GA ID:812968). a r X i v : . [ c s . D L ] S e p Anteghini et al.
Of available scholarly knowledge digitalization avenues [1,2,3,6,10,11,18], wehighlight the Open Research Knowledge Graph (ORKG) [12]. It is a next-generation digital library (DL) that focuses on ingesting information in schol-arly articles as machine-actionable knowledge graphs (KG). In it, an article isrepresented with both (bibliographic) metadata and semantic descriptions (assubject-predicate-object triples) of its contributions . ORKG has a number of ad-vantages as: 1) it enables flexible semantic content modeling (i.e., ontologized ornot, depending on the user or domain); 2) it semantifies contributions at variouslevels of granularity from shallow to fine-grained; and 3) it publishes persistentKG links per article contribution that it contains. For further technical detailsabout the platform, we refer the reader to the introductory paper [12].The ORKG DL aims to integrate and interlink contributions’ KGs for Scienceat large, i.e. multidisciplinarily. Thus far, ongoing efforts are in place for inte-grating scholarly contributions from at least two disciplines, viz. Math [21] (e.g., ) and the Natural Language Pro-cessing subdomain in AI [9] (e.g., ).Moreover, the ORKG also has a separate feature to automatically import indi-vidual articles’ contributions data found tabulated in survey articles [20]. E.g.,an ORKG object for Earth Science articles’ contributions surveyed: . Since surveys are written in mostdisciplines, this latter feature directly targets the ORKG aim; however, its solelimitation is that it is restricted only to those papers that have been surveyed.On the other hand, with the per-domain semantification models, articles notsurveyed can be also modeled in the ORKG.In this paper, we describe our ongoing work in extending the ORKG to inte-grate biological assays from the Biochemistry discipline. For bioassays, a seman-tification model already exists as the BioAssay Ontology (BAO) [25]. However,we need to design a pragmatic workflow for integrating bioassays semantified bythe BAO in the ORKG DL. To this end, we discuss the manual and automaticprocess of integrating such semantified data in the ORKG DL. Furthermore,we show how these semantified data integrated in the ORKG is amenable toadvanced computational processing support for the researcher.With the volume of research burgeoning [14], adopting a finer-grained se-mantification as KG for scholarly content representation is compelling. Bettersemantification means better machine actionability, which in turn means innu-merable possibilities of advanced computational functions on scholarly content.One function especially poignant in this era of the publications deluge [13], iscomputational support to alleviate the manual information ingestion cognitiveburden. This is precisely the computational support showcase we depict fromthe ORKG DL over our integrated bioassay KGs, consequently highlighting thebenefits of digitalizing bioassays and of the ORKG DL platform.
Allowing practitioners to easily search for similar bioassays as well ascompare these semantically structured bioassays on their key properties. emantified Bioassays in the Open Research Knowledge Graph 3
Why integrate bioassays in a knowledge graph?
Until their recent semantificationin an expert-annotated dataset of 983 bioassays [7,22,24] based on the BAO [25],bioassays were published in the form of plain text. Integrating their semanti-fied counterpart in a KG facilitates their advanced computational processing.Consider that key assay concepts related to biological screening, including Per-turbagen, Participants, Meta Target and Detection Technology, will be machine-actionable. This widens the potential for relational enrichment and interlinkingwhen integrated with machine-interpretable formats of wet lab protocols and in-organic materials synthesis reactions and procedures [15,16,17,19]. Furthermore,in this era of neural-based ML technologies, KG-based word embeddings fosternew inferential discovery mechanisms given that they encode high-dimensionalsemantic spaces [5] with bioassay KGs so far untested for.
Why the ORKG DL [2]? The core of the setup of knowledge-based digitalized in-formation flows is the distributed, decentralized, collaborative creation and evo-lution of information models. Moreover, vocabularies, ontologies, and knowledgegraphs to establish a common understanding of the data between the variousstakeholders. And, importantly, the integration of these technologies into the in-frastructure and processes of search and knowledge exchange toward a researchlibrary of the future. The ORKG DL is such a solution. Implemented withinTIB, as a central library and information centre for science and technology, italso promises development longevity: the Leibniz Association institutional net-works presents a critical mass of application domains and users to enhance theinfrastructure and continuously integrate new knowledge disciplines.With these considerations in place, the work described in the subsequentsections is being carried forth. Next, we describe our approach in the context oftwo main research questions.
RQ1:
What are steps for manually digitalizing a Bioassay in the ORKG?
Thedigitalization is based on the prior requirement that text-based bioassays are se-mantified based on the BioAssay Ontology (BAO) [25]. This is the manual aspectof the digitalization process involving domain experts or the assay authors them-selves. In Figure 1, we show an example of a manually pre-semantified bioassayintegrated in ORKG. This bioassay was semantified on eight properties based onthe BAO. It was drawn from an expert-annotated set of 983 bioassays [22,24].In terms of salient features, the bioassays in this dataset have 53 triple semanticstatements on average with a minimum of 5 and a maximum of 92 statements;there are 42 different types of bioassays (e.g., luciferase reporter gene assay,protein-protein interaction assay—see in appendix the full list); and there are 11assay formats (e.g., cell-based, biochemical). Thus, the manual semantificationtask complexity can be viewed as 53 modeling decisions.In gist, the manual digitalizaton of a bioassay in the ORKG includes: 1) a BAO-based semantification step : forming subject-predicate-object triples of
Anteghini et al. emantified Bioassays in the Open Research Knowledge Graph 5 encoding high-dimensional semantic spaces of the underlying text, obviate theneed to make explicit considerations for features of the text. Moreover, they sig-nificantly outperform systems designed based on explicit features [8]—with duecredit to the system by Clark et al. [7] designed prior to the onset of this revo-lutionary technology. Next, our hybrid workflow is designed toward a practicalend—to be integrated in the ORKG DL which has a predominant focus on thedigitalization of scholarly knowledge content multidisciplinarily, thus setting itapart from any existing DL.
RQ2:
What are the modules needed in the hybrid digitalization of Bioassaysin the ORKG?
Essentially, given a new bioassay text input, we are implementingtwo modules in a two-step workflow as follows: 1) an automated semantifier; and2) a human-in-the-loop curation of the predicted labels either by the assay authoror a dedicated curator. Unlike the manual workflow, this presents a much easierand less time-intensive task for the human. They would be merely selecting thecorrectly predicted triples, deleting the incorrect ones, or defining new ones asneeded. Assuming a well-trained machine learning module, the latter two stepsmay be entirely omitted. Toward this hybrid workflow, as work in progress,the automated semantifier is in development, and we are also implementingextensions in the ORKG infrastructure to include additional front-end views asassay curation interfaces.
Premise: We need an information processing tool that can be used by biomedicalpractitioners to quickly comprehend bioassays’ key properties.
The ORKG DL has a computational feature to generate and publish surveysin the form of a tabulated comparisons of the KG nodes [20]. To demonstratethis feature, we manually entered the data of three semantified bioassays in theORKG DL. Applying then the ORKG survey feature on the three assays aggre-gates their semantified graph nodes in tabulated comparisons across the assays.This is depicted in Figure 2. With such structured computations enabled, wehave a novel approach to uncovering and presenting information relying on ag-gregated scholarly knowledge. The computation shown in Fig. 2 aligns closelywith the notion of the traditional survey articles, except it is fully automated andoperates on machine-actionable knowledge elements. The BAO-semantified as-says are compared side-by-side on their graph nodes. Thus, tracking the progresson bioassays, can be eased from a task of several days to a few minutes.
Thus in this paper, we outlined a vision in two separate workflows for integratingbioassay knowledge in the ORKG DL and our ongoing work to this end. Theimplications of bioassay structured and machine-actionable knowledge are broad.
Anteghini et al.
Fig. 2: Comparisons of semantified bioassays in the ORKG digital library. Online
To mention just one in the particular context of the current Covid-19 pandemic:The discovery of cures for diseases can be greatly expedited if scientists aregiven intelligent information access tools, and our work toward automaticallysemantifying bioassays are a step in this direction.To this end, the workflows prescribed in this work offer the possibilities tochose between a manual or a semi-automatic strategy for bioassays’ semantifi-cation within a real-world digital library.We would like to invite interested researchers to collaborate with us on the fol-lowing topics: 1) generating a large dataset of semantically structured bioassays;2) user evaluation of our semi-automated system for semantically structuringbioassay data.We deem this as a starting point for a discussion in the community ulti-mately leading to more clearly defined technical requirements, and a roadmap emantified Bioassays in the Open Research Knowledge Graph 7 for fulfilling the potential of the ORKG as a next-generation digital library forfine-grained semantified access to scholarly content.
References
1. Aryani, A., Poblet, M., Unsworth, K., Wang, J., Evans, B., Devaraju, A.,Hausstein, B., Klas, C.P., Zapilko, B., Kaplun, S.: A research graph dataset for con-necting research data repositories using rd-switchboard. Scientific data , 180099(2018)2. Auer, S.: Towards an open research knowledge graph (Jan 2018).https://doi.org/10.5281/zenodo.11571853. Baas, J., Schotten, M., Plume, A., Cˆot´e, G., Karimi, R.: Scopus as a curated,high-quality bibliometric data source for academic research in quantitative sciencestudies. Quantitative Science Studies (1), 377–386 (2020)4. Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientifictext. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-guage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP). pp. 3606–3611 (2019)5. Bianchi, F., Rossiello, G., Costabello, L., Palmonari, M., Minervini, P.: Knowledgegraph embeddings and explainable ai. arXiv preprint arXiv:2004.14843 (2020)6. Birkle, C., Pendlebury, D.A., Schnell, J., Adams, J.: Web of science as a data sourcefor research on scientific and scholarly activity. Quantitative Science Studies (1),363–376 (2020)7. Clark, A.M., Bunin, B.A., Litterman, N.K., Sch¨urer, S.C., Visser, U.: Fast andaccurate semantic annotation of bioassays exploiting a hybrid of machine learningand user confirmation. PeerJ , e524 (2014)8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-rectional transformers for language understanding. In: Proceedings of the 2019Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers).pp. 4171–4186 (2019)9. D’Souza, J., Auer, S.: Nlpcontributions: An annotation scheme for machine readingof scholarly contributions in natural language processing literature (2020)10. Fricke, S.: Semantic scholar. Journal of the Medical Library Association: JMLA (1), 145 (2018)11. Hendricks, G., Tkaczyk, D., Lin, J., Feeney, P.: Crossref: The sustainable source ofcommunity-owned scholarly metadata. Quantitative Science Studies (1), 414–427(2020)12. Jaradeh, M.Y., Oelen, A., Farfar, K.E., Prinz, M., D’Souza, J., Kismih´ok, G.,Stocker, M., Auer, S.: Open research knowledge graph: next generation infrastruc-ture for semantic scholarly knowledge. In: Proceedings of the 10th InternationalConference on Knowledge Capture. pp. 243–246 (2019)13. Jinha, A.E.: Article 50 million: an estimate of the number of scholarly articles inexistence. Learned Publishing (3), 258–263 (2010)14. Johnson, R., Watkinson, A., Mabe, M.: The stm report. An overview of scientificand scholarly publishing. 5th edition October (2018)15. Kononova, O., Huo, H., He, T., Rong, Z., Botari, T., Sun, W., Tshitoyan, V.,Ceder, G.: Text-mined dataset of inorganic materials synthesis recipes. Scientificdata (1), 1–11 (2019) Anteghini et al.16. Kulkarni, C., Xu, W., Ritter, A., Machiraju, R.: An annotated corpus formachine reading of instructions in wet lab protocols. In: NAACL: HLT,Volume 2 (Short Papers). pp. 97–106. New Orleans, Louisiana (Jun 2018).https://doi.org/10.18653/v1/N18-201617. Kuniyoshi, F., Makino, K., Ozawa, J., Miwa, M.: Annotating and extracting syn-thesis process of all-solid-state batteries from scientific literature. In: LREC. pp.1941–1950 (2020)18. Manghi, P., Atzori, C., Bardi, A., Shirrwagen, J., Dimitropoulos, H., La Bruzzo, S.,Foufoulas, I., Lhden, A., Bcker, A., Mannocci, A., Horst, M., Baglioni, M., Czer-niak, A., Kiatropoulou, K., Kokogiannaki, A., De Bonis, M., Artini, M., Ottonello,E., Lempesis, A., Nielsen, L.H., Ioannidis, A., Bigarella, C., Summan, F.: Ope-naire research graph dump (Dec 2019). https://doi.org/10.5281/zenodo.3516918, https://doi.org/10.5281/zenodo.3516918
19. Mysore, S., Jensen, Z., Kim, E., Huang, K., Chang, H.S., Strubell, E., Flanigan,J., McCallum, A., Olivetti, E.: The materials science procedural text corpus: An-notating materials synthesis procedures with shallow semantic structures. In: Pro-ceedings of the 13th Linguistic Annotation Workshop. pp. 56–64 (2019)20. Oelen, A., Jaradeh, M.Y., Stocker, M., Auer, S.: Generate fair literature surveyswith scholarly knowledge graphs. In: Proceedings of the ACM/IEEE Joint Confer-ence on Digital Libraries in 2020. p. 97106. JCDL 20, Association for ComputingMachinery, New York, NY, USA (2020). https://doi.org/10.1145/3383583.339852021. Runnwerth, M., Stocker, M., Auer, S.: Operational research literature as a usecase for the open research knowledge graph. In: Bigatti, A.M., Carette, J., Daven-port, J.H., Joswig, M., de Wolff, T. (eds.) Mathematical Software - ICMS 2020 -7th International Conference, Braunschweig, Germany, July 13-16, 2020, Proceed-ings. Lecture Notes in Computer Science, vol. 12097, pp. 327–334. Springer (2020).https://doi.org/10.1007/978-3-030-52200-1 3222. Sch¨urer, S.C., Vempati, U., Smith, R., Southern, M., Lemmon, V.: Bioassay ontol-ogy annotations facilitate cross-analysis of diverse high-throughput screening datasets. Journal of biomolecular screening (4), 415–426 (2011)23. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,(cid:32)L., Polosukhin, I.: Attention is all you need. In: Advances in neural informationprocessing systems. pp. 5998–6008 (2017)24. Vempati, U.D., Przydzial, M.J., Chung, C., Abeyruwan, S., Mir, A., Sakurai, K.,Visser, U., Lemmon, V.P., Sch¨urer, S.C.: Formalization, annotation and analysis ofdiverse drug and probe screening assay datasets using the bioassay ontology (bao).PloS one (11), e49198 (2012)25. Visser, U., Abeyruwan, S., Vempati, U., Smith, R.P., Lemmon, V., Sch¨urer, S.C.:Bioassay ontology (bao): a semantic description of bioassays and high-throughputscreening results. BMC bioinformatics (1), 257 (2011)26. Wang, K., Shen, Z., Huang, C., Wu, C.H., Dong, Y., Kanakia, A.: Microsoft aca-demic graph: When experts are not enough. Quantitative Science Studies (1),396–413 (2020)emantified Bioassays in the Open Research Knowledge Graph 9 A Bioassay types
Bioassay types protein-protein interaction hydrolase activitykinase activity protein-small molecule interactionviability beta lactamase reporter genecytochrome P450 enzyme activity luciferase enzyme activityluciferase reporter gene oxidoreductase activityprotein unfolding chaperone activitylyase activity transporterplasma membrane potential dye redistributioncalcium redistribution apoptosisbeta lactamase reporter gene beta galactosidase reporter genephosphatase activity cAMP redistributionIP1 redistribution cell morphologyphosphorylation transferase activityisomerase activity protein redistributionradioligand binding signal transductionion channel platelet activationfluorescent protein reporter gene protein-DNA interactionprotease activity cell permeabilityprotein stability protein-turnoverlocalization organism behaviorcytotoxicity cell growth
Table 1: List of the different bioassay types present in our dataset
B Preliminary Results of Automated Semantification:SciBERT-based Bioassay Semantifier
The semantic statements depicted in Figure 3 were automatically generated fromSciBERT-based [4] neural semantification system. These predictions were madefor the same bioassay text depicted in Figure 1. Comparing the automaticallygenerated one against the reference, we see that almost all the manually curatedlabels are correctly predicted. Among 16 manually curated labels, excludingthose we omit in our training procedure (e.g., has title, PubChem AID, DepositDate, has incubation time value, has concentration unit), the model accuratelypredicts 12 statements, while the remaining were deemed by a domain-specialistas valid additional candidates to incorporate in the reference set (e.g., has sig-nificant direction, has concentration throughput).0 Anteghini et al.