[PDF] Unsupervised Identification of Study Descriptors in Toxicology Research: An Experimental Study

Abstract

Identifying and extracting data elements such as study descriptors in publication full texts is a critical yet manual and labor-intensive step required in a number of tasks. In this paper we address the question of identifying data elements in an unsupervised manner. Specifically, provided a set of criteria describing specific study parameters, such as species, route of administration, and dosing regimen, we develop an unsupervised approach to identify text segments (sentences) relevant to the criteria. A binary classifier trained to identify publications that met the criteria performs better when trained on the candidate sentences than when trained on sentences randomly picked from the text, supporting the intuition that our method is able to accurately identify study descriptors.

Full PDF

UUnsupervised Identiﬁcation of Study Descriptors in Toxicology Research:An Experimental Study

Drahomira Herrmannova, Steven R. Young, Robert M. Patton, Christopher G. Stahl

Oak Ridge National Laboratory, TN, USA { herrmannovad, youngsr, pattonrm, stahlcg } @ornl.gov Nicole C. Kleinstreuer

NICEATM, NTP, NIEHS, NIHResearch Triangle Park, NC, USA [email protected]

Mary S. Wolfe

NTP, NIEHS, NIHResearch Triangle Park, NC, USA [email protected]

Abstract

Identifying and extracting data elements suchas study descriptors in publication full texts isa critical yet manual and labor-intensive steprequired in a number of tasks. In this paperwe address the question of identifying data el-ements in an unsupervised manner. Speciﬁ-cally, provided a set of criteria describing spe-ciﬁc study parameters, such as species, routeof administration, and dosing regimen, we de-velop an unsupervised approach to identifytext segments (sentences) relevant to the crite-ria. A binary classiﬁer trained to identify pub-lications that met the criteria performs betterwhen trained on the candidate sentences thanwhen trained on sentences randomly pickedfrom the text, supporting the intuition that ourmethod is able to accurately identify study de-scriptors.

Acknowledgments

Support for this research was provided by agrant from the National Institute of EnvironmentalHealth Sciences (AES 16002-001), National Insti-tutes of Health to Oak Ridge National Laboratory.This research was supported in part by an ap-pointment to the Oak Ridge National LaboratoryASTRO Program, sponsored by the U.S. Depart-ment of Energy and administered by the OakRidge Institute for Science and Education.This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy.The United States Government retains and the publisher, by accepting the article for publica-tion, acknowledges that the United States Govern-ment retains a non-exclusive, paid-up, irrevoca-ble, worldwide license to publish or reproduce thepublished form of this manuscript, or allow oth-ers to do so, for United States Government pur-poses. The Department of Energy will providepublic access to these results of federally spon-sored research in accordance with the DOE PublicAccess Plan . Extracting data elements such as study descrip-tors from publication full texts is an essentialstep in a number of tasks including systematicreview preparation (Jonnalagadda et al., 2015),construction of reference databases (Kleinstreueret al., 2016), and knowledge discovery (Smal-heiser, 2012). These tasks typically involve do-main experts identifying relevant literature per-taining to a speciﬁc research question or a topicbeing investigated, identifying passages in the re-trieved articles that discuss the sought after infor-mation, and extracting structured data from thesepassages. The extracted data is then analyzed, forexample to assess adherence to existing guidelines(Kleinstreuer et al., 2016). Figure 1 shows an ex-ample text excerpt with information relevant to aspeciﬁc task (assessment of adherence to existingguidelines (Kleinstreuer et al., 2016)) highlighted. http://energy.gov/downloads/doe-public-access-plan a r X i v : . [ c s . C L ] N ov igure 1: Text excerpt from a reference database of rodent uterotrophic bioassay publications (Kleinstreuer et al.,2016). The text in this example was manually annotated by one of the authors to highlight information relevant toguidelines for performing uterotrophic bioassays set forth by (OECD, 2007). Extracting the data elements needed in thesetasks is a time-consuming and at present a largelymanual process which requires domain exper-tise. For example, in systematic review prepara-tion, information extraction generally constitutesthe most time consuming task (Tsafnat et al.,2014). This situation is made worse by the rapidlyexpanding body of potentially relevant literaturewith more than one million papers added intoPubMed each year (Landhuis, 2016). Therefore,data annotation and extraction presents an impor-tant challenge for automation.A typical approach to automated identiﬁcationof relevant information in biomedical texts is to in-fer a prediction model from labeled training data –such a model can then be used to assign predictedlabels to new data instances. However, obtainingtraining data for creating such prediction modelscan be very costly as it involves the step whichthese models are trying to automate – manual dataextraction. Furthermore, depending on the taskat hand, the types of information being extractedmay vary signiﬁcantly. For example, in system-atic reviews of randomized controlled trials thisinformation generally includes the patient group,the intervention being tested, the comparison , andthe outcomes of the study (PICO elements) (Tsaf-nat et al., 2014). In toxicology research the ex-traction may focus on routes of exposure, dose,and necropsy timing (Kleinstreuer et al., 2016).Previous work has largely focused on identifyingspeciﬁc pieces of information such as biomedicalevents (Gonzalez et al., 2015) or PICO elements(Jonnalagadda et al., 2015). However, depending on the domain and the end goal of the extraction,these may be insufﬁcient to comprehensively de-scribe a given study.Therefore, in this paper we focus on unsuper-vised methods for identifying text segments (suchas sentences or ﬁxed length sequences of words)relevant to the information being extracted. Wedevelop a model that can be used to identify textsegments from text documents without labeleddata and that only requires the current documentitself, rather than an entire training corpus linkedto the target document. More speciﬁcally, weutilize representation learning methods (Mikolovet al., 2013a), where words or phrases are embed-ded into the same vector space. This allows usto compute semantic relatedness among text frag-ments, in particular sentences or text segments in agiven document and a short description of the typeof information being extracted from the document,by using similarity measures in the feature space.The model has the potential to speed up identiﬁ-cation of relevant segments in text and thereforeto expedite annotation of domain speciﬁc informa-tion without reliance on costly labeled data.We have developed and tested our approach ona reference database of rodent uterotropic bioas-says (Kleinstreuer et al., 2016) which are labeledaccording to their adherence to test guidelines setforth in (OECD, 2007). Each study in the databaseis assigned a label determining whether or notit met each of six main criteria deﬁned by the https://ntp.niehs.nih.gov/pubhealth/evalatm/test-method-evaluations/endocrine-disruptors/ref-data/edhts.html uidelines; however, the database does not con-tain sentence-level annotations or any informationabout where the criteria was mentioned in eachpublication. Due to the lack of ﬁne-grained an-notations, supervised learning methods cannot beeasily applied to aid annotating new publicationsor to annotate related but distinct types of studies.This database therefore presents an ideal use-casefor unsupervised approaches.While our approach doesn’t require any labeleddata to work, we use the labels available in thedataset to evaluate the approach. We train a binaryclassiﬁcation model for identifying publicationswhich satisﬁed given criteria and show the modelperforms better when trained on relevant sentencesidentiﬁed by our method than when trained on sen-tences randomly picked from the text. Further-more, for three out of the six criteria, a modeltrained solely on the relevant sentences outper-forms a model which utilizes full text. The resultsof our evaluation support the intuition that seman-tic relatedness to criteria descriptions can help inidentifying text sequences discussing sought afterinformation.There are two main contributions of this work.We present an unsupervised method that em-ploys representation learning to identify text seg-ments from publication full text which are relevantto/contain speciﬁc sought after information (suchas number of dose groups). In addition, we ex-plore a new dataset which hasn’t been previouslyused in the ﬁeld of information extraction.The remainder of this paper is organized as fol-lows. In the following section we provide moredetails of the task and the dataset used in thisstudy. In Section 3 we describe our approach. InSection 4 we evaluate our model and discuss ourresults. In Section 5 we compare our work to ex-isting approaches. Finally, in Section 6 we provideideas for further study. This section provides more details about the spe-ciﬁc task and the dataset used in our study whichmotivated the development of our model.

Signiﬁcant efforts in toxicology research are be-ing devoted towards developing new in vitro meth-ods for testing chemicals due to the large num-ber of untested chemicals in use ( > in vivo methods (2-3 years and millions of dollars perchemical (Judson et al., 2009)). To facilitate thedevelopment of novel in vitro methods and as-sess the adherence to existing study guidelines,a curated database of high-quality in vivo rodentuterotrophic bioassay data extracted from researchpublications has recently been developed and pub-lished (Kleinstreuer et al., 2016).The creation of the database followed the studyprotocol design set forth in (OECD, 2007), whichis composed of six minimum criteria (MC, Table1). An example of information pertaining to thecriteria is shown in Figure 1. Only studies whichmet all six minimum criteria were consideredguideline-like (GL) and were included in a follow-up detailed study and the ﬁnal database (Klein-streuer et al., 2016). However, of the 670 pub-lications initially considered for inclusion, only93 ( ∼ The version of the database which contains bothGL and non-GL studies consists of 670 publica-tions (spanning the years 1938 through 2014) withresults from 2,615 uterotrophic bioassays. Specif-ically, each entry in the database describes onestudy, and studies are linked to publications usingPubMed reference numbers (PMIDs). Each study riteria name Description

MC 1: Animal model Immature rats, ovariectomized (OVX) adult rats, or OVX adult miceare acceptable (immature mice are not acceptable). OVX animals:OVX should be performed between six and eight weeks of age (al-lowing at least 14 days post-surgery before dosing for rats and sevendays post-surgery for mice). Immature rats: dosing should beginbetween postnatal day (PND) 18 and PND 21, and be completed byPND 25.MC 2: Group size Each control group should have a minimum of three animals andeach test group should have a minimum of ﬁve animals.MC 3: Route of administration Acceptable routes of administration: oral gavage (p.o.), subcuta-neous (s.c.) injection, or intraperitoneal (i.p.) injection.MC 4: Number of dose groups Minimum of two dose level groups. Must have positive control andnegative control.MC 5: Dosing interval Dosing for a minimum of three consecutive days. Complete by PND25 in immature animals.MC 6: Necropsy timing Should be carried out 18-36 hours after the last dose.

Table 1: Minimum criteria for guideline-like studies. The descriptions are reprinted here from (Kleinstreuer et al.,2016). is assigned seven 0/1 labels – one for each of theminimum criteria and one for the overall GL/non-GL label. The database also contains more de-tailed subcategories for each label (for example“species” label for MC 1) which were not used inthis study. The publication PDFs were providedto us by the database creators. We have used theGrobid library to convert the PDF ﬁles into struc-tured text. After removing documents with miss-ing PDF ﬁles and documents which were not con-verted successfully, we were left with 624 full textdocuments.Each publication contains on average 3.7 stud-ies (separate bioassays), 194 publications containa single study, while the rest contain two or morestudies (with 82 being the most bioassays per pub-lication). The following excerpt shows an examplesentence mentioning multiple bioassays (with dif-ferent study protocols): With the exception of the ﬁrst study(experiment 1), which had group sizesof 12, all other studies had group sizesof 8.

For this experiment we did not distinguish be-tween publications describing a single or multi-ple studies. Instead, our focus was on retrievingall text segments (which may be related to mul-tiple studies) relevant to each of the criteria. For https://github.com/kermitt2/grobid Criteria 0 1 Total % of 1

MC 1 414 175 589 29.71MC 2 35 577 612 94.28MC 3 70 536 606 88.45MC 4 309 206 515 40.00MC 5 96 490 586 83.62MC 6 228 340 568 59.86GL 522 72 594 12.12

Table 2: Label statistics. Column shows number ofpublications per MC which did not meet the criteria andcolumn shows number of publications which met thecriteria. The last column in the table shows proportionof positive (i.e. criteria met) labels. each MC, if a document contained multiple studieswith different labels, we discarded that documentfrom our analysis of that criteria; if a documentcontained multiple studies with the same label, wesimply combine all those labels into a single label.Table 2 shows the ﬁnal size of the dataset. In this section we describe the method we haveused for retrieving text segments related to the cri-teria described in the previous section. The intu-ition is based off question answering systems. Wetreat the criteria descriptions (Table 1) as the ques-tion and the text segments within the publicationthat discusses the criteria as the answer. Given aull text publication, the goal is to ﬁnd the text seg-ments most likely to contain the answer.We represent the criteria descriptions and textsegments extracted from the documents as vectorsof features, and utilize relatedness measures to re-trieve text segments most similar to the descrip-tions. A similar step is typically performed bymost question answering (QA) systems – in QAsystems both the input documents and the ques-tion are represented as a sequence of embeddingvectors and a retrieval system then compares thedocument and question representations to retrievetext segments most likely containing the answer(Mishra and Jain, 2016).To account for the variations in language thatcan be used to describe the criteria, we repre-sent words as vectors generated using Word2Vec(Mikolov et al., 2013a). The following two ex-cerpts show two different ways MC 6 was de-scribed in text:

Animals were killed 24 h after being in-jected and their uteri were removed andweighed.All animals were euthanized by expo-sure to ethyl ether 24 h after the ﬁnaltreatment.

We hypothesize that the use of word embed-ding features will allow us to detect relevant wordswhich are not present in the criteria descriptions.(Mikolov et al., 2013b) have shown that an im-portant feature of Word2Vec embeddings is thatsimilar words will have similar vectors becausethey appear in similar contexts. We utilize thisfeature to calculate similarity between the crite-ria descriptions and text segments (such as sen-tences) extracted from each document. A high-level overview of our approach is shown in Fig-ure 2.We use the following method to retrieve themost relevant text segments:

Segment extraction:

First, we break each doc-ument down into shorter sequences such as sen-tences or word sequences of ﬁxed length. Whilethe ﬁrst option (sentences) results in text which iseasier to process, it has the disadvantage of result-ing in sequences of varying length which may af-fect the resulting similarity value. However, forsimplicity, in this study we utilize the sentenceversion.

Document Criteria descriptionSegment extractionSegment representation Description representationWord to word similaritiesSegment to description similaritiesCandidate segments

Figure 2: High level overview of our approach. Thedotted line represents an optional step of ﬁndingsmaller sub-segments within the candidate segments.For example, in our case, we ﬁrst retrieve the most sim-ilar sentences and in the second step ﬁnd most similarcontinuous 5-grams found withing those sentences.

Segment/description representation:

We rep-resent each sequence and the input description as asequence of vector representations. For this studywe have utilized Word2Vec embeddings (Mikolovet al., 2013a) trained using the Gensim library onour corpus of 624 full text publications.

Word to word similarities:

Next we calculatesimilarity between each word vector from each se-quence s i and each word vector from the inputdescription d using cosine similarity . The outputof this step is a similarity matrix S i ∈ R N i × M d for each sequence s i , where N i is the number ofunique words in the sequence and M d is the num-ber of unique words in the description d . Segment to description similarities:

To ob-tain a similarity value representing the related-ness of each sequence to the input descriptionwe ﬁrst convert each input matrix S i into a vec-tor v i ∈ R N i by choosing the maximum simi-larity value for each word in the sequence, thatis v i = max rows ( S i ) . Each sequence is then as-signed a similarity value r i ∈ R which is calcu-lated as r i = avg ( v i ) . In the future we are plan-ning to experiment with different ways of calcu-ating relatedness of the sequences to the descrip-tions, such as with computing similarity of embed-dings created from the text fragments using ap-proaches like Doc2Vec (Le and Mikolov, 2014).In this study, after ﬁnding the top sentences, wefurther break each sentence down into continu-ous n-grams to ﬁnd the speciﬁc part of the sen-tence discussing the MC. We repeat the same pro-cess described above to calculate the relatednessof each n-gram to the description. Candidate segments:

For each document weselect the top k text segments (sentences in the ﬁrststep and 5-grams in the second step) most similarto the description. Figures 3, 4, and 5 show example annotations gen-erated using our method for the ﬁrst three criteria.For this example we ran our method on the ab-stract of the target document rather than the fulltext and highlighted only the single most similarsentence. The abstract used to produce these ﬁg-ures is the same as the abstract shown in Figure 1.In all three ﬁgures, the lighter yellow color high-lights the sentence which was found to be the mostsimilar to a given MC description, the darker redcolor shows the top 5-gram found within the topsentence, and the bold underlined text is the textwe are looking for (the correct answer). Annota-tions generated for the remaining three criteria areshown in Appendix A.Due to space limitations, Figures 3, 4, and 5show results generated on abstracts rather thanon full text; however, we have observed similarlyaccurate results when we applied our method tofull text. The only difference between the ab-stracts and the full text version is how many topsentences we retrieved. When working with ab-stracts only, we observed that if the criteria wasdiscussed in the abstract, it was generally sufﬁ-cient to retrieve the single most similar sentence.However, as the criteria may be mentioned in mul-tiple places within the document, when workingwith full text documents we have retrieved and an-alyzed the top k sentences instead of just a singlesentence. In this case we have typically found thecorrect sentence/sentences among the top 5 sen-tences. We have also observed that the similar sen-tences which don’t discuss the criteria directly (i.e.the “incorrect” sentences) typically discuss relatedtopics. For example, consider the following three sentences: After weaning on pnd 21, the dams wereeuthanized by CO2 asphyxiation andthe juvenile females were individuallyhoused.Six CD(SD) rat dams, each with recon-stituted litters of six female pups, werereceived from Charles River Laborato-ries (Raleigh, NC, USA) on offspringpostnatal day (pnd) 16.This validation study followedOECD TG 440, with six femaleweanling rats (postnatal day 21) perdose group and six treatment groups.

These three sentences were extracted from theabstract and the full text of a single document(document 20981862, the abstract of which isshown in Figures 1 and 3-8). These three sen-tences were retrieved as the most similar to MC 1,with similarity scores of 70.61, 65.31, and 63.69,respectively. The third sentence contains the “an-swer” to MC 1 (underlined). However, it can beseen the top two sentences also discuss the animalsused in the study (more speciﬁcally, the sentencesdiscuss the animals’ housing and their origin).

The goal of this experiment was to explore empir-ically whether our approach truly identiﬁes men-tions of the minimum criteria in text. As we didnot have any ﬁne-grained annotations that couldbe used to directly evaluate whether our modelidentiﬁes the correct sequences, we have used adifferent methodology. We have utilized the exist-ing 0/1 labels which were available in the database(these were discussed in Section 2) to train one bi-nary classiﬁer for each MC. The task of each ofthe classiﬁers is to determine whether a publica-tion met the given criteria or not. We have thencompared a baseline classiﬁer trained on all fulltext with three other models: • A model which, instead of all full text, uti-lized only the top k sentences most similarto the given MC. The top k sentences wereidentiﬁed using our model introduced in theprevious section. • A model which utilized only the k least simi-lar sentences. igure 3: Annotations generated using our method for the abstract from Figure 1. The sentence which was foundto be the most similar to the description for “MC 1: Animal model” is highlighted in yellow and the most similarsequence of words within that sentence is highlighted in red. The text we are looking for is highlighted with boldunderlined text. For this example we ran our method on the abstract of the target document rather than the full textand highlighted only the single most similar sentence.Figure 4: Annotations generated using our method for “MC 2: Group size”. The highlighting used is the same asin Figure 3. • A model which utilized only k random sen-tences (but none of the top or bottom k sen-tences – the sentences were chosen at ran-dom from the interval ( k, n − k ) where n is the number of sentences in the documentand where sentences are sorted from the mostsimilar to the least similar).The only difference between the four models iswhich sentences from each document are passed tothe classiﬁer for training and testing. The intuitionis that a classiﬁer utilizing the correct sentencesshould outperform both other models.To avoid selecting the same sentences across thethree models we removed documents which con-tained less than ∗ k sentences (Table 3, row Num-ber of documents shows how many documents sat-isﬁed this condition). In all of the experimentspresented in this section, the publication full textwas tokenized, lower-cased, stemmed, and stopwords were removed. All models used a BernoulliNa¨ıve Bayes classiﬁer (scikit-learn implementa-tion which used a uniform class prior) trainedon binary occurrence matrices created using 1- 3-grams extracted from the publications, with n-grams appearing in only one document removed.The complete results obtained from leave-one-outcross validation are shown in Table 3. In all caseswe report classiﬁcation accuracy. In the case ofthe random-k sentences model the accuracy wasaveraged over 10 runs of the model.We compare the results to two baselines: (1) abaseline obtained by classifying all documents asbelonging to the majority class ( baseline 1 in Ta-ble 3) and (2) a baseline obtained using the samesetup (features and classiﬁcation algorithm) as inthe case of the top-/random-/bottom-k sentences models but which utilized all full text instead ofselected sentences extracted from the text only( baseline 2 in Table 3).

Table 3 shows that for four out of the six criteria(MC 1, MC 4, MC 5, and MC 6) the top-k sen-tences model outperforms baseline 1 as well the bottom-k and the random-k sentences models by asigniﬁcant margin. Furthermore, for three of the igure 5: Annotations generated using our method for “MC 3: Route of administration”. The highlighting used isthe same as in Figure 3.

Approach MC1 MC2 MC3 MC4 MC5 MC6

Baseline 1: Most frequent label 70.35 94.43 88.74 59.48 84.30 60.44Baseline 2: All full text 78.25 92.06 89.59 67.94 84.83 74.05Top-k sentence 76.84 91.55 87.71 68.35 88.54 74.23Bottom-k sentences 70.00 91.39 88.23 63.10 80.60 63.70Random-k sentences 73.26 93.72 88.43 65.65 85.29 68.28Number of documents 570 592 586 496 567 551Number of pos. labels 169 559 520 201 478 333

Table 3: Evaluation results. six criteria (MC 4, MC 5, and MC 6) the top-ksentences model also outperforms the baseline 2 model (model which utilized all full text). Thisseems to conﬁrm our hypothesis that semantic re-latedness of sentences to the criteria descriptionshelps in identifying sentences discussing the crite-ria. These seems to be the case especially giventhat for three of the six criteria the top-k sentences model outperforms the model which utilizes allfull text ( baseline 2 ) despite being given less in-formation to learn from (selected sentences onlyin the case of the top-k sentences model vs. all fulltext in the case of the baseline 2 model).For two of the criteria (MC 2 and MC 3) thisis not the case and the top-k sentences model per-forms worse than both other models in the caseof MC 3 and worse than the random-k model inthe case of MC 2. One possible explanation forthis is class imbalance. In the case of MC 2, only33 out of 592 publications (5.57%) represent neg-ative examples (Table 3). As the top-k sentences model picks only sentences closely related to MC2, it is possible that due to the class imbalancethe top sentences don’t contain enough negativeexamples to learn from. On the other hand, the bottom-k and random-k sentences models may se-lect text not necessarily related to the criteria but potentially containing linguistic patterns which themodel learns to associate with the criteria; for ex-ample, certain chemicals may require the use of acertain study protocol which may not be alignedwith the MC and the model may key in on theappearance of these chemicals in text rather thanthe appearance of MC indicators. The situation issimilar in the case of MC 3. We would like to em-phasize that the goal of this experiment was notto achieve state-of-the-art results but to investigateempirically the viability of utilizing semantic re-latedness of text segments to criteria descriptionsfor identifying relevant segments.

In this section we present studies most similar toour work. We focus on unsupervised methods forinformation extraction from biomedical texts.Many methods for biomedical data annotationand extraction exist which utilize labeled dataand supervised learning approaches ((Liu et al.,2016) and (Gonzalez et al., 2015) provided a goodoverview of a number of these methods); how-ever, unsupervised approaches in this area aremuch scarcer. One such approach has been intro-duced by (Zhang and Elhadad, 2013), who haveproposed a model for unsupervised Named En-ity Recognition. Similar to our approach, theirmodel is based on calculating the similarity be-tween vector representations of candidate phrasesand existing entities. However, their vector repre-sentations are created using a combination of TF-IDF weights and word context information, andtheir method relies on a terminology. More re-cently, (Chen and Sokolova, 2018) have utilizedWord2Vec and Doc2Vec embeddings for unsuper-vised sentiment classiﬁcation in medical dischargesummaries.A number of previous studies have focusedon unsupervised extraction of relations such asprotein-protein interactions (PPI) from biomedicaltexts. For example, (Quan et al., 2014) have uti-lized several techniques, namely kernel-based pat-tern clustering and dependency parsing, to extractPPI from biomedical texts. (Alicante et al., 2016)have introduced a system for unsupervised extrac-tion of entities and relations between these enti-ties from clinical texts written in Italian, which uti-lized a thesaurus for extraction of entities and clus-tering methods for relation extraction. (Rink andHarabagiu, 2011) also used clinical texts and pro-posed a generative model for unsupervised rela-tion extraction. Another approach focusing on re-lation extraction has been proposed by (Madkouret al., 2007). Their approach is based on construct-ing a graph which is used to construct domain-independent patterns for extracting protein-proteininteractions.A similar but distinct approach to unsupervisedextraction is distant supervision. Similarly as un-supervised extraction methods, distant supervi-sion methods don’t require any labeled data, butmake use of weakly labeled data, such as data ex-tracted from a knowledge base. Distant supervi-sion has been applied to relation extraction (Liuet al., 2014), extraction of gene interactions (Mal-lory et al., 2015), PPI extraction (Thomas et al.,2012; Bobi´c et al., 2012), and identiﬁcation ofPICO elements (Wallace et al., 2016). The ad-vantage of our approach compared to the distantlysupervised methods is that it does not require anyunderlying knowledge base or a similar source ofdata.

In this paper we presented a method for unsuper-vised identiﬁcation of text segments relevant tospeciﬁc sought after information being extracted from scientiﬁc documents. Our method is entirelyunsupervised and only requires the current docu-ment itself and the input descriptions instead ofcorpus linked to this document. The method uti-lizes short descriptions of the information beingextracted from the documents and the ability ofword embeddings to capture word context. Con-sequently, it is domain independent and can po-tentially be applied to another set of documentsand criteria with minimal effort. We have used themethod on a corpus of toxicology documents and aset of guideline protocol criteria needed to be ex-tracted from the documents. We have shown theidentiﬁed text segments are very accurate. Fur-thermore, a binary classiﬁer trained to identifypublications that met the criteria performed bet-ter when trained on the candidate sentences thanwhen trained on sentences randomly picked fromthe text, supporting our intuition that our method isable to accurately identify relevant text segmentsfrom full text documents.There are a number of things we plan on inves-tigating next. In our initial experiment we haveutilized criteria descriptions which were not de-signed to be used by our model. One possible im-provement of our method could be replacing thecurrent descriptions with example sentences takenfrom the documents containing the sought after in-formation. We also plan on testing our method onan annotated dataset, for example using existingannotated PICO element datasets (Boudin et al.,2010).

References

Anita Alicante, Anna Corazza, Francesco Isgr`o, andStefano Silvestri. 2016. Unsupervised entity andrelation extraction from clinical records in italian.

Computers in biology and medicine , 72:263–275.Tamara Bobi´c, Roman Klinger, Philippe Thomas,and Martin Hofmann-Apitius. 2012. Improvingdistantly supervised extraction of drug-drug andprotein-protein interactions. In

Proceedings ofthe Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP , pages 35–43. Associ-ation for Computational Linguistics.Florian Boudin, Jian-Yun Nie, Joan C Bartlett, RolandGrad, Pierre Pluye, and Martin Dawes. 2010. Com-bining classiﬁers for robust pico element detec-tion.

BMC medical informatics and decision mak-ing , 10(1):29.Qufei Chen and Marina Sokolova. 2018. Word2vecand doc2vec in unsupervised sentiment analysisf clinical discharge summaries. arXiv preprintarXiv:1805.00352 .Graciela H Gonzalez, Tasnia Tahsin, Britton CGoodale, Anna C Greene, and Casey S Greene.2015. Recent advances and emerging applicationsin text and data mining for biomedical discovery.

Brieﬁngs in bioinformatics , 17(1):33–42.Siddhartha R. Jonnalagadda, Pawan Goyal, andMark D. Huffman. 2015. Automating data extrac-tion in systematic reviews: a systematic review.

Sys-tematic Reviews , 4(1).Richard Judson, Ann Richard, David J Dix, KeithHouck, Matthew Martin, Robert Kavlock, VickiDellarco, Tala Henry, Todd Holderman, PhilipSayre, et al. 2009. The toxicity data landscapefor environmental chemicals.

Environmental healthperspectives , 117(5):685.Nicole C. Kleinstreuer, Patricia C. Ceger, David G.Allen, Judy Strickland, Xiaoqing Chang,Jonathan T. Hamm, and Warren M. Casey. 2016. Acurated database of rodent uterotrophic bioactivity.

Environmental Health Perspectives , 124(5).Esther Landhuis. 2016. Scientiﬁc literature: Informa-tion overload.

Nature , 535(7612):457–458.Quoc Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In

Inter-national Conference on Machine Learning , pages1188–1196.Feifan Liu, Jinying Chen, Abhyuday Jagannatha, andHong Yu. 2016. Learning for biomedical informa-tion extraction: Methodological review of recent ad-vances. arXiv preprint arXiv:1606.07993 .Mengwen Liu, Yuan Ling, Yuan An, and Xiaohua Hu.2014. Relation extraction from biomedical litera-ture with minimal supervision and grouping strat-egy. In

Bioinformatics and Biomedicine (BIBM),2014 IEEE International Conference on , pages 444–449. IEEE.Amgad Madkour, Kareem Darwish, Hany Hassan,Ahmed Hassan, and Ossama Emam. 2007. Bionoc-ulars: extracting protein-protein interactions frombiomedical text. In

Proceedings of the Workshop onBioNLP 2007: Biological, Translational, and Clini-cal Language Processing , pages 89–96. Associationfor Computational Linguistics.Emily K Mallory, Ce Zhang, Christopher R´e, andRuss B Altman. 2015. Large-scale extractionof gene interactions from full-text literature usingdeepdive.

Bioinformatics , 32(1):106–113.Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efﬁcient estimation of word represen-tations in vector space. In

ICLR Workshop . Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013b. Linguistic regularities in continuous spaceword representations. In

Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies , pages 746–751.Amit Mishra and Sanjay Kumar Jain. 2016. A surveyon question answering systems with classiﬁcation.

Journal of King Saud University-Computer and In-formation Sciences , 28(3):345–361.OECD. 2007. Test No. 440: Uterotrophic Bioassayin Rodents. In

OECD Guidelines for the Testing ofChemicals, Section 4 . OECD Publishing, Paris.Changqin Quan, Meng Wang, and Fuji Ren. 2014.An unsupervised text mining method for relationextraction from biomedical literature.

PloS one ,9(7):e102039.Bryan Rink and Sanda Harabagiu. 2011. A generativemodel for unsupervised discovery of relations andargument classes from clinical texts. In

Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing , pages 519–528. Associationfor Computational Linguistics.Neil R. Smalheiser. 2012. Literature-based discovery:Beyond the ABCs.

Journal of the American Societyfor Information Science and Technology , 63(2):218–224.Philippe Thomas, Tamara Bobi´c, Ulf Leser, Mar-tin Hofmann-Apitius, and Roman Klinger. 2012.Weakly labeled corpora as silver standard for drug-drug and protein-protein interaction. In

Proceed-ings of the Workshop on Building and EvaluatingResources for Biomedical Text Mining (BioTxtM)on Language Resources and Evaluation Conference(LREC) .Guy Tsafnat, Paul Glasziou, Miew Keen Choong,Adam Dunn, Filippo Galgani, and Enrico Coiera.2014. Systematic review automation technologies.

Systematic reviews , 3(1):74.Byron C Wallace, Joel Kuiper, Aakash Sharma, MingxiZhu, and Iain J Marshall. 2016. Extracting pico sen-tences from clinical trial reports using superviseddistant supervision.

The Journal of Machine Learn-ing Research , 17(1):4572–4596.Shaodian Zhang and No´emie Elhadad. 2013. Unsuper-vised biomedical named entity recognition: Experi-ments with clinical and biological texts.

Journal ofbiomedical informatics , 46(6):1088–1098.

Supplemental Material