[PDF] Experiment Segmentation in Scientific Discourse as Clause-level Structured Prediction using Recurrent Neural Networks

Abstract

We propose a deep learning model for identifying structure within experiment narratives in scientific literature. We take a sequence labeling approach to this problem, and label clauses within experiment narratives to identify the different parts of the experiment. Our dataset consists of paragraphs taken from open access PubMed papers labeled with rhetorical information as a result of our pilot annotation. Our model is a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) cells that labels clauses. The clause representations are computed by combining word representations using a novel attention mechanism that involves a separate RNN. We compare this model against LSTMs where the input layer has simple or no attention and a feature rich CRF model. Furthermore, we describe how our work could be useful for information extraction from scientific literature.

Full PDF

EExperiment Segmentation in Scientiﬁc Discourse as Clause-level StructuredPrediction using Recurrent Neural Networks

Pradeep Dasigi , Gully A.P.C. Burns , Eduard Hovy , and Anita de Waard Language Technologies Institute , Carnegie Mellon University , 5000 Forbes Avenue , Pittsburgh, PA 15213 , USA Information Sciences Institute , Viterbi School of Engineering , University of Southern California , Marina del Rey, CA 90292 , USA Elsevier Research Data Services , Jericho, VT 05465 , USA

Abstract

We propose a deep learning model for identifying structurewithin experiment narratives in scientiﬁc literature. We take asequence labeling approach to this problem, and label clauseswithin experiment narratives to identify the different partsof the experiment. Our dataset consists of paragraphs takenfrom open access PubMed papers labeled with rhetorical in-formation as a result of our pilot annotation. Our model isa Recurrent Neural Network (RNN) with Long Short-TermMemory (LSTM) cells that labels clauses. The clause repre-sentations are computed by combining word representationsusing a novel attention mechanism that involves a separateRNN. We compare this model against LSTMs where the in-put layer has simple or no attention and a feature rich CRFmodel. Furthermore, we describe how our work could be use-ful for information extraction from scientiﬁc literature.

Introduction

An important part of science is communicating results.There are well established rhetorical guidelines (Alley,1996) for scientiﬁc writing that are used across disciplinesand consequently, narratives describing evidence within ascientiﬁc investigation are expected to have a certain struc-ture. Typically, the description begins with certain back-ground information which has already been proved, fol-lowed by some motivating hypotheses to introduce the ex-periment, the methods inferences made based on those re-sults. Understanding this structure is important since it en-ables the higher-level construction of the general argumentof the paper. The reader assembles the pieces in order to un-derstand what was done, why it was done, what prior knowl-edge it builds upon and/or refutes, and with what certaintythe ﬁnal conclusions should be accepted. Without such anoverall model of the experiment, the reader has nothing butbasic assertions.In this work, our aim is to identify these discourse ele-ments given an experiment narrative. We view the task athand as a sequence labeling problem: Given a sequence ofclauses from a paragraph describing an experiment, we seekto label the clauses with their discourse type. There existseveral proposals for experiment discourse models (Liakata,2010; Nawaz, Thompson, and Ananiadou, 2010; Mizuta andCollier, 2004; Nwogu, 1997). We adopt the discourse typetaxonomy for biological papers suggested by De Waard and Pander Maat (2012), and deﬁne our problem as identifyingthe discourse type of each clause in a given experiment de-scription. The taxonomy contains seven types (Table 1) andFigure 1 shows an example paragraph broken down intoclauses and tagged with discourse types.While there has been some variation in the level of granu-larity of text in prior discourse processing work, for our taskthe appropriate level of processing is clearly at the clause-level. As shown in Figure 1, there are many sentences inour data containing clauses of different kinds. For example,a pattern we observe frequently is when an author writes“To understand phenomenon X, we performed experimentY” yielding a ‘goal’ followed by a ‘method’ clause in a sin-gle sentence. Using the main and subordinate clauses froma Stanford parse provided good segregation of this structure.For this work, we focus on Systems Biology (SB) papersconcerning signaling pathways in cancer cells. Typically,researchers in this ﬁeld use a number of small-scale ex-perimental assays to investigate molecular events, see Voit(2012) and Svoboda and Reenstra (2002) for textbook andreview introductions. There can easily be as many as 20-30separate small experiments in any study that each provideevidence for the interpretive assertions being made. Ourgoal is to partition the text of SB papers to identify small-scale passages that describe the goals, methods, results andimplications of each experiment. By convention, subﬁguresdenoted by ‘1A’, ‘1B’, ‘1C’ etc. each describe data from aseparate experiment and are directly referenced in the narra-tive (see Figure 1). Related Work

Identifying structure within scientiﬁc papers

There isa signiﬁcant amount of prior work that is aimed at scien-tiﬁc discourse processing. Teufel and Moens (2002) andTeufel and Moens (1999) describe argumentative zoning(AZ), a way of classifying scientiﬁc papers at a sentencelevel, into zones, thus extracting the structure from entirepapers. Hirohata et al. (2008) use a 4-way classiﬁcationscheme for abstracts of scientiﬁc papers for identifying ob-jectives, methods, results and conclusions. Liakata (2010) From

Angers-Loustau et al. (1999) “Protein tyrosinephosphatase-PEST regulates focal adhesion disassembly, migra-tion, and cytokinesis in ﬁbroblasts”

J. Cell Bio. 144:1019-31 a r X i v : . [ c s . C L ] F e b igure 1: Example of a experiment description tagged withdiscourse types.described a three-layer ﬁner grained annotation scheme forsentence-level annotation (with 11 separate categorical la-bels) for identifying the core scientiﬁc concepts of papers.Classiﬁcation performance for machine learning systems toautomatically tag scientiﬁc sentences was F-Score=0.51 forLibSVM classiﬁers (Liakata et al., 2012). There is exten-sive overlap between leaf elements of the CoreSC schemaand our simpler discourse type model (‘Hypothesis’, ‘Goal’,‘Method’, and ‘Result’ are shared between both annotationsets and tags like ‘Background’ and ‘Conclusion’ map to our’fact’ and ’implication’ tags).Guo et al. (2010) used SVM and Na¨ıve Bayes classiﬁersto compare the three schemes described above. Gupta andManning (2011) also studied the problem of extracting fo-cus, techniques and the domain of research papers to iden-tify the inﬂuence of research communities over each other.In these studies, the focus of research is largely cen-tered on modeling the discourse being used to construct ascientiﬁc argument, driving towards understanding “senti-ment expressed towards cited work, ownership of ideas, andspeech acts which express rhetorical statements typical forscientiﬁc argumentation” (Teufel, 2000). These are drivenby human-to-human communication processes of the scien-tiﬁc literature rather than using discourse elements to sup-port machine reading of a semantic representation of sci-entiﬁc ﬁndings from primary experimental research papers.Our focus is speciﬁcally on attempting to identify text per-taining to experimental evidence for scientiﬁc IE rather fo-cusing on authors’ interpretations of those ﬁndings. Deep Learning for structured prediction and text classi-ﬁcation

There is a great amount of work in classiﬁcationand structured prediction over text and other modalities, thatuses deep learning. Particularly in sequence labeling tasks intext, (Collobert et al., 2011) words are represented as vectors

Type Deﬁnition

Goal Research goalFact A known fact, a statement takento be true by the authorResult The outcome of an experimentHypothesis A claim proposed by the authorMethod Experimental methodProblem An unresolved or contradictoryissueImplication An interpretation of the resultsTable 1: Seven label taxonomy from De Waard and Pan-der Maat (2012)and used as features to train a tagger. One advantage of usingthis approach is the reusability of pre-trained word vectors(Mikolov et al., 2014; Pennington, Socher, and Manning,2014) as features in various tasks. In our task, the sequencesbeing labeled are clauses instead of words. We obtain vectorrepresentations of clauses by summarizing those of words inthe clauses.Attention has been used for complex tasks like ques-tion answering (Hermann et al., 2015) and machine trans-lation (Bahdanau, Cho, and Bengio, 2014). In sequence-to-sequence learning problems like machine translation (Bah-danau, Cho, and Bengio, 2014), parsing (Vinyals et al.,2015) and image caption generation (Xu et al., 2015), onenetwork is used to encode the input modality, and a differ-ent network to decode into the output modality, with the de-coder using attention to learn parts of the input to attend tofor generating a given part of the output sequence. Whileour work does use two different models, one for encodingclause representations as a function of word representationsand another for decoding clause labels from clause represen-tations, the two models operate at different granularities.

Comparison with RST based discourse parsing

Generaldomain discourse parsing is a well-studied problem. Whilethere are many discourse theories (see Marcu (2000), chap-ter 2 for an overview), Rhetorical Structure Theory (RST) byMann and Thompson (1988), received a lot of attention. Itis generally accepted that relations between non-overlappingchunks of text need to be considered to account for the over-all meaning (Marcu, 2000). Accordingly, rhetorical relationsare central in RST for marking the structure. In contrast,the taxonomy we use applies to the clauses themselves, in-stead of the relations between them. This is made possibleby the speciﬁcity of our domain: In the general case, it maynot be possible to identify the type of a clause in isolation.However, it has to be noted that the information conveyedby our clause-centric formalism may also be expressed us-ing a relation-centric discourse formalism like RST. Figure 2shows one possible RST tree for the text shown in Figure 1.igure 2: Tree using RST relations for the text in Figure 1.The numbers indicate the clauses from the text shown in Fig-ure 1.

Approach

We call our system Scientiﬁc Discourse Tagger (SciDT).Our pipeline is shown in Figure 3. The input to the tag-ger is a set of clauses from a paragraph. They are ﬁrstembedded to obtain a tensor D ∈ R c × w × d where c is thenumber of clauses , and w is the number of words in eachclause, and each word is represented as a d dimensional vec-tor. The tensors are zero-padded along the clause and worddimensions dimensions if needed. The next step is to sum-marize the clause representations to obtain D summ ∈ R c × d ,a matrix corresponding to the entire paragraph, with onevector per clause. Finally, D summ is fed to the Recur-rent Neural Network (RNN) with Long Short-Term Mem-ory (LSTM) Hochreiter and Schmidhuber (1997) cells, tolabel the clauses. We propose two ways of summarizing theclause representations below. Both variants use attention tolearn the weights of words within a clause based on theirimportance for the labeling task, and compute a weightedaverage of the word representations using those weights, toget the clause representation. The attention component andthe LSTM-RNN are trained jointly. We use pretrained rep-resentations for words and ﬁx them during training. Attention with and without context

Both variants take as input the tensor D . The output in bothcases is a matrix A ∈ R c × w , which contains the attentionweights of all the words in the paragraph. We ﬁrst projectthe input words to a lower dimensional space in both casesusing a projection operator P ∈ R d × p . D l = tanh ( D.P ) ∈ R c × w × p (1)The low dimension representations are then scored differ-ently by each variant. Out of context

This model deﬁnes a simple scoring oper-ator s s ∈ R p , that scores each word based only on its low dimension representation. The scoring is out of context be-cause each word is scored in isolation. D il = D l [ i, : , :] ∈ R w × p (2) a is = softmax ( D il .s s ) ∈ R w (3) A s = (cid:2) a s a s · · · a ws (cid:3) ∈ R c × w (4)Equation 2 corresponds to selecting all the words in the i th clause of the paragraph. Equation 3 shows the computationof attention scores for all words in the clause, and equation 4simply puts the clause scores together to get the paragraphlevel attention values. Clause context

In this variant, we score words in a clausein the context of other words that occur in the clause. Con-cretely, as shown in the equations below, this is a recurrent scoring mechanism that uses a RNN to score each word in aclause as a function of its low-dimension representation andits previous context in the clause given by the hidden layer inthe RNN. It has to be noted that the recurrence in this scor-ing model is over words in a clause while that in the LSTMdescribed previously is over clauses. D il = D l [ i, : , :] ∈ R w × p (5) h ij = tanh ( D il [ j, :] .W IH + h ij − .W HH ) ∈ R w × p (6) a ir = softmax ( h ij .s r ) ∈ R w (7) A r = (cid:2) a r a r · · · a wr (cid:3) ∈ R c × w (8)Equation 5, equation 7 and equation 8 are similar to equa-tion 2, equation 3 and equation 4 respectively. The operator s r ∈ R p is similar to s s from simple attention. In equation 6,we apply the standard RNN recurrence to update the hiddenstate, using the parameters W IH ∈ R p × p , operating on theinput word at the current timestep j , and W HH ∈ R p × p , op-erating on the hidden state from the previous timestep j − . Input to LSTM

A weighted sum of the input tensor is computed, with theweights coming from the attention model, and it is fed to theLSTM. D summ [ i, :] = A [ i, :] .D [ i, : , :] ∈ R c × d (9)The above equation shows the composed representation ofthe i th clause stored as the i th row in D summ . Experiments

Implementation Details

We used the 200 dimension vectors trained on Pubmed Cen-tral data by Pyysalo et al. (2013) as input representationsand projected them down to 50d to keep the parameter spaceof the entire model under control. The projection operatoris also trained along with the rest of the pipeline. LSTMswere implemented using Keras (Chollet, 2015) and atten-tion using Theano (Bergstra et al., 2010). We trained for Code is publicly available at https://github.com/edvisees/sciDT igure 3: Our Scientiﬁc Discourse Tagging (SciDT) pipeline. The input is a list of clauses, which is embedded to get the tensor D , containing one vector per word. Summarization involves converting D into a matrix D summ , that has one vector per clause,which is then passed to a LSTM-RNN for labeling . The lower part of the ﬁgure shows the two ways of summarizing the i th clause (given by D [ i ] ).atmost 100 epochs, while monitoring accuracy on held-outvalidation data for early stopping. We used ADAM (Kingmaand Ba, 2014) as the optimization algorithm. Dropout with p = 0 . was used on input to the attention layer. Data Preprocessing, Annotations and Pipelines

We created a scientiﬁc discourse marked dataset from 75 pa-pers in the area of intercellular cancer pathways taken fromthe Open Access subset of Pubmed Central. Using a mul-tithreaded preprocessing pipeline, we extracted the Resultssections of each of those papers, and parsed all the sentencesusing the Stanford Parser (Socher et al., 2013). This processseparated the main and subordinate clauses of each sentencethat we process as a sequence over separate paragraphs. Weasked domain experts to label each of those clauses usingthe seven label taxonomy suggested by De Waard and Pan-der Maat (2012). We also added a None label for thoseclauses that do not fall under any category. Each sequencein the dataset corresponds to the clauses extracted from aparagraph. So we make the assumption that paragraphs areminimal experiment narratives. On the whole, our dataset consists of 392 experiment descriptions with a total of 4497clauses. This is an ongoing annotation effort, and we intendto make a bigger dataset in the future.Figure 4 shows p ( position | type ) values of discoursetypes at various positions in a paragraph, estimated fromthe entire annotated dataset. It can be seen that goal , fact , problem and hypothesis are more likely at the beginning ofa paragraph compared to other locations, whereas method peaks before the middle, result at the middle, and implica-tion clearly towards the end of a paragraph. This trend sup-ports the expected narrative structure described in Section1. Please contact the authors if you would like to use this datasetfor your research.

Figure 4: Chart showing probabilities of each dicourse typeat various positions in a paragraph. Each paragraph is bro-ken into ﬁve partsModel Attention Accuracy F-scoreCRF - 0.6942 0.6818SciDT None 0.6912 0.6749SciDT Simple 0.7379 0.7261SciDT Recurrent

Table 2: Accuracies and F-scores from 5-fold cross valida-tion of SciDT in various settings and a CRF. Simple atten-tion corresponds to the out of context variant, and recurrentto the clause context variant.igure 5: Examples of attention probabilities assigned bythe recurrent model to words in parts of clauses and theircorrectly predicted labels. Darker shades show higher atten-tion weights.

Results and Analysis

A baseline model we compare against is a Conditional Ran-dom Field (Lafferty, McCallum, and Pereira, 2001) thatuses as features part of speech tags, the identities of verbsand adverbs, presence of ﬁgure references and citations,and hand-crafted lexicon features that indicate speciﬁc dis-course types. In addition, we also test a variant of our modelthat does not use attention in the input layer and the clausevectors are obtained simply as an average of the vectors ofwords in them. Table 2 shows accuracy scores and weightedaverages of f-scores from 5-fold cross validation of the twobaseline models and the two attention based models. Theperformance of the averaged input SciDT is comparable tothe CRF model, whereas the two attention models performbetter. The performance of the recurrent attention SciDTmodel shows the importance of modeling context in atten-tion. On closer examination of the attention weights as-signed to words in unseen paragraphs, we noticed strongertrends in the recurrent attention based model. Particularly,the main verbs of the sentences received the highest atten-tion in many cases in the recurrent model. That the verbform is an important indicator of discourse type identiﬁca-tion was shown by De Waard and Pander Maat (2012).Figure 5 shows examples of parts of clauses and the at-tention weights assigned to the words. These indicate thegeneral trends of words relevant to discourse classes beinggiven higher attention: investigate whether indicates Goal , analysis is a Method word, and strongly suggest is a phraseexpected in

Implication . While some of these errors made bythe LSTM may be attributed to the model itself, there werealso some exceptions to the assumption that clauses are thesmallest units of discourse. There are some infrequent caseswhere clauses had components of multiple discourse typesin them. Moreover, the syntactic parser we used to sepa-rate clauses was sometimes inaccurate, resulting in incorrectclause splits. For example, words like demonstrate and suggest indicate im-plication ; phrases like data not shown indicate result F-scores are of all classes were averaged weighted by the num-ber of points within that class in the gold standard.

Discussion and Conclusion

We introduced a sequence labeling approach for identifyingthe discourse elements within experiment descriptions. Ourmodel uses an attention mechanism over word representa-tions to obtain clause representations. The results show thatour attention based composition mechanism used to encodeclauses adds value to the LSTM model. Visualizations showthat the clause context model does indeed learn to attend towords important for the ﬁnal tagging decision. In the future,we shall extend the idea of contextual attention to attend towords based on context at the paragraph level.Our system can complement existing IE tools that oper-ate on scientiﬁc literature, and provide useful epistemic andcontextual features. Identifying the structure of experimentsprovides additional context information that can help vari-ous downstream tasks. For event co-reference, one can usethe structure to accurately resolve anaphora links. For exam-ple, a reference made in an implication statement is likely tosome entity in a result statement it follows. It has to be notedthat the taxonomy we used also provides epistemic informa-tion which is helpful in information extraction (IE): IE sys-tems need not process clauses labeled as hypothesis or goal since they do not contain events that actually occurred. Go-ing forward, our goal is to read, assemble and model mech-anisms describing complex phenomena from collections ofrelevant scientiﬁc documents.The application of our methods can reveal a small-scalediscourse structure to contextualize, report and interpret ev-idence from individual experiments in a ﬁne-grained con-text. This could be used to support cyclic models of scien-tiﬁc reasoning where data from individual experiments canbe placed in an appropriate interpretive context within an in-formatics system that synthesizes knowledge across manypapers. Beyond the scope of direct applications to IE, thiswork may be applied to Semantic Web representations ofscientiﬁc knowledge and biocuration pipelines to accelerateknowledge acquisition. References

Alley, M. 1996.

The craft of scientiﬁc writing . SpringerScience & Business Media.Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .Bergstra, J.; Breuleux, O.; Bastien, F.; Lamblin, P.; Pascanu,R.; Desjardins, G.; Turian, J.; Warde-Farley, D.; and Ben-gio, Y. 2010. Theano: A cpu and gpu math compiler inpython. In van der Walt, S., and Millman, J., eds.,

Pro-ceedings of the 9th Python in Science Conference , 3 – 10.Chollet, F. 2015. Keras. https://github.com/fchollet/keras .Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.;Kavukcuoglu, K.; and Kuksa, P. 2011. Natural languageprocessing (almost) from scratch.

Journal of MachineLearning Research

Journal of English for aca-demic purposes

Proceedings of the 2010 Work-shop on Biomedical Natural Language Processing , 99–107. Association for Computational Linguistics.Gupta, S., and Manning, C. 2011. Analyzing the dynamicsof research by extracting key aspects of scientiﬁc papers.In

Proceedings of 5th International Joint Conference onNatural Language Processing , 1–9. Chiang Mai, Thai-land: Asian Federation of Natural Language Processing.Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.;Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teachingmachines to read and comprehend. In

Advances in NeuralInformation Processing Systems , 1684–1692.Hirohata, K.; Okazaki, N.; Ananiadou, S.; Ishizuka, M.; andBiocentre, M. I. 2008. Identifying sections in scientiﬁcabstracts using conditional random ﬁelds.Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory.

Neural computation arXiv preprint arXiv:1412.6980 .Lafferty, J.; McCallum, A.; and Pereira, F. C. 2001. Condi-tional random ﬁelds: Probabilistic models for segmentingand labeling sequence data.Liakata, M.; Saha, S.; Dobnik, S.; Batchelor, C.; andRebholz-Schuhmann, D. 2012. Automatic recognition ofconceptualization zones in scientiﬁc articles and two lifescience applications.

Bioinformatics (Oxford, England)

Proceedings of the Workshop on Negation and Specu-lation in Natural Language Processing , 1–4. Associationfor Computational Linguistics.Mann, W. C., and Thompson, S. A. 1988. Rhetorical struc-ture theory: Toward a functional theory of text organiza-tion.

Text-Interdisciplinary Journal for the Study of Dis-course

LREC , 1737–1740.Nawaz, R.; Thompson, P.; and Ananiadou, S. 2010. Evaluat-ing a meta-knowledge annotation scheme for bio-events.In

Proceedings of the Workshop on Negation and Specula-tion in Natural Language Processing , 69–77. Associationfor Computational Linguistics. Nwogu, K. N. 1997. The medical research paper: Structureand functions.

English for speciﬁc purposes

EMNLP , vol-ume 14, 1532–43.Pyysalo, S.; Ginter, F.; Moen, H.; Salakoski, T.; and Ana-niadou, S. 2013. Distributional semantics resources forbiomedical text processing.

Proceedings of Languages inBiology and Medicine .Socher, R.; Bauer, J.; Manning, C. D.; and Ng, A. Y. 2013.Parsing with compositional vector grammars. In

In Pro-ceedings of the ACL conference . Citeseer.Svoboda, K. K. H., and Reenstra, W. R. 2002. Approachesto studying cellular signaling: a primer for morphologists.

The Anatomical record

Towards Standards and Tools for Discourse Tag-ging , 84–93.Teufel, S., and Moens, M. 2002. Summarizing scientiﬁcarticles: experiments with relevance and rhetorical status.

Computational linguistics

Argumentative Zoning: Information Ex-traction from Scientiﬁc Text.

Ph.D. Dissertation, Schoolof Cognitive Science, University of Edinburgh, Edinburg.Vinyals, O.; Kaiser, Ł.; Koo, T.; Petrov, S.; Sutskever, I.;and Hinton, G. 2015. Grammar as a foreign language.In

Advances in Neural Information Processing Systems ,2755–2763.Voit, E. 2012.

A First Course in Systems Biology . GarlandScience, 1st edition.Xu, K.; Ba, J.; Kiros, R.; Courville, A.; Salakhutdinov,R.; Zemel, R.; and Bengio, Y. 2015. Show, attend andtell: Neural image caption generation with visual atten-tion. arXiv preprint arXiv:1502.03044arXiv preprint arXiv:1502.03044