Experiment Segmentation in Scientific Discourse as Clause-level Structured Prediction using Recurrent Neural Networks
Pradeep Dasigi, Gully A.P.C. Burns, Eduard Hovy, Anita de Waard
EExperiment Segmentation in Scientific Discourse as Clause-level StructuredPrediction using Recurrent Neural Networks
Pradeep Dasigi , Gully A.P.C. Burns , Eduard Hovy , and Anita de Waard Language Technologies Institute , Carnegie Mellon University , 5000 Forbes Avenue , Pittsburgh, PA 15213 , USA Information Sciences Institute , Viterbi School of Engineering , University of Southern California , Marina del Rey, CA 90292 , USA Elsevier Research Data Services , Jericho, VT 05465 , USA
Abstract
We propose a deep learning model for identifying structurewithin experiment narratives in scientific literature. We take asequence labeling approach to this problem, and label clauseswithin experiment narratives to identify the different partsof the experiment. Our dataset consists of paragraphs takenfrom open access PubMed papers labeled with rhetorical in-formation as a result of our pilot annotation. Our model isa Recurrent Neural Network (RNN) with Long Short-TermMemory (LSTM) cells that labels clauses. The clause repre-sentations are computed by combining word representationsusing a novel attention mechanism that involves a separateRNN. We compare this model against LSTMs where the in-put layer has simple or no attention and a feature rich CRFmodel. Furthermore, we describe how our work could be use-ful for information extraction from scientific literature.
Introduction
An important part of science is communicating results.There are well established rhetorical guidelines (Alley,1996) for scientific writing that are used across disciplinesand consequently, narratives describing evidence within ascientific investigation are expected to have a certain struc-ture. Typically, the description begins with certain back-ground information which has already been proved, fol-lowed by some motivating hypotheses to introduce the ex-periment, the methods inferences made based on those re-sults. Understanding this structure is important since it en-ables the higher-level construction of the general argumentof the paper. The reader assembles the pieces in order to un-derstand what was done, why it was done, what prior knowl-edge it builds upon and/or refutes, and with what certaintythe final conclusions should be accepted. Without such anoverall model of the experiment, the reader has nothing butbasic assertions.In this work, our aim is to identify these discourse ele-ments given an experiment narrative. We view the task athand as a sequence labeling problem: Given a sequence ofclauses from a paragraph describing an experiment, we seekto label the clauses with their discourse type. There existseveral proposals for experiment discourse models (Liakata,2010; Nawaz, Thompson, and Ananiadou, 2010; Mizuta andCollier, 2004; Nwogu, 1997). We adopt the discourse typetaxonomy for biological papers suggested by De Waard and Pander Maat (2012), and define our problem as identifyingthe discourse type of each clause in a given experiment de-scription. The taxonomy contains seven types (Table 1) andFigure 1 shows an example paragraph broken down intoclauses and tagged with discourse types.While there has been some variation in the level of granu-larity of text in prior discourse processing work, for our taskthe appropriate level of processing is clearly at the clause-level. As shown in Figure 1, there are many sentences inour data containing clauses of different kinds. For example,a pattern we observe frequently is when an author writes“To understand phenomenon X, we performed experimentY” yielding a ‘goal’ followed by a ‘method’ clause in a sin-gle sentence. Using the main and subordinate clauses froma Stanford parse provided good segregation of this structure.For this work, we focus on Systems Biology (SB) papersconcerning signaling pathways in cancer cells. Typically,researchers in this field use a number of small-scale ex-perimental assays to investigate molecular events, see Voit(2012) and Svoboda and Reenstra (2002) for textbook andreview introductions. There can easily be as many as 20-30separate small experiments in any study that each provideevidence for the interpretive assertions being made. Ourgoal is to partition the text of SB papers to identify small-scale passages that describe the goals, methods, results andimplications of each experiment. By convention, subfiguresdenoted by ‘1A’, ‘1B’, ‘1C’ etc. each describe data from aseparate experiment and are directly referenced in the narra-tive (see Figure 1). Related Work
Identifying structure within scientific papers
There isa significant amount of prior work that is aimed at scien-tific discourse processing. Teufel and Moens (2002) andTeufel and Moens (1999) describe argumentative zoning(AZ), a way of classifying scientific papers at a sentencelevel, into zones, thus extracting the structure from entirepapers. Hirohata et al. (2008) use a 4-way classificationscheme for abstracts of scientific papers for identifying ob-jectives, methods, results and conclusions. Liakata (2010) From
Angers-Loustau et al. (1999) “Protein tyrosinephosphatase-PEST regulates focal adhesion disassembly, migra-tion, and cytokinesis in fibroblasts”
J. Cell Bio. 144:1019-31 a r X i v : . [ c s . C L ] F e b igure 1: Example of a experiment description tagged withdiscourse types.described a three-layer finer grained annotation scheme forsentence-level annotation (with 11 separate categorical la-bels) for identifying the core scientific concepts of papers.Classification performance for machine learning systems toautomatically tag scientific sentences was F-Score=0.51 forLibSVM classifiers (Liakata et al., 2012). There is exten-sive overlap between leaf elements of the CoreSC schemaand our simpler discourse type model (‘Hypothesis’, ‘Goal’,‘Method’, and ‘Result’ are shared between both annotationsets and tags like ‘Background’ and ‘Conclusion’ map to our’fact’ and ’implication’ tags).Guo et al. (2010) used SVM and Na¨ıve Bayes classifiersto compare the three schemes described above. Gupta andManning (2011) also studied the problem of extracting fo-cus, techniques and the domain of research papers to iden-tify the influence of research communities over each other.In these studies, the focus of research is largely cen-tered on modeling the discourse being used to construct ascientific argument, driving towards understanding “senti-ment expressed towards cited work, ownership of ideas, andspeech acts which express rhetorical statements typical forscientific argumentation” (Teufel, 2000). These are drivenby human-to-human communication processes of the scien-tific literature rather than using discourse elements to sup-port machine reading of a semantic representation of sci-entific findings from primary experimental research papers.Our focus is specifically on attempting to identify text per-taining to experimental evidence for scientific IE rather fo-cusing on authors’ interpretations of those findings. Deep Learning for structured prediction and text classi-fication
There is a great amount of work in classificationand structured prediction over text and other modalities, thatuses deep learning. Particularly in sequence labeling tasks intext, (Collobert et al., 2011) words are represented as vectors
Type Definition
Goal Research goalFact A known fact, a statement takento be true by the authorResult The outcome of an experimentHypothesis A claim proposed by the authorMethod Experimental methodProblem An unresolved or contradictoryissueImplication An interpretation of the resultsTable 1: Seven label taxonomy from De Waard and Pan-der Maat (2012)and used as features to train a tagger. One advantage of usingthis approach is the reusability of pre-trained word vectors(Mikolov et al., 2014; Pennington, Socher, and Manning,2014) as features in various tasks. In our task, the sequencesbeing labeled are clauses instead of words. We obtain vectorrepresentations of clauses by summarizing those of words inthe clauses.Attention has been used for complex tasks like ques-tion answering (Hermann et al., 2015) and machine trans-lation (Bahdanau, Cho, and Bengio, 2014). In sequence-to-sequence learning problems like machine translation (Bah-danau, Cho, and Bengio, 2014), parsing (Vinyals et al.,2015) and image caption generation (Xu et al., 2015), onenetwork is used to encode the input modality, and a differ-ent network to decode into the output modality, with the de-coder using attention to learn parts of the input to attend tofor generating a given part of the output sequence. Whileour work does use two different models, one for encodingclause representations as a function of word representationsand another for decoding clause labels from clause represen-tations, the two models operate at different granularities.
Comparison with RST based discourse parsing
Generaldomain discourse parsing is a well-studied problem. Whilethere are many discourse theories (see Marcu (2000), chap-ter 2 for an overview), Rhetorical Structure Theory (RST) byMann and Thompson (1988), received a lot of attention. Itis generally accepted that relations between non-overlappingchunks of text need to be considered to account for the over-all meaning (Marcu, 2000). Accordingly, rhetorical relationsare central in RST for marking the structure. In contrast,the taxonomy we use applies to the clauses themselves, in-stead of the relations between them. This is made possibleby the specificity of our domain: In the general case, it maynot be possible to identify the type of a clause in isolation.However, it has to be noted that the information conveyedby our clause-centric formalism may also be expressed us-ing a relation-centric discourse formalism like RST. Figure 2shows one possible RST tree for the text shown in Figure 1.igure 2: Tree using RST relations for the text in Figure 1.The numbers indicate the clauses from the text shown in Fig-ure 1.
Approach
We call our system Scientific Discourse Tagger (SciDT).Our pipeline is shown in Figure 3. The input to the tag-ger is a set of clauses from a paragraph. They are firstembedded to obtain a tensor D ∈ R c × w × d where c is thenumber of clauses , and w is the number of words in eachclause, and each word is represented as a d dimensional vec-tor. The tensors are zero-padded along the clause and worddimensions dimensions if needed. The next step is to sum-marize the clause representations to obtain D summ ∈ R c × d ,a matrix corresponding to the entire paragraph, with onevector per clause. Finally, D summ is fed to the Recur-rent Neural Network (RNN) with Long Short-Term Mem-ory (LSTM) Hochreiter and Schmidhuber (1997) cells, tolabel the clauses. We propose two ways of summarizing theclause representations below. Both variants use attention tolearn the weights of words within a clause based on theirimportance for the labeling task, and compute a weightedaverage of the word representations using those weights, toget the clause representation. The attention component andthe LSTM-RNN are trained jointly. We use pretrained rep-resentations for words and fix them during training. Attention with and without context
Both variants take as input the tensor D . The output in bothcases is a matrix A ∈ R c × w , which contains the attentionweights of all the words in the paragraph. We first projectthe input words to a lower dimensional space in both casesusing a projection operator P ∈ R d × p . D l = tanh ( D.P ) ∈ R c × w × p (1)The low dimension representations are then scored differ-ently by each variant. Out of context
This model defines a simple scoring oper-ator s s ∈ R p , that scores each word based only on its low dimension representation. The scoring is out of context be-cause each word is scored in isolation. D il = D l [ i, : , :] ∈ R w × p (2) a is = softmax ( D il .s s ) ∈ R w (3) A s = (cid:2) a s a s · · · a ws (cid:3) ∈ R c × w (4)Equation 2 corresponds to selecting all the words in the i th clause of the paragraph. Equation 3 shows the computationof attention scores for all words in the clause, and equation 4simply puts the clause scores together to get the paragraphlevel attention values. Clause context
In this variant, we score words in a clausein the context of other words that occur in the clause. Con-cretely, as shown in the equations below, this is a recurrent scoring mechanism that uses a RNN to score each word in aclause as a function of its low-dimension representation andits previous context in the clause given by the hidden layer inthe RNN. It has to be noted that the recurrence in this scor-ing model is over words in a clause while that in the LSTMdescribed previously is over clauses. D il = D l [ i, : , :] ∈ R w × p (5) h ij = tanh ( D il [ j, :] .W IH + h ij − .W HH ) ∈ R w × p (6) a ir = softmax ( h ij .s r ) ∈ R w (7) A r = (cid:2) a r a r · · · a wr (cid:3) ∈ R c × w (8)Equation 5, equation 7 and equation 8 are similar to equa-tion 2, equation 3 and equation 4 respectively. The operator s r ∈ R p is similar to s s from simple attention. In equation 6,we apply the standard RNN recurrence to update the hiddenstate, using the parameters W IH ∈ R p × p , operating on theinput word at the current timestep j , and W HH ∈ R p × p , op-erating on the hidden state from the previous timestep j − . Input to LSTM
A weighted sum of the input tensor is computed, with theweights coming from the attention model, and it is fed to theLSTM. D summ [ i, :] = A [ i, :] .D [ i, : , :] ∈ R c × d (9)The above equation shows the composed representation ofthe i th clause stored as the i th row in D summ . Experiments
Implementation Details
We used the 200 dimension vectors trained on Pubmed Cen-tral data by Pyysalo et al. (2013) as input representationsand projected them down to 50d to keep the parameter spaceof the entire model under control. The projection operatoris also trained along with the rest of the pipeline. LSTMswere implemented using Keras (Chollet, 2015) and atten-tion using Theano (Bergstra et al., 2010). We trained for Code is publicly available at https://github.com/edvisees/sciDT igure 3: Our Scientific Discourse Tagging (SciDT) pipeline. The input is a list of clauses, which is embedded to get the tensor D , containing one vector per word. Summarization involves converting D into a matrix D summ , that has one vector per clause,which is then passed to a LSTM-RNN for labeling . The lower part of the figure shows the two ways of summarizing the i th clause (given by D [ i ] ).atmost 100 epochs, while monitoring accuracy on held-outvalidation data for early stopping. We used ADAM (Kingmaand Ba, 2014) as the optimization algorithm. Dropout with p = 0 . was used on input to the attention layer. Data Preprocessing, Annotations and Pipelines
We created a scientific discourse marked dataset from 75 pa-pers in the area of intercellular cancer pathways taken fromthe Open Access subset of Pubmed Central. Using a mul-tithreaded preprocessing pipeline, we extracted the Resultssections of each of those papers, and parsed all the sentencesusing the Stanford Parser (Socher et al., 2013). This processseparated the main and subordinate clauses of each sentencethat we process as a sequence over separate paragraphs. Weasked domain experts to label each of those clauses usingthe seven label taxonomy suggested by De Waard and Pan-der Maat (2012). We also added a None label for thoseclauses that do not fall under any category. Each sequencein the dataset corresponds to the clauses extracted from aparagraph. So we make the assumption that paragraphs areminimal experiment narratives. On the whole, our dataset consists of 392 experiment descriptions with a total of 4497clauses. This is an ongoing annotation effort, and we intendto make a bigger dataset in the future.Figure 4 shows p ( position | type ) values of discoursetypes at various positions in a paragraph, estimated fromthe entire annotated dataset. It can be seen that goal , fact , problem and hypothesis are more likely at the beginning ofa paragraph compared to other locations, whereas method peaks before the middle, result at the middle, and implica-tion clearly towards the end of a paragraph. This trend sup-ports the expected narrative structure described in Section1. Please contact the authors if you would like to use this datasetfor your research.
Figure 4: Chart showing probabilities of each dicourse typeat various positions in a paragraph. Each paragraph is bro-ken into five partsModel Attention Accuracy F-scoreCRF - 0.6942 0.6818SciDT None 0.6912 0.6749SciDT Simple 0.7379 0.7261SciDT Recurrent
Table 2: Accuracies and F-scores from 5-fold cross valida-tion of SciDT in various settings and a CRF. Simple atten-tion corresponds to the out of context variant, and recurrentto the clause context variant.igure 5: Examples of attention probabilities assigned bythe recurrent model to words in parts of clauses and theircorrectly predicted labels. Darker shades show higher atten-tion weights.
Results and Analysis
A baseline model we compare against is a Conditional Ran-dom Field (Lafferty, McCallum, and Pereira, 2001) thatuses as features part of speech tags, the identities of verbsand adverbs, presence of figure references and citations,and hand-crafted lexicon features that indicate specific dis-course types. In addition, we also test a variant of our modelthat does not use attention in the input layer and the clausevectors are obtained simply as an average of the vectors ofwords in them. Table 2 shows accuracy scores and weightedaverages of f-scores from 5-fold cross validation of the twobaseline models and the two attention based models. Theperformance of the averaged input SciDT is comparable tothe CRF model, whereas the two attention models performbetter. The performance of the recurrent attention SciDTmodel shows the importance of modeling context in atten-tion. On closer examination of the attention weights as-signed to words in unseen paragraphs, we noticed strongertrends in the recurrent attention based model. Particularly,the main verbs of the sentences received the highest atten-tion in many cases in the recurrent model. That the verbform is an important indicator of discourse type identifica-tion was shown by De Waard and Pander Maat (2012).Figure 5 shows examples of parts of clauses and the at-tention weights assigned to the words. These indicate thegeneral trends of words relevant to discourse classes beinggiven higher attention: investigate whether indicates Goal , analysis is a Method word, and strongly suggest is a phraseexpected in
Implication . While some of these errors made bythe LSTM may be attributed to the model itself, there werealso some exceptions to the assumption that clauses are thesmallest units of discourse. There are some infrequent caseswhere clauses had components of multiple discourse typesin them. Moreover, the syntactic parser we used to sepa-rate clauses was sometimes inaccurate, resulting in incorrectclause splits. For example, words like demonstrate and suggest indicate im-plication ; phrases like data not shown indicate result F-scores are of all classes were averaged weighted by the num-ber of points within that class in the gold standard.
Discussion and Conclusion
We introduced a sequence labeling approach for identifyingthe discourse elements within experiment descriptions. Ourmodel uses an attention mechanism over word representa-tions to obtain clause representations. The results show thatour attention based composition mechanism used to encodeclauses adds value to the LSTM model. Visualizations showthat the clause context model does indeed learn to attend towords important for the final tagging decision. In the future,we shall extend the idea of contextual attention to attend towords based on context at the paragraph level.Our system can complement existing IE tools that oper-ate on scientific literature, and provide useful epistemic andcontextual features. Identifying the structure of experimentsprovides additional context information that can help vari-ous downstream tasks. For event co-reference, one can usethe structure to accurately resolve anaphora links. For exam-ple, a reference made in an implication statement is likely tosome entity in a result statement it follows. It has to be notedthat the taxonomy we used also provides epistemic informa-tion which is helpful in information extraction (IE): IE sys-tems need not process clauses labeled as hypothesis or goal since they do not contain events that actually occurred. Go-ing forward, our goal is to read, assemble and model mech-anisms describing complex phenomena from collections ofrelevant scientific documents.The application of our methods can reveal a small-scalediscourse structure to contextualize, report and interpret ev-idence from individual experiments in a fine-grained con-text. This could be used to support cyclic models of scien-tific reasoning where data from individual experiments canbe placed in an appropriate interpretive context within an in-formatics system that synthesizes knowledge across manypapers. Beyond the scope of direct applications to IE, thiswork may be applied to Semantic Web representations ofscientific knowledge and biocuration pipelines to accelerateknowledge acquisition. References
Alley, M. 1996.
The craft of scientific writing . SpringerScience & Business Media.Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .Bergstra, J.; Breuleux, O.; Bastien, F.; Lamblin, P.; Pascanu,R.; Desjardins, G.; Turian, J.; Warde-Farley, D.; and Ben-gio, Y. 2010. Theano: A cpu and gpu math compiler inpython. In van der Walt, S., and Millman, J., eds.,
Pro-ceedings of the 9th Python in Science Conference , 3 – 10.Chollet, F. 2015. Keras. https://github.com/fchollet/keras .Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.;Kavukcuoglu, K.; and Kuksa, P. 2011. Natural languageprocessing (almost) from scratch.
Journal of MachineLearning Research
Journal of English for aca-demic purposes
Proceedings of the 2010 Work-shop on Biomedical Natural Language Processing , 99–107. Association for Computational Linguistics.Gupta, S., and Manning, C. 2011. Analyzing the dynamicsof research by extracting key aspects of scientific papers.In
Proceedings of 5th International Joint Conference onNatural Language Processing , 1–9. Chiang Mai, Thai-land: Asian Federation of Natural Language Processing.Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.;Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teachingmachines to read and comprehend. In
Advances in NeuralInformation Processing Systems , 1684–1692.Hirohata, K.; Okazaki, N.; Ananiadou, S.; Ishizuka, M.; andBiocentre, M. I. 2008. Identifying sections in scientificabstracts using conditional random fields.Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory.
Neural computation arXiv preprint arXiv:1412.6980 .Lafferty, J.; McCallum, A.; and Pereira, F. C. 2001. Condi-tional random fields: Probabilistic models for segmentingand labeling sequence data.Liakata, M.; Saha, S.; Dobnik, S.; Batchelor, C.; andRebholz-Schuhmann, D. 2012. Automatic recognition ofconceptualization zones in scientific articles and two lifescience applications.
Bioinformatics (Oxford, England)
Proceedings of the Workshop on Negation and Specu-lation in Natural Language Processing , 1–4. Associationfor Computational Linguistics.Mann, W. C., and Thompson, S. A. 1988. Rhetorical struc-ture theory: Toward a functional theory of text organiza-tion.
Text-Interdisciplinary Journal for the Study of Dis-course
LREC , 1737–1740.Nawaz, R.; Thompson, P.; and Ananiadou, S. 2010. Evaluat-ing a meta-knowledge annotation scheme for bio-events.In
Proceedings of the Workshop on Negation and Specula-tion in Natural Language Processing , 69–77. Associationfor Computational Linguistics. Nwogu, K. N. 1997. The medical research paper: Structureand functions.
English for specific purposes
EMNLP , vol-ume 14, 1532–43.Pyysalo, S.; Ginter, F.; Moen, H.; Salakoski, T.; and Ana-niadou, S. 2013. Distributional semantics resources forbiomedical text processing.
Proceedings of Languages inBiology and Medicine .Socher, R.; Bauer, J.; Manning, C. D.; and Ng, A. Y. 2013.Parsing with compositional vector grammars. In
In Pro-ceedings of the ACL conference . Citeseer.Svoboda, K. K. H., and Reenstra, W. R. 2002. Approachesto studying cellular signaling: a primer for morphologists.
The Anatomical record
Towards Standards and Tools for Discourse Tag-ging , 84–93.Teufel, S., and Moens, M. 2002. Summarizing scientificarticles: experiments with relevance and rhetorical status.
Computational linguistics
Argumentative Zoning: Information Ex-traction from Scientific Text.
Ph.D. Dissertation, Schoolof Cognitive Science, University of Edinburgh, Edinburg.Vinyals, O.; Kaiser, Ł.; Koo, T.; Petrov, S.; Sutskever, I.;and Hinton, G. 2015. Grammar as a foreign language.In
Advances in Neural Information Processing Systems ,2755–2763.Voit, E. 2012.
A First Course in Systems Biology . GarlandScience, 1st edition.Xu, K.; Ba, J.; Kiros, R.; Courville, A.; Salakhutdinov,R.; Zemel, R.; and Bengio, Y. 2015. Show, attend andtell: Neural image caption generation with visual atten-tion. arXiv preprint arXiv:1502.03044arXiv preprint arXiv:1502.03044