[PDF] Bridging the Knowledge Gap: Enhancing Question Answering with World and Domain Knowledge

Abstract

Full PDF

BBridging the Knowledge Gap:Enhancing Question Answering with World and Domain Knowledge

Travis R. Goodwin and Dina Demner-Fushman

Lister Hill National Center for Biomedical CommunicationsU.S. National Library of MedicineBethesda, MD, USA

Abstract

In this paper we present OSCAR (Ontology-based Semantic Composition AugmentedRegularization), a method for injecting task-agnostic knowledge from an Ontology orknowledge graph into a neural network dur-ing pretraining. We evaluated the impact of in-cluding OSCAR when pretraining BERT withWikipedia articles by measuring the perfor-mance when ﬁne-tuning on two question an-swering tasks involving world knowledge andcausal reasoning and one requiring domain(healthcare) knowledge and obtained . , . , and improved accuracy comparedto pretraining BERT without OSCAR and ob-taining new state-of-the-art results on two ofthe tasks. “The detective ﬂashed his badge to the police of-ﬁcer.” The nearly effortless ease at which we, ashumans, can understand this simple statement be-lies the depth of semantic knowledge needed forits understanding: What is a detective? What isa police ofﬁcer? What is a badge? What doesit mean to ﬂash a badge? Why would the de-tective need to ﬂash his badge to the police ofﬁ-cer? Understanding this sentence requires know-ing the answer to all these questions and relies onthe reader’s knowledge about this world.As shown in Figure 1, suppose we were in-terested in determining whether, upon showingthe policeman his badge, it is more plausible thatthe detective would be let into the crime sceneor that the police ofﬁcer would conﬁscate thedetective’s badge? Although both scenarios arecertainly possible, our accumulated expectationsabout the world suggest it would be very extraor-dinary for the police ofﬁcer to conﬁscate the de-tective’s badge rather than allow him to enter thecrime scene. Premise:

The detective ﬂashed his badge to the policeofﬁcer. What is the most likely effect ? A: The police ofﬁcer conﬁscated the detective’s badge. B: The police ofﬁcer let the detective enter the crimescene.

Figure 1: Example of a question requiring common-sense and causal reasoning (Roemmele et al., 2011).

Evidence of Grice’s Maxim of Quantity (Grice,1975), this shared knowledge of the world is rarelyexplicitly stated in text. Fortunately, some ofthis knowledge can be extracted from Ontologiesand knowledge bases. For example ConceptNet(Speer et al., 2017) indicates that a detective is aT

YPE O F police ofﬁcer , and is C APABLE O F ﬁnd-ing evidence ; that evidence can be L OCATED A T a crime scene ; and that a badge is a T YPE O F au-thority symbol .While neural networks have been shown to ob-tain state-of-the-art performance on many types ofquestion answering and reasoning tasks from rawdata (Devlin et al., 2018; Rajpurkar et al., 2016;Manning, 2015), there has been relatively little in-vestigation into how to inject ontological knowl-edge into deep learning models, with most priorattempts embedding ontological information out-side of the network itself (Wang et al., 2017).In this paper we present a pretraining regular-ization technique we call OSCAR (OntologicalSemantic Composition Augmented Regulariza-tion) which is capable of injecting world knowl-edge and ontological relationships into a deep neu-ral network. We show that incorporating OSCARinto BERT’s pretraining injects sufﬁcient worldknowledge to improve ﬁne-tuned performance inthree question answering datasets. The main con-tributions of this work are:1. OSCAR, a regularization method for inject- a r X i v : . [ c s . C L ] O c t ng ontological information and semanticcomposition into deep learning models;2. Empirical evidence showing the impact ofOSCAR on two tasks requiring world knowl-edge, causal reasoning, and discourse under-standing even with as few as 500 training ex-ample, as well as a task requiring medical do-main knowledge; and3. An open-source implementation of OSCARand BERT supporting mixed precision train-ing, non-TPU model distribution, and en-hanced numerical stability. Pretraining.

The idea of training a model on arelated problem before training on the problem ofinterest has been shown effective for many naturallanguage processing tasks (Dai and Le, 2015; Pe-ters et al., 2017; Howard and Ruder, 2018). Morerecent uses of pretraining adapt transfer learningby ﬁrst training a network on a language modelingtask, and then ﬁne-tuning (retraining) that modelfor a supervised problem of interest (Dai and Le,2015; Howard and Ruder, 2018; Radford et al.,2018). Pretraining in this way has the advantagethat the model can build on previous parameters toreduce the amount of information it needs to learnfor a speciﬁc downstream task. Conceptually, themodel can be viewed as applying what it has al-ready learned from the language model task whenlearning the downstream task.

BERT (Bidirectional Encoder Representationsfrom Transformers) is a pretrained neural networkwhich has been shown to obtain state-of-the-art re-sults on eleven natural language processing tasksafter ﬁne-tuning (Devlin et al., 2018). Despite itsstrong empirical performance, the architecture ofBERT is relatively simple: four layers of trans-formers (Vaswani et al., 2017) are stacked to pro-cess each sentence.

Incorporating OSCAR into BERT’s pretraining re-quires an embedded ontology and a text corpus.In our experiments we used ConceptNet 5 as ourontology relying on pretrained entity embeddingsknown as ConceptNet NumberBatch (Speer et al.,2017).Our text corpus was a 2019 dump of EnglishWikipedia articles with templates expanded.

Virtually all neural networks designed for natu-ral language processing represent language as asequence of words, subwords, or characters. Bycontrast, Ontologies and knowledge bases encodesemantic information about entities which maycorrespond to individual nouns (e.g., “badge”)or multiword phrases (“police ofﬁcer”). Conse-quently, injecting world and domain knowledgefrom a knowledge base into the network requires semantically decomposing the information aboutan entity into the supporting information about itsconstituent words. To do this, OSCAR is imple-mented using a three step approach illustrated inFigure 2:

Step 1. entities are recognized in a sentence usinga Finite State Transducer (FST);

Step 2. the sequence of subwords correspondingto each entity are semantically composedto produce an entity-level encoding; and

Step 3. the average energy between the composedentity encoding and the pretrained entityencoding from the ontology is used as aregularization term in the pretraining lossfunction.By training the model to compose sequences ofsubwords into entities, during back-propagationthe semantics of each entity are decomposed andinjected into the network based on the neural acti-vations associated with its constituent words.

We recognized entities during training and infer-ence online by (1) tokenizing each entity in ourontology using the same tokenizer used to pre-pare the BERT pretraining data, and (2) compil-ing a Finite State Transducer to detect sequencesof subword IDs corresponding to entities. TheFST, illustrated in Figure 3, allowed us to detectentities on-the-ﬂy without hard coding a speciﬁcontology or inducing any discernible change intraining or inference time. Formally, let X = x , x , · · · , x N represent the sequence of wordsin a sentence. The FST processes X and returnsthree sequences: s , s , · · · , s M ; l , l , · · · , l M ;and e , e , · · · , e M representing the start offset,length, and the pretrained embedded representa-tion of every mention of any entity in the Ontol-ogy. Note: including or excluding subsumed andde-masked entities (as illustrated in Figure 2) pro-vided no discernable effect in our experiments. ritish policy during the Spanish Civil War was officially that of [MASK] Pre-training Sentence: ⋯ . 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 𝒙𝒙 ⋯ 𝒙𝒙 𝑁𝑁 𝒄𝒄 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝒙𝒙 𝒄𝒄 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝒙𝒙 𝒄𝒄 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝒙𝒙 𝒄𝒄 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝒙𝒙 , 𝒙𝒙 , 𝒙𝒙 Semantic Composition (§4.2) 𝑅𝑅 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = 1 𝑀𝑀 � 𝑡𝑡=1𝑀𝑀 𝑓𝑓 𝒄𝒄 𝑡𝑡 , 𝒆𝒆 𝑡𝑡 Energy Regularization (§4.3) FSTNumberBatch ⋮𝒆𝒆 ⋮𝒆𝒆 𝐾𝐾 ⋮ ConceptNet

British ( 𝑐𝑐 = 1; 𝑙𝑙 = 1 ) Spanish Civil War ( 𝑐𝑐 = 5; 𝑙𝑙 = 3 ) War ( 𝑐𝑐 = 𝑙𝑙 = ) † Nonintervention ( 𝑐𝑐 = 12; 𝑙𝑙 = 2 ) ‡ Policy ( 𝑐𝑐 = 2; 𝑙𝑙 = 1) Civil War ( 𝑐𝑐 = 6; 𝑙𝑙 = 2 ) † Officially ( 𝑐𝑐 = 9; 𝑙𝑙 = 1 ) † Intervention ( 𝑐𝑐 = 13; 𝑙𝑙 = 1 ) † Spanish ( 𝑐𝑐 = 5; l = 1 ) † Civil ( 𝑐𝑐 = 6; 𝑙𝑙 = 1 ) † Non ( 𝑐𝑐 = 12; 𝑙𝑙 = 1 ) †‡ ⋮ Entity Detection (§4.1)

Figure 2: Architecture of OSCAR when injecting ontology knowledge from ConceptNet into BERT where ‘ † ’indicates subsumed entities, ‘ ‡ ’ indicates de-masked entities, N is the length of the input sentence, M is thenumber of entities detected in the sentence, and K is the number of entities with embeddings in ConceptNet. 𝐸𝐸 ⋯𝐸𝐸 , ⋯ 𝐸𝐸 , 𝐸𝐸 𝐸𝐸 𝐸𝐸 Figure 3: Finite State Transducer (FST) used to detectentities during pretraining; each node corresponds toa word ID, and e i indicates the i -th pretrained entityembedding in ConceptNet’s NumberBatch. The role of semantic composition is to learna composed representation c , c , · · · , c M foreach entity detected in X such that c i = compose ( x s i , x s i +1 , · · · , x s i + l i ) . As pretrainingin BERT is computationally expensive; we consid-ered three computationally-efﬁcient methods forcomposing words and subwords into entities. Recurrent Additive Networks (RANs) are asimpliﬁed alternative to LSTM- or GRU-based re-current neural networks that use only additive con-nections between successive layers and have beenshown to obtain similar performance with 38%fewer learnable parameters (Lee et al., 2017).Given a sequence of words x , x , · · · , x L weuse the following layers to accumulate informationabout how the semantics of each word in an entitycontribute to the overall semantics of the entity: (cid:102) m t = W m x t (1a) i t = σ ( W i [ h t − , x t ] + b i ) (1b) f t = σ ( W f [ h t − , x t ] + b f ) (1c) m t = i t ◦ (cid:102) m t + f t ◦ c t − (1d) h t = g ( m t ) (1e) where [ • ] represents vector concatenation, (cid:102) m t rep-resents the content layer which encodes any new semantic information provided by word x t , ◦ in-dicates an element-wise product, i t represents theinput gate, f t represents the forget gate, m t rep-resents the internal memories about the entity, and h t is the output layer encoding accumulated se-mantics about word x t . We deﬁne the composedentity c i := h s i + l i for the sequence beginningwith x s i . Linear Recurrent Additive Networks.

A sec-ond, simpler version of a RAN omits the contentand output layers (i.e., Equations 1a and 1e) andEquation 1d. It is updated to depend on x t di-rectly: m t = i t ◦ x t + f t ◦ m t − . We deﬁne thecomposed entity c i := m s i + l i for the sequence ofsubwords beginning with x s i . Linear Interpolation.

The third, simplest formof semantic composition represents the semanticsof an entity as an unordered linear combination ofthe semantics of its constituent words, i.e.: c i := W e ( x s i + x s i +1 + · · · + x s i + l i ) + l i · b e . We project the composed entities into the samevector space as the pretrained entity embeddingsfrom the Ontology, and measure the average en-ergy across all entities detected in the sentence: R OSCAR = 1 M M (cid:88) i =1 f ( W p c i + b p , e i ) (2) where f is an energy function capturing the energybetween the composed entity c i and the pretrainedentity embedding e i . We considered three energyfunctions: (1) the Euclidean distance, (2) the ab-solute distance, and (3) the angular distance. remise: Gina misplaced her phone at hergrandparents. It wasn’t anywhere in the living room.She realized she was in the car before. She grabbed herdad’s keys and ran outside.

Ending A:

She found her phone in the car.

Ending B:

She didn’t want her phone anymore.

Figure 4: Example of a Story Cloze question (correctanswer is A).

Consumer Health Question:

Can sepsis be prevented.Can someone get this from a hospital?

FAQ A:

Who gets sepsis?

FAQ B:

What is the economic cost of sepsis?

Figure 5: Example of a Recognizing Question Entail-ment (RQE) question (correct answer is A).

We evaluated the impact of OSCAR on three ques-tion answering tasks requiring world or domainknowledge and causal reasoning.

Choice of Plausible Alternatives (CoPA)presents 500 training and 500 testing sets of two-choice questions and requires to choose the mostplausible cause or effect entailed by the premise,as illustrated in Figure 1 (Roemmele et al., 2011).

The Story Cloze Test evaluates story under-standing, story generation, and script learning andrequires a system to choose the correct ending toa four-sentence story, as illustrated in Figure 4(Mostafazadeh et al., 2016). In our experimentswe used only the 3,744 labeled stories.

Recognizing Question Entailment:

To over-come the complexity of healthcare questions, BenAbacha et al. (2016) proposed to simplify clinicalquestion answering by answering sub-questionsusing Recognizing Question Entailment (RQE).The RQE collection consists of 8,588 training and302 testing pairs of consumer health questions(CHQs) and frequently asked questions (FAQs)with labels indicating whether answering the FAQentails answering the CHQ, as illustrated in Fig-ure 5.Table 1 presents the results of BERT when pre-trained on Wikipedia with and without OSCAR,the state-of-the-art, and the average performanceof different semantic composition methods and en-ergy functions when calculating OSCAR.

Model

CoPA Cloze RQE

BERT 55.2 74.200 74.834OSCAR

SotA 71.2

OSCAR: Linear RAN

Table 1: Accuracy when ﬁne-tuning BERT pre-trainedon Wikipedia data and pre-trained on Wikipedia datawith OSCAR.

OSCAR provided a signiﬁcant improvement inaccuracy for both common sense causal reason-ing tasks, indicating that OSCAR was able toinject useful world knowledge into the network.While less pronounced than the general domain,for the clinical domain, OSCAR provided a mod-est improvement over standard BERT and both im-proved over the state-of-the-art.When comparing semantic composition meth-ods, the Linear method had the most consistentperformance across both domains; the RecurrentAdditive Network (RAN) obtained the lowest per-formance on the general domain and the highestperformance on medical texts, while the LinearRAN exhibited the opposite behavior.In terms of energy functions, the Euclidean dis-tance was the most consistent, the Angular dis-tance was the best for the Story Cloze and RQEtasks, and the Absolute difference was the best forCoPA. The Angular distance (being scale invari-ant) is least affected by the number of subwordsconstituting an entity while the Absolute distanceis most affected. Consequently, we believe theAbsolute distance was only effective on the CoPAevaluation because the entities in CoPA are typi-cally very short (single words or subwords).In future work we hope to explore the impactof incorporating different Ontologies and knowl-edge graphs as well as alternative types of entityembeddings (Bordes et al., 2013; Lin et al., 2015;Wang et al., 2014; Nickel et al., 2011; Xiong et al.,2018). eferences

Asma Ben Abacha and Dina Demner-Fushman. 2016.Recognizing question entailment for medical ques-tion answering. In

AMIA 2016, American Med-ical Informatics Association Annual Symposium,Chicago, IL, USA, November 12-16, 2016 .Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors,

Advances in Neural Information ProcessingSystems 26 , pages 2787–2795. Curran Associates,Inc.Andrew M Dai and Quoc V Le. 2015. Semi-supervisedsequence learning. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, editors,

Advances in Neural Information Processing Systems28 , pages 3079–3087. Curran Associates, Inc.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .H Paul Grice. 1975. Logic and conversation. ,pages 41–58.Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model ﬁne-tuning for text classiﬁcation. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 328–339. Association for Com-putational Linguistics.Kenton Lee, Omer Levy, and Luke Zettlemoyer.2017. Recurrent additive networks. arXiv preprintarXiv:1705.07393 .Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, andXuan Zhu. 2015. Learning entity and relation em-beddings for knowledge graph completion. In

Pro-ceedings of the Twenty-Ninth AAAI Conference onArtiﬁcial Intelligence , AAAI’15, pages 2181–2187.AAAI Press.Edward Loper and Steven Bird. 2002. Nltk: The natu-ral language toolkit. In

Proceedings of the ACL-02Workshop on Effective Tools and Methodologies forTeaching Natural Language Processing and Com-putational Linguistics - Volume 1 , ETMTNLP ’02,pages 63–70, Stroudsburg, PA, USA. Associationfor Computational Linguistics.Christopher D Manning. 2015. Computational linguis-tics and deep learning.

Computational Linguistics ,41(4):701–707.Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and cloze evaluation for deeper understanding ofcommonsense stories. In

Proceedings of the 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies , pages 839–849. Associa-tion for Computational Linguistics.Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In

Proceedings ofthe 28th International Conference on InternationalConference on Machine Learning , ICML’11, pages809–816, USA. Omnipress.Matthew Peters, Waleed Ammar, Chandra Bhagavat-ula, and Russell Power. 2017. Semi-supervised se-quence tagging with bidirectional language models.In

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers) , pages 1756–1765. Association forComputational Linguistics.Alec Radford, Karthik Narasimhan, Time Salimans,and Ilya Sutskever. 2018. Improving language un-derstanding with unsupervised learning. Technicalreport, Technical report, OpenAI.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing , pages 2383–2392. Asso-ciation for Computational Linguistics.Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In .Robyn Speer, Joshua Chin, and Catherine Havasi.2017. Conceptnet 5.5: An open multilingual graphof general knowledge. In

AAAI Conference on Arti-ﬁcial Intelligence , pages 4444–4451.Ole Tange. 2018.

GNU Parallel 2018 . Ole Tange.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran As-sociates, Inc.Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017. Knowl-edge graph embedding: A survey of approaches andapplications.

IEEE Transactions on Knowledge andData Engineering , 29(12):2724–2743.Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph embedding by trans-lating on hyperplanes. In

Proceedings of the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence ,AAAI’14, pages 1112–1119. AAAI Press.enhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo,and William Yang Wang. 2018. One-shot relationallearning for knowledge graphs. In

Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing , pages 1980–1990. Asso-ciation for Computational Linguistics.

A Hyper-parameter Tuning

A.1 Fine-tuning

For each ﬁne-tuning task, we used a greedyapproach to hyper-parameter tuning by incre-mentally and independently optimizing: batchsize ∈ { , , } ; initial learning rate ∈{ e − , e − , e − } ; whether to include sub-sumed entities ∈ { yes , no } ; and whether to in-clude masked entities ∈ { yes , no } .For CoPA, the Story Cloze task, and RQE wefound an optimal batch size of 16 and an optimallearning rate of e − . We also found that includ-ing subsumed entities and masked was optimal (ata net performance improvement of < accu-racy). A.2 Pretraining

We pretrained BERT using a 2019 Wikipediadump formatted for Wikipedia’s Cirrus search en-gine. Preprocessing relied on NLTK’s Punktsentence segmenter (Loper and Bird, 2002), andthe WordPiece subword tokenizer provided withBERT. We used the vocabulary from BERT base(not large), and a maximums sequence size of 384subwords, training , steps, with an initiallearning rate of e − , and 320 warm-up steps. B BERT Modiﬁcations

We used a modiﬁed version of BERT allowing formixed precision training. This necessitated a num-ber of minor changes to improve numerical stabil-ity around softmax operations. Training was per-formed using a single node with 4 Tesla P100seach (multiple variants of OSCAR were trainedsimultaneously using ﬁve such nodes at a time).Non-TPU multi-gpu support was added to BERTbased on Horovod and relying on Open MPI. https://eng.uber.com/horovod/ C State-of-the-Art

State-of-the-art was determined using the ofﬁcialleader boards maintained by the task organizers. The RQE evaluation has not concluded; as such,we use the organizers’ baseline as the state-of-the-art.

D OSCAR PretrainingRecommendations

When pretraining BERT (or another model) withOSCAR, we make the following recommenda-tions:1. ignore subsumed entities: including sub-sumed entities provided only a minor in-crease in accuracy ( < average relativeimprovement) at a 10% increase in total train-ing time;2. ignore masked entities: De-masking enti-ties had little over-all impact on model per-formance ( < average relative improve-ment), and no discernible effect on trainingtime;3. use linear composition as it exhibits consis-tent performance and requires 50% less train-ing time than the RAN and 40% less than theLinear RAN; and4. select the energy function based on the aver-age length of entities in the ﬁne-tuning tasks:Angular distance with long entities, Absolutedistance with short entities, and Euclideandistance with varied entities. E Stanford Question Answering Dataset

We evaluated the impact of OSCAR on the Stan-ford Question Answering Dataset (SQuAD; Ra-jpurkar et al. 2016), version 1.1 and observed nodiscern able change in performance (an Accuracyof . using BERT pretraining on Wikipediawithout OSCAR and . with OSCAR). Thelack of impact of OSCAR for SQuAD is unsur-prising, as the vast majority of SQuAD questionscan be answered directly by surface-level informa-tion in the text. However, the lack of impact showsthat injecting world knowledge with OSCAR doesnot come at the expense of model performance fortasks that require little external knowledge. CoPA’s“leaderboard” is available at http://people.ict.usc.edu/˜gordon/copa.html . The Story Cloze leaderboard is available at https://competitions.codalab.org/competitions/15333https://competitions.codalab.org/competitions/15333