[PDF] ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning

Abstract

Given questions regarding some prototypical situation such as Name something that people usually do before they leave the house for work? a human can easily answer them via acquired experiences. There can be multiple right answers for such questions, with some more common for a situation than others. This paper introduces a new question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers. We also propose a generative evaluation task where a model has to output a ranked list of answers, ideally covering all prototypical answers for a question. After presenting multiple competitive baseline models, we find that human performance still exceeds model scores on all evaluation metrics with a meaningful gap, supporting the challenging nature of the task.

Full PDF

PProtoQA: A Question Answering Dataset forPrototypical Common-Sense Reasoning

Michael Boratko ∗ Xiang Lorraine Li ∗ Rajarshi Das ∗ Tim O’Gorman ∗ Dan Le Andrew McCallum

College of Information and Computer SciencesUniversity of Massachusetts Amherst { mboratko,xiangl,rajarshi,togorman,dhle,mccallum } @cs.umass.edu Abstract

Given questions regarding some prototypicalsituation — such as

Name something that peo-ple usually do before they leave the house forwork? — a human can easily answer them viaacquired experiences. There can be multipleright answers for such questions with somemore common for a situation than others.This paper introduces a new question an-swering dataset for training and evaluatingcommon-sense reasoning capabilities of arti-ﬁcial intelligence systems in such prototyp-ical situations. The training set is gatheredfrom an existing set of questions played in along-running international trivia game show –F

AMILY -F EUD . The hidden evaluation set iscreated by gathering answers for each ques-tion from

100 crowd-workers . We also proposean open-domain task where a model has tooutput a ranked list of answers, ideally cov-ering all prototypical answers for a question.On evaluating our dataset with various compet-itive state-of-the-art models, we ﬁnd there is asigniﬁcant gap between the best model and hu-man performance on a number of evaluationmetrics.

Humans posses the innate ability to implicitly rea-son using a wealth of shared background knowl-edge, much of which is acquired via experiences.For example, consider the question in Figure 1 —“

Name something that people usually do before theyleave the house for work? ”. Humans can easily an-swer such questions with ‘ prototypical ’ answers —i.e. a set of answers that they commonly associatewith situations invoked by the question. These ques-tions require common-sense reasoning, however,because of its ‘common’ nature, such knowledge is * Equal contribution.Data and interactive demo available at http://protoqa.com (i) Name something that people usually do before they leave for work?

Ask crowd-workers+manual clustering G r oo m Lo ck doo r S ho w e r B r eak f as t D r ess S ay goodb ye P r ay J og I n va li d A n s w e r s (ii) Name a piece of equipment that you are likely to find at your office and not at home? Categories: printer/copier ( ), office furniture ( ), computer equipment ( ), stapler ( ) files ( ), office appliances ( ), security systems ( ) (iii) Name something that an athlete would not keep in her refrigerator. Categories: unhealthy / fast food ( ), unhealthy drinks ( ), clothing/shoes ( accessories ( ) (iv) Name something that you might forget in a hotel room? Categories: phone ( ), toothbrush/towels ( ), clothing/shoes ( keys ( ), purse/wallet ( ), accessories ( ), charger ( ) Figure 1: We focus on common-sense reasoning overprototypical situations when there could be many dif-ferent answers but some are more common than others.Our task is in open-domain style ( not multiple-choiceformat). Answers to a question are crowd-sourced from100 workers and are then manually clustered into cate-gories. To perform well, a model has to output a rankedlist of answers covering multiple categories. often not explicitly captured in text (reporting bias)(Gordon and Van Durme, 2013).We present a new dataset to train and evaluatemodels about their common-sense knowledge ofprototypical situations. In such situations, thereare often multiple right answers, with some an-swers more prototypical (common) than others,thereby forming a distribution over them. For ex-ample, when we polled

100 people (Figure 1),popular answers to the previous question were‘shower/cleaning’ (43) or ‘breakfast’ (30). How-ever, we also received very reasonable answersin the tail such as ‘lock door/grab keys’ (7), ‘saygoodbye’ (4) and ‘pray’ (1). We think, for artiﬁcialintelligence (AI) systems to achieve human-levelcommon-sense reasoning, they should be able tomatch the distribution over prototypical answers. a r X i v : . [ c s . C L ] M a y ur dataset and task are inspired by a source ofnaturally occurring common sense questions (notdeveloped for any particular NLP task) used in acompetitive trivia game show — F AMILY -F EUD .F AMILY -F EUD is a long-running trivia game showwhich started on American television in 1976 andhas been adopted internationally in more than 50countries. The game show is played by asking par-ticipants questions such as those in Figure 1, andthey receive points not for a “correct” answer, butwhen their answer matches the answers from otherpeople surveyed, and in proportion to how manypeople gave that answer. What makes F

AMILY -F EUD appealing is the fact that the original answersto the question were collected by a professionalpolling company by doing a telephone survey of100 different people all over the country and fur-ther clustered into meaningful categories (e.g. can-taloupe, honeydew, watermelons as ‘ melons ’) —thereby automatically giving us a distribution overboth the answers and the underlying concepts thoseanswers refer to.We present a common-sense reasoning task inwhich a model has to output a ranked list of answersfor each question. Our evaluation metrics (§ 3) aredesigned to encourage both popularity and diver-sity in answers, i.e. models that not only predictmore popular answers, but also cover all plausibleanswer categories are encouraged. We present botha publicly available set of around 9.7K questionsand along with 7-8 labeled answer categories foreach question oriented towards this kind of com-monsense knowledge, and a newly annotated andunseen test set for evaluation. This test set providesa set of 15,400 crowd-sourced human judgmentsover 154 new questions. These questions were cre-ated by us by perturbing existing F

AMILY -F EUD questions to ensure that they do not occur in anyiteration of the game show, while maintaining thesame level of common-sense and prototypical rea-soning to answer them (§ 2.4). The crowd-sourcedanswers were further categorized manually by twoexpert annotators, creating the same setup as in theoriginal game-show.Recent common-sense reasoning benchmarks of-ten use a multiple-choice paradigm, where the taskis to identify the most plausible answer from a listof options (Mostafazadeh et al., 2016; Zellers et al.,2018; Talmor et al., 2019). However, it has beenshown that language models (LMs) trained on large http://shorturl.at/guW34http://shorturl.at/bhKS1http://shorturl.at/rFMT2 amounts of unlabeled data such as B

ERT (Devlinet al., 2019) do exceedingly well on these datasets,achieving human-level accuracy in a few of them.To counter this, Zellers et al. (2019) propose a newadversarial ﬁltering approach where benchmarksevolve in an adversarial way as new models areproposed and introduced the H

ELLA S WAG datasetcreated by using powerful LMs such as B

ERT . Soonafter its introduction, R

OBERTA (Liu et al., 2019)improved upon the accuracy of B

ERT model by 45points. A primary reason for this is that generat-ing hard negative examples for the multiple choiceformat is challenging, even for humans (Schwartzet al., 2017; Gururangan et al., 2018; Poliak et al.,2018).Instead of the multiple-choice paradigm, wesetup our task in an open-domain question answer-ing (QA) format where a model has to output aranked list of answers which is ‘matched’ to crowd-sourced answers in each category. While such anapproach can penalize a correct model predictionwhen it does not match an existing reference an-swer, we counter this issue in (a) by collecting per question, a number substan-tially higher than any other work in open-domainQA and (b) by proposing evaluation metrics whichuse large lexical resources such as WordNet (Miller,1995) to perform matching, and (c) by focusingupon methods to score ranked lists of answers, in-stead of focusing upon a top score. We suggest thatopen-domain evaluation of some common-sensereasoning tasks is a natural and realistic paradigm,and one which shares natural similarities to the eval-uation challenges found in various natural languagegeneration tasks such as summarization (Radevet al., 2003) and translation (Callison-Burch et al.,2010) and should be an area of active research.We evaluate this dataset on a variety of base-line models — from models trained on symboliccommon-sense knowledge store such as Concept-Net (Speer et al., 2017), to QA models poweredby large masked LMs such as B

ERT , to the di-rect prediction of answers in a language-modelingparadigm using a large G PT -2 LM (Radford et al.,2018). While most models perform quite poorly atthis challenging task, when G PT -2 was ﬁne-tunedusing the F AMILY -F EUD training set, its perfor-mance improved drastically, although remainingsigniﬁcantly below a score of human-level perfor-mance.The contributions of this paper are as follows.(a) We introduce a large scale QA dataset regardingommon-sense knowledge of prototypical situa-tions, and a rich evaluation set for models trainedupon that data. (b) We present this as an open-domain task, and review a range of directions forrobust evaluation in this open-domain setting, bothwith rich data (large sets of reference answers andclustering over answers) and with evaluation mea-sures such as a WordNet-based similarity. (c) Wealso design evaluation metric that encourage mod-els to provide diverse answers covering all plausi-ble answer categories (d) We evaluate our dataseton existing models, and reveal the strong ability oflarge contextualized language models when ﬁne-tuned on this data. Finally, (e), we discuss the gapbetween model and human performance on thistask, showing that this is still a challenging task formodels with room for improvement.

Three publicly available fan websites for the showFamily Feud were used to collect a large collectionof questions.A range of fan websites for Family Feud havetranscribed such Family Feud questions. Well over10,000 questions were collected with their answers,and a set of 9,762 questions remained after ﬁltering,quality control, and de-duplication. That ﬁlteringincluded the omission of classes of questions (suchas name a vegetable ) which did not evaluate inter-esting commonsense knowledge.While any “commonsense” dataset inherentlybears the risks of encoding culture-speciﬁc infor-mation and biases, questions regarding prototypicalscenarios and prototypical behavior are naturallyquite susceptible to this, and we note it as an im-portant issue to be aware of regarding such data. Asmall subset of 29 questions which might be viewedas problematic or encoding stereotypes were explic-itly labeled and will be released separately with thetraining data, so that one might evaluate the ex-tent to which models trained on such a task mightacquire undesirable biases.

In order to focus upon a rich, open-ended answergeneration task, we collected 100 answers for eachquestion from the crowd-sourcing platform Fig-ureEight, and then provided rich double-annotatedclustering over those answers. By gathering largesets of possible answers and clustering them we for example, https://protoqa.com/ can provide rough distributions over the expectedanswers, increasing the ability to recognize anyway of expressing one of those answers.We gathered a test set of new questions with aneye towards maintaining the same domain and thesame commonsense reasoning seen in the trainingdata. In order to maintain similarity to existingquestions, we removed a set of questions from thescraped data and then perturbed important aspectsof them. For example, given an existing questionof “

Name something a person might forget to puton if they leave the house in a hurry. ”, changes ofpolarity and events would derive a related question“

Name something that people usually do before theyleave house for work ”. Deriving such unseen testquestions was especially important to avoid the riskof having a publicly-available question be includedin the training data for contextual language models;by making new data, we can be more conﬁdent thatany high-performing model has not yet seen thedata.Having derived new questions, we then createdtasks on FigureEight for each of those questions tobe answered by 100 workers. To match the trainingdata (which is inherently grounded in US culture),we limited workers to US locations. Low-qualityworkers were automatically detected through testquestions during annotation, and the clustering passprovided a second manual quality control check.This left us with 154 questions which we split intoa test set and dev set of 102 and 52 respectively.

After initial collection of 100 answers for eachquestion, we then clustered answers of each ques-tion. Each list was manually clustered by two differ-ent experts familiar with the task. The clusteringswere generated separately and then compared, anda ﬁnal clustering was agreed on. During this clus-tering phase, answers could be marked as invalid aswell — most commonly, either due to low-qualityannotators or a clear misunderstanding of a ques-tion. In order to keep these clusters roughly similarto the granularity of answers used in the trainingdata and to avoid low-quality evaluation, we elimi-nated questions for which the 8 most popular clus-ter did not cover more than 85 of the 100 responses.Since each set of answers was clustered twiceand adjudicated, we measure the agreement with acluster agreement metric BLANC (Recasens andHovy, 2011; Luo et al., 2014), an extension ofthe Rand index used to score coreference cluster-ing. Using this, the similarity between the clusters 4 6 8 1000 . . Figure 2: Number of answer clusters per question, inthe scraped (blue) and crowdsourced (red) datasets. produced by any two annotators averaged out toa BLANC score of 83.17, suggesting a coherentamount of agreement regarding the clustering ofanswers.Figure 2 illustrates how this crowd-sourced testset related to the training data; the actual size of thelargest clusters remains similar between the twodatasets, but our data tends to have more clusters,generally capturing all possible answers within thetop 8 clusters, but often using seven or eight clus-ters. More cluster provide more relaxed evaluationas we include more answer strings with smallersize clusters, which also provide with us more in-teresting answer strings.

The data presented here involves a range of differ-ent types of commonsense knowledge. To examinethe distribution of different kinds of reasoning, andto examine whether that distribution of reasoningvaried between the publicly available data and thecrowdsourced development and test set, we proposea small inventory of six types of commonsense rea-soning often present in these questions.These types consist of (1) M

ENTAL OR S O - CIAL R EASONING , (2) K

NOWLEDGE OF P RO - TOTYPICAL S ITUATIONS which one is familiarwith, (3) R

EASONING ABOUT NOVEL , COMPLEXEVENTS , (4) N

EGATION AND EXCEPTIONS andunderstanding their consequences, (5) S

PECIFIC E NTITY KNOWLEDGE of named people, locations,or organizations, and ﬁnally (6) K

NOWLEDGE OFHABITUAL ACTIVITIES of speciﬁc occupations ortypes of entities. To study the distribution over the data, we took25 questions from the training collection and 25questions from the crowd-sourced development set,and marked each one with any number of the sixcategories which seemed necessary for the questionas a simple approximation of prior works which ex-amine the types of knowledge required for reason-ing tasks (LoBue and Yates, 2011; Boratko et al.,2018). Table 1 illustrates examples of questionswith these types, and one can see the frequencyof each type used in Table 2. The counts shownfor each dataset illustrate that while the creationmethodology varied between the two resources, thekind of commonsense reasoning tasks evaluatedby these models is quite similar between the twocorpus types. The greatest difference to note is thatthe crowd-sourced data makes less use of questionsregarding speciﬁc entities, which were avoided asthey tended to involve fact-based world-knowledgerather than commonsense reasoning.

Recent commonsense reasoning benchmarks oftenuse a multiple-choice paradigm where the task isto identify the most plausible answer from a list ofoptions (Zellers et al., 2018, 2019). However gen-erating challenging negative examples is hard, sothat often within months of the release of a dataset,models may achieve human-level or near human-level performance, as occured with B

ERT for theS

WAG dataset and R

OBERTA for the H

ELLA S WAG dataset. Such issues highlight the difﬁculty of es-tablishing robust and stable metrics using negativesamples and adversarial methods alone. It also hasbeen shown that generating negative examples iseven hard for humans, who can inadvertently intro-duce annotation artifacts which models can easilyidentify in order to solve the task (Schwartz et al.,2017; Gururangan et al., 2018; Poliak et al., 2018).An appealing alternative for benchmarking mod-els is via open-domain answer generation taskswhere the model has to generate the correct an-swers. This side-steps the need to ﬁnd challengingnegative examples. However, this paradigm intro-duces another challenge — the possibility of mod-els getting wrongly penalized for predictions notin the list of correct answers. This problem is alsofaced in other natural language generation task suchas machine translation (MT), summarization anddialog generation.Our solution to the above problem is to collectand cluster a large number of open-ended responsesuestion Example Answers TypesName a profession where you might be ﬁred if you lost your voice radio host , teacher 3, 4, 6Name something a boy scout might learn. knot tying, camping 2, 5, 6Name a bad sport for someone who is afraid of the water. diving, water polo 1, 3 ,6Name something a monk probably would not own. weapons, smartphone 2, 4, 6Name something parents tell their kids not to do steal, smoke 1, 2, 4, 6Name a reason why someone would wear gloves cold weather, cleaning 2, 3

Table 1: Examples of questions from collected (top 3) and crowd-sourced (bottom 3) development sets

Reasoning type Scraped Dev Crowd-sourcedMental/Social 16% 12%Prototypical Events 68% 80%Event Reasoning 28% 40%Negation 12% 20%Speciﬁc Entities 20% 4%Habitual Activity 40% 24%

Table 2: Percentage of questions utilizing each reason-ing type (100 crowd-sourced responses in our case). Thisis much higher than other typical tasks (e.g. thereare usually very few reference translation or sum-maries). Also, compared to summarization or MT,a prototypical answer for our task is only a word ora short phrase, making the problem less severe. Fur-thermore, as described shortly, we do not restrictourselves to rigid exact matches and propose a sim-ilarity measure that uses synonyms from WordNet(Miller, 1995).The next subsections describe how we scorea ranked list of model predictions w.r.t the an-swer clusters containing the 100 crowd-sourcedresponses. We ﬁrst describe a similarity functionto compute similarity between two strings (§ 3.1)and then we describe how we score the ranked listencouraging diversity in the answers (§ 3.2).

With the large number of raw answers retrievedfor each question, exact string matching of a newanswer to those in each answer cluster works sur-prisingly well. Still, it is clear that reasonable an-swer strings (eg. synonyms or slightly embellishedphrases) may be incorrectly marked as wrongwith such a stringent matching criteria. METEOR(Banerjee and Lavie, 2005; Lavie and Denkowski,2009) addressed similar issues in machine transla-tion via stemming and synonym matching. We takea similar approach, expanding the set of answer clusters using WordNet synsets, and comparing allpossible partitions of the tokenization of the rawstrings. For more details, please see the appendix.

As mentioned before, we want to design evalua-tion metrics that favor models that can cover allplausible answer categories and not just predict themost popular answer. We ﬁrst compute an align-ment score between each answer in the ranked listand each of our answer clusters. The alignmentscore is computed as the maximum score betweenthe predicted answer string and any reference stringpresent in the cluster, scaled by the size of the clus-ter. After computing the alignment scores betweenall pairs of answers and clusters, we employ Hun-garian matching algorithm (Kuhn, 1955; Munkres,1957) to compute the exact optimal matching ofanswers to clusters. It is worth noting that, a modelwhich produces a ranked list of answers only in onecluster will be penalized and a model which max-imally covers all plausible clusters will score themaximum. Lastly, to make the comparison betweenlists of different lengths uniform, we propose thefollowing metrics.1. M AX A NSWERS @ k limits the total numberof answers allowed to upto k answers.2. M AX I NCORRECT @ k allows unlimited an-swers, but stops after k unmatched answersare provided. We explore three baseline models for this task – aknowledge-base baseline using ConceptNet (Speer

We also report a weaker metric that eschews the size ofthe cluster (Set Intersection) in table 3Note that since our scores are always calculated as a per-centage of the max score one could receive, M AX A NSWERS is slightly different than hits@k in this setting. t al., 2017), a QA-based model which retrieves re-lated posts in a discussion forum for each question,and a language-modeling baseline which examineshow well modern pre-trained language models doat directly producing the answers.

ConceptNet (Speer et al., 2017) is a knowl-edge base containing common sense related tripleswhich has been shown to be helpful for variousdownstream tasks. (Zhong et al., 2019; Wang et al.,2019) This makes it a good potential source forsolving this task, as well as assessing how wellthis dataset captures existing notions of commonsense. For example, with a question ’Besides mu-sic, name something you might hear on a morn-ing radio show’ and answer ’weather’, the follow-ing ConceptNet triples (listen to radio, Cause, youhear local weather report) and (listen to radio, Has-Subevent, hear weather report) provide valid sup-port for the answer of weather report. We use thisidea to conduct the ConceptNet baseline.The ConceptNet baseline is a purely symbolicbaseline which can be evaluated without the needfor training. Given a question, we extract a listof key words from the question by removing stopwords. We perform the same key word extractionprocedure for the term1 and term2 of ConceptNettriples. Provided with the list of key words, wecompare them against term1 in ConceptNet triples.If there are any overlap words between sentencekey words and term1 key words, we return elementsterm2 as possible answers to the question.The answers are further ranked by the corre-sponding ConceptNet triple score. Note that this isa fairly generous baseline since the model is ableto return an unlimited number of answers, howeverthe resulting answers will be noisy as well whichprevents it from performing well on the Max In-correct ranking task. The set intersection score willgive some indication on the overlap between com-mon sense captured in ConceptNet, and providesan idea of how much gain there could be had by amore sophisticated ranking of the outputs.

As this dataset is in the form of questions and an-swers, it may be treated as a QA dataset, althoughthe content is far from the fact-based data usu-ally modeled in QA tasks. As the training set onlyshows answers out of context, one must use distantsupervision in order to train a QA model on the data, a well-explored situation in modern QA work(Joshi et al., 2017).We should note that unlike factoid-based QA,one may expect there to be a limit in the perfor-mance of such a models, as commonsense datais well-known to have a reporting bias (Gordonand Van Durme, 2013) wherein many parts of gen-eral knowledge are never explicitly stated in text.Because of that, models trained in this paradigmcannot be expected to ﬁnd explicit statements ofthe generalizations (which would often be left un-stated) but can only hope to learn how to identifysituations where a particular fact is presupposed orentailed.To train a model in this approach, we collecteda set of 85,781 documents by using a web searchfor each question. All searches were constrained toReddit, which contains a large amount of adviceand personal narratives of a domain useful for thetask. For any post matching that query, any stringsmatching an answer to that question in the trainingdata would be treated as a positive example for theQA model. Table 4 illustrates the kind of examplesfound for a single query “name something you doat a concert”, which illustrates that while manyexamples are roughly correct in that they addressthose activities in a concert environment, learninga QA model from them is more difﬁcult.For the baseline results reported here, we ﬁne-tune the “Bert for QA” model of the transformerspackage (Wolf et al., 2019) designed for Squad2.0 (Rajpurkar et al., 2018) , ﬁne-turning BERT-large (Devlin et al., 2019). At test time, the modelwas applied to all passages of all Reddit threadsfound in the ﬁrst page of the search query for thequestion, and the 20-best scores from each passagewere combined together, reporting a ranked listusing the summed scores.

We also report a language model generation base-line due to the improved representation power bylanguage models. The baseline is performed usingAI2 GPT2 large model (Radford et al., 2019) ofhugging face pytorch implementation (Wolf et al.,2019). We get the answer predictions through thelanguage model by either doing ﬁne-tuning by ourtraining data or without ﬁne tuning.We transform the original question by hand-designed transformed rules in order for it to becompatible with the GPT2 training data. E.g “Namesomething people do when they wake up.” → “Onething people do when they wake up is ...”. The etrics % ConceptNet QA Model GPT2 GPT2Fine Tune HumanRankingExactMatch Set Intersection

10 6.9 20.0 31.2 58.2 - 28.9 32.8 46.1

10 - 26.7 44.4 65.8

Table 3: Results on the annotated test set.

Q: Name something you do at a concert:A:

But you are always expected to clap forthe spalla . A: I’ll often buy a drink for something to do,or check my email on my phone, or whatever,to kill time . once the band starts i ’m focusedon that

Table 4: Examples of distant-learning positive exam-ples used for training QA baseline hand-designed rules are attached in appendix of thepaper. The transformed questions can be used asthe input to the language model, GPT2 is expectedto ﬁnish the sentence, and we will take the gener-ated tokens as our predicted answer. The reportedﬁne-tuning result is trained on the scraped corpusand the best model selected based on performanceon our annotated dev data.In order to generate diverse answers for a givensentence we use Nucleus Sampling (Holtzmanet al., 2019) as our decoding methods. We get 300sampled answers for each question then group themby counts. The returned ranked list is ranked byeach answer’s occurrence in the 300 samples.

As can be seen from table 3, ﬁne-tuned GPT2model which independently samples answers per-forms the best. To benchmark human performanceagainst such models, we collected 30 human re-sponses per question and aggregated them bycounts (like GPT2 predictions). Last column in table 3 reports the human performance.

Discussion

It is clear from table 3 that both the KBand QA based model signiﬁcantly underperformon our dataset. The low performance of the KBbaseline hints towards the low coverage of Con-ceptNet required for answering the prototypicalscenarios of our questions. Similarly powerful QAmodels and large LMs that are trained on largecorpus also seem to not have the common senseknowledge required to answer the question. Sur-prisingly, the performance of GPT2 model that wasfurther ﬁne-tuned on our training data signiﬁcantlyimproved, suggesting the usefulness of our accom-panying training set. However, the human perfor-mance for all our metrics signiﬁcantly outperformall baselines suggesting large scope of improve-ment.

A wide variety of commonsense reasoning datasetsnow exist, although none address the same kindof commonsense generalizations evaluated here.Datasets exist evaluating plausible contexts, rea-sons or results for physical commonsense, socialreasoning, visual question answering and abduc-tive reasoning (Bisk et al., 2019; Bhagavatula et al.,2019; Huang et al., 2019; Sap et al., 2019b; Zellerset al., 2018, 2019), but differ in evaluating againstnegative samples during evaluation. The ATOMICdataset (Sap et al., 2019a) is a more similar com-monsense reasoning dataset that proposed here;while ATOMIC utilized if-then reasoning (such asthe resultant states and motivations of participants),t also assumes an open-domain task of freely pre-dicting strings, although they evaluate using humanassessment. Related veins of work study common-sense reasoning and inference or entailment (Zhanget al., 2017; Bowman et al., 2015; Roemmele et al.,2011; Levesque et al., 2012).This particular dataset might be said to studygeneralizations, or prototypical events and situa-tions. This naturally has connections both to themodeling of scripts and frames (Schank and Abel-son, 1977; Chambers and Jurafsky, 2009; Fillmoreet al., 1976; Ferraro and Van Durme, 2016), but weassume no need to predict latent structures.

We have presented a new common sense datasetwith many novel features. The inclusion of countsover clusters of answers provides a very rich struc-ture to train and evaluate with. The collection ofa large set of answers and a proposed automatedmethod of assigning answers to clusters facilitatesan open-ended style of evaluation, which is oftenthe desired use-case for these models. As shown intable 3, existing ﬁne-tuned state-of-the-art modelshave a way to go before modeling the distributionof this common sense data.

In addition to the elements of this task which areappealing from a common sense modeling perspec-tive, the inherent appeal of this task to humansopens a number of possibilities for future data col-lection and evaluation. While the high availabilityof crowdsource workers has led to great progressin dataset generation it is not without its ﬂaws, andweeding out poor quality responses is often non-trivial for more interesting tasks. On the other hand,millions of people play this game as an app on theirphones, not for any monetary gain but simply fortheir own enjoyment. In the future we propose tocollect more data by creating a form of this game,leveraging people’s natural interest and enjoymentand using mechanism design to encourage highquality answers to more common sense questions.

We thank the IESL and NLP labs at UMassAmherst for their generous efforts in assisting withdata collection. This work was supported in partby the Center for Intelligent Information Retrieval http://shorturl.at/lnoEZ and the Center for Data Science, in part by theChan Zuckerberg Initiative under the project Scien-tiﬁc Knowledge Base Construction, and in part byDARPA. Any opinions, ﬁndings and conclusionsor recommendations expressed in this material arethose of the authors and do not necessarily reﬂectthose of the sponsor.

References

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In

Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation and/or sum-marization , pages 65–72.Chandra Bhagavatula, Ronan Le Bras, ChaitanyaMalaviya, Keisuke Sakaguchi, Ari Holtzman, Han-nah Rashkin, Doug Downey, Scott Wen-tau Yih, andYejin Choi. 2019. Abductive commonsense reason-ing. arXiv preprint arXiv:1908.05739 .Yonatan Bisk, Rowan Zellers, Ronan Le Bras, JianfengGao, and Yejin Choi. 2019. Piqa: Reasoning aboutphysical commonsense in natural language. arXivpreprint arXiv:1911.11641 .Michael Boratko, Harshit Padigela, Divyendra Mikkili-neni, Pritish Yuvraj, Rajarshi Das, Andrew McCal-lum, Maria Chang, Achille Fokoue-Nkoutche, Pa-van Kapanipathi, Nicholas Mattei, et al. 2018. Asystematic classiﬁcation of knowledge, reasoning,and context within the arc dataset. arXiv preprintarXiv:1806.00358 .Samuel R Bowman, Gabor Angeli, Christopher Potts,and Christopher D Manning. 2015. A large anno-tated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 .Chris Callison-Burch, Philipp Koehn, Christof Monz,Kay Peterson, Mark Przybocki, and Omar Zaidan.2010. Findings of the 2010 joint workshop on sta-tistical machine translation and metrics for machinetranslation. In

Proceedings of the Joint Fifth Work-shop on Statistical Machine Translation and Met-ricsMATR , pages 17–53, Uppsala, Sweden. Associ-ation for Computational Linguistics.Nathanael Chambers and Dan Jurafsky. 2009. Unsu-pervised learning of narrative schemas and their par-ticipants. In

ACL .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In

NAACL .Francis Ferraro and Benjamin Van Durme. 2016. Auniﬁed bayesian model of scripts, frames and lan-guage. In

Thirtieth AAAI Conference on ArtiﬁcialIntelligence .harles J Fillmore et al. 1976. Frame semantics andthe nature of language. In

Annals of the New YorkAcademy of Sciences: Conference on the origin anddevelopment of language and speech .Jonathan Gordon and Benjamin Van Durme. 2013. Re-porting bias and knowledge acquisition. In

AKBC .Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R Bowman, andNoah A Smith. 2018. Annotation artifacts in natu-ral language inference data. In

NAACL .Ari Holtzman, Jan Buys, Maxwell Forbes, and YejinChoi. 2019. The curious case of neural text degener-ation. arXiv preprint arXiv:1904.09751 .Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, andYejin Choi. 2019. Cosmos qa: Machine readingcomprehension with contextual commonsense rea-soning. arXiv preprint arXiv:1909.00277 .Mandar Joshi, Eunsol Choi, Daniel S Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. In

ACL .Harold W Kuhn. 1955. The hungarian method for theassignment problem.

Naval research logistics quar-terly .Alon Lavie and Michael J Denkowski. 2009. Themeteor metric for automatic evaluation of machinetranslation.

Machine translation , 23(2-3):105–115.Hector Levesque, Ernest Davis, and Leora Morgen-stern. 2012. The winograd schema challenge. In

In-ternational Conference on the Principles of Knowl-edge Representation and Reasoning .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Peter LoBue and Alexander Yates. 2011. Types ofcommon-sense knowledge needed for recognizingtextual entailment. In

ACL .Xiaoqiang Luo, Sameer Pradhan, Marta Recasens, andEduard Hovy. 2014. An extension of blanc to sys-tem mentions. In

ACL .George A Miller. 1995. Wordnet: a lexical database forenglish.

Communications of the ACM , 38(11):39–41.Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A corpusand evaluation framework for deeper understandingof commonsense stories. In

NAACL .James Munkres. 1957. Algorithms for the assignmentand transportation problems.

SIAM . Adam Poliak, Jason Naradowsky, Aparajita Haldar,Rachel Rudinger, and Benjamin Van Durme. 2018.Hypothesis only baselines in natural language infer-ence. In *SEM .Dragomir R Radev, Simone Teufel, Horacio Saggion,Wai Lam, John Blitzer, Hong Qi, Arda Celebi,Danyu Liu, and Elliott Drabek. 2003. Evaluationchallenges in large-scale document summarization.In

ACL .Alec Radford, Karthik Narasimhan, Tim Salimans,and Ilya Sutskever. 2018. Improving language un-derstanding by generative pre-training.

Tech re-port,OpenAI .Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIBlog .Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. In

ACL .Marta Recasens and Eduard Hovy. 2011. Blanc: Imple-menting the rand index for coreference evaluation.

Natural Language Engineering .Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In

AAAI Spring Symposium .Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,Brendan Roof, Noah A Smith, and Yejin Choi.2019a. Atomic: An atlas of machine commonsensefor if-then reasoning. In

AAAI .Maarten Sap, Hannah Rashkin, Derek Chen, RonanLeBras, and Yejin Choi. 2019b. Socialiqa: Com-monsense reasoning about social interactions. arXivpreprint arXiv:1904.09728 .Roger C Schank and Robert P Abelson. 1977. Scripts,plans, goals, and understanding hillsdale.

U.: Lau-rence Erlbaum .Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles,Yejin Choi, and Noah A Smith. 2017. The effectof different writing tasks on linguistic style: A casestudy of the roc story cloze task. In

CoNLL .Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.Conceptnet 5.5: An open multilingual graph of gen-eral knowledge. In

AAAI .Alon Talmor, Jonathan Herzig, Nicholas Lourie, andJonathan Berant. 2019. Commonsenseqa: A ques-tion answering challenge targeting commonsenseknowledge. In

NAACL .Xiaoyan Wang, Pavan Kapanipathi, Ryan Musa,Mo Yu, Kartik Talamadupula, Ibrahim Abdelaziz,Maria Chang, Achille Fokoue, Bassem Makni,icholas Mattei, et al. 2019. Improving natural lan-guage inference using external knowledge in the sci-ence questions domain. In

AAAI .Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

ArXiv , abs/1910.03771.Rowan Zellers, Yonatan Bisk, Roy Schwartz, and YejinChoi. 2018. Swag: A large-scale adversarial datasetfor grounded commonsense inference. In

EMNLP .Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. Hellaswag: Can amachine really ﬁnish your sentence? In

ACL .Sheng Zhang, Rachel Rudinger, Kevin Duh, and Ben-jamin Van Durme. 2017. Ordinal common-sense in-ference.

TACL .Wanjun Zhong, Duyu Tang, Nan Duan, Ming Zhou, Ji-ahai Wang, and Jian Yin. 2019. Improving questionanswering by commonsense-based pre-training. In

CCF International Conference on Natural LanguageProcessing and Chinese Computing . A WordNet Similarity Function

1. Let S be the set of synsets in WordNet, and let S ( x ) be the set of synsets associated with thestring x .2. Let SynsetSim ( X , Y ) : S × S → [ , ] be ascore for synset similarity, eg.SynsetSim ( X , Y ) : = (cid:40) X = Y , .

3. A given string may corresponse to multiplesynsets. Given two strings x and y we deﬁneSynsetsScore ( x , y ) = max { SynsetSim ( S x , S y ) : S x ∈ S ( x ) , S y ∈ S ( y ) } .

4. Some valid answer strings may not correspondto a synset at all, so we deﬁneSubstringScore = max ( SynsetsScore ( x , y ) , ExactMatch ( x , y ))

5. Some answers are several words long, andtherefore won’t map to a synset even if somesubstring would. To account for this, we to-kenize and strip stopwords from both thepredicted and ground-truth answer strings.To compare these sets of tokens A , B we let M ( A , B ) be the set of all possible (partial)matchings between elements in A and B , andthen deﬁneTokensScore ( A , B )= max m ∈ M ( A , B ) ∑ ( a , b ) ∈ m SubstringScore ( a , b ) max ( | A | , | B | )

6. We repeat this process for every element inan answer cluster C , which is a set of stringsobtained from the survey, and then set theoverall score for this answer cluster to beWordNetScore ( x , C ) = max { TokensScore ( T ( x ) , T ( y )) : y ∈ C } Remark.

Fully tokenizing the input has the po-tential to lose information, since some WordNetclusters are labeled with multiple words. Considercomparing “chewing gum” with “gum”. The aboveprocess would assign this a score of . , becausetokenizing yields [“chewing”, “gum”], howeverchewing gum” is, itself, in the same WordNet synsetas “gum”. The solution to this problem in gen-eral is to compare all possible partitions of thetokens, and deﬁne the overall PartitionsScore tobe the maximum among all pairs of possible par-titions for the predicted answer and the ground-truth string. We replace the

TokensScore with this

PartitionsScore to capture such situations.

With a scoring method as described, it is pos-sible for an answer to receive a positive score formultiple clusters. We take the following approach:1. Round the scores to { , } to make a ”hard”cluster decision.2. For a given question, if some predicted an-swers match with multiple clusters, we choosethe maximum matching with respect to the ﬁ-nal score. B GPT2 Transformation rules