Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning
DDetermining Question-Answer Plausibility in Crowdsourced DatasetsUsing Multi-Task Learning
Rachel Gardner Maya Varma Clare Zhu Ranjay Krishna
Department of Computer Science, Stanford University, CA { rachel0, mvarma2, clarezhu, ranjaykrishna } @cs.stanford.edu Abstract
Datasets extracted from social networks andonline forums are often prone to the pitfalls ofnatural language, namely the presence of un-structured and noisy data. In this work, weseek to enable the collection of high-qualityquestion-answer datasets from social media byproposing a novel task for automated qualityanalysis and data cleaning: question-answer(QA) plausibility . Given a machine or user-generated question and a crowd-sourced re-sponse from a social media user, we determineif the question and response are valid; if so,we identify the answer within the free-form re-sponse.We design BERT-based models to perform theQA plausibility task, and we evaluate the abil-ity of our models to generate a clean, us-able question-answer dataset. Our highest-performing approach consists of a single-task model which determines the plausibil-ity of the question, followed by a multi-task model which evaluates the plausibil-ity of the response as well as extracts an-swers (Question Plausibility AUROC=0.75,Response Plausibility AUROC=0.78, AnswerExtraction F1=0.665).
Large, densely-labeled datasets are a critical re-quirement for the creation of effective supervisedlearning models. The pressing need for high quan-tities of labeled data has led many researchers tocollect data from social media platforms and on-line forums (Abu-El-Haija et al., 2016; Thomeeet al., 2016; Go et al., 2009). Due to the pres-ence of noise and the lack of structure that exist inthese data sources, manual quality analysis (usuallyperformed by paid crowdworkers) is necessary toextract structured labels, filter irrelevant examples,standardize language, and perform other prepro-cessing tasks before the data can be used. However, obtaining dataset annotations in this manner is atime-consuming and expensive process that is oftenprone to errors.In this work, we develop automated data clean-ing and verification mechanisms for extractinghigh-quality data from social media platforms .We specifically focus on the creation of question-answer datasets, in which each data instance con-sists of a question about a topic and the correspond-ing answer. In order to filter noise and improve dataquality, we propose the task of question-answer(QA) plausibility , which includes the followingthree steps:• Determine question plausibility:
Dependingon the type of dataset being constructed, thequestion posed to respondents may be gener-ated by a machine or a human. We determinethe likelihood that the question is both rele-vant and answerable.•
Determine response plausibility:
We predictwhether the user’s response contains a reason-able answer to the question.•
Extract answer from free-form response:
Ifthe response is deemed to be plausible, weidentify and extract the segment of the re-sponse that directly answers the question.Because we assume social media users generallyanswer questions in good faith (and are posed ques-tions which they can answer), we can assume plau-sible answers are correct ones (Park et al., 2019).Necessarily, if this property were not satisfied, thenany adequate solutions would require the very do-main knowledge of interest. Therefore, we look toapply this approach toward data with this property.In this study, we demonstrate an application ofQA plausibility in the context of visual question All code is available at https://github.com/rachel-1/qa_plausibility . a r X i v : . [ c s . C L ] N ov nswering (VQA), a well-studied problem in thefield of computer vision (Antol et al., 2015). We as-semble a large VQA dataset with images collectedfrom an image-sharing social network, machine-generated questions related to the content of theimage, and responses from social media users. Wethen train a multitask BERT-based model and eval-uate the ability of the model to perform the threesubtasks associated with QA plausibility. The meth-ods presented in this work hold potential for reduc-ing the need for manual quality analysis of crowd-sourced data as well as enabling the use of question-answer data from unstructured environments suchas social media platforms. Prior studies on the automated labeling task fordatasets derived from social media typically focuson the generation of noisy labels; models trainedon such datasets often rely on weak supervisionto learn relevant patterns. However, approachesfor noisy label generation, such as Snorkel (Ratneret al., 2017) and CurriculumNet (Guo et al., 2018),often use functions or other heuristics to gener-ate labels. One such example is the Sentiment140dataset, which consists of 1.6 million tweets la-beled with corresponding sentiments based on theemojis present in the tweet (Go et al., 2009). Inthis case, the presence of just three category labels(positive, neutral, negative) simplifies the labelingtask and reduces the effects of incorrect labels ontrained models; however, this problem becomes in-creasingly more complex and difficult to automateas the number of annotation categories increases.Previous researchers have studied question rel-evance by reasoning explicitly about the informa-tion available to answer the question. Several VQAstudies have explicitly extracted premises, or as-sumptions made by questions, to determine if theoriginal question is relevant to the provided image(Mahendru et al., 2017; Prabhakar et al., 2018). Anumber of machine comprehension models havebeen devised to determine the answerability of aquestion given a passage of text (Rajpurkar et al.,2018; Back et al., 2020). In contrast, we are ableto leverage the user’s freeform response to deter-mine if the original question was valid. Our modelis also tasked with supporting machine-generatedquestions, which may be unanswerable and lead tonoisy user-generated responses.While the concept of answer plausibility in user responses has also been previously explored, exist-ing approaches use hand-crafted rules and knowl-edge sources (Smith et al., 2005). By using alearned approach, we give our system the flexibilityto adapt with the data and cover a wider variety ofcases.
The dataset consists of questions and responsescollected from an image-sharing social media plat-form. We utilize an automated question-generationbot in order to access public image posts, generatea question based on image features, and recorddata from users that replied to the question, asshown in Figure 1 (Krishna et al., 2019). Be-cause the question-generation bot was designedto maximize information gain, it generates ques-tions across a wide variety of categories, includingobjects, attributes, spatial relationships, and activi-ties (among others). For the sake of space, we referreaders to the original paper for more informationon the method of question generation and diver-sity of the resulting questions asked. All users thatcontributed to the construction of this dataset wereinformed that they were participating in a researchstudy, and IRB approval was obtained for this work.For the privacy of our users, the dataset will not bereleased at this time. Rather than focus on the spe-cific dataset, we wish to instead present a generalmethod for cleaning user-generated datasets andargue its generality even to tasks such as visual-question-answering. the_user
The image can be found on pixabay.com. research_botthe_user he is a boy
For privacy reasons, this is a stock photo! responseimage
What is the girl wearing? question
Figure 1:
An example question and response pair col-lected from social media.
Note that since the questionsare generated by a bot, the question may not always berelevant to the image, as demonstrated here.
The dataset was labeled by crowdworkers onAmazon Mechanical Turk (AMT), who performedthree annotation tasks, as shown in Table 1: (1)determine if the question was plausible, (2) deter- uestion relevant? question becomes “what is in the image?”
What is on top of the blender? there s no blender What color is the orange?what you want to know What is the building? it s kocatepe mosque… yes no response relevant? yes no answer extraction What is in the image?boy What is the building? mosque vocab reduction What is in the image?boy What is the building?kocatepe mosque discard What is the girl wearing? he is a boy Figure 2:
Overview of the QA plausibility task, with representative examples . Given a question and user response,we determine if the question and response are plausible given the image. If so, we then extract a structured answerlabel from the response. (cid:44)(cid:44)
Y2. Q What is the person doing? Y 22.8R not much lol N3. Q What is on top of the cake? N 11.4R that is not cake that’s chicken Y4. Q What is the hamster doing? N 15.3R that is not a hamster N
Table 1:
Representative examples of cases present inthe data, and the percentage of examples representedby each class in our dataset.
Examples (1) and (2) havevalid questions that accurately refer to the correspond-ing images, while (3) and (4) do not correctly refer toobjects in the image. However, in example (3), the useridentifies the error made by the bot and correctly refersto the object in the image; as a result, this response isclassified as valid. mine if the response was plausible, and (3) if theresponse was deemed to be plausible, extract ananswer span. Plausible questions and answers aredefined as those that accurately refer to the contentof the image.It is important to note that since the question-generation process is automated, the question couldbe unrelated to the image due to bot errors; how-ever, in such situations where the question isdeemed to be implausible, the response may stillbe valid if it accurately refers to the content of theimage. If the response is judged to be plausible, theAMT crowdworker must then extract the answerspan from the user’s response. In order to capturethe level of detail we required (while discouragingAMT crowdworkers from simply copy/pasting theentire response), we set the maximum length ofan answer span to be five words for the labelingstep. However, the final model itself is not limited to answers of any particular length.For cost reasons, each example was labeled byonly one annotator. While we could have averagedlabels across annotators, we found that the majorityof the labeling errors were due to misunderstand-ings of the non-standard task, meaning that errorswere localized to particular annotators rather thanrandomly spread across examples. This issue wasmitigated by adding a qualifying task and manuallyreviewing a subset of labels per worker for the finaldata collection.While one might expect images to be necessary(or at least helpful) for determining question andresponse plausibility, we found that human anno-tators were able to determine the validity of theinputs based solely on text without the need for theaccompanying image. In our manual analysis ofseveral hundred examples (approximately 5% ofthe dataset), we found that every example whichrequired the image to label properly could be cate-gorized as a “where” question. When the bot askedquestions of the general form “where is the X” or“where was this taken,” users assumed our bot hadbasic visual knowledge and was therefore asking aquestion not already answered by the image (suchas “where is the dog now” or “what part of theworld was this photo taken in”). This led to validresponses that did not pertain to image features andwere therefore not helpful for training downstreammodels. Table 2 gives one such example. Once weremoved these questions from the dataset, we couldnot find a single remaining example that requiredimage data to label properly. As a result, we wereable to explore the QA plausibility task in a VQAsetting, despite not examining image features.Our preprocessing steps and annotation proce-dure resulted in a total of 7200 question-response xample Question/Response Valid?Q Where is the dog? YR sitting next to me on the sofa N
Table 2:
Example requiring analysis of the original im-age (removed from dataset along with other “where”questions which often lead to confusion) . pairs with answer labels. We use a standard split of80% of the dataset for training, 10% for validation,and 10% for testing. Model Architecture:
As shown in Figure 3, weutilized a modified BERT model to perform thethree sub-tasks associated with QA plausibility.The model accepts a concatenation of the machine-generated question and user response as input, withthe [CLS] token inserted at the start of the sen-tence and the [SEP] token inserted to separate thequestion and response. [CLS] What is the girl wearing ? [SEP] he is a boy [SEP]
What is the girl wearing?he is a boy [CLS] ... he is a boy ... [SEP]0.0 0.1 0.1 0.1 0.7 0.0
Features from fine-tuned BERT model Y N question relevant? responserelevant? Y N answer extraction Figure 3:
Model architecture.
The question and user re-sponse serve as input to a modified BERT model, whichwill output question plausibility, response plausibility,and an answer label.
In order to perform the question plausibility clas-sification task, the pooled transformer output ispassed through a dropout layer (p=0.5), fully con-nected layer, and a softmax activation function. Anidentical approach is used for response plausibilityclassification. To extract the answer span, encodedhidden states corresponding to the last attentionblock are passed through a single fully connectedlayer and softmax activation; this yields two proba-bility distributions over tokens, with the first repre-senting the start token and the second representingthe end token. The final model output includesthe probability that the question and response are plausible, with each expressed as a score between0 and 1; if the response is deemed to be plausible,the model also provides the answer label, which isexpressed as a substring of the user response.
Experiments:
We utilized a pretrained BERT BaseUncased model, which has 12 layers, 110 millionparameters, a hidden layer size of 768, and a vocab-ulary size of 30,522. We trained several single-taskand multi-task variants of our model in order tomeasure performance on the three subtasks associ-ated with QA plausibility. In the multi-task setting,loss values from the separate tasks are combined;however, an exception to this exists if the user’sresponse is classified as implausible. In these cases,the answer span extraction loss is manually set tozero and the answer extraction head is not updated.We evaluated performance on question and re-sponse plausibilities by computing accuracy andAUC-ROC scores. Performance on the answer spanextraction task was evaluated with F1 scores, whichmeasure overlap between the predicted answer la-bel and the true answer (Rajpurkar et al., 2018).
We investigated performance of our BERT modelon the various subtasks associated with QA plausi-bility. Results are summarized in Table 3. Single-task models trained individually on the subtasksachieved an AUC-ROC score of 0.75 on the ques-tion plausibility task, an AUC-ROC score of 0.77on the response plausibility task, and an F1 scoreof 0.568 on the answer extraction task. A multi-task model trained simultaneously on all three tasksdemonstrated decreased performance on the ques-tion and response plausibility tasks when comparedto the single-task models. We found that the high-est performance was achieved when a single-taskmodel trained on the question plausibility task wasfollowed by a multi-task model trained on both theresponse plausibility and answer extraction tasks;this model achieved an AUC-ROC score of 0.75 onquestion plausibility, an AUC-ROC score of 0.79on response plausibility, and an F1 score of 0.665on answer extraction.Our results suggest that multi-task learning ismost effective when the tasks are closely related,such as with response plausibility and answer ex-traction. Since the BERT architecture is extremelyquick for both training and evaluation, we foundthat the increase in performance afforded by usinga single-task model and multi-task model in seriesombined Task Question Plausibility Response Plausibility Answer ExtractionAcc AUROC Acc AUROC F1Question Plausibility (QP) only % - - -Response Plausibility (RP) only - - 64.62% 0.7674 -Answer Extraction only - - - - 0.568RP and Answer Extraction - - QP, RP and Answer Extraction 63.90% 0.6803 60.91% 0.6881 0.6160
Table 3:
Model Evaluation Metrics.
Performance metrics of our model are shown here. Multi-task learning helpsimprove performance when the model is simultaneously trained on the response plausibility and answer extractionsubtasks, but decreases performance when the model is simultaneously trained on all three subtasks. was worth the overhead of training two separatemodels. It is worth noting that a more complicatedmodel architecture might have been able to betteraccommodate the loss terms from all three subtasks,but we leave such efforts to future work.
Deep learning studies are often hindered by lack ofaccess to large datasets with accurate labels. In thispaper, we introduced the question-answer plausi-bility task in an effort to automate the data clean-ing process for question-answer datasets collectedfrom social media. We then presented a multi-task deep learning model based on BERT, whichaccurately identified the plausibility of machine-generated questions and user responses as wellas extracted structured answer labels. Althoughwe specifically focused on the visual question an-swering problem in this paper, we expect that ourresults will be useful for other question-answerscenarios, such as in settings where questions areuser-generated or images are not available.Overall, our approach can help improve the deeplearning workflow by processing and cleaning thenoisy and unstructured natural language text avail-able on social media platforms. Ultimately, ourwork can enable the generation of large-scale, high-quality datasets for artificial intelligence models.
References
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee,Paul Natsev, George Toderici, BalakrishnanVaradarajan, and Sudheendra Vijayanarasimhan.2016. Youtube-8m: A large-scale video classifica-tion benchmark.Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question An-swering.Seohyun Back, Sai Chetan Chinthakindi, Akhil Kedia,Haejun Lee, and J. Choo. 2020. Neurquri: Neu-ral question requirement inspector for answerabilityprediction in machine reading comprehension. In
ICLR .Alec Go, Richa Bhayani, and Lei Huang. 2009. Twittersentiment classification using distant supervision.Sheng Guo, Weilin Huang, Haozhi Zhang, ChenfanZhuang, Dengke Dong, Matthew R. Scott, and Din-glong Huang. 2018. Curriculumnet: Weakly super-vised learning from large-scale web images.
CoRR ,abs/1808.01097.Ranjay Krishna, Michael Bernstein, and Li Fei-Fei.2019. Information maximizing visual question gen-eration.
CoRR , abs/1903.11207.Aroma Mahendru, Viraj Prabhu, Akrit Mohapa-tra, Dhruv Batra, and Stefan Lee. 2017. Thepromise of premise: Harnessing question premisesin visual question answering.
EMNLP 2017 ,abs/1705.00601.Junwon Park, Ranjay Krishna, Pranav Khadpe, Li Fei-Fei, and Michael Bernstein. 2019. Ai-based requestaugmentation to increase crowdsourcing participa-tion.
Proceedings of the Seventh AAAI Conferenceon Human Computation and Crowdsourcing .Prakruthi Prabhakar, Nitish Kulkarni, and LinghaoZhang. 2018. Question relevance in visual questionanswering. arXiv preprint abs/1807.08435 .Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad.
Association for Computational Lin-guistics (ACL) , abs/1806.03822.Alexander Ratner, Stephen H. Bach, Henry R. Ehren-berg, Jason Alan Fries, Sen Wu, and Christopher R´e.2017. Snorkel: Rapid training data creation withweak supervision.
CoRR , abs/1711.10160.roy Smith, Thomas M. Repede, and Steven L. Lyti-nen. 2005. Determining the plausibility of answersto questions.
American Association for Artificial In-telligence .Bart Thomee, David A. Shamma, Gerald Fried-land, Benjamin Elizalde, Karl Ni, Douglas Poland,Damian Borth, and Li-Jia Li. 2016. Yfcc100m.