LRG at SemEval-2021 Task 4: Improving Reading Comprehension with Abstract Words using Augmentation, Linguistic Features and Voting
Abheesht Sharma, Harshit Pandey, Gunjan Chhablani, Yash Bhartia, Tirtharaj Dash
LLRG at SemEval-2021 Task 4: Improving Reading Comprehension withAbstract Words using Augmentation, Linguistic Features and Voting
Abheesht Sharma ∗ Dept. of CS&ISBITS Pilani, Goa Campus [email protected]
Harshit Pandey ∗ Dept. of Computer SciencePune University [email protected]
Gunjan Chhablani ∗ Oracle India Pvt. Ltd.Hyderabad, India [email protected]
Yash Bhartia ∗ Dept. of CS&ISBITS Pilani, Goa Campus [email protected]
Tirtharaj Dash
APP Centre for AI ResearchBITS Pilani, Goa Campus [email protected]
Abstract
In this article, we present our methodologiesfor SemEval-2021 Task-4: Reading Compre-hension of Abstract Meaning. Given a fill-in-the-blank-type question and a correspondingcontext, the task is to predict the most suitableword from a list of 5 options. There are threesub-tasks within this task: Imperceptibility(subtask-I), Non-Specificity (subtask-II), andIntersection (subtask-III). We use encodersof transformers-based models pre-trained onthe masked language modelling (MLM) taskto build our Fill-in-the-blank (FitB) models.Moreover, to model imperceptibility, we de-fine certain linguistic features, and to modelnon-specificity, we leverage information fromhypernyms and hyponyms provided by a lexi-cal database. Specifically, for non-specificity,we try out augmentation techniques, and otherstatistical techniques. We also propose vari-ants, namely
Chunk Voting and
Max Context ,to take care of input length restrictions forBERT, etc. Additionally, we perform a thor-ough ablation study, and use Integrated Gradi-ents to explain our predictions on a few sam-ples. Our best submissions achieve accura-cies of 75.31% and 77.84%, on the test setsfor subtask-I and subtask-II, respectively. Forsubtask-III, we achieve accuracies of 65.64%and 62.27%. The code is available here.
A very common assessment in schools is question-answering based on a given “comprehension pas-sage”. Students are given a comprehension passage,from which they are supposed to glean necessaryinformation, and answer short questions (such asfill-in-the-blanks-type question) based on what theyhave garnered from the given passage. While tryingto find the most appropriate word for the blank, thechildren look at the words surrounding the blank ∗ Equal contribution. Author ordering determined by coin flip. (“context”). The word should be such that when theword fills the blank, the sentence makes sense andit is grammatically correct. Inspired by this, andperhaps, after the enormous success of Transform-ers (Vaswani et al., 2017), researchers at Googlecame up with a large number of “pretraining tasks”and built knowledge-heavy language models whichcould be fine-tuned on various natural languageprocessing (NLP) downstream tasks. One of theearlier pretraining tasks was “Masked LanguageModelling (MLM)”, one of the two pretrainingtasks of the breakthrough model, BERT (Devlinet al., 2019). The approach here was similar to howkids are taught language at school: some tokens inthe text were randomly “masked” and the modelwas trained to predict these masked tokens.SemEval-2021 Task-4 (Zheng et al., 2021) fo-cuses on a similar idea. Every sample has an article,and a corresponding question. The question hasa blank which the model is supposed to predictfrom a set of 5 options. The novelty in the tasklies in its 3 subtasks: Imperceptibility (subtask-I), Non-Specificity (subtask-II), and Intersection(subtask-III). A description of these subtasks isgiven in Section 3. In this work, we propose usingBERT and its derivative models such as DistilBERT(Sanh et al., 2019), ALBERT (Lan et al., 2019) andRoBERTa (Liu et al., 2019). Further, we propose 2BERT variants: (1)
BERT Voting ; (2)
BERT Max.Context . Most importantly, we also model the con-cepts of imperceptibility and non-specificity. Forimperceptibility, we create statistical embeddingsusing features that have a high correlation withconcreteness. For non-specificity, we propose twoapproaches: (1) we augment the dataset by replac-ing some nouns in the article by their hypernyms;and (2) we use the options’ hyponyms to decide themost appropriate option. We also experiement withGA-Reader (Dhingra et al., 2017b) and GSAMN-based approaches (Lai et al., 2019) by trying out a r X i v : . [ c s . C L ] F e b heir various combinations with BERT.In Section 2, we perform a succinct literaturesurvey. Section 3 elucidates our approach, includ-ing the modelling aspect, the various variants ofthe base model, and the different ways we modelimperceptibility and non-specificity. In Section 4,we perform an extensive ablation and comparativestudy. The advent of large-scale question answering sys-tems began with straightforward tasks, like the oneintroduced by the SimpleQuestions Dataset (Bor-des et al., 2015), which consisted of knowledge-base fact triples which were later used to answerquestions. However, this dataset would only judgea model based on the ability to relate the facts tothe question at hand. The purpose of NLP researchis to be able to create a generalised model thatmay answer questions based on any context, thusdatasets like the CNN Daily Mail (Hermann et al.,2015) and SQuAD (Rajpurkar et al., 2016) werecreated. In a typical question-answering dataset, anoriginal and anonymised context is provided beforeeach question. Before transformers, methods con-sisting of LSTM/GRUs were used to achieve goodresults on the aforementioned tasks. These datasetshowever, always had answers in the passage.The CLOTH (Xie et al., 2018) dataset focuses onpassages from middle-school and high-school text,with multiple fill-in-the-blanks in the passage. TheReCAM (Zheng et al., 2021) dataset puts a twist toarchetypal fill-in-the-blank datasets by providinganswer choices that are abstract in some form andwhich are not available in the passage itself. Themodels created for the QA task have to take intoaccount semantic relations between the options andthe context. GA-Reader (Dhingra et al., 2017b),is one such model, which utilises a multi-hop ar-chitecture with a novel attention mechanism, thatserves as a baseline to this task.
The first model we employ follows a cloze-stylequestion answering approach, in which we use var-ious pretrained transformer models as encoders,followed by a decoder layer, which helps us toselect the correct answer.
AnimalDog FishTerrier Hound Seafish Freshwater FishAnimal
Figure 1: An example of a Hypernymy Tree
Specifically, we leverage BERT along with someof its popular and successful variants such as: Dis-tilBERT, ALBERT, and RoBERTa. In the MLMtask, tokens in the text are randomly masked, andthe model is trained in a self-supervised way topredict these masked tokens. Conceptually, thesetransformers-based models are expected to takecare of bidirectional context while predicting themasked token.In our method, firstly, the transformer modellearn the contextual embeddings of the article andthe question. For the next block, the embedding ofthe masked token (i.e., the blank) is passed througha fully-connected layer, of which, the number ofoutputs corresponds to the size of the vocabularyspace for the pretrained model. Each candidateoption is first tokenised using WordPiece tokeniser(Wu et al., 2016), and mapped to the vector in theoutput vocabulary space. If the candidate optiongenerates multiple tokens, we average the mappedscores. The model chooses the option with thehighest regression score. An overview of the modelis given in Figure 2.
Nouns can be clearly demarcated into two broadcategories: Concrete Nouns, and Abstract Nouns.Concrete Nouns are words that represent tangibleconcepts, i.e., any noun referring to a name, place,object, material, etc. is considered a concrete word.Concrete words refer to concepts that can be felt by5 human senses: Sight, Sound, Smell, Taste, andTouch. In contrast, any noun alluding to an abstractconcept that cannot be experienced by our sensesis an abstract word (Spreen and Schulz, 1966). Insubtask-I, the model has to predict the most ac-curate and the most imperceptible word from thegiven options. To model the imperceptibility of
OKENIZER . . . . . .
ARTICLE TOKENS QUESTION TOKENS[CLS]
TOK [SEP] [SEP]
TOKTOKTOK [MASK]
MLM-BASED TRANSFORMER ENCODER
EMB[SEP]EMBTOKEMBTOKEMBTOK EMBTOK EMB[MASK]EMBTOK EMB[SEP]EMB[CLS] TOK
LINEAR LAYER
Tokenizer ARTICLE QUESTION (a) BERT FitB Encoder
REGRESSION SCORE FOR EVERY TOKEN IN THE VOCABULARYOPTION 1TOKENS OPTION 2TOKENS . . .
OPTION 5TOKENS
AVERAGE AVERAGE AVERAGE . . .
ARGMAX L OO KU P OPTIONSTOKENIZER L OO KU P L OO KU P OPTION 1TOKENSCORES OPTION 2TOKENSCORES OPTION 5TOKENSCORESPREDICTEDLABEL (b) BERT FitB Classifier Figure 2: Architecture of Transformer-based FitB Model every word, we incorporate certain linguistic fea-tures which are highly correlated with the notionof “imperceptibility”. These linguistic features arelisted below:
Length and Frequency of the Word
In existingliterature, authors have claimed that there existsstrong evidence that concrete words are, in general,shorter than abstract words (Tanaka et al., 2013). Areasonable justification provided is that more fre-quently used words tend to be short (Feng et al.,2011) and since humans have a penchant for de-scribing objects, places, or things near them, thesefrequently used words are generally concrete nouns.It is rather intuitive that humans would prefer easein the pronunciation of oft-used words. Moreover,many abstract words in the English language areformed by adding suffixes to the root word, such as“coarse” becomes “coarseness”, “forget” becomes“forgetfulness” and so on (Tanaka et al., 2013).
Number of Senses of the Word
In Linguistics,polysemy refers to the capacity for a word to havemultiple meanings or senses. Abstract nouns areobserved to be more “polysemous” than concretenouns (Tanaka et al., 2013). For example, in Word-Net (Fellbaum, 1998), the word “dog” has 8 senses,while the word “love” has 10 senses.
Number of Hyponyms
Tanaka et al. 2013 finda direct correlation between the abstractness of anoun and the number of hyponyms the word has.We consider the number of hyponyms of the mostcommonly occurring sense of the word, and theaverage number of hyponyms of all the senses ofthe word.
Score-based Features
Abstract nouns evokeemotions in humans. SentiWordNet (Baccianellaet al., 2010), another lexical database like Word-Net, gives scores based on the how positive, neg-ative or objective they are. Abstract words havea higher positive/negative score, while concretewords have a higher objective score. Again, here,we consider these scores for the most commonlyoccurring sense, and the average scores of all thesenses of the word.
Depth in Hypernymy Tree
This feature is moresuited for non-specificity. However, we includethis as a feature of imperceptibility since the con-cepts of imperceptibility and non-specificity arerelated. For example, consider the words ”money”and ”property”. The latter is more imperceptibleand non-specific than the former. Moreover, this isparticularly useful for Subtask-III. Therefore, thedepth of a word in the hypernymy tree is directlyproportional to the concreteness of the word.rom the features above, we have a 13-dimensional vector for every word in the lexicon.The embedding is created so that every dimensionis directly proportional to the concreteness of theword. For example, the length of a word is ingeneral, indirectly proportional to the concretenessof the word, so we take the length dimension ofthe vector as large value − length of word . Thelarge value chosen was the same for all featureswhich are indirectly proportional to concreteness.Towards improving the trained model, we use amethod which we term as the Difference Method .If the difference of the top-2 probabilities predictedby the model is greater than a certain threshold,this implies that the model is sure of the predic-tion it has made. However, if the difference is lessthan the tunable threshold, the model is ambivalentabout whether the option with the highest probabil-ity or the option with the second highest probabilityis correct. In this case, we compute for how manydimensions the value of the linguistic embeddingof the second word is less than the value of thelinguistic embedding of the first word. If the ma-jority of the values (i.e., 7) are less, we change theprediction of the model to the second-most proba-ble option. The threshold is tuned on the dev set.Furthermore, we use a
Threshold Method towardsimproving the model performance. If the high-est probability is less than a tunable threshold, themodel is unsure of its predictions and we considerthe improvement approaches on the option with thesecond-highest probability.
According to Spreen and Schulz, 1966, a highlyspecific word refers to a very particular instance,while a non-specific word refers to a generic con-cept, i.e., it encompasses many classes/instances.For example, consider the words “animal”, “bird”and “eagle”. The words are listed in increasingorder of specificity.We find parallels between the definition ofspecificity/non-specificity and the linguistic phe-nomenon of hypernymy. Schreuder and Baayen,1995 define a hypernym as “a word with a generalmeaning that has basically the same meaning of amore specific word”. The more specific word is thecorresponding hyponym. In simpler terms, eachword is related to some super-types and sub-types,called as hypernyms and hyponyms, respectively.In linguistics, hyponymy is a semantic relation be-tween a hyponym denoting a subtype and a hyper- nym denoting a supertype.For example, in figure 1, as we traverse up thehypernymy tree, assuming we consider the word“dog”, we find that its hypernym is “animal”, whichis much broader than “dog”. On the other hand, aswe go down the hypernymy tree, we find more spe-cific terms for the word “dog” such as “terrier”. Es-sentially, hyponyms represent “IS-A” relationships.For example, “terrier” is a “dog”. We leverage thehypernymy property of words to help the modelin deciding the most non-specific option. The twomethods which we implement are:
Hypernym Augmentation Method
In order toinfuse a sense of non-specificity (other than train-ing on the given dataset for non-specificity), weaugment the dataset for subtask-I. We randomly se-lect n nouns from the article by using a basic POSTagging pipeline. For each noun, we use the Leskalgorithm (Lesk, 1986) to find the most appropriatesense of the word based on its context. For thissense of the word, we find its hypernyms, pick ahypernym uniformly at random from this list of hy-pernyms and replace the noun in the article with thehypernym. We do this for all n combinations, i.e.,corresponding to every sample, we have n aug-mented samples. Furthermore, we randomly masktokens in this dataset and train BERT on the MLMtask, on this dataset. This serves a dual purpose.Firstly, it serves as a sort of domain adaptation, andsecondly, it infuses a sense of non-specificity in themodel.While finetuning BERT MLM on the augmenteddataset, we freeze two layers, due to time andcomputational constraints. We replace the nor-mal BERT Encoder in our BERT FitB model withthe BERT Encoder fine-tuned on the augmenteddataset. Hyponyms Options Method
Here, we use the
Difference Method / Threshold Method . If the modelis sure of its prediction, we keep the prediction ofthe model. Otherwise, we generate hyponyms foreach option using WordNet. After the hyponymsare tokenised, we use the trained model’s outputand map each hyponym token to the output vocab-ulary space and get the corresponding scores. Wethen take the maximum score amongst all of thehyponyms as the predicted probability for that op-tion. The reason for incorporating this approachpertains to how the transformer models were pre-trained. Consider the following sentence: “He had [MASK] and it was bitter”. Now, suppose thatwe have two options: “beer” and “drink”. Gen-erally, our transformer-based model would lookat the word “bitter” and predict “beer”. However,“drink” is more non-specific than “beer”.
To address the limitations of the vanillatransformer-based models, we attempt mul-tiple modifications to the proposed baselinetransformer models, specifically for BERT. Themajor limitation of the pretrained BERT modelthat we’ve used, is the restriction on the lengthof the tokenised inputs. Only 512 tokens from asample can be processed by BERT in one parseand hence, some articles end up getting truncatedand context is lost. The following are some ofthe modifications we’ve made to improve theperformance of our models:
Voting
We tokenise the question and the article.We split the article into chunks and pair each chunkwith the question such that the length of the to-kenised ( chunk, question ) pair is 512. Whilesplitting the article into chunks, we keep a max-overlap stride of 128 so that the context of the pre-vious chunk is not lost. We train the model on thesenewly formed ( chunk, question ) pairs. During in-ference, we take the weighted sum of the logits.For BERT FitB Voting (Similarity) , the weights arecalculated as: weight ij = u i .v j || u i |||| v j || (1)where u i is the embedding of the question in the i th sample, and v j is the embedding of the j th chunkof the sample’s article. To find the embeddings,we extract the [CLS] embedding from a pretrainedBERT encoder.We also try out an alternate way of defining theweights: weight ij = |{ q i toks. } ∩ { chunk j toks. }||{ chunk j toks. }| (2)where { q i toks. } is the set of tokens in the i th sam-ple’s question, and { chunk j toks. } is the set oftokens in the j th chunk of the sample. | . | repre-sents the cardinality of a set. We call the method BERT FitB Voting (Exact Matching) .We normalise the computed weights: norm weight ij = weight ij (cid:80) n i j =1 weight ij (3) where n i is the number of chunks in the i th sample.The idea behind this is that higher the similaritybetween the question and the article’s chunk, higheris the weight assigned to the logits returned bythe trained model with the question-chunk pair asinput. In Equation 2, we find the fraction of tokenscommon between the question and chunk. Max Context
This method is a slight modifica-tion of the Voting Method. Instead of training themodel on all ( chunk, question ) pairs for a partic-ular sample, we train the model on the pair withthe highest weight. The weights are calculated asdescribed in Equation 2. We propose a few modifications to the baseline,namely GA-Reader (Dhingra et al., 2017a) pro-vided by the organisers.
GA-Reader BERT
We use GA-Reader on top ofBERT embeddings. This could lead to potential im-provement in performance for subtask-I as BERTembeddings are more feature-rich than GloVe em-beddings.
GA-BERT
Based on the Gated-AttentionReader, we came up with an approach that usesGated-Attention across two-BERT streams. Thefirst stream takes in the question input, and workslike the regular BERT model. The second streamtakes the article input. Assume the layer outputs forlayer L are Q L and A L , respectively, for questionand article streams. Then, to the layer L + 1 forquestion stream, Q L is passed as input, while tolayer L + 1 for article stream, GA ( Q L , A L ) ispassed, where GA is the Gated-Attention function.This is done for all 12 layers of BERT-BASE.Finally, on this model, two types of heads areattached - Selection and Pooling (similar to BERTFitB), and Attention Classification (similar toGA-Reader). The logits for each head are concate-nated and a fully-connected layer is added on top.Since this is a major change in the architecture ofBERT, this model needs a significant amount ofpretraining. Answer-Attention
Since GA-Reader also at-tends to the candidate answer embeddings, we alsoattempt an approach where we pass the optionsto the BERT model. On the option embeddingsand the [MASK] token embeddings, we apply mul-tiplicative attention (dot product) to get attention odel Information Imperceptibility Non-SpecificityModel Variant Val Acc. Test Acc. Val Acc. Test Acc.
BERT Fill-in-the-Blank base 67.03% 66.77% 64.39% 65.74%BERT Fill-in-the-Blank large % % % %DistilBERT Fill-in-the-Blank base 67.03% 66.02% 63.69% 62.67%RoBERTa Fill-in-the-Blank base 52.45% 51.11% 33.73% 35.99%RoBERTa Fill-in-the-Blank large 51.02% 52.44% 33.14% 34.95%ALBERT Fill-in-the-Blank base-v2 31.42% 30.46% 31.84% 31.14%ALBERT Fill-in-the-Blank large-v2 31.06% 30.76% 30.08% 33.27%GA-Reader (baseline) - 21.23% 21.51% 21.50% 21.86% Table 1: Results of the Vanilla Fill-in-the-Blank(FitB) Models and GA-Reader scores. These scores are directly used as logits forthe prediction.
Lai et al. (2019) pro-pose a combination of Gated-Attention and Self-Attention - Gated Self-Attention (GSA). They showimprovements on smaller datasets compared toCompare-Aggregate Approaches. We use two GSAlayers on top of BERT Embeddings, and use thesame decoder and selection method as BERT FitB.
In all our experiments, we use the PyTorch im-plementations of the transformers-based modelsprovided by the HuggingFace (Wolf et al., 2019).The metric for all the 3 subtasks is accuracy. Forsubtask-I, to obtain the linguistic features men-tioned in 3.2, and to obtain the hypernyms and hy-ponyms for subtask-II, we use the lexical database,WordNet provided by NLTK (Bird and Loper,2004), a library in Python.For both subtasks, we train our models on train+ trial dataset, and evaluate them on the dev set.
For all our experiments, we use Adam Optimiser(Kingma and Ba, 2017) and Cross Entropy Loss.For choosing the optimal set of hyperparameters,we run a Grid Search on our models. We zeroin on a learning rate of 1e-5. Schedulers such asLinear Scheduler, Cosine Annealing Scheduler, etc.seem to have a negative impact on the results. Forthe FitB models, we keep all the layers unfrozen.Additionally, the maximum input length is kept as512. We train our models for 4 epochs, keeping abatch size of 2.
Among the vanilla models, BERT FitB Large per-forms the best. This is understandable when itcomes to DistilBERT and ALBERT, since thesemodels are pruned and distilled for faster compu-tation. Notably, DistilBERT gives comparable per-formance to BERT FitB Base. A slightly surpris-ing observation was that there is a degradation inaccuracy on using RoBERTa. This could be be-cause even though it was pretrained more robustlythan BERT on the MLM task, it was not pretrainedon the Next Sentence Prediction Task, and hence,might perform worse on Textual Entailment tasks.A peculiar observation is that the large variants ofALBERT FitB and RoBERTa FitB models performworse than their base variants. This may implythat more training data is needed to train the largevariants. For subtask-I, in table 2, we also demon-strate the results of BERT Ensemble, in which weensemble (i.e., averaging over the predictions) twocheckpoints saved during the training process.When it comes to the
Difference Method using Lin-guistic Features for imperceptibility, we observean improvement on the dev set, but a slight fall isobserved while evaluating it on the test set. Thismight be solved by careful tuning of the threshold.In the future, we aspire to learn embeddings us-ing these Linguistic Features as input to commonmodels such as Word2Vec (Mikolov et al., 2013).For non-specificity, with the hypernym augmen-tation method, BERT FitB achieves lower accuracy.A possible reason for this could be that replacingthe nouns with their hypernyms in some contextschanges the meaning of the sentence (even thoughwe use Lesk Algorithm for WSD, not all hyper-nyms make sense). For example, the word “drink”is replaced with “food”. For the hyponyms method,we can improve our results by recursively gener-ating hyponyms for a particular option, instead of he polls are already years overdue and were scheduled for Sunday . They were postponed because of an ongoing stalemate between the government and a group of opposition senators over an electorallaw . Haiti is the poorest country in the region and is still struggling to recover from a 2010 earthquake . Protesters lit piles of wood in the central neighbourhood of Bel Aire before marching to a wealthyhillside neighbourhood , where riot police guarded hotels , shops and Haiti ' s elections office . Some demanded President Michel Martelly ' s resignation for his " inability to organise elections in thecountry " . Two opposition activists who had organised the protest were arrested by police for " public unrest and inciting violence " . Mid - term senate elections in Haiti had been due in May 2012 , whilethe municipal poll is three years behind schedule as Haiti slowly emerges from the earthquake which left much of the country devastated in 2010 . In June , President Michel Martelly decreed that theelections be held on 26 October . The date was set after lengthy talks mediated by the president of Haiti ' s Bishops ' Conference , Cardinal Chibly Langlois , intended to overcome the political deadlockbetween the opposition and the government . But after the National Assembly failed to pass an electoral law in time , the office of Mr Martelly announced another postponement on Sunday . No new datehas been set , but the statement said that " President Michel Martelly , in his constant concern to guarantee political stability , promises to pursue consultations with the different sectors of national life inorder to hold the elections as soon as possible " . Opposition politicians accuse President Martelly of wanting to rule by decree - a likely scenario if no elections are held before the lower chamber ' s termruns out in January . The government argues that opposition politicians are also dragging their feet in the hope of extending their time in office without elections . Thousands of Haitians marched in thecapital Port - au - Prince on Sunday in protest at a delay in the country ' s [MASK] and municipal elections .Options: Local, Annual, Legislative, Municipal, Devastating
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Figure 3: Explanation of a Correctly Classified Sample from Subtask-I (Imperceptibility). The correct option ishighlighted in green.
Model Variant Val Acc. Test Acc.
BERT FitB LF large 75.75% 75.06%DistilBERT FitB LF base 68.10% 65.73%BERT FitB ENS large 75.15%
BERT FitB ENS LF large 75.87% 75.26%BERT FitB EM large 76.58% 76.35%BERT FitB EM LF large
Table 2: Results and Ablation Study of the Improve-ment Methods on Subtask-I taking the immediate hyponyms. Again, thresholdtuning may help.In Table 3, a positive sign for the DifferenceMethod or the
Threshold Method is the improve-ment in the results of BERT FitB Voting (ExactMatching) when we consider the hyponyms. Theaccuracy jumps from 72.86% to 75.79% on the devset and from 77.83% to 78.98% on the test set. Thisreinforces our claim that with more careful tuningof the threshold, we might get improvements onthe test set in other methods.BERT FitB Voting performs better than vanillaBERT FitB on both subtasks. This is intuitive sincein the latter, we truncate the article to 512 tokenswithout any consideration of how much contextis lost. Voting, on the other hand, considers allcontexts and hence, gives a superior performance.For GA-Reader-BERT, when compared with theGA-Reader baseline, the accuracy improves from21% to 39% on subtask-I dev set. Due to com-putational restrictions, we couldn’t pretrain GA-BERT, and only fine-tuned it for subtask-I to get anidea about its performance, which was sub-optimal LF=Linguistic Features, ENS=Ensemble, FitB=Fill-in-the-Blank, EM=Exact Matching, VS=Voting (similarity),MC=Max Context, Aug=Augmentation, Hypr=Hypernym,Hypo=Hyponym
Model Variant Val Acc. Test Acc.
BERT FitB Hypo large 75.09% 72.83%BERT FitB Hypr Aug large 62.26% 60.78%BERT FitB Hypr Aug Hypo large 64.51% 55.52%BERT FitB EM large 72.86% 77.83%BERT FitB EM Hypo large % %BERT FitB VS large 73.09% 77.59%BERT FitB VS Hypo large 75.56% 78.63%BERT MC large 71.33% 71.21% Table 3: Results and Ablation Study of the Improve-ment Methods on Subtask-II (19%). The Answer-Attention system gave us adev score of ≈
61% on subtask-I, which is muchhigher than the baseline.GSAMN-BERT-Cloze achieves ≈
31% accuracyon subtask-I dev set. The reasons for this couldbe lack of pretraining, unlike the original paper, ordifferent way to getting the output logits. We seeimprovement as we reduced number of layers to1( ≈ ≈ We use the method of Integrated Gradients (Sun-dararajan et al., 2017). We follow Ramnath et al.(2020) to compute the word-wise attribution scoresfor BERT FitB for both subtasks. We computethe Integrated Gradients of the target with respectto the embedding outputs. The Riemann RightApproximation Method with n steps = 25 is used.After obtaining the token-wise attribution scores,we obtain the word-wise attribution scores by usingtoken-to-word offset mapping. We pick the top-10word-wise attribution scores and normalise them.To implement IG, we use the Captum (Kokhlikyanet al., 2020) library. In favour of brevity, we presentone example for each subtask.In Fig. 3, the correct answer is “legislative”. Theattribution scores of words like senate , senators , he Royal College of Physicians of Edinburgh warned that being overweight may now be considered " the norm " . It claimed a tax would help fund the " spiralling " healthcare costs associated with theproblem . The British Soft Drinks Association ( BSDA ) insisted that the case is " not compelling " . It cited research which suggested a 20 % tax would save just four calories per day . LiverpoolUniversity chair of clinical epidemiology , Simon Capewell , is due to speak at a conference on the issue in Edinburgh later , entitled : " Obesity : A 21st Century Epidemic " . Professor Capewell will citeMexico as one example where a 10 % sugary drinks tax is believed to have contributed to a 10 % reduction in the consumption of such beverages while Finland , France , Hungary , Latvia and the USAhave also introduced sugar taxes . He said : " The revenues raised can then be invested back into initiatives to increase children ' s health in these countries , as is happening in Mexico . " Scotland has anexcellent track record in addressing public health issues . Notable achievements include smoke - free public places and proposals for minimum unit pricing for alcohol . We need to explore how thesedevelopments could be repeated with sugary drinks . " Gavin Partington , BSDA director general , said : " The efforts by soft drinks companies including product reformulation , smaller pack sizes andincreased promotion of low and no - calorie drinks have led to a 7 % reduction in calories from soft drinks in the last three years . " It ' s also worth noting that politicians in Belgium and Denmark rejectedthe notion of a tax in 2013 and the experience in France shows that while sales of soft drinks initially fell after a tax was introduced in 2012 , they have increased since . " Doctors have called for theintroduction of a tax on sugary [MASK] and drinks to tackle what they describe as an " obesity epidemic " .Options: Food ,Terms, Head, Unit, Snacks Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Figure 4: Explanation of a Correctly Classified Sample from Subtask-II (Non-Specificity). The correct option ishighlighted in green.
Imperceptibility Non-SpecificityModel Test Acc. Model Test Acc.
BERT FitB % BERT FitB 61.83%DistilBERT FitB 52.16% BERT FitB with Hyponyms 59.95%DistilBERT FitB + Linguistic Features 51.61% BERT FitB with Hypernym Augmentation 45.98%BERT FitB + Linguistic Features 65.54% BERT FitB Voting (Exact Matching) %BERT FitB Ensemble + Linguistic Features 64.95% - -
Table 4: Submitted Results of Subtask-III: Testing the performance of a system that is trained on one subtask andevaluated on the other. municipal and
President are high, as is demon-strated by the intensity of the colour. The word“legislative” is, in a sense, more imperceptible thanany of the words mentioned above. The senate isthe legislative branch of the government, and sena-tors are its members; municipal refers to municipalcorporations which are the grassroots governingbodies, etc. Moreover, other words such as elec-tions , political , country also have high attributionscores. These words are related to “legislative”which exhibits the fact that BERT FitB is not onlyable to learn the concept of imperceptibility, but isalso able to predict a suitable word.Similarly, in Fig. 4, the correct answer is “food”.Note that “snacks” is also an option; however, foodis more non-specific than the latter and hence, foodis the correct option. Another interesting thing tonote is the high attribution scores for words/phraseslike calories , beverages , sugar and sugary drinks .This backs the fact that the model is able to learnthe concept of non-specificity, i.e., the above men-tioned words are essentially hyponyms of “food”. We tried out myriad approaches, taking care tonot only focus on the architecture aspect, but alsohow we can quantify imperceptibility and non-specificity. Although we did not achieve favourableimprovements in all approaches, we did observegains in accuracy on the dev set. We reckon that with more careful tuning of parameters such as thethreshold in the
Difference Method , we will be ableto achieve these gains on the test set.We further interpreted the outputs oftransformers-based models using IntegratedGradients, and demonstrated that transformermodels are able to learn the concepts of impercep-tibility and non-specificity. In the future, we intendto solidify our proposed approaches and carry outfurther research in this interesting field.
References
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-tiani. 2010. SentiWordNet 3.0: An enhanced lexi-cal resource for sentiment analysis and opinion min-ing. In
Proceedings of the Seventh InternationalConference on Language Resources and Evaluation(LREC’10) , Valletta, Malta. European Language Re-sources Association (ELRA).Steven Bird and Edward Loper. 2004. NLTK: The nat-ural language toolkit. In
Proceedings of the ACL In-teractive Poster and Demonstration Sessions , pages214–217, Barcelona, Spain. Association for Compu-tational Linguistics.Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale Simple QuestionAnswering with Memory Networks.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferencef the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, WilliamCohen, and Ruslan Salakhutdinov. 2017a. Gated-attention readers for text comprehension. In
Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers) , pages 1832–1846, Vancouver, Canada. As-sociation for Computational Linguistics.Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang,William W. Cohen, and Ruslan Salakhutdinov.2017b. Gated-Attention Readers for Text Compre-hension.Christiane Fellbaum. 1998.
WordNet: An ElectronicLexical Database . Bradford Books.Shi Feng, Zhiqiang Cai, and Danielle McNamara. 2011.Simulating human ratings on word concreteness.Karl Moritz Hermann, Tom´aˇs Koˇcisk´y, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching Machines toRead and Comprehend.Diederik P. Kingma and Jimmy Ba. 2017. Adam: AMethod for Stochastic Optimization.Narine Kokhlikyan, Vivek Miglani, Miguel Martin,Edward Wang, Bilal Alsallakh, Jonathan Reynolds,Alexander Melnikov, Natalia Kliushkina, CarlosAraya, Siqi Yan, and Orion Reblitz-Richardson.2020. Captum: A unified and generic model inter-pretability library for pytorch.Tuan Lai, Quan Hung Tran, Trung Bui, and DaisukeKihara. 2019. A gated self-attention memory net-work for answer selection. In
Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 5953–5959, Hong Kong,China. Association for Computational Linguistics.Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Sori-cut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations.
CoRR , abs/1909.11942.Michael Lesk. 1986. Automatic sense disambiguationusing machine readable dictionaries: how to tell apine cone from an ice cream cone. In
Proceedings ofthe 5th annual international conference on Systemsdocumentation , pages 24–26.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach.
CoRR , abs/1907.11692. Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient Estimation of Word Represen-tations in Vector Space.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ Questionsfor Machine Comprehension of Text.Sahana Ramnath, Preksha Nema, Deep Sahni, andMitesh M. Khapra. 2020. Towards interpretingBERT for reading comprehension based QA. In
Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 3236–3242, Online. Association for Computa-tional Linguistics.Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof BERT: smaller, faster, cheaper and lighter.
CoRR ,abs/1910.01108.Robert Schreuder and R. Harald Baayen. 1995. Mod-elling morphological processing. In Laurie B. Feld-man, editor,
Morphological aspects of language pro-cessing , pages 131–154. Erlbaum.Otfried Spreen and Rudolph W. Schulz. 1966. Param-eters of abstraction, meaningfulness, and pronuncia-bility for 329 nouns.
Journal of Verbal Learning andVerbal Behavior , 5(5):459 – 468.Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017.Axiomatic attribution for deep networks.
CoRR ,abs/1703.01365.Shinya Tanaka, Adam Jatowt, Makoto Kato, and Kat-sumi Tanaka. 2013. Estimating content concrete-ness for finding comprehensible documents. pages475–484.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need.
CoRR , abs/1706.03762.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.
CoRR , abs/1910.03771.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, Melvin John-son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws,Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, KeithStevens, George Kurian, Nishant Patil, Wei Wang,Cliff Young, Jason Smith, Jason Riesa, Alex Rud-nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,and Jeffrey Dean. 2016. Google’s neural machinetranslation system: Bridging the gap between humanand machine translation.
CoRR , abs/1609.08144.izhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy.2018. Large-scale cloze test dataset created byteachers. In