[PDF] AVA: an Automatic eValuation Approach to Question Answering Systems

Abstract

We introduce AVA, an automatic evaluation approach for Question Answering, which given a set of questions associated with Gold Standard answers, can estimate system Accuracy. AVA uses Transformer-based language models to encode question, answer, and reference text. This allows for effectively measuring the similarity between the reference and an automatic answer, biased towards the question semantics. To design, train and test AVA, we built multiple large training, development, and test sets on both public and industrial benchmarks. Our innovative solutions achieve up to 74.7% in F1 score in predicting human judgement for single answers. Additionally, AVA can be used to evaluate the overall system Accuracy with an RMSE, ranging from 0.02 to 0.09, depending on the availability of multiple references.

Full PDF

AAVA: an Automatic eValuation Approach to Question Answering Systems

Thuy Vu

Amazon AlexaManhattan Beach, CA, USA [email protected]

Alessandro Moschitti

Amazon AlexaManhattan Beach, CA, USA [email protected]

Abstract

We introduce AVA, an automatic evaluation ap-proach for Question Answering, which givena set of questions associated with Gold Stan-dard answers, can estimate system Accuracy.AVA uses Transformer-based language modelsto encode question, answer, and reference text.This allows for effectively measuring the simi-larity between the reference and an automaticanswer, biased towards the question semantics.To design, train and test AVA, we built mul-tiple large training, development, and test setson both public and industrial benchmarks. Ourinnovative solutions achieve up to 74.7% in F1score in predicting human judgement for sin-gle answers. Additionally, AVA can be used toevaluate the overall system Accuracy with anRMSE, ranging from 0.02 to 0.09, dependingon the availability of multiple references.

Accuracy evaluation is essential both to guide sys-tem development as well as to estimate its quality,which is important for researchers, developers, andusers. This is often conducted using benchmarkingdatasets, containing a data sample, possibly repre-sentative of the target data distribution, providedwith Gold Standard (GS) labels (typically producedwith a human annotation process). The evaluationis done by comparing the system output with theexpected labels using some metrics.This approach unfortunately falls short whendealing with generation tasks, for which the systemoutput may span a large, possibly inﬁnite, set ofcorrect items. For example, in case of QuestionAnswering (QA) systems, the correct answers forthe question,

Where is Rome located ? is large. Asit is impossible, also for cost reasons, to annotateall possible system pieces of output, the standardapproach is to manually re-evaluate the new outputof the system. This dramatically limits the experi- mentation velocity, while increasing signiﬁcantlythe development costs.Another viable solution in speciﬁc domains con-sists in automatically generating an evaluationscore between the system and the reference an-swers, which correlates with human judgement.The BLEU score, for example, is one popular mea-sure in Machine Translation (Papineni et al., 2002).This, however, can only be applied to speciﬁc tasksand even in those cases, it typically shows limita-tions (Way, 2018). As a consequence there is an ac-tive research in learning methods to automaticallyevaluate MT systems (Ma et al., 2019), while hu-man evaluation becomes a requirement in machinetranslation benchmarking (Barrault et al., 2019).QA will deﬁnitely beneﬁt by a similar approachbut the automatic evaluation is technically morecomplex for several reasons: First, segment over-lapping metrics such as BLEU, METEOR, orROUGE, do not work since the correctness of ananswer loosely depends on the match between thereference and candidate answers. For example, twotext candidates can be correct and incorrect evenif they only differ by one word (or even one char-acter), e.g., for the questions,

Who was the 43 rd president of USA ? , a correct answer is George W.Bush , while the very similar answer,

George H. W.Bush , is wrong.Second, the matching between the answer can-didates and the reference must be carried out atsemantic level and it is radically affected by thequestion semantics. For example, match ( t, r | q ) can be true but match ( t, r | q ) can be false, where t and r are a pair of answer candidate and reference,and q and q are two different questions. Thiscan especially happen for the case of the so-callednon-factoid questions, e.g., asking for a description,opinion, manner, etc., which are typically answeredby a fairly long explanatory text. For example, Ta-ble 1 shows an example of a non factoid question a r X i v : . [ c s . C L ] M a y uestion : What does cause left arm pain ? Reference : Arm pain can be caused by a wide variety ofproblems, ranging from joint injuries to compressed nerves;if it radiates into your left arm can even be a sign of a heartattack.

Answer 1 : It is possible for left arm pain to be caused fromstraining the muscles of the arm, pending heart attack, or itcan also be caused from indigestion.

Answer 2 : Anxiety can cause muscles in the arm to becometense, and that tension could lead to pain.

Answer 3 : In many cases, arm pain actually originates froma muscular problem in your neck or upper spine.

Table 1: Example of a non-factoid questions and three different valid answers, which share sim-ilarity with respect to the question. However, if thequestion were, what may cause anxiety ? , Answer1 and Answer 3 would intuitively look less relatedto Answer 2.In this paper, we study the design of models formeasuring the Accuracy of QA systems. In par-ticular, we design several pre-trained Transformermodels (Devlin et al., 2018; Liu et al., 2019) thatencode the triple of question q , candidate t , andreference r in different ways.Most importantly, we built (i) two datasets fortraining and testing the point-wise estimation ofQA system output, i.e., the evaluation if an answeris correct or not, given a GS answer; and (ii) twodatasets constituted by a set of outputs from sev-eral QA systems, for which AVA is supposed toestimate the Accuracy.The results show a high Accuracy for point-wisemodels, up to 75%. Regarding the overall Accuracyestimation, AVA can almost always replicate theranking of systems in terms of Accuracy performedby humans. Finally, the RMSE with respect tohuman evaluation depends on the datasets, rangingfrom 2% to 10%, with an acceptable Std. Dev.lower than 3-4%.The structure of the paper is as follows: we beginwith the description of the problem in Sec. 3. Thisis then followed by the details of the data construc-tion and model design, which are key aspects forsystem development, in sections 4 and 5. We studythe performance of our models in three differentevaluation scenarios in Sec. 6. Automatic evaluation has been an interesting re-search for decades (Papineni et al., 2002; Magniniet al., 2002). There are two typical strategies todesign an automatic evaluator: supervised and un- q : What is the population of California? r : With slightly more than 39 million people (according to2016 estimates), California is the nation’s most populousstateits population is almost one and a half times that ofsecond-place Texas (28 million). s : 39 million t : The resident population of California has been steadilyincreasing over the past few decades and has increased to39.56 million people in 2018. Table 2: An example of input data supervised. In machine translation, for example,BLEU (Papineni et al., 2002) has been a very pop-ular unsupervised evaluation method for the task.There are also other supervised methods recentlyproposed, most notably (Ma et al., 2019). For dia-log systems, neural-based automatic evaluators arealso studied (Ghazarian et al., 2019; Lowe et al.,2017; Tao et al., 2017; Kannan and Vinyals, 2017)QA has been traditionally studied early in liter-ature (Green et al., 1961). QA has recently beenused to evaluate a summarization task (Eyal et al.,2019). Automatic evaluation for QA was addressedby Magnini et al. (2002) and also for multiple sub-domain QA systems (Leidner and Callison-Burch,2003; Lin and Demner-Fushman, 2006; Shah andPomerantz, 2010; Gunawardena et al., 2015). How-ever, little progress has been made in the past twodecades towards obtaining a standard method. Au-tomating QA evaluation is still an open problemand there is no recent work supporting it.

We target the automatic evaluation of QA sys-tems, for which system Accuracy (the percentageof correct answers) is the most important measure.We also consider more complex measures such asMAP and MRR in the context of Answer SentenceReranking/Selection.

The task of reranking answer sentence candidatesprovided by a retrieval engine can be modeled witha classiﬁer scoring the candidates. Let q be a ques-tion, T q = { t , . . . , t n } be a set of answer sentencecandidates for q , we deﬁne R as a ranking function,which orders the candidates in T q according to ascore, p ( q, t i ) , indicating the probability of t i tobe a correct answer for q . Popular methods mod-eling R include Compare-Aggregate (Yoon et al.,2019), inter-weighted alignment networks (Shenet al., 2017), and BERT (Garg et al., 2020). .2 Automatic Evaluation of QA Accuracy The evaluation of system Accuracy can be ap-proached in two ways: (i) evaluation of the sin-gle answer provided by the target system, whichwe call point-wise evaluation; and (ii) the aggre-gated evaluation of a set of questions, which wecall system-wise evaluation.We deﬁne the former as a function: A ( q, r, t i ) → { , } , where r is a referenceanswer (GS answer) and the output is simply acorrect/incorrect label. Table 2 shows an examplequestion associated with a reference, a systemanswer, and a short answer .A conﬁguration of A is applied to compute the ﬁ-nal Accuracy of a system using an aggregator func-tion. In other words, to estimate the overall systemAccuracy, we simply assume the point-wise AVApredictions as they were the GS. For example, incase of the Accuracy measure, we simply averagethe AVA predictions, i.e., | Q | (cid:80) q ∈ Q A ( q, r, t i [ , s ]) ,where s is a short answer (e.g., used in machinereading). It is an optional input, which we only usefor a baseline, described in Section 4.1. The main intuition on building an automatic eval-uator for QA is that the model should capture (i)the same information a standard QA system uses;while (ii) exploiting the semantic similarity be-tween the system answer and the reference, biasedby the information asked by the question. We buildtwo types of models: (i) linear classiﬁer, which ismore interpretable and can help us to verify ourdesign hypothesis and (ii) Transformer-based meth-ods, which have been successfully used in severallanguage understanding tasks.

Given an input example, ( q, r, s, t ) , our classiﬁeruses the following similarity features: x = sim-token ( s, r ) , x = sim-text ( r, t ) , x = sim-text ( r, q ) ;and x = sim-text ( q, t ) , where sim-token between s and r is a binary feature testing if r is includedin s , sim-text is a sort of Jaccard similarity: sim-text ( s i , s j ) = 2 | tok ( s i ) ∩ tok ( s j ) || tok ( s i ) | + | tok ( s j ) | , and tok ( s ) is a function that splits s into tokens. The latter can be very effective but it adds an additionalannotation cost, thus we limit its use just for the baselinemodel. That is, we aim to have a lower cost AVA model

Let x = f ( q, r, s, t ) = ( x , x , x , x ) be asimilarity feature vector describing our evaluationtuple. We train w on a dataset D = { d i : ( x i , l i ) } using SVM, where l i is a binary label indicatingwhether t answers q or not. We compute the point-wise evaluation of t as the test x · w > α , where α is a threshold trading off Precision for Recall instandard classiﬁcation approaches. Transformer-based architectures have proved to bepowerful language models, which can capture com-plex similarity patterns. Thus, they are suitablemethods to improve our basic approach describedin the previous section. Following the linear clas-siﬁer modeling, we propose three different waysto exploit the relations among the members of thetuple ( q, r, s, t ) .Let B be a pre-trained language model, e.g.,the recently proposed BERT (Devlin et al., 2018),RoBERTa (Liu et al., 2019), XLNet (Yang et al.,2019), AlBERT (Lan et al., 2020). We use a lan-guage model to compute the embedding represen-tation of the tuple members: B ( a, a (cid:48) ) → x ∈ R d ,where ( a, a (cid:48) ) is a sentence pair, x is the outputrepresentation of the pair, and d is the dimen-sion of the output representations. The classiﬁ-cation layer is a standard feedforward network as A ( x ) = W (cid:124) x + b , where W and b are parameterswe learn by ﬁne-tuning the model on a dataset D .We describe different designs for A as follows. A : Text-Pair Embedding We build a language model representation forpairs of members of the tuple, x = ( q, r, t ) bysimply inputing them to Transformer models B in the standard sentence pair fashion. We con-sider four different conﬁgurations of A , one foreach following pair ( q, r ) , ( q, t ) , ( r, t ) , and onefor the triplet, ( q, r, t ) , modeled as the concate-nation of the previous three. The representationfor each pair is produced by a different and inde-pendent BERT instance, i.e., B p . More formally,we have the following three models A ( B p ( p )) , ∀ p ∈ D , where D = { ( q, r ) , ( q, t ) , ( r, t ) } . Ad-ditionally, we design a model over ( q, r, t ) with A ( ∪ p ∈D B p ( p )) , where ∪ means concatenationof the representations. We do not use the short an-swer, s , as its contribution is minimal when usingpowerful Transformer-based models. A : Improved Text-Triple Embedding The models of the previous section are lim-ted to pair representations. We improve this bydesigning B models that can capture pattern de-pendencies across q , r and t . To achieve this,we concatenate pairs of the three pieces of textabove. We indicate this string concatenation withthe ◦ operator. Speciﬁcally, we consider D = { ( q, r ◦ t ) , ( r, q ◦ t ) , ( t, q ◦ r ) } and propose thefollowing A . As before, we have the individualmodels, A ( B p ( p )) , ∀ p ∈ D as well as the com-bined model, A ( ∪ p ∈D B p ( p )) , where again, weuse different instances of B and ﬁne-tune them to-gether accordingly. A : Peer Attention for Pair of Transformer-based Models Our previous designs instantiate dif-ferent B for each pair, learning the feature represen-tations of the target pair and the relations betweenits members, during the ﬁne-tuning process. Thisindividual optimization prevents to capture patternsacross the representations of different pairs as thereis no strong connection between the B instances.Indeed, the combination of feature representations only happens in the last classiﬁcation layer.We propose peer-attention to encourage the fea-ture transferring between different B instances.The idea, similar to encoder-decoder setting inTransformer-based models (Vaswani et al., 2017),is to introduce an additional decoding step foreach pair. Figure 1 depicts our proposed settingfor learning representation of two different pairs: a = ( a, a (cid:48) ) and g = ( g, g (cid:48) ) . The standard ap-proach learns representations for these two in onepass, via B a and B g . In peer-attention setting, therepresentation output after processing one pair, cap-tured in H [ CLS ] , is input to the second pass of ﬁne-tuning for the other pair. Thus, the representationin one pair can attend over the representation in theother pair during the decoding stage. This allowsthe feature representations from each B instanceto be shared both during training and predictionstages. We describe the datasets we created to developAVA. First, we build two large scale datasets forthe standard QA task, namely

AS2-NQ and

AS2-GPD , derived from the Google Natural Questionsdataset and our internal dataset, respectively. Theconstruction of the datasets is described in Sec-tion 5.1. Second, we describe our approach togenerate labelled data for AVA using the datasetsfor QA task, described in Section 5.2. Finally, we

Figure 1: peer attention on ( a, a (cid:48) ) and ( g, g (cid:48) ) . build an additional dataset constituted by a set ofsystems and their output on target test sets. This canbe used to evaluate the ability of AVA to estimatethe end-to-end system performance (system-wiseevaluation), described in Section 5.3. Google Natural Questions (NQ) is a large scaledataset for machine reading task (Kwiatkowskiet al., 2019). Each question is associated with aWikipedia page and at least one long paragraph( long answer ) that contains the answer to thequestion. The long answer may contain addi-tional annotations of short answer , a succintextractive answer from the long paragraph. A long answer usually consists of multiple sen-tences, thus NQ is not directly applicable to oursetting.We create AS2-NQ from NQ by leveragingboth long answer and short answer anno-tations. In particular for a given question, the(correct) answers for a question are sentencesin the long answer paragraphs that contain an-notated short answer s. The other sentencesfrom the Wikipedia page are considered incor-rect. The negative examples can be of thefollowing types: (i) Sentences that are in the long answer but do not contain annotated shortanswers. It is possible that these sentences mightcontain the short answer . (ii) Sentences thatare not part of the long answer but contain a short answer as subphrase. Such occurrenceis generally accidental. (iii) All the other sentencesin the document.The generation of negative examples impacts onthe robustness of the training model when selectingthe correct answer out of the incorrect ones. AS2-NQ has four labels that describe possible confusingevels of a sentence candidate. We apply the sameprocessing both to training and development setsof NQ. This dataset enables to perform an effectivetransfer step (Garg et al., 2020). Table 3 shows thestatistics of the dataset.

A search engine using a large index can retrievemore relevant documents than those available inWikipedia. Thus, we retrieved high-probably rele-vant candidates as follows: we (i) retrieved top 500relevant documents; (ii) automatically extracted thetop 100 sentences ranked by a BERT model overall sentences of the documents; and (iii) had all thetop 100 sentences manually annotated as correct orincorrect answers. This process does not guaranteethat we have all correct answers but the probabilityto miss them is much lower than for other datasets.In addition, this dataset is richer than AS2-NQ as itconsists of answers from multiple sources. Further-more, the average number of answers to a questionis also higher than in AS2-NQ. Table 4 shows thestatistics of the dataset.

The AS2 datasets from the previous section typi-cally consist of a set of questions Q . Each q ∈ Q has T q = { t , . . . , t n } candidates, comprised ofboth correct answers C q and incorrect answers C q , T q = C q ∪ C q . We construct the dataset for point-wise automatic evaluation (described in Section 4)in the following steps: (i) to have positive and nega-tive examples for AVA, we ﬁrst ﬁlter the QA datasetto only keep questions that have at least two cor-rect answers. This is critical to build positive andnegative examples.Formally, let (cid:104) q, r, t, l (cid:105) be an input for AVA,AVA-Positives = (cid:104) q ; ( r, t ) ∈ C q × C q and r (cid:54) = t (cid:105) We also build negative examples as follows:AVA-Negatives = (cid:10) q ; ( r, t ) ∈ C q × C q (cid:11) We create AVA-NQ and AVA-GPD from the QAdatasets, AS2-NQ and AS2-GPD. The statistics arepresented on the right side of tables 3 and 4.

To test AVA at level of overall system Accuracy,we need to have a sample of systems and their out-put on different test sets. We create a dataset thathas candidate answers collected from eight systems from a set of 1,340 questions. The questions weresampled from an anonymized set of user utterances.We only considered information inquiry questions.The systems differ from each other in multipleways, including: (i) modeling : Compare-Aggregate(CNN-based) and different Transformers-based ar-chitectures with different hyper-parameter settings;(ii) training : the systems trained on different re-sources; and (iii) candidates : the pool of candidatesfor the selected answers are different.

We study the following performance aspects ofAVA in predicting: (i) the correctness of the indi-vidual answers provided by systems to questions(point-wise estimation); and (ii) the overall sys-tem Accuracy. We evaluated QA Accuracy as wellas passage reranking performance, in comparisonwith the human labeling.The ﬁrst aspect studies the capacity of our differ-ent machine learning models, whereas the secondprovides a perspective on the practical use of AVAto develop QA systems.

We trained and test models using AVA-NQ andAVA-GPD datasets, described in Section 5.2. Wealso evaluate the point-wise performance on theWikiQA and TREC-QA datasets.

Table 5 summarizes the conﬁgurations we con-sider for training and testing. For the linear classi-ﬁer baseline, we built a vanilla SVM classiﬁer us-ing scikit-learn . We set the probability parameter to enable Platt scaling calibration on thescore of SVM.We developed our Transformer-based evalua-tors on top of the HuggingFace’s Transformerlibrary (Wolf et al., 2019). We use RoBERTa-Base as the initial pre-trained model for each B instance (Liu et al., 2019). We use the defaulthyperparameter setting for typical GLUE train-ings. This includes (i) the use of the AdamW vari-ant (Loshchilov and Hutter, 2017) as optimizer, (ii)the learning rate of e - set for all ﬁne-tuningexercises, and (iii) the maximum sequence lengthset to 128. The number of iterations is set to 2. Wealso use a development set to enable early stoppingbased on F1 measure after the ﬁrst iteration. We ﬁxthe same batch size setting in the experiments to S2-NQ AS2-NQ Qs with multiple As AVA-NQ data split

Table 3: AS2-NQ and AVA-NQ Statistics

AS2-GPD AS2-GPD Qs with multiple As AVA-GPD data split

Table 4: AS2-GPD and AVA-GPD Statistics

Model Setting Conﬁgurations

Linear Classiﬁer using 4 features x i A one for each and one for all from D A all possible combinations from D A the most probable setting from A Table 5: The AVA conﬁgurations used in training avoid possible performance discrepancies causedby different batch size settings.

We study the performance of AVA in evaluatingpassage reranker systems, which differ not only inmethods but also in domains and application set-tings. We employ the following evaluation strate-gies to benchmark AVA.

Point-wise Evaluation

We study the perfor-mance of AVA on point-wise estimation using tra-ditional Precision, Recall, and F1. The metricsindicate the performance of AVA in predicting ifan answer candidate is correct or not.

System-wise evaluation

We measured AVAwhen used in a simple aggregator to compute theoverall system performance over a test set. The met-rics we consider are: Precision-at-1 (P@1), MeanAverage Precision (MAP), and Mean ReciprocalRank (MRR), when computing the performance onTREC-QA and WikiQA, since such datasets con-tain ranks of answers. In contrast, we only use P@1on ADS dataset, as this only includes the selectedanswers for each system.We use Kendall’s Tau-b to measure the correla-tion between the ranking produced by AVA and theone available in the GS: τ = c − dc + d , where c and d We use scipy.stats.kendalltau are the numbers of concordant and discordant pairsbetween two rankings.We additionally analyze the gap of each per-formance given by AVA and the one computedwith the GS, using root mean square error:RMSE ( a, h ) = (cid:113) n Σ ni =1 ( a i − h i ) , where a and h are the measures given by AVA and from humanannotation respectively. We evaluate the performance of AVA in predictingif an answer t is correct for a question q , givena reference r . Table 6 shows the result: Column1 reports the names of the systems described inSection 4, while columns 2 and 3 show the F1measured on AVA-NQ and AVA-GPD, respectively.We note that: (i) the F1 on AVA-GPD is muchhigher than the one on AVA-NQ, this is becausethe former dataset is much larger than latter;(ii) A ( { ( q, r ) } ) cannot predict if an answer iscorrect as it does not use it in the representation,thus its Accuracy is lower than 7%;(iii) A ( { ( r, t ) } ) is already a reasonable modelmainly based on paraphrasing between r and t ;(iv) A ( { ( q, t ) } ) is also a good model as it is asmuch powerful as a QA system;(v) the A models that takes the entire triplet q , r and t are the most accurare achieving an F1 ofalmost 74%;(vi) the use of combinations of triplets, e.g., A ( { ( r, q ◦ t ) , ( t, q ◦ r ) } ) , provides an even moreaccurate model; and ﬁnally,(vii) the peer-attention model, i.e., A (( r, q ◦ t ) , ( t, q ◦ r )) reaches almost 75%. raining set from AVA-NQ AVA-GPDdevelopment set from AVA-NQ AVA-GPD Model F1 on AVA-GPD-Test

Linear Classiﬁer 0.0000 0.3999 A ( { ( q, r ) } ) A ( { ( r, t ) } ) A ( { ( q, t ) } ) A ( D ) A ( { ( q, r ◦ t ) } ) A ( { ( r, q ◦ t ) } ) A ( { ( t, q ◦ r ) } ) ) 0.4517 0.7236 A ( { ( q, r ◦ t ) , ( t, q ◦ r ) } ) A ( { ( r, q ◦ t ) , ( t, q ◦ r ) } ) A ( { ( r, q ◦ t ) , ( q, r ◦ t ) } ) A ( D ) A (( r, q ◦ t ) , ( t, q ◦ r )) Table 6: F1 on AVA-GPD-Test

Metrics Kendall RMSE ± στ p TREC-QA-Dev P@1 1.000 0.003 0.000 ± ± ± ± ± ± ± ± ± ± ± ± Table 7: System-wise evaluation on TREC-QA andWikiQA using AVA model, A (( r, q ◦ t ) , ( t, q ◦ r )) . We evaluate the ability of AVA in predicting the Ac-curacy of QA systems as well as the performance ofanswer sentence reranking tasks. We conduct twoevaluation studies with two public datasets, TREC-QA and WikiQA, and an internal ADS dataset.

For TREC-QA and WikiQA, we used a bag ofdifferent models against the development and testsets and compared the results with the perfor-mance measured by AVA using one of the bestmodel according to the point-wise evaluation, i.e., A (( r, q ◦ t ) , ( t, q ◦ r )) .More speciﬁcally, we apply each model m to se-lect the best answer t from the list of candidates for q in the dataset. We ﬁrst compute the performanceof model m based on the provided annotations. Themetrics include Accuracy or Precision-at-1 (P@1), Metrics M1 M2 M3 M4 M5 M6 T R E C - QA - D e v G o l d P@1 0.717 0.870 0.891 0.935 0.739 0.826MAP 0.691 0.858 0.913 0.912 0.769 0.796MRR 0.819 0.923 0.937 0.967 0.835 0.890

AVA

P@1 0.717 0.870 0.891 0.935 0.739 0.826MAP 0.688 0.831 0.864 0.857 0.717 0.772MRR 0.809 0.920 0.940 0.967 0.803 0.876 T r ec - QA - T e s t G o l d P@1 0.596 0.885 0.904 0.962 0.712 0.788MAP 0.661 0.873 0.894 0.904 0.771 0.801MRR 0.763 0.933 0.945 0.976 0.820 0.869

AVA

P@1 0.635 0.904 0.962 0.981 0.712 0.827MAP 0.639 0.845 0.896 0.886 0.680 0.789MRR 0.764 0.936 0.981 0.990 0.793 0.880 W i k i QA - D e v G o l d P@1 0.545 0.727 0.455 0.545 0.636 0.727MAP 0.636 0.744 0.656 0.621 0.755 0.781MRR 0.720 0.831 0.695 0.703 0.803 0.864

AVA

P@1 0.545 0.727 0.455 0.545 0.636 0.727MAP 0.523 0.751 0.643 0.617 0.713 0.774MRR 0.568 0.841 0.682 0.698 0.788 0.841 W i k i QA - T e s t G o l d P@1 0.563 0.844 0.781 0.688 0.813 0.781MAP 0.634 0.778 0.753 0.746 0.834 0.820MRR 0.746 0.917 0.876 0.833 0.906 0.883

AVA

P@1 0.625 0.781 0.719 0.656 0.719 0.656MAP 0.660 0.750 0.687 0.683 0.705 0.704MRR 0.732 0.820 0.783 0.741 0.791 0.762

Table 8: Details of system-wise Evaluation onTREC-QA and WikiQA using AVA model and GS, A (( r, q ◦ t ) , ( t, q ◦ r )) . MAP, and MRR. We then run AVA for ( q, t ) us-ing the GS answers of q as reference r . The ﬁnalAVA score is the average of AVA scores applied todifferent references for q . Before computing theAccuracy on the test set, we tune the AVA thresh-old to minimize the RMSE between the Accuracy(P@1) measured by AVA and the one computedwith the GS, on the development set of each dataset.We use these thresholds to evaluate the results onthe test sets.We considered six different models, includingone Compare-Aggregate (CNN) trained model andﬁve other Transformers-based models. Four of thelatter are collected from public resources (Garget al., 2020). These models differ in the architec-tures and their training data thus their output israther different. We removed questions that haveno correct or no incorrect answers.Table 7 reports the overall results averaged overthe six models. We note that (i) setting the rightthreshold on the dev. set, the error on P@1 is 0;(ii) this is not the case for MAP, which is a muchharder value to predict as it requires to estimatean entire ranking; (iii) on the TREC-QA test set,AVA has an error ranging from 2 to 4.1 points onany measure; (iv) on the WikiQA test set, the error github.com/alexa/wqa tanda DS Split Evaluator S1 S2 S3 S4 S5 S6 S7 S8 Kendall RMSE ± στ p Dev (20%) AVA 0.215 0.278 0.22 0.369 0.285 0.294 0.283 0.355 0.929 0.0004 0.0198 ± ± Table 9: Details of system-wise Evaluation on ADS benchmark dataset

Question q Candidate t TANDA Reference r A when were the nobelprize awards ﬁrst given ? among them is the winner of the ﬁrst prize in1901 , sully prudhomme . 0.0001 leo tolstoy lost the ﬁrst literature prize in 1901to the forgettable rene f . a . sully prudhomme . 0.596what branch of the ser-vice did eileen mariecollins serve in ? the ﬁrst woman to command a space shuttlemission , air force col . eileen collins , sees herﬂight next month as `` a great challenge ” inmore ways than one . 0.046 shuttle commander eileen collins , a workingmother and air force colonel , was set to makehistory as the ﬁrst woman to command a spacemission . 0.895what was johnny apple-seed ’s real name ? appleseed , whose real name was john chapman, planted many trees in the early 1800s . 0.026 whitmore said he was most fascinated with thestory of john chapman , who is better known asjohnny appleseed . 0.948when was the challengerspace shuttle disaster ? sept . 29 , 1988 americans return to spaceaboard the shuttle discovery , after a 32-monthabsence in the wake of the challenger accident . 0.995 challenger was lost on its 10th mission duringa 1986 launch accident that killed seven crewmembers . 0.080when did jack welch be-come chairman of gen-eral electric ? everyone knew it was coming , but now theyknow when : john f . welch jr . , the chairmanof general electric , will retire after the company’s annual meeting in april 2001 . 0.968 welch has turned what had been a $ 25 billionmanufacturing company in 1981 into a $ 100billion behemoth that derives huge portions ofits revenues from more proﬁtable services . 0.064 Table 10: Examples show AVA can detect the failures of the State-of-the-art model by Garg et al. (2020). is higher, reaching 10%, probably due to a largercomplexity of the questions; (v) the std. dev. islow, suggesting that AVA can be used to estimatesystem performance.Additionally, we compute the Kendall’s Tau-bcorrelation between the ranking of the six systemssorted in order of performance (P@1) accordingto the GS and AVA. We observe a perfect corre-lation on TREC-QA and a rather high correlationon WikiQA. This means that AVA can be used todetermine if a model is better than another, whichis desirable when developing new systems. Thelow p-values indicate reliability of our results.Finally, Table 8 shows the comparison betweenthe performance evaluated with GS (Human) andAVA for all six models. The predictions of AVAare close to those from human judgement.

We use ADS dataset in this evaluation. The task ismore challenging as AVA only receives one best an-swer for a system selected from different candidatepools. There was also no control of the sourcesfor the candidates. Table 9 shows the result. Wenote a lower correlation due to the fact that the 8evaluated systems have very close Accuracy. Onthe other hand, the RMSE is rather low 3.1% andthe std. dev. is also acceptable < . , suggestingan error less than 7% with a probability > Table 10 reports some example questions fromTREC-QA test set, the top candidate selected bythe TANDA system (Garg et al., 2020), the classiﬁ-cation score of the latter, and the AVA score. AVAjudges an answer correct if the score is larger than0.5. We note that even if the score of TANDA sys-tem is low, AVA assigns to the answer a very highscore, indicating that it is correct (see the ﬁrst threeexamples). Conversely, a wrong answer could beclassiﬁed as such by AVA, even if TANDA assignedit a very large score (see the last two examples).

We presented AVA, an automatic evaluator methodfor QA systems. Speciﬁcally, we discussed ourdata collection strategy and model design to enableAVA development. First, we collected seven dif-ferent datasets, classiﬁed into three different types,which we used to develop AVA in different stages.Second, we proposed different Transformer-basedmodeling designs of AVA to exploit the featuresignals relevant to address the problem. Our exten-sive experimentation has shown the effectivenessof AVA for different types of evaluation: point-wiseand system-wise over Accuracy, MAP and MRR. eferences

Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a,Christian Federmann, Mark Fishel, Yvette Gra-ham, Barry Haddow, Matthias Huck, Philipp Koehn,Shervin Malmasi, Christof Monz, Mathias M¨uller,Santanu Pal, Matt Post, and Marcos Zampieri. 2019.Findings of the 2019 conference on machine transla-tion (WMT19). In

Proceedings of the Fourth Con-ference on Machine Translation (Volume 2: SharedTask Papers, Day 1) , pages 1–61, Florence, Italy. As-sociation for Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.Question answering as an automatic evaluation met-ric for news article summarization. In

NAACL 2019 ,pages 3938–3948, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Siddhant Garg, Thuy Vu, and Alessandro Moschitti.2020. TANDA: Transfer and adapt pre-trained trans-former models for answer sentence selection.Sarik Ghazarian, Johnny Tian-Zheng Wei, AramGalstyan, and Nanyun Peng. 2019. Better au-tomatic evaluation of open-domain dialogue sys-tems with contextualized embeddings.

CoRR ,abs/1904.10635.Bert F. Green, Jr., Alice K. Wolf, Carol Chomsky,and Kenneth Laughery. 1961. Baseball: An auto-matic question-answerer. In

Papers Presented atthe May 9-11, 1961, Western Joint IRE-AIEE-ACMComputer Conference , IRE-AIEE-ACM ’61 (West-ern), pages 219–224, New York, NY, USA. ACM.Tilani Gunawardena, Nishara Pathirana, MedhaviLokuhetti, Roshan G. Ragel, and Sampath Deegalla.2015. Performance evaluation techniques for an au-tomatic question answering system.Anjuli Kannan and Oriol Vinyals. 2017. Adver-sarial evaluation of dialogue models.

CoRR ,abs/1701.08198.Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-ﬁeld, Michael Collins, Ankur Parikh, Chris Alberti,Danielle Epstein, Illia Polosukhin, Matthew Kelcey,Jacob Devlin, Kenton Lee, Kristina N. Toutanova,Llion Jones, Ming-Wei Chang, Andrew Dai, JakobUszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-ral questions: a benchmark for question answeringresearch.

TACL .Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. { ALBERT } : A lite { bert } for self-supervisedlearning of language representations. In ICLR . Jochen L. Leidner and Chris Callison-Burch. 2003.Evaluating question answering systems using faq an-swer injection. In

Proceedings of the 6th AnnualCLUK Research Colloquium .Jimmy J. Lin and Dina Demner-Fushman. 2006. Meth-ods for automatically evaluating answers to complexquestions.

Information Retrieval , 9:565–587.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach.

CoRR , abs/1907.11692.Ilya Loshchilov and Frank Hutter. 2017. Fixingweight decay regularization in adam.

CoRR ,abs/1711.05101.Ryan Lowe, Michael Noseworthy, Iulian Vlad Ser-ban, Nicolas Angelard-Gontier, Yoshua Bengio, andJoelle Pineau. 2017. Towards an automatic Tur-ing test: Learning to evaluate dialogue responses.In

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers) , pages 1116–1126, Vancouver,Canada. Association for Computational Linguistics.Qingsong Ma, Johnny Wei, Ondˇrej Bojar, and YvetteGraham. 2019. Results of the WMT19 metricsshared task: Segment-level and strong MT systemspose big challenges. In

WMT 2019 , pages 62–90,Florence, Italy. Association for Computational Lin-guistics.Bernardo Magnini, Matteo Negri, Roberto Prevete, andHristo Tanev. 2002. Towards automatic evaluationof question/answering systems. In

LREC .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

ACL 2002 , pages311–318, Philadelphia, Pennsylvania, USA. Associ-ation for Computational Linguistics.Chirag Shah and Jefferey Pomerantz. 2010. Evaluatingand predicting answer quality in community qa. In

Proceedings of the 33rd International ACM SIGIRConference on Research and Development in Infor-mation Retrieval , SIGIR ’10, pages 411–418, NewYork, NY, USA. ACM.Gehui Shen, Yunlun Yang, and Zhi-Hong Deng. 2017.Inter-weighted alignment network for sentence pairmodeling. In

Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Process-ing , pages 1179–1189, Copenhagen, Denmark. As-sociation for Computational Linguistics.Chongyang Tao, Lili Mou, Dongyan Zhao, and RuiYan. 2017. RUBER: an unsupervised method for au-tomatic evaluation of open-domain dialog systems.

CoRR , abs/1701.03079.shish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Andy Way. 2018. Quality expectations of machinetranslation.

CoRR , abs/1803.08409.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

ArXiv , abs/1910.03771.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding.

CoRR , abs/1906.08237.Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim,Trung Bui, and Kyomin Jung. 2019. A compare-aggregate model with latent clustering for answer se-lection.