[PDF] Hierarchical Ranking for Answer Selection

Abstract

Answer selection is a task to choose the positive answers from a pool of candidate answers for a given question. In this paper, we propose a novel strategy for answer selection, called hierarchical ranking. We introduce three levels of ranking: point-level ranking, pair-level ranking, and list-level ranking. They formulate their optimization objectives by employing supervisory information from different perspectives to achieve the same goal of ranking candidate answers. Therefore, the three levels of ranking are related and they can promote each other. We take the well-performed compare-aggregate model as the backbone and explore three schemes to implement the idea of applying the hierarchical rankings jointly: the scheme under the Multi-Task Learning (MTL) strategy, the Ranking Integration (RI) scheme, and the Progressive Ranking Integration (PRI) scheme. Experimental results on two public datasets, WikiQA and TREC-QA, demonstrate that the proposed hierarchical ranking is effective. Our method achieves state-of-the-art (non-BERT) performance on both TREC-QA and WikiQA.

Full PDF

HHierarchical Ranking for Answer Selection

Hang Gao, Mengting Hu, Renhong Cheng, Tiegang Gao

Nankai [email protected]

Abstract

Answer selection is a task to choose the positive answers froma pool of candidate answers for a given question. In this pa-per, we propose a novel strategy for answer selection, calledhierarchical ranking. We introduce three levels of ranking:point-level ranking, pair-level ranking, and list-level ranking.They formulate their optimization objectives by employingsupervisory information from different perspectives to achievethe same goal of ranking candidate answers. Therefore, thethree levels of ranking are related and they can promote eachother. We take the well-performed compare-aggregate modelas the backbone and explore three schemes to implement theidea of applying the hierarchical rankings jointly: the schemeunder the Multi-Task Learning (MTL) strategy, the RankingIntegration (RI) scheme, and the Progressive Ranking Inte-gration (PRI) scheme. Experimental results on two publicdatasets, WikiQA and TREC-QA, demonstrate that the pro-posed hierarchical ranking is effective. Our method achievesstate-of-the-art (non-BERT) performance on both TREC-QAand WikiQA.

Introduction

Answer selection is a basic task in question answering. Givena question and a list of candidate sentences, the machineneeds to select the positive answers, which are sentences thatcan answer the question. Other sentences are called negativeanswers.Recently, attention-based neural networks perform well inthis task. A number of recent works (He, Gimpel, and Lin2015; Santos et al. 2016; Bian et al. 2017; Wang, Hamza, andFlorian 2017) have proposed multiple variants of attentionmechanisms for different scenarios. Besides, we observedthat previous models mainly have three different usages ofthe supervisory signal.Some works (Devlin et al. 2019; Tay et al. 2017; Wang,Hamza, and Florian 2017; Shen, Yang, and Deng 2017; Denget al. 2019; Shen et al. 2018) regard the supervisory signalsas the correlation between the question and answer. Theyutilize supervisory signals to learn to rank on point-level.This is intuitive and has been widely adopted. Some works(Rao, He, and Lin 2016; He, Gimpel, and Lin 2015; Santoset al. 2016) construct positive-negative pairs according tosupervisory signals, and distinguish the positive answers andnegative answers by learning their differences. This aims torank the candidate answers on pair-level. Some other models point-level ( q , a

1) ( q , a

2) ( q , a

3) ( q , a

4) ( q , a

5) ( q , a pair-level ( q , a

1) ( q , a

2) ( q , a

3) ( q , a

4) ( q , a

5) ( q , a list-level ( q , a

1) ( q , a

2) ( q , a

3) ( q , a

4) ( q , a

5) ( q , a Figure 1: Difference between the three levels of ranking.The dashed box shows the different granularities of differentlevels of ranking. The red box indicates a positive question-answer pair and the blue box indicates a negative one.(Bian et al. 2017; Wang and Jiang 2017) treat the supervisorysignals as the list structure information and learn to rank thecandidate answers. This can be called learning to rank on list-level. In conclusion, they can learn to rank based on differentlevels of granularity.Based on the experience of deep learning, different usemethods of supervisory information have different optimiza-tion objectives in training. The point-level ranking focuson optimizing the relevance score of each question-answerpair. The pair-level ranking learns the correlation amongcandidates in a contrastive way, which distinguishes the pos-itive and negative answers. The models are optimized bytriplet ranking loss to allocate a higher score for positivepair ( Q, A + ) than the negative one ( Q, A − ) . The list-levelranking aims to ﬁt the prediction of a list of question-answerpairs with the ground-truth. In other words, they can rank theanswer sentences through different optimization objectives,which shows that they are related and complementary. Weargue that applying these hierarchical rankings jointly canbring performance improvement.Under the idea of combining the hierarchical rankings,an intuitive solution is employing multi-task learning whichtreats the three rankings as three tasks. However, there existinternal relationships between the three optimization goals.For the same data, they focus on different levels of data gran-ularity. This allows them to promote and inspire each other.For example, matching ( Q, A ) pair better might improve thepair-level comparison and the list-level ranking might bene-ﬁt from the pair-level approach. Simple multi-task learningmight be insufﬁcient to make full use of these relationships. a r X i v : . [ c s . C L ] F e b n this paper, we propose a novel strategy for answer se-lection, called hierarchical ranking. Different from previousworks that learn to rank at a ﬁxed level, there are three levelsof ranking in our strategy: point-level ranking, pair-level rank-ing, and list-level ranking. We design three schemes to imple-ment the proposed hierarchical ranking strategy. Speciﬁcally,we ﬁrst introduce the scheme under Multi-Task Learning(MTL) strategy. Then, we propose the Ranking Integration(RI) scheme to make better use of the internal relationshipsbetween the three levels of ranking. Finally, we explore theProgressive Ranking Integration (PRI) scheme, which furtherenhances the combination of internal relations.In summary, the main contributions of this paper are asfollows:• We revisit the three classical methods for answer selection.We propose a novel strategy for exploiting their meritsand internal relationship between them, called hierarchicalranking. To the best of our knowledge, this is the ﬁrst workto address this task with the three levels jointly.• To implement the hierarchical ranking strategy, we proposethree schemes: the scheme under Multi-Task Learning(MTL), Ranking Integration (RI) and Progressive RankingIntegration (PRI).• Extensive experiments are conducted on public datasets,TREC-QA and WikiQA. Our model achieves state-of-the-art performance on both WikiQA and TREC-QA. Resultsdemonstrate the effectiveness of the proposed hierarchicalranking.The rest of the paper is organized as follows: we introducerelated work about answer selection task in section 2. Insection 3, we introduce the proposed strategy in detail. Thenthe experimental analysis and comparisons of the methodsare reported in section 4. At last, we draw a conclusion forthe entire paper. Related Work

Mainstream Architectures.

The Siamese Network (Brom-ley et al. 1994) is a popular architecture. It calculates therelevance score, using such as Euclidean distance, cosinesimilarity, etc, according to the sentence vectors of the givenquestion and answer. Many works are based on Siamese Net-work (Yin et al. 2016; Santos et al. 2016; Rao, He, and Lin2016; Tan, Xiang, and Zhou 2015).A disadvantage of siamese architecture is that some keyinformation may be lost after feature extraction of sentencesto obtain sentence vectors. To solve this problem, anotherarchitecture has been proposed, called Compare-AggregateNetwork (Wang and Jiang 2017). In the Compare-AggregateNetwork, matching is based on smaller units such as wordsor phrases and the comparison process is more speciﬁc. Thus,the compare-aggregate network performs better. Many recentwork is based on this architecture(Wang and Jiang 2017;Wang, Hamza, and Florian 2017; Bian et al. 2017; Yoonet al. 2019). Considering the ﬂexible structure and excellentperformance of this architecture, we build models based onthis architecture to validate our ideas.

Mainstream Approaches.

Previous work typically ad-dressed the task of answer selection by three approaches: pointwise, pairwise and listwise. Beneﬁting from the develop-ment of neural networks, all three approaches have achievedgood performance.The pointwise approach (Devlin et al. 2019; Tay et al.2017; Wang, Hamza, and Florian 2017; Shen, Yang, andDeng 2017; Deng et al. 2019; Shen et al. 2018) usually trainsa binary classiﬁcation model or a logical regression model bycross-entropy loss. This approach regards the question and itscandidate answers as separate training pairs, which simpliﬁesthe training and is easy to utilize many other classiﬁcationmodels.The pairwise approach (Rao, He, and Lin 2016; He, Gim-pel, and Lin 2015; Santos et al. 2016) learns the correlationamong candidates in a contrastive way. Compared with thepointwise method, this approach employs the supervisory sig-nals by connecting the positive answers and negative answersin the candidates.The listwise approach (Bian et al. 2017; Wang and Jiang2017) aims to learn the group structure information of thequestion-answer pairs. This approach could compare all thecandidate answers globally. For more details, we refer inter-ested readers to (Li 2011; Liu 2009).

Model Architecture

In this section, we ﬁrst formulate the task, then a brief de-scription of the backbone network is given. Next, We willdescribe the scheme under multi-task learning strategy, rank-ing integration scheme and progressive ranking integrationscheme, and explain their differences and the motivationsbehind them.

Formulation

For a given question and a candidate answer, let Q denotethe question, and A denote the answer, y is the ground-truthlabel representing whether A is positive or not. Backbone Network

Considering the excellent performance of the compare-aggregate model (Wang and Jiang 2017; Bian et al. 2017;Yoon et al. 2019), we take it as the backbone and prove theeffectiveness of the proposed hierarchical ranking strategy. Atypical compare-aggregate model has the following layers:encoding layer, interaction layer, comparison and aggregationlayer and prediction layer.

Encoding Layer

In this layer, for the given question Q = { q , q , ..., q n } and and an answer A = { a , a , ..., a m } .Firstly we map them into word embedding sequences E q = { e q1 , e q2 , ..., e qn } and E a = { e a1 , e a2 , ..., e am } for Q and A re-spectively. Then, we use a modiﬁed version of LSTM/GRU(Wang and Jiang 2017) to obtain the context representation H q and H a which contains context information: H a = sigmoid ( E a W + b ) (cid:12) tanh ( E a W + b ) H q = sigmoid ( E q W + b ) (cid:12) tanh ( E q W + b ) (1)where W , W , b and b are parameters in the modiﬁedLSTM/GRU. A Question Answer

Encoding Layer

Hq Ha

Aggregation Aggregation AggregationComparison

Cq Ca rqpoint rapoint rqpair rapair rqlist ralist [:] [:] [:] rpoint rpair rlist point-level ranking pair-level ranking list-level ranking rpoint rpair rlist[:] ̂rpoint ̂rpair ̂rlist

Lpoint Lpair Llist rpoint rpair rlist[:]

Lpoint Lpair Llist rpoint rpair rlist

Lpoint Lpair Llist [:] rpoint rpair rlist

Lpoint Lpair Llist [:] [:] rpoint rpair rlist ̂rpoint ̂rpair ̂rlist

Lpoint Lpair Llist [:][:]

RI(point) RI(pair) RI(list)PRI(list) PRI(point) ̂rpoint ̂rpair ̂rlist ̂rpoint ̂rpair ̂rlist̂rpoint ̂rpair ̂rlist rpoint rpair rlist Q Question Answer A Encoding Layer H q H a Interaction LayerCompare&Aggregate Compare&Aggregate Compare&Aggregate rqpoint [:] rapoint rqpair rapair rqlist ralist [:] [:]

Figure 2: The architecture of the model that follow MTLstrategy. [; ] denotes the vector concatenate operation.

Interaction Layer

After obtaining the representation of the input question andanswer, we utilize a standard co-attention module to learn theinteractions between question-answer pair: M = H q H a T (2)Then a soft-align is used to obtain the alignment ˆ H q and ˆ H a to each other as follows: ˆ H q = sof tmax ( M ) H a ˆ H a = sof tmax ( M T ) H q (3) Comparison and Aggregation Layer

To compare thealigned question ˆ H q and H q on word-level, we adopt a com-parison function to get the matching sequences. So does thealigned answer ˆ H a and H a . C q = ˆ H q (cid:12) H q C a = ˆ H a (cid:12) H a (4)where the operator (cid:12) is element-wise multiplication. C a = { c a1 , c a2 , ..., c am } and C q = { c q1 , c q2 , ..., c qn } are feature se-quences. Then,we apply a one-layer CNN to aggregate thematching sequence to a ﬁxed dimensional vector. r q = CN N ([ c q1 , c q2 , ..., c qn ]) r a = CN N ([ c a1 , c a2 , ..., c am ]) (5)We concatenate r q and r a together as a matching vector r = [ r q ; r a ] . Prediction Layer

In the prediction layer, we pass r through a multi-layer perceptron for prediction. Multi-task Learning Strategy

As mentioned before, multi-task learning (MTL) is an intu-itive solution of using the three levels of ranking jointly. Thearchitecture of the model which follows MTL is shown inFigure 2. It follows hard parameter sharing. In MTL Strategy,the three levels of ranking are treated as three tasks. Considering that rankings on different granularities havedifferent concerns, the encoding layer and the interactionlayer are shared between the three rankings, and the compar-ison and aggregation layers and prediction layers are level-speciﬁc. Speciﬁcally, the advantage of sharing the encodinglayer and the interaction layer is that the low-level represen-tation could beneﬁt from all the three ranking objectives. Thegoal of the comparison and aggregation layer is to obtainthe comparison features of the question and answer, and thecomparison features based on different granularity are dis-tinct. Therefore, applying the level-speciﬁc comparison andaggregation layer is helpful to learn speciﬁc features fromdifferent perspectives.With the three level-speciﬁc comparison and aggregationlayers, we can obtain three comparison features. Let r point denote the feature for point-level ranking, r pair denote thefeature for pair-level ranking and r list as the feature usedfor list-level ranking. Note that the three comparison andaggregation layers for the different rankings have the samearchitecture. Next, we describe the level-speciﬁc predictionlayer and the losses used for training. At each time, we feeda given question and all its candidate answers to the model,thus, we suppose there are k positive answers and t negativeanswers in the candidate list for clearer description. Point-level Ranking

The point-level ranking strategy aimsto optimize the predicted probability distribution of eachquestion-answer pair to be consistent with the ground truth.Thus, we regard the task as a classiﬁcation problem. For the i -th question-answer pair, We feed r ipoint into a two-layerperceptron for prediction and employ cross-entropy loss fortraining. p ipoint = sof tmax ( M LP ( r ipoint )) L point = − k + t k + t (cid:88) i logP ( y i | p ipoint ) (6)where y i is the ground-truth label for the i -th question-answerpair. In the testing phase, the scores used for sorting candidateanswers are the probabilities that the answer is positive. Pair-level Ranking

The pair-level ranking strategy is tooptimize the relative ordering of a pair of prediction scores.The main idea behind it is noise contrastive estimation. Forthe i -th question-answer pair, we feed r ipair into a two-layerperceptron to calculate a relevance score. p ipair = M LP ( r ipair ) (7)Here, we introduce two methods to construct the pairsfor pair-level ranking. In the ﬁrst method, we generate allpossible positive-negative pairs. In the second method, wepick the negative pair with the highest relevance score andassign scores of all positive pairs higher than it. For the given k positive answers and t negative answers, we can obtain k ∗ t pairs in the ﬁrst method and k pairs in the second method.For a positive-negative pairs, let p + pair denote the score ofthe positive pair and p − pair denote the relevance score of thenegative pair. We use a margin loss to assign larger scores A Question Answer

Encoding Layer

H q H a

Aggregation Aggregation AggregationComparison

Cq Ca rqpoint rapoint rqpair rapair rqlist ralist [:] [:] [:] r point rpair rlist point-level ranking pair-level ranking list-level ranking rpoint rpair rlist[:] ̂rpoint ̂rpair ̂rlist

Lpoint Lpair Llist rpoint rpair rlist[:]

Lpoint Lpair Llist rpoint rpair rlist

Lpoint Lpair Llist [:] rpoint rpair rlist

Lpoint Lpair Llist [:] [:] rpoint rpair rlist ̂rpoint ̂rpair ̂rlist

Lpoint Lpair Llist [:][:]

RI(point) RI(pair) RI(list)PRI(list) PRI(point) ̂rpoint ̂rpair ̂rlist ̂rpoint ̂rpair ̂rlist̂rpoint ̂rpair ̂rlist

Figure 3: The proposed three RI schemes and two PRI schemes. The red dashed line represents the integrated representationemployed in testing. Note that the architecture of the bottom layers in RI and PRI is the same as the MTL model so it’s omittedfor conciseness.to positive pairs than negative pairs for each ( p + pair , p − pair ) .The loss for pair-level ranking can be formulated as follow. L pair = 1 k ∗ t k (cid:88) t (cid:88) max (0 , M − ( p + pair − p − pair )) (8)where M refers to the margin which treated as a hyper-parameter. Note that the above loss follows the ﬁrst methodof positive-negative pairs generation. In the testing phase, p ipair is used for sorting candidate answers. List-level Ranking.

The list-level ranking strategy is to op-timize the relative ordering of the list of prediction scores. Wefeed r ilist into a two-layer perceptron to calculate a relevancescore. Then We calculate the probability of k + t pairs tobe the positive one and utilize it to rank candidate answers.We utilize KL-divergence loss for training. Besides, Label Y needs to be normalized. p ilist = M LP ( r ilist ) p list = sof tmax ( (cid:110) p list , ..., p k + tlist (cid:111) ) Y = Y (cid:80) k + ti =1 y i L list = 1 k + t KL ( p list || Y ) (9)In the testing phase, p list is the list of scores used forsorting candidate answers. Joint Loss.

The MTL strategy is to minimize the joint loss. L = λ point L point + λ pair L pair + λ list L list (10)where λ point , λ pair , λ list balance the effect of each loss. An Example about Batch Working.

Here, we give an ex-ample to clarify how batching works for the different rank-ing strategy. Suppose a given question has 2 positive an-swers and 3 negative answers. There are 5 question-answerpairs for point-level ranking and one list of question-answerpairs for list-level ranking. For pair-level ranking, there are6 or 3 positive-negative pairs for training, depending on themethod of positive-negative pairs generation. For the con-venience of explanation, assume that each question in thedataset has 2 positive answers and 3 negative answers, then,if the batch size is set to 10, for point-wise approach, thereare 50 question-answer pairs; for pair-wise approach, thereare 60 or 30 positive-negative pairs and 10 lists of pairs forlist-wise approach. As we can see from this example, theamount of data used for different ranking training is differenton different granularities.

Ranking Integration

As mentioned before, the three different ranking strategiescan complement and reinforce each other to achieve the goalof ranking the list of candidate answers. The multi-task learn-ing strategy is insufﬁcient to make full use of these internalrelationships. In this section, we explore several more directschemes, called Ranking Integration (RI).An intuitive experience is, the extracted features are differ-ent under different optimization objectives. Thus, the threelevels of optimization goals can learn features from differentperspectives. The features obtained by point-level methodscontain information about the relevance between Q and A .The pair-level features contain the comparison informationbetween an opposite pair ( Q, A + ) and ( Q, A − ) . The list-level features contain information about comparing ( Q, A ) with all other pairs. Based on these considerations, we arguethat it is reasonable to integrate the features from the threeapproaches. The enhanced feature will be helpful to improvethe performance of the model. This is where the idea of RIcheme comes from.As shown in the top row of Figure 3, RI contains threeschemes. In each RI scheme, we take one level ranking asthe main objective and the other two rankings as the auxiliaryobjectives. The main objective is used for the ﬁnal prediction,and the auxiliary objectives are used for training the distinctmatching features on the speciﬁc granularities and providingmore evidence for the main objective. The main differencebetween RI and MTL is that the matching feature used forthe ﬁnal prediction is enhanced in RI. Next, we will describethe feature enhancement scheme in RI(point), RI(pair) andRI(list).In RI(point), we take the point-level ranking as the mainobjective. The features extracted from the pair-level rankingand list-level ranking are used to enhance the feature used forpoint-level ranking. ˆr pair = r pair ˆr list = r list ˆr point = [ ˆr pair ; ˆr list ; r point ] (11)where ˆr point is the enhanced feature. We pass ˆr pair and ˆr list through the level-speciﬁc prediction layer for the training ofauxiliary objectives, and pass ˆr point through the point-levelprediction layer for the ﬁnal prediction. The losses used totrain different rankings are the same as those described inMTL strategy.Similarly, in RI(pair), we take the pair-level ranking as themain objective. ˆr point = r point ˆr list = r list ˆr pair = [ ˆr point ; ˆr list ; r pair ] (12)In RI(list), we take the list-level ranking as the main objec-tive. ˆr point = r point ˆr pair = r pair ˆr list = [ ˆr point ; ˆr pair ; r list ] (13) Progressive Ranking Integration

In this section, we introduce the proposed novel ProgressiveRanking Integration (PRI). As shown in the bottom row ofFigure 3, PRI contains two schemes. Compared with RI, PRIemploys a progressive way to integrate features and computelosses. The idea is borrowed from the divide-and-conquertechnique. For a given question and its candidate answers, thepoint-level ranking strategy concentrates on determining therelevance between the question and an answer. We can getthe ranking results of a positive pair and a negative pair (pair-level ranking) by combining the results of relevance judgmentin pairs. In other words, the solution of the point-level rankingstrategy can be combined to give a solution to the pair-levelranking. Similarly, combining the ranking results of multiplepositive-negative pairs can get the ranking results of a list ofcandidate answers. In other words, the solution of the pair-level ranking strategy can be combined to give a solution tothe list-level ranking. This follows the divide-and-conquertechnique. We implement this idea in PRI(list). Besides, we also explore an inverse process of PRI(list), called PRI(point).The intuition behind PRI(point) is that it is also reasonableto sort the candidate answers altogether (list-level) ﬁrstly,then gradually reﬁne it to sort the pairs, and ﬁnally optimizethe similarity of each question-answer pair. In conclusion,PRI(list) aims to rank the candidate answers from individualto list and PRI(point) aims to obtain the ranking results byprogressive reﬁnement.

PRI(list)

Follow the idea of PRI(list), ﬁne-grained rankingcan be treated as the foundation for the next level ranking.More formally: ˆr point = r point ˆr pair = [ ˆr point ; r pair ] ˆr list = [ ˆr pair ; r list ] (14) PRI(point).

In this scheme, we explore an inverse process ofPRI(list) as follow, and take the output of point-level rankingas the ﬁnal prediction. ˆr list = r list ˆr pair = [ ˆr list ; r pair ] ˆr point = [ ˆr pair ; r point ] (15) Experiments

In this section, we ﬁrst introduce the datasets and evaluationmetrics used in the experiment. Then, the model parameters,training settings are introduced in detail. Finally, we analyzethe experimental results and validate our methods.

Datasets and Metrics

We utilize two public datasets, TREC-QA and WikiQA.Many previous works also used these two datasets to evaluatemodel performance.The TREC-QA dataset (Wang, Smith, and Mitamura 2007),collected from TREC-QA track 8-13 data. In this paper, weonly display the results on the Clean version TREC-QAwhich consists of 1,229 questions with 53,417 question-answer pairs in train set, 65 questions with 1,117 pairs indevelopment set and 68 questions with 1,442 pairs in the testset.WikiQA (Yang, Yih, and Meek 2015), a common datasetfor answer selection. We follow (Yang, Yih, and Meek 2015)to remove all questions with no correct candidate answers.The excluded WikiQA has 873 questions with 8627 question-answer pairs in the train set, 126 questions with 1130 pairsin the development set and 243 questions with 2351 pairs inthe test set.Follow previous works, MAP(mean average precision) andMRR(mean reciprocal rank) are adopted as our evaluationmetrics.

Compared Methods

We compare the proposed scheme against competitive base-lines follow . It should be noted that some other research https://aclweb.org/aclwiki/Question Answering (State of the art) odels WikiQA TREC-QAMAP MRR MAP MRRAP-CNN(Santos et al. 2016) 0.689 0.696 0.753 0.851MP-CNN(He, Gimpel, and Lin 2015) 0.693 0.709 0.777 0.836NCE(Rao, He, and Lin 2016) 0.701 0.718 0.801 0.877L.D.C(Wang, Mi, and Ittycheriah 2016) 0.706 0.723 0.771 0.845PWIM(He and Lin 2016) 0.709 0.723 - -HyperQA(Tay, Tuan, and Hui 2018a) 0.712 0.727 0.784 0.865BIMPM(Wang, Hamza, and Florian 2017) 0.718 0.731 0.802 0.875IWAN(Shen, Yang, and Deng 2017) 0.733 0.750 0.822 0.889DCA(Bian et al. 2017) 0.736 0.749 0.813 0.867IWAN+sCARNN(Tran et al. 2018) - - 0.829 0.875MCAN(Tay, Tuan, and Hui 2018b) - - 0.838 0.904BERT Fine-Tuning(Laskar, Huang, and Hoque 2020) 0.843 0.857 0.905 0.967RoBERTa Fine-Tuning(Laskar, Huang, and Hoque 2020) 0.900 0.915 0.936 0.978CA(point) 0.721 0.735 0.809 0.867CA(pair) 0.725 0.737 0.832 0.889CA(list) 0.736 0.749 0.809 0.863MTL(point) 0.734 0.747 0.831 0.885MTL(pair) 0.734 0.745 0.837 0.895MTL(list) 0.739 0.751 0.835 0.892RI(point) 0.734 0.747 0.823 0.880RI(pair) 0.737 0.749 RI(list) 0.740 0.751 0.835 0.892PRI(point) 0.732 0.743 0.825 0.890PRI(list) † ‡ ‡ Table 1: Evaluation results on WikiQA and Clean version TREC-QA in terms of MAP and MRR. We conduct T-test comparingPRI(list) and a strong baseline DCA. The marker † means p-value p < . and ‡ means p-value p < . .lines utilize transfer learning (Lai et al. 2019), external knowl-edge (Madabushi, Lee, and Barnden 2018; Shen et al. 2018)such as knowledge graphs or pre-trained model BERT (Laiet al. 2019; Shao et al. 2019), which are good works. Butfor fairness, we do not use these models as baselines. Be-sides, for a fair comparison, we reimplement DCA (Bianet al. 2017) but not clipping the number of candidate answerscorresponding to each question to the same number. We alsopresent the current state-of-the-art BERT (Devlin et al. 2019)ﬁne-tuning results and RoBERTa (Liu et al. 2019) ﬁne-tuningresults, which reported in Laskar, Huang, and Hoque, on eachdataset. Our Methods

To verify the strategies we proposed, we implement severalmodels as follow. Note that the level indicated in parenthesesis the ranking level used for the ﬁnal prediction.•

CA.

CA(point), CA(pair) and CA(list) are the basiccompare-aggregate models which follow only one speciﬁcranking for training respectively.•

MTL.

MTL(point), MTL(pair) and MTL(list) are the mod-els which follow the MTL strategy.•

RI.

RI(point), RI(pair) and RI(list) are the models whichfollow the RI scheme.•

PRI.

PRI(point) and PRI(list) are the models which followthe PRI scheme.

Implementation details

We implement our model with Pytorch(version 1.5.0) andtrain on one Tesla P100 GPU. We follow Bian et al. to to-kenize and pad the sentences, but not clipping the numberof candidate answers corresponding to each question to thesame number.

Model details.

For both WikiQA and TREC-QA, the pre-trained 300-dimensional GloVe word vectors (Pennington,Socher, and Manning 2014) on the 840B Common Crawlcorpus are used as initialization for word embedding and em-beddings of out-of-vocabulary words are initialized to zeros.For TREC-QA, all embeddings are ﬁxed during training. ForWikiQA, all embeddings are ﬁne-tuned. The dimension ofhidden states is 300. We use [1,2,3,4,5] as the kernel size ofthe one-layer CNN. The output channel of CNN is 150. ForWikiQA, we adopt k-max (Bian et al. 2017) to ﬁlter irrelevantwords. In the pair-level ranking, for WikiQA, we apply themargin loss for each positive-negative pair and the predictionscores are normalized with sigmoid . The margin is set to 0.8.For TREC-QA, we adopt the optional solution which picksthe negative case with the highest relevance score and assignsall positive individual scores higher than it. The margin is setto 1.

Training details.

For training, we adopt Adam to optimizethe models. The learning rate for model parameters is 5e-4. For wikiQA, the learning rate for embeddings is 5e-5.For TREC-QA, embeddings are not updated during training.odels WikiQA TREC-QAMAP MRR MAP MRRPRI(list) 0.742 0.754 0.841 0.898(1) w/o point-level 0.735 0.746 0.829 0.883(2) w/o pair-level 0.739 0.752 0.833 0.881(3) PRI(all list) 0.726 0.737 0.814 0.859Table 2: Ablation study.Mini-batch is taken to train the models, each batch contains30 questions and their candidate answers. We use an earlystop to prevent overﬁtting. We set the early stop to 10, whichmeans we stop training when the MAP of development setstops increasing for 10 epochs in succession. We set λ pair to 1, λ list to 1. For WikiQA, we set λ point to 2, and 1 forTREC-QA. All the results we reported are the average of 5runs with random seeds [0,1,2,3,4]. Experimental Analysis

The experimental results are reported in Table 1 and 2. Thebest scores (besides BERT) on each metric are marked inbold. All the results are the average score of 5 runs.

Compare with baselines.

Table 1 illustrates the results com-pared with baselines. Firstly, we observe that it is effectiveto use the three levels of ranking jointly for training. Inthis paper, we propose the scheme under MTL strategy, RIscheme and PRI scheme, and all of them outperform theCA models which follow only one speciﬁc ranking. Speciﬁ-cally, the performances of MTL(list), RI(list) and PRI(list)are better than that of CA(list) on both WikiQA and TREC-QA in terms of MAP and MRR, and similarly, the perfor-mances of MTL(point), RI(point) and PRI(point) are alsobetter than CA(point). This proves that the main ranking ob-jectives can beneﬁt from the other two auxiliary objectives.As such, the three schemes we proposed for hierarchical rank-ing are effective and crucial. Secondly, we observe that insome cases, RI performs better than MTL. RI can improveperformance marginally. However, it also contributes to per-formance. Thirdly, we can observe that PRI can bring furtherperformance improvement based on RI. Especially, comparedRI(list), PRI(list) performs better on both two datasets. Fi-nally, we observe that except for BERT, the proposed PRI(list)achieves the best performance on all metrics on WikiQA andRI(pair) achieves the best performance on all metrics onTREC-QA. Besides, PRI(list) achieves the second-best per-formance on MAP on TREC-QA and the performance ofRI(list) is also better than other baselines (besides BERT).Speciﬁcally, compared with a strong baseline DCA on TREC-QA, our method PRI(list) signiﬁcantly outperforms it by+0.028 on MAP and +0.031 on MRR. This shows that ournetwork which employs the hierarchical ranking strategy issuccessful.

Ablation study.

In this section, we aim to demonstrate theeffectiveness of the joint ranking. We take the best performingPRI(list) as the full conﬁguration model, and the followingmodels are constructed: (1) “w/o point-level”: we removethe point-level ranking part in PRI(list) and keep only the L li s t list-level onlyPRI(list) Figure 4: L list loss curves in different strategy.pair-level ranking as an auxiliary objective. (2) “w/o pair-level”: we remove the pair-level ranking part in PRI(list) andkeep only the point-level ranking as an auxiliary objective.(3) “PRI(all list)”: We set all the three objectives as list-levelranking. The results are reported in Table 2.Firstly, we observe that both the removal of point-levelranking or pair-level ranking caused performance degrada-tion. This shows that both the point-level ranking and pair-level ranking are necessary.Secondly, we observe that replacing all three rankings withlist-level ranking signiﬁcantly decreases the performance.This proves that it is necessary to apply the three levels ofranking jointly in PRI scheme, and the performance improve-ment comes from the hierarchical ranking strategy rather thanmore parameters. Loss.

As shown in Figure 4, we plot the curves of L list ofdifferent schemes. We observe that L list in PRI(list) schemedecreases faster than it in “list-level only”. This shows thatwith the help of multiple approaches and feature integration,our method PRI(list) needs fewer epochs to converge. Conclusion

In this paper, we propose a novel strategy for answer selec-tion, called hierarchical ranking. There are three levels ofranking in the proposed strategy: point-level ranking, pair-level ranking, and list-level ranking. To implement the hier-archical ranking strategy, we ﬁrst introduce a scheme underMulti-Task Learning (MTL) strategy, then propose RankingIntegration (RI) scheme, and furthermore, we explore the ideaof integrating the feature progressively via Progressive Rank-ing Integration (PRI). Experimental results demonstrate thatthe proposed hierarchical ranking strategy is effective, all thethree schemes under hierarchical ranking strategy outperformthe models which follow only one speciﬁc ranking approach,and the proposed RI and PRI can further improve the per-formance. Our method achieves state-of-the-art (non-BERT)performance on both TREC-QA and WikiQA.

References

Bian, W.; Li, S.; Yang, Z.; Chen, G.; and Lin, Z. 2017. ACompare-Aggregate Model with Dynamic-Clip Attention forAnswer Selection. In (CIKM) , 1987–1990.romley, J.; Guyon, I.; LeCun, Y.; S¨ackinger, E.; and Shah,R. 1994. Signature veriﬁcation using a” siamese” time delayneural network. In (NeurIPS) , 737–744.Deng, Y.; Xie, Y.; Li, Y.; Yang, M.; Du, N.; Fan, W.; Lei,K.; and Shen, Y. 2019. Multi-task learning with multi-viewattention for answer selection and knowledge base questionanswering. In (AAAI) , 6318–6325.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In (NAACL-HLT) , 4171–4186.He, H.; Gimpel, K.; and Lin, J. 2015. Multi-perspective sen-tence similarity modeling with convolutional neural networks.In (EMNLP) , 1576–1586.He, H.; and Lin, J. 2016. Pairwise word interaction modelingwith deep neural networks for semantic similarity measure-ment. In (NAACL-HLT) , 937–948.Lai, T.; Tran, Q. H.; Bui, T.; and Kihara, D. 2019. A GatedSelf-attention Memory Network for Answer Selection. In (EMNLP-IJCNLP) , 5952–5958.Laskar, M. T. R.; Huang, J.; and Hoque, E. 2020. Contextu-alized Embeddings based Transformer Encoder for SentenceSimilarity Modeling in Answer Selection Task. In

LREC2020 .Li, H. 2011. Learning to rank for information retrieval andnatural language processing.

Synthesis Lectures on HumanLanguage Technologies

Foundations and Trends in Information Retrieval (COLING) , 3283–3294.Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global Vectors for Word Representation. In (EMNLP) , 1532–1543.Rao, J.; He, H.; and Lin, J. 2016. Noise-contrastive estimationfor answer selection with deep neural networks. In (CIKM) ,1913–1916.Santos, C. N. D.; Tan, M.; Xiang, B.; and Zhou, B. 2016.Attentive Pooling Networks. arXiv: Computation and Lan-guage .Shao, B.; Gong, Y.; Qi, W.; Duan, N.; and Lin, X. 2019.Aggregating Bidirectional Encoder Representations UsingMatchLSTM for Sequence Matching. In (EMNLP-IJCNLP) ,6058–6062.Shen, G.; Yang, Y.; and Deng, Z.-H. 2017. Inter-weightedalignment network for sentence pair modeling. In (EMNLP) ,1179–1189. Shen, Y.; Deng, Y.; Yang, M.; Li, Y.; Du, N.; Fan, W.; andLei, K. 2018. Knowledge-aware attentive neural network forranking question answer pairs. In (SIGIR) , 901–904.Tan, M.; Xiang, B.; and Zhou, B. 2015. LSTM-based DeepLearning Models for Non-factoid Answer Selection. arXiv:Computation and Language .Tay, Y.; Phan, M. C.; Tuan, L. A.; and Hui, S. C. 2017. Learn-ing to rank question answer pairs with holographic dual lstmarchitecture. In (SIGIR) , 695–704.Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018a. Hyperbolicrepresentation learning for fast and efﬁcient neural questionanswering. In (WSDM) , 583–591.Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018b. Multi-cast atten-tion networks. In (KDD) , 2299–2308.Tran, Q. H.; Lai, T.; Haffari, G.; Zukerman, I.; Bui, T.; andBui, H. 2018. The context-dependent additive recurrent neu-ral net. In (NAACL-HLT) , 1274–1283.Wang, M.; Smith, N. A.; and Mitamura, T. 2007. What is theJeopardy model? A quasi-synchronous grammar for QA. In (EMNLP-CoNLL) , 22–32.Wang, S.; and Jiang, J. 2017. A Compare-Aggregate Modelfor Matching Text Sequences. In (ICLR) .Wang, Z.; Hamza, W.; and Florian, R. 2017. Bilateral Multi-Perspective Matching for Natural Language Sentences. In (IJCAI) , 4144–4150.Wang, Z.; Mi, H.; and Ittycheriah, A. 2016. Sentence Simi-larity Learning by Lexical Decomposition and Composition.In (COLING) , 1340–1349.Yang, Y.; Yih, W.-t.; and Meek, C. 2015. Wikiqa: A challengedataset for open-domain question answering. In (EMNLP) ,2013–2018.Yin, W.; Sch¨utze, H.; Xiang, B.; and Zhou, B. 2016. Abcnn:Attention-based convolutional neural network for modelingsentence pairs.

Transactions of the Association for Computa-tional Linguistics

4: 259–272.Yoon, S.; Dernoncourt, F.; Kim, D. S.; Bui, T.; and Jung, K.2019. A compare-aggregate model with latent clustering foranswer selection. In (CIKM)(CIKM)