[PDF] Modelling Interaction of Sentence Pair with coupled-LSTMs

Abstract

Recently, there is rising interest in modelling the interactions of two sentences with deep neural networks. However, most of the existing methods encode two sequences with separate encoders, in which a sentence is encoded with little or no information from the other sentence. In this paper, we propose a deep architecture to model the strong interaction of sentence pair with two coupled-LSTMs. Specifically, we introduce two coupled ways to model the interdependences of two LSTMs, coupling the local contextualized interactions of two sentences. We then aggregate these interactions and use a dynamic pooling to select the most informative features. Experiments on two very large datasets demonstrate the efficacy of our proposed architecture and its superiority to state-of-the-art methods.

Full PDF

MModelling Interaction of Sentence Pair with Coupled-LSTMs

Pengfei Liu Xipeng Qiu ∗ Xuanjing Huang

Shanghai Key Laboratory of Intelligent Information Processing, Fudan UniversitySchool of Computer Science, Fudan University825 Zhangheng Road, Shanghai, China { pﬂiu14,xpqiu,xjhuang } @fudan.edu.cn Abstract

Recently, there is rising interest in modelling theinteractions of two sentences with deep neural net-works. However, most of the existing methodsencode two sequences with separate encoders, inwhich a sentence is encoded with little or no in-formation from the other sentence. In this pa-per, we propose a deep architecture to modelthe strong interaction of sentence pair with twocoupled-LSTMs. Speciﬁcally, we introduce twocoupled ways to model the interdependences of twoLSTMs, coupling the local contextualized interac-tions of two sentences. We then aggregate theseinteractions and use a dynamic pooling to selectthe most informative features. Experiments on twovery large datasets demonstrate the efﬁcacy of ourproposed architecture and its superiority to state-of-the-art methods.

Distributed representations of words or sentences have beenwidely used in many natural language processing (NLP)tasks, such as text classiﬁcation [Kalchbrenner et al. , 2014],question answering and machine translation [Sutskever et al. ,2014] and so on. Among these tasks, a common problem ismodelling the relevance/similarity of the sentence pair, whichis also called text semantic matching.Recently, deep learning based models is rising a substantialinterest in text semantic matching and have achieved somegreat progresses [Hu et al. , 2014; Qiu and Huang, 2015; Wan et al. , 2016].According to the phases of interaction between two sen-tences, previous models can be classiﬁed into three cate-gories.

Weak interaction Models

Some early works focus on sen-tence level interactions, such as ARC-I[Hu et al. , 2014],CNTN[Qiu and Huang, 2015] and so on. These modelsﬁrst encode two sequences with some basic (Neural Bag-of-words, BOW) or advanced (RNN, CNN) components ofneural networks separately, and then compute the matching ∗ Corresponding author. score based on the distributed vectors of two sentences. Inthis paradigm, two sentences have no interaction until arriv-ing ﬁnal phase.

Semi-interaction Models

Some improved methods focuson utilizing multi-granularity representation (word, phraseand sentence level), such as MultiGranCNN [Yin andSch¨utze, 2015] and Multi-Perspective CNN [He et al. , 2015].Another kind of models use soft attention mechanism toobtain the representation of one sentence by depending onrepresentation of another sentence, such as ABCNN [Yin et al. , 2015], Attention LSTM[Rockt¨aschel et al. , 2015;Hermann et al. , 2015]. These models can alleviate the weakinteraction problem, but are still insufﬁcient to model the con-textualized interaction on the word as well as phrase level.

Strong Interaction Models

These models directly build aninteraction space between two sentences and model the in-teraction at different positions. ARC-II [Hu et al. , 2014]and MV-LSTM [Wan et al. , 2016]. These models enable themodel to easily capture the difference between semantic ca-pacity of two sentences.In this paper, we propose a new deep neural networkarchitecture to model the strong interactions of two sen-tences. Different with modelling two sentences with sepa-rated LSTMs, we utilize two interdependent LSTMs, calledcoupled-LSTMs, to fully affect each other at different timesteps. The output of coupled-LSTMs at each step depends onboth sentences. Speciﬁcally, we propose two interdependentways for the coupled-LSTMs: loosely coupled model (LC-LSTMs) and tightly coupled model (TC-LSTMs). Similar tobidirectional LSTM for single sentence [Schuster and Pali-wal, 1997; Graves and Schmidhuber, 2005], there are fourdirections can be used in coupled-LSTMs. To utilize all theinformation of four directions of coupled-LSTMs, we aggre-gate them and adopt a dynamic pooling strategy to automati-cally select the most informative interaction signals. Finally,we feed them into a fully connected layer, followed by anoutput layer to compute the matching score.The contributions of this paper can be summarized as fol-lows.1. Different with the architectures of using similarity ma-trix, our proposed architecture directly model the stronginteractions of two sentences with coupled-LSTMs,which can capture the useful local semantic relevances a r X i v : . [ c s . C L ] M a y f two sentences. Our architecture can also capturethe multiple granular interactions by several stackedcoupled-LSTMs layers.2. Compared to the previous works on text matching, weperform extensive empirical studies on two very largedatasets. The massive scale of the datasets allows us totrain a very deep neural networks. Experiment resultsdemonstrate that our proposed architecture is more ef-fective than state-of-the-art methods. Long short-term memory network (LSTM) [Hochreiter andSchmidhuber, 1997] is a type of recurrent neural network(RNN) [Elman, 1990], and speciﬁcally addresses the issue oflearning long-term dependencies. LSTM maintains a mem-ory cell that updates and exposes its content only whendeemed necessary.While there are numerous LSTM variants, here we use theLSTM architecture used by [Jozefowicz et al. , 2015], whichis similar to the architecture of [Graves, 2013] but withoutpeep-hole connections.We deﬁne the LSTM units at each time step t to be a col-lection of vectors in R d : an input gate i t , a forget gate f t , an output gate o t , a memory cell c t and a hidden state h t . d isthe number of the LSTM units. The elements of the gatingvectors i t , f t and o t are in [0 , .The LSTM is precisely speciﬁed as follows.  ˜c t o t i t f t  =  tanh σσσ  T A , b (cid:20) x t h t − (cid:21) , (1) c t = ˜c t (cid:12) i t + c t − (cid:12) f t , (2) h t = o t (cid:12) tanh ( c t ) , (3)where x t is the input at the current time step; T A , b is an afﬁnetransformation which depends on parameters of the network A and b . σ denotes the logistic sigmoid function and (cid:12) de-notes elementwise multiplication. Intuitively, the forget gatecontrols the amount of which each unit of the memory cellis erased, the input gate controls how much each unit is up-dated, and the output gate controls the exposure of the internalmemory state.The update of each LSTM unit can be written precisely asfollows ( h t , c t ) = LSTM ( h t − , c t − , x t ) . (4)Here, the function LSTM ( · , · , · ) is a shorthand for Eq. (1-3). To deal with two sentences, one straightforward method is tomodel them with two separate LSTMs. However, this methodis difﬁcult to model local interactions of two sentences. Animproved way is to introduce attention mechanism, which has h (1)1 h (1)2 h (1)3 h (2)1 h (2)2 h (2)3 (a) Parallel LSTMs h (1)1 h (1)2 h (1)3 h (2)1 h (2)2 h (2)3 (b) Attention LSTMs h (1)41 h (2)41 h (1)42 h (2)42 h (1)43 h (2)43 h (1)44 h (2)44 h (1)31 h (2)31 h (1)32 h (2)32 h (1)33 h (2)33 h (1)34 h (2)34 h (1)21 h (2)21 h (1)22 h (2)22 h (1)23 h (2)23 h (1)24 h (2)24 h (1)11 h (2)11 h (1)12 h (2)12 h (1)13 h (2)13 h (1)14 h (2)14 (c) Loosely coupled-LSTMs h h h h h h h h h h h h h h h h (d) Tightly coupled-LSTMs Figure 1: Four different coupled-LSTMs.been used in many tasks, such as machine translation [Bah-danau et al. , 2014] and question answering [Hermann et al. ,2015].Inspired by the multi-dimensional recurrent neural net-work [Graves et al. , 2007; Graves and Schmidhuber, 2009;Byeon et al. , 2015] and grid LSTM [Kalchbrenner et al. ,2015] in computer vision community, we propose two modelsto capture the interdependences between two parallel LSTMs,called coupled-LSTMs (C-LSTMs).To facilitate our models, we ﬁrstly give some deﬁni-tions. Given two sequences X = x , x , · · · , x n and Y = y , y , · · · , y m , we let x i ∈ R d denote the embedded rep-resentation of the word x i . The standard LSTM have onetemporal dimension. When dealing with a sentence, LSTMregards the position as time step. At position i of sen-tence x n , the output h i reﬂects the meaning of subsequence x i = x , · · · , x i .To model the interaction of two sentences as early as pos-sible, we deﬁne h i,j to represent the interaction of the subse-quences x i and y j .Figure 1(c) and 1(d) illustrate our two propose models. Forintuitive comparison of weak interaction parallel LSTMs, wealso give parallel LSTMs and attention LSTMs in Figure 1(a)and 1(b).We describe our two proposed models as follows. To model the local contextual interactions of two sentences,we enable two LSTMs to be interdependent at different posi-tions. Inspired by Grid LSTM [Kalchbrenner et al. , 2015] andword-by-word attention LSTMs [Rockt¨aschel et al. , 2015],we propose a loosely coupling model for two interdependentLSTMs.More concretely, we refer to h (1) i,j as the encoding of subse-quence x i in the ﬁrst LSTM inﬂuenced by the output of thesecond LSTM on subsequence y j . Meanwhile, h (2) i,j is thencoding of subsequence y j in the second LSTM inﬂuencedby the output of the ﬁrst LSTM on subsequence x i h (1) i,j and h (2) i,j are computed as h (1) i,j = LSTM ( H (1) i − , c (1) i − ,j , x i ) , (5) h (2) i,j = LSTM ( H (2) j − , c (2) i,j − , y j ) , (6)where H (1) i − = [ h (1) i − ,j , h (2) i − ,j ] , (7) H (2) j − = [ h (1) i,j − , h (2) i,j − ] . (8) The hidden states of LC-LSTMs are the combination of thehidden states of two interdependent LSTMs, whose memorycells are separated. Inspired by the conﬁguration of the multi-dimensional LSTM [Byeon et al. , 2015], we further conﬂateboth the hidden states and the memory cells of two LSTMs.We assume that h i,j directly model the interaction of the sub-sequences x i and y j , which depends on two previous in-teraction h i − ,j and h i,j − , where i, j are the positions insentence X and Y .We deﬁne a tightly coupled-LSTMs units as follows.  ˜c i,j o i,j i i,j f i,j f i,j  =  tanh σσσσ  T A , b  x i y j h i,j − h i − ,j  , (9) c i,j = ˜c i,j (cid:12) i i,j + [ c i,j − , c i − ,j ] T (cid:20) f i,j f i,j (cid:21) (10) h i,j = o t (cid:12) tanh ( c i,j ) (11)where the gating units i i,j and o i,j determine which mem-ory units are affected by the inputs through ˜c i,j , and whichmemory cells are written to the hidden units h i,j . T A , b is anafﬁne transformation which depends on parameters of the net-work A and b . In contrast to the standard LSTM deﬁned overtime, each memory unit c i,j of a tightly coupled-LSTMs hastwo preceding states c i,j − and c i − ,j and two correspondingforget gates f i,j and f i,j . Our two proposed coupled-LSTMs can be formulated as ( h i,j , c i,j ) = C-LSTMs ( h i − ,j , h i,j − , c i − ,j , c i,j − , x i , y j ) , (12)where C-LSTMs can be either

TC-LSTMs or LC-LSTMs .The input consisted of two type of information at step ( i, j ) incoupled-LSTMs: temporal dimension h i − ,j , h i,j − , c i − ,j , c i,j − and depth dimension x i , y j . The difference between TC-LSTMsand LC-LSTMs is the dependence of information from temporal anddepth dimension. Interaction Between Temporal Dimensions

The TC-LSTMs model the interactions at position ( i, j ) by merging the inter-nal memory c i − ,j c i,j − and hidden state h i − ,j h i,j − along rowand column dimensions. In contrast with TC-LSTMs, LC-LSTMsﬁrstly use two standard LSTMs in parallel, producing hidden states h i,j and h i,j along row and column dimensions respectively, whichare then merged together ﬂowing next step. Interaction Between Depth Dimension

In TC-LSTMs, eachhidden state h i,j at higher layer receives a fusion of information x i and y j , ﬂowed from lower layer. However, in LC-LSTMs, theinformation x i and y j are accepted by two corresponding LSTMsat the higher layer separately.The two architectures have their own characteristics, TC-LSTMsgive more strong interactions among different dimensions while LC-LSTMs ensures the two sequences interact closely without beingconﬂated using two separated LSTMs. Comparison of LC-LSTMs and word-by-word AttentionLSTMs

The main idea of attention LSTMs is that the representation of sen-tence X is obtained dynamically based on the alignment degree be-tween the words in sentence X and Y, which is asymmetric unidirec-tional encoding. Nevertheless, in LC-LSTM, each hidden state ofeach step is obtained with the consideration of interaction betweentwo sequences with symmetrical encoding fashion.

In this section, we present an end-to-end deep architecture formatching two sentences, as shown in Figure 2.

To model the sentences with neural model, we ﬁrstly need trans-form the one-hot representation of word into the distributed repre-sentation. All words of two sequences X = x , x , · · · , x n and Y = y , y , · · · , y m will be mapped into low dimensional vectorrepresentations, which are taken as input of the network. After the embedding layer, we use our proposed coupled-LSTMsto capture the strong interactions between two sentences. A basicblock consists of ﬁve layers. We ﬁrstly use four directional coupled-LSTMs to model the local interactions with different informationﬂows. And then we sum the outputs of these LSTMs by aggregationlayer. To increase the learning capabilities of the coupled-LSTMs,we stack the basic block on top of each other.

Four Directional Coupled-LSTMs Layers

The C-LSTMs is deﬁned along a certain pre-deﬁned direction, wecan extend them to access to the surrounding context in all direc-tions. Similar to bi-directional LSTM, there are four directions incoupled-LSTMs. ( h i,j , c i,j ) = C-LSTMs ( h i − ,j , h i,j − , c i − ,j , c i,j − , x i , y j ) , ( h i,j , c i,j ) = C-LSTMs ( h i − ,j , h i,j +1 , c i − ,j , c i,j +1 , x i , y j ) , ( h i,j , c i,j ) = C-LSTMs ( h i +1 ,j , h i,j +1 , c i +1 ,j , c i,j +1 , x i , y j ) , ( h i,j , c i,j ) = C-LSTMs ( h i +1 ,j , h i,j − , c i +1 ,j , c i,j − , x i , y j ) . Aggregation Layer

The aggregation layer sums the outputs of four directional coupled-LSTMs into a vector. ˆ h i,j = (cid:88) d =1 h di,j , (13)where the superscript t of h i,j denotes the different directions. , ··· , x n y , ··· , y m P P · · ·

Pooling FullyConnectedLayer OutputLayerInput Layer Stacked C-LSTMs Pooling Layer

Figure 2: Architecture of coupled-LSTMs for sentence-pair encoding. Inputs are fed to four C-LSTMs followed by an aggre-gation layer. Blue cuboids represent different contextual information from four directions.

Stacking C-LSTMs Blocks

To increase the capabilities of network of learning multiple granular-ities of interactions, we stack several blocks (four C-LSTMs layersand one aggregation layer) to form deep architectures.

The output of stacked coupled-LSTMs layers is a tensor H ∈ R n × m × d , where n and m are the lengths of sentences, and d is thenumber of hidden neurons. We apply dynamic pooling to automat-ically extract R p × q subsampling matrix in each slice H i ∈ R n × m ,similar to [Socher et al. , 2011].More formally, for each slice matrix H i , we partition the rowsand columns of H i into p × q roughly equal grids. These grid arenon-overlapping. Then we select the maximum value within eachgrid. Since each slice H i consists of the hidden states of one neuronat different positions, the pooling operation can be regarded as themost informative interactions captured by the neuron.Thus, we get a p × q × d tensor, which is further reshaped into avector. The vector obtained by pooling layer is fed into a full connectionlayer to obtain a ﬁnal more abstractive representation.

The output layer depends on the types of the tasks, we choose thecorresponding form of output layer. There are two popular types oftext matching tasks in NLP. One is ranking task, such as communityquestion answering. Another is classiﬁcation task, such as textualentailment.1. For ranking task, the output is a scalar matching score, whichis obtained by a linear transformation after the last fully-connected layer.2. For classiﬁcation task, the outputs are the probabilities of thedifferent classes, which is computed by a softmax function af-ter the last fully-connected layer.

Our proposed architecture can deal with different sentence matchingtasks. The loss functions varies with different tasks.

Max-Margin Loss for Ranking Task

Given a positive sen-tence pair ( X, Y ) and its corresponding negative pair ( X, ˆ Y ) . Thematching score s ( X, Y ) should be larger than s ( X, ˆ Y ) .For this task, we use the contrastive max-margin criterion [Bordes et al. , 2013; Socher et al. , 2013] to train our models on matchingtask. MQA RTEEmbedding size 100 100Hidden layer size 50 50Initial learning rate 0.05 0.005Regularization E − E − Pooling ( p, q ) (2,1) (1,1)Table 1: Hyper-parameters for our model on two tasks. The ranking-based loss is deﬁned as L ( X, Y, ˆ Y ) = max (0 , − s ( X, Y ) + s ( X, ˆ Y )) . (14)where s ( X, Y ) is predicted matching score for ( X, Y ) . Cross-entropy Loss for Classiﬁcation Task

Given a sen-tence pair ( X, Y ) and its label l . The output ˆ l of neural networkis the probabilities of the different classes. The parameters of thenetwork are trained to minimise the cross-entropy of the predictedand true label distributions. L ( X, Y ; l , ˆ l ) = − C (cid:88) j =1 l j log(ˆ l j ) , (15)where l is one-hot representation of the ground-truth label l ; ˆ l ispredicted probabilities of labels; C is the class number.To minimize the objective, we use stochastic gradient descentwith the diagonal variant of AdaGrad [Duchi et al. , 2011]. To pre-vent exploding gradients, we perform gradient clipping by scalingthe gradient when the norm exceeds a threshold [Graves, 2013]. In this section, we investigate the empirical performances of our pro-posed model on two different text matching tasks: classiﬁcation task(recognizing textual entailment) and ranking task (matching of ques-tion and answer).

The word embeddings for all of the models are initialized with the100d GloVe vectors (840B token version, [Pennington et al. , 2014])and ﬁne-tuned during training to improve the performance. Theother parameters are initialized by randomly sampling from uniformdistribution in [ − . , . .For each task, we take the hyperparameters which achieve the bestperformance on the development set via an small grid search overcombinations of the initial learning rate [0 . , . , . , l regularization [0 . , E − , E − , E − and the threshold value odel k | θ | M Train TestNBOW 100 80K 77.9 75.1single LSTM[Rockt¨aschel et al. , 2015] 100 111K 83.7 80.9parallel LSTMs[Bowman et al. , 2015] 100 221K 84.8 77.6Attention LSTM[Rockt¨aschel et al. , 2015] 100 252K 83.2 82.3Attention(w-by-w) LSTM[Rockt¨aschel et al. , 2015] 100 252K 83.7 83.5LC-LSTMs (Single Direction) 50 45K 80.8 80.5LC-LSTMs 50 45K 81.5 80.9four stacked LC-LSTMs 50 135K 85.0 84.3TC-LSTMs (Single Direction) 50 77.5K 81.4 80.1TC-LSTMs 50 77.5K 82.2 81.6four stacked TC-LSTMs 50 190K 86.7

Table 2: Results on SNLI corpus. of gradient norm [5, 10, 100]. The ﬁnal hyper-parameters are set asTable 1. • Neural bag-of-words (NBOW): Each sequence as the sum ofthe embeddings of the words it contains, then they are concate-nated and fed to a MLP. • Single LSTM: A single LSTM to encode the two sequences,which is used in [Rockt¨aschel et al. , 2015]. • Parallel LSTMs: Two sequences are encoded by two LSTMsseparately, then they are concatenated and fed to a MLP. • Attention LSTMs: An attentive LSTM to encode two sen-tences into a semantic space, which used in [Rockt¨aschel etal. , 2015]. • Word-by-word Attention LSTMs: An improvement of atten-tion LSTM by introducing word-by-word attention mecha-nism, which used in [Rockt¨aschel et al. , 2015].

Recognizing textual entailment (RTE) is a task to determine the se-mantic relationship between two sentences. We use the StanfordNatural Language Inference Corpus (SNLI) [Bowman et al. , 2015].This corpus contains 570K sentence pairs, and all of the sentencesand labels stem from human annotators. SNLI is two orders of mag-nitude larger than all other existing RTE corpora. Therefore, themassive scale of SNLI allows us to train powerful neural networkssuch as our proposed architecture in this paper.

Results

Table 2 shows the evaluation results on SNLI. The rd column of thetable gives the number of parameters of different models without theword embeddings.Our proposed two C-LSTMs models with four stacked blocksoutperform all the competitor models, which indicates that our thin-ner and deeper network does work effectively.Besides, we can see both LC-LSTMs and TC-LSTMs beneﬁtfrom multi-directional layer, while the latter obtains more gains thanthe former. We attribute this discrepancy between two models totheir different mechanisms of controlling the information ﬂow fromdepth dimension.Compared with attention LSTMs, our two models achieve com-parable results to them using much fewer parameters (nearly / ). A pe r s on i s w ea r i ng a g r een s h i r t . . over hunched pants black and shirt red a in person A −0.5 0 0.5 (a) 3rd neuron A pe r s on i s ou t s i de . . street the down walking is jeans wearing woman A (b) 17th neuron Figure 3: Illustration of two interpretable neurons and someword-pairs capture by these neurons. The darker patches de-note the corresponding activations are higher.

By stacking C-LSTMs, the performance of them are improved sig-niﬁcantly, and the four stacked TC-LSTMs achieve . accuracyon this dataset.Moreover, we can see TC-LSTMs achieve better performancethan LC-LSTMs on this task, which need ﬁne-grained reasoningover pairs of words as well as phrases. Understanding Behaviors of Neurons in C-LSTMs

To get an intuitive understanding of how the C-LSTMs work on thisproblem, we examined the neuron activations in the last aggregationlayer while evaluating the test set using TC-LSTMs. We ﬁnd thatsome cells are bound to certain roles.Let h i,j,k denotes the activation of the k -th neuron at the positionof ( i, j ) , where i ∈ { , . . . , n } and j ∈ { , . . . , m } . By visualizingthe hidden state h i,j,k and analyzing the maximum activation, wecan ﬁnd that there exist multiple interpretable neurons. For exam-ple, when some contextualized local perspectives are semanticallyrelated at point ( i, j ) of the sentence pair, the activation value ofhidden neuron h i,j,k tend to be maximum, meaning that the modelcould capture some reasoning patterns.Figure 3 illustrates this phenomenon. In Figure 3(a), a neu-ron shows its ability to monitor the local contextual interactionsabout color. The activation in the patch, including the word pair“ (red, green) ”, is much higher than others. This is informa-tive pattern for the relation prediction of these two sentences, whoseground truth is contradiction. An interesting thing is there are twowords describing color in the sentence “ A person in a redshirt and black pants hunched over. ”. Our modelignores the useless word “ black ”, which indicates that this neu-ron selectively captures pattern by contextual understanding, not justword level interaction.In Figure 3(b), another neuron shows that it can capture thelocal contextual interactions, such as “ (walking down thestreet, outside) ”. These patterns can be easily captured bypooling layer and provide a strong support for the ﬁnal prediction.Table 3 illustrates multiple interpretable neurons and some rep-resentative word or phrase pairs which can activate these neurons.These cases show that our models can capture contextual interac-tions beyond word level.

Error Analysis

Although our models C-LSTMs are more sensitive to the dis-crepancy of the semantic capacity between two sentences, some ndex of Cell Word or Phrase Pairs -th (in a pool, swimming), (near a fountain, next to the ocean), (street, outside) -th (doing a skateboard, skateboarding), (sidewalk with, inside), (standing, seated) -th (blue jacket, blue jacket), (wearing black, wearing white), (green uniform, red uniform) -th (a man, two other men), (a man, two girls), (an old woman, two people)Table 3: Multiple interpretable neurons and the word-pairs/phrase-pairs captured by these neurons. Model k P@1(5) P@1(10)Random Guess - 20.0 10.0NBOW 50 63.9 47.6single LSTM 50 68.2 53.9parallel LSTMs 50 66.9 52.1Attention LSTMs 50 73.5 62.0LC-LSTMs (Single Direction) 50 75.4 63.0LC-LSTMs 50 76.1 64.1three stacked LC-LSTMs 50

TC-LSTMs (Single Direction) 50 74.3 62.4TC-LSTMs 50 74.9 62.9three stacked TC-LSTMs 50 77.0 65.3

Table 4: Results on Yahoo question-answer pairs dataset. semantic mistakes at the phrasal level still exist. For example,our models failed to capture the key informative pattern whenpredicting the entailment sentence pair “

A girl takes offher shoes and eats blue cotton candy/The girlis eating while barefoot . ”Besides, despite the large size of the training corpus, it’s still verydifferent to solve some cases, which depend on the combinationof the world knowledge and context-sensitive inferences. For ex-ample, given an entailment pair “ a man grabs his crotchduring a political demonstration/The man ismaking a crude gesture ”, all models predict “ neutral ”.This analysis suggests that some architectural improvements orexternal world knowledge are necessary to eliminate all errorsinstead of simply scaling up the basic model. Matching question answering (MQA) is a typical task for semanticmatching. Given a question, we need select a correct answer fromsome candidate answers.In this paper, we use the dataset collected from Yahoo! Answerswith the getByCategory function provided in Yahoo! Answers API,which produces , questions and corresponding best answers.We then select the pairs in which the length of questions and answersare both in the interval [4 , , thus obtaining , question an-swer pairs to form the positive pairs.For negative pairs, we ﬁrst use each question’s best answer as aquery to retrieval top , results from the whole answer set withLucene, where or answers will be selected randomly to constructthe negative pairs.The whole dataset is divided into training, validation and testingdata with proportion

20 : 1 : 1 . Moreover, we give two test settings:selecting the best answer from 5 and 10 candidates respectively.

Results

Results of MQA are shown in the Table 4. For our models, dueto stacking block more than three layers can not make signiﬁcantimprovements on this task, we just use three stacked C-LSTMs. By analyzing the evaluation results of question-answer matchingin table 4, we can see strong interaction models (attention LSTMs,our C-LSTMs) consistently outperform the weak interaction models(NBOW, parallel LSTMs) with a large margin, which suggests theimportance of modelling strong interaction of two sentences.Our proposed two C-LSTMs surpass the competitor methodsand C-LSTMs augmented with multi-directions layers and multiplestacked blocks fully utilize multiple levels of abstraction to directlyboost the performance.Additionally, LC-LSTMs is superior to TC-LSTMs. The reasonmay be that MQA is a relative simple task, which requires less rea-soning abilities, compared with RTE task. Moreover, the parametersof LC-LSTMs are less than TC-LSTMs, which ensures the formercan avoid suffering from overﬁtting on a relatively smaller corpus.

Our architecture for sentence pair encoding can be regarded asstrong interaction models, which have been explored in previousmodels.An intuitive paradigm is to compute similarities between all thewords or phrases of the two sentences. Socher et al. [2011] ﬁrstlyused this paradigm for paraphrase detection. The representationsof words or phrases are learned based on recursive autoencoders.Wan et al. [2016] used LSTM to enhance the positional contextualinteractions of the words or phrases between two sentences. Theinput of LSTM for one sentence does not involve another sentence.A major limitation of this paradigm is the interaction of two sen-tence is captured by a pre-deﬁned similarity measure. Thus, it isnot easy to increase the depth of the network. Compared with thisparadigm, we can stack our C-LSTMs to model multiple-granularityinteractions of two sentences.Rockt¨aschel et al. [2015] used two LSTMs equipped with atten-tion mechanism to capture the iteration between two sentences. Thisarchitecture is asymmetrical for two sentences, where the obtainedﬁnal representation is sensitive to the two sentences’ order.Compared with the attentive LSTM, our proposed C-LSTMs aresymmetrical and model the local contextual interaction of two se-quences directly.

In this paper, we propose an end-to-end deep architecture to capturethe strong interaction information of sentence pair. Experiments ontwo large scale text matching tasks demonstrate the efﬁcacy of ourproposed model and its superiority to competitor models. Besides,our visualization analysis revealed that multiple interpretable neu-rons in our proposed models can capture the contextual interactionsof the words or phrases.In future work, we would like to incorporate some gating strate-gies into the depth dimension of our proposed models, like highwayor residual network, to enhance the interactions between depth andother dimensions thus training more deep and powerful neural net-works. eferences [Bahdanau et al. , 2014] D. Bahdanau, K. Cho, and Y. Bengio. Neu-ral machine translation by jointly learning to align and translate.

ArXiv e-prints , September 2014.[Bordes et al. , 2013] Antoine Bordes, Nicolas Usunier, AlbertoGarcia-Duran, Jason Weston, and Oksana Yakhnenko. Trans-lating embeddings for modeling multi-relational data. In

NIPS ,2013.[Bowman et al. , 2015] Samuel R. Bowman, Gabor Angeli, Christo-pher Potts, and Christopher D. Manning. A large annotated cor-pus for learning natural language inference. In

Proceedings ofthe 2015 Conference on Empirical Methods in Natural LanguageProcessing , 2015.[Byeon et al. , 2015] Wonmin Byeon, Thomas M Breuel, FedericoRaue, and Marcus Liwicki. Scene labeling with lstm recur-rent neural networks. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3547–3555,2015.[Duchi et al. , 2011] John Duchi, Elad Hazan, and Yoram Singer.Adaptive subgradient methods for online learning and stochas-tic optimization.

The Journal of Machine Learning Research ,12:2121–2159, 2011.[Elman, 1990] Jeffrey L Elman. Finding structure in time.

Cogni-tive science , 14(2):179–211, 1990.[Graves and Schmidhuber, 2005] Alex Graves and J¨urgen Schmid-huber. Framewise phoneme classiﬁcation with bidirectionallstm and other neural network architectures.

Neural Networks ,18(5):602–610, 2005.[Graves and Schmidhuber, 2009] Alex Graves and J¨urgen Schmid-huber. Ofﬂine handwriting recognition with multidimensionalrecurrent neural networks. In

Advances in Neural InformationProcessing Systems , pages 545–552, 2009.[Graves et al. , 2007] Alex Graves, Santiago Fern´andez, and J¨urgenSchmidhuber. Multi-dimensional recurrent neural networks.In

Artiﬁcial Neural Networks–ICANN 2007 , pages 549–558.Springer, 2007.[Graves, 2013] Alex Graves. Generating sequences with recurrentneural networks. arXiv preprint arXiv:1308.0850 , 2013.[He et al. , 2015] Hua He, Kevin Gimpel, and Jimmy Lin. Multi-perspective sentence similarity modeling with convolutional neu-ral networks. In

Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing , pages 1576–1586,2015.[Hermann et al. , 2015] Karl Moritz Hermann, Tomas Kocisky, Ed-ward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. Teaching machines to read and comprehend.In

Advances in Neural Information Processing Systems , pages1684–1692, 2015.[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and J¨urgenSchmidhuber. Long short-term memory.

Neural computation ,9(8):1735–1780, 1997.[Hu et al. , 2014] Baotian Hu, Zhengdong Lu, Hang Li, and QingcaiChen. Convolutional neural network architectures for matchingnatural language sentences. In

Advances in Neural InformationProcessing Systems , 2014.[Jozefowicz et al. , 2015] Rafal Jozefowicz, Wojciech Zaremba,and Ilya Sutskever. An empirical exploration of recurrent net-work architectures. In

Proceedings of The 32nd InternationalConference on Machine Learning , 2015. [Kalchbrenner et al. , 2014] Nal Kalchbrenner, Edward Grefen-stette, and Phil Blunsom. A convolutional neural network formodelling sentences. In

Proceedings of ACL , 2014.[Kalchbrenner et al. , 2015] Nal Kalchbrenner, Ivo Danihelka, andAlex Graves. Grid long short-term memory. arXiv preprintarXiv:1507.01526 , 2015.[Pennington et al. , 2014] Jeffrey Pennington, Richard Socher, andChristopher D Manning. Glove: Global vectors for word rep-resentation.

Proceedings of the Empiricial Methods in NaturalLanguage Processing (EMNLP 2014) , 12:1532–1543, 2014.[Qiu and Huang, 2015] Xipeng Qiu and Xuanjing Huang. Convo-lutional neural tensor network architecture for community-basedquestion answering. In

Proceedings of International Joint Con-ference on Artiﬁcial Intelligence , 2015.[Rockt¨aschel et al. , 2015] Tim Rockt¨aschel, Edward Grefenstette,Karl Moritz Hermann, Tom´aˇs Koˇcisk`y, and Phil Blunsom. Rea-soning about entailment with neural attention. arXiv preprintarXiv:1509.06664 , 2015.[Schuster and Paliwal, 1997] Mike Schuster and Kuldip K Paliwal.Bidirectional recurrent neural networks.

Signal Processing, IEEETransactions on , 45(11):2673–2681, 1997.[Socher et al. , 2011] Richard Socher, Eric H Huang, Jeffrey Pen-nin, Christopher D Manning, and Andrew Y Ng. Dynamic pool-ing and unfolding recursive autoencoders for paraphrase detec-tion. In

Advances in Neural Information Processing Systems ,2011.[Socher et al. , 2013] Richard Socher, Danqi Chen, Christopher DManning, and Andrew Ng. Reasoning with neural tensor net-works for knowledge base completion. In

Advances in NeuralInformation Processing Systems , pages 926–934, 2013.[Sutskever et al. , 2014] Ilya Sutskever, Oriol Vinyals, andQuoc VV Le. Sequence to sequence learning with neuralnetworks. In

Advances in Neural Information ProcessingSystems , pages 3104–3112, 2014.[Wan et al. , 2016] Shengxian Wan, Yanyan Lan, Jiafeng Guo, JunXu, Liang Pang, and Xueqi Cheng. A deep architecture for se-mantic matching with multiple positional sentence representa-tions. In

AAAI , 2016.[Yin and Sch¨utze, 2015] Wenpeng Yin and Hinrich Sch¨utze. Con-volutional neural network for paraphrase identiﬁcation. In

Pro-ceedings of the 2015 Conference of the North American Chapterof the Association for Computational Linguistics: Human Lan-guage Technologies , pages 901–911, 2015.[Yin et al. , 2015] Wenpeng Yin, Hinrich Sch¨utze, Bing Xiang,and Bowen Zhou. Abcnn: Attention-based convolutionalneural network for modeling sentence pairs. arXiv preprintarXiv:1512.05193arXiv preprintarXiv:1512.05193