[PDF] Enhancing Sentence Relation Modeling with Auxiliary Character-level Embedding

Abstract

Neural network based approaches for sentence relation modeling automatically generate hidden matching features from raw sentence pairs. However, the quality of matching feature representation may not be satisfied due to complex semantic relations such as entailment or contradiction. To address this challenge, we propose a new deep neural network architecture that jointly leverage pre-trained word embedding and auxiliary character embedding to learn sentence meanings. The two kinds of word sequence representations as inputs into multi-layer bidirectional LSTM to learn enhanced sentence representation. After that, we construct matching features followed by another temporal CNN to learn high-level hidden matching feature representations. Experimental results demonstrate that our approach consistently outperforms the existing methods on standard evaluation datasets.

Full PDF

EEnhancing Sentence Relation Modeling with Auxiliary Character-levelEmbedding

Peng Li

Computer Science and EngineeringUniversity of Texas at Arlington [email protected]

Heng Huang

Computer Science and EngineeringUniversity of Texas at Arlington [email protected]

Abstract

Neural network based approaches forsentence relation modeling automaticallygenerate hidden matching features fromraw sentence pairs. However, the qual-ity of matching feature representation maynot be satisﬁed due to complex semanticrelations such as entailment or contradic-tion. To address this challenge, we pro-pose a new deep neural network architec-ture that jointly leverage pre-trained wordembedding and auxiliary character embed-ding to learn sentence meanings. The twokinds of word sequence representationsas inputs into multi-layer bidirectionalLSTM to learn enhanced sentence repre-sentation. After that, we construct match-ing features followed by another temporalCNN to learn high-level hidden matchingfeature representations. Experimental re-sults demonstrate that our approach con-sistently outperforms the existing methodson standard evaluation datasets.

Traditional approaches (Lai and Hockenmaier,2014; Zhao et al., 2014; Jimenez et al., 2014)for sentence relation modeling tasks such as para-phrase identiﬁcation, question answering, recog-nized textual entailment and semantic textual sim-ilarity prediction usually build the supervisedmodel using a variety of hand crafted features.Hundreds of features generated at different lin-guistic levels are exploited to boost classiﬁcation.With the success of deep learning, there has beenmuch interest in applying deep neural networkbased techniques to further improve the predictionperformances (Socher et al., 2011b; Iyyer et al.,2014; Yin and Schutze, 2015).A key component of deep neural network is word embedding which serve as an lookup ta-ble to get word representations. From low levelNLP tasks such as language modeling, POS tag-ging, name entity recognition, and semantic rolelabeling (Collobert et al., 2011; Mikolov et al.,2013), to high level tasks such as machine trans-lation, information retrieval and semantic analy-sis (Kalchbrenner and Blunsom, 2013; Socher etal., 2011a; Tai et al., 2015). Deep word represen-tation learning has demonstrated its importancefor these tasks. All the tasks get performance im-provement via further learning either word levelrepresentations or sentence level representations.On the other hand, some researchers have foundcharacter-level convolutional networks (Kim et al.,2016; Zhang et al., 2015) are useful in extractinginformation from raw signals for the task such aslanguage modeling or text classiﬁcation.In this work, we focus on deep neural net-work based sentence relation modeling tasks. Weexplore treating each sentence as a kind of rawsignal at character level, and applying temporal(one-dimensional) Convolution Neural Network(CNN) (Collobert et al., 2011), Highway Multi-layer Perceptron (HMLP) and multi-layer bidirec-tional LSTM (Long Short Term Memory) (Graveset al., 2013) to learn sentence representations.We propose a new deep neural network archi-tecture that jointly leverage pre-trained word em-bedding and character embedding to represent themeaning sentences. More speciﬁcally, our new ap-proach ﬁrst generates two kinds of word sequencerepresentations. One kind of sequence represen-tations are the composition of pre-trained wordvectors. The other kind of sequence representa-tion comprise word vectors that generating fromcharacter-level convolutional network. We then in-ject the two sequence representations into bidi-rectional LSTM, which means forward directionalLSTM accept pre-trained word embedding out-put and backward directional LSTM accept aux- a r X i v : . [ c s . C L ] M a r liary character CNN embedding output. The ﬁ-nal sentence representation is the concatenation ofthe two direction. After that, we construct match-ing features followed by another temporal CNN tolearn high-level hidden matching feature represen-tations. Figure 1 shows the neural network archi-tecture for general sentence relation modeling.Our model shows that when trained on smallsize datasets, combining pre-trained word embed-dings with auxiliary character-level embeddingcan improve the sentence representation. Wordembeddings can help capturing general word se-mantic meanings, whereas char-level embeddingcan help modeling task speciﬁc word mean-ings. Note that auxiliary character-level embed-ding based sentence representation do not requirethe knowledge of words or even syntactic structureof a language. The enhanced sentence representa-tion generated by multi-layer bidirectional LSTMwill encapsulate the character and word levels in-formations. Furthermore, it may enhance match-ing features that generated by computing similar-ity measures on sentence pairs. Quantitative eval-uations on standard dataset demonstrate the effec-tiveness and advantages of our method.Figure 1: Neural Network Architecture for DeepMatching Feature Learning. M-BLSTM is Multi-layer Bidirectional LSTM. Orange color repre-sents sequence representations that concatenatingpre-trained word vectors. Purple color representssequence representation concatenating word vec-tors that generating from character-level convolu-tional network and HMLP. Besides pre-trained word vectors, we are also in-terested in generating word vectors from charac-ters. To achieve that, we leverage deep convolu-tional neural network(ConvNets). The model ac-cepts a sequence of encoded characters as input.The encoding si done by prescribing an alphabetof size m for the input language, and then quan-tize each character using one-hot encoding. Then,the sequence of characters is transformed to a se-quence of such m sized vectors with ﬁxed length l . Any character exceeding length l is ignored,and any characters that are not in the alphabet arequantized as all-zero vectors. The alphabet used inour model consists of 36 characters, including 26english letters and 10 digits. Below, we will intro-duce character-level temporal convolution neuralnetwork. Temporal Convolution applies one-dimensionalconvolution over an input sequence. The one-dimensional convolution is an operation betweena vector of weights m ∈ R m and a vector of in-puts viewed as a sequence x ∈ R n . The vector m is the ﬁlter of the convolution. Concretely, wethink of x as the input token and x i ∈ R as a sin-gle feature value associated with the i -th characterin this token. The idea behind the one-dimensionalconvolution is to take the dot product of the vec-tor m with each m -gram in the token x to obtainanother sequence c : c j = m T x j − m +1: j . (1)Usually, x i is not a single value, but a d -dimensional vector so that x ∈ R d × n . There ex-ist two types of 1d convolution operations. Oneis called Time Delay Neural Networks (TDNNs).The other one was introduced by (Collobert et al.,2011). In TDNN, weights m ∈ R d × m form a ma-trix. Each row of m is convolved with the corre-sponding row of x . In (Collobert et al., 2011) ar-chitecture, a sequence of length n is representedas: x n = x ⊕ x · · · ⊕ x n , (2)where ⊕ is the concatenation operation. In gen-eral, let x i : i + j refer to the concatenation of char-acters x i , x i +1 , . . . , x i + j . A convolution operationinvolves a ﬁlter w ∈ R hk , which is applied to aindow of h characters to produce the new fea-ture. For example, a feature c i is generated from awindow of characters x i : i + h − by: c i = f ( w · x i : i + h − + b ) . (3)Here b ∈ R is a bias term and f is a non-linear function such as the thresholding function f ( x ) = max { , x } . This ﬁlter is applied to eachpossible window of characters in the sequence { x h , x h +1 , . . . , x n − h +1: n } to produce a featuremap: c = [ c , c , . . . , c n − h +1 ] , (4)with c ∈ R n − h +1 . On top of convolutional neural network layers,we build another Highway Multilayer Perceptron(HMLP) layer to further enhance character-levelword embeddings. Conventional MLP applies anafﬁne transformation followed by a nonlinearity toobtain a new set of features: z = g ( Wy + b ) . (5)One layer of a highway network does the follow-ing: z = t (cid:12) g ( W H y + b H ) + (1 − t ) (cid:12) y , (6)where g is a nonlinearity, t = σ ( W T y + b T ) iscalled as the transform gate, and (1 − t ) is calledas the carry gate. Similar to the memory cells inLSTM networks, highway layers allow adaptivelycarrying some dimensions of the input directly tothe input for training deep networks. Now that we have two kinds of word sequencerepresentations. One kind of sequence represen-tations are the composition of pre-trained wordvectors. The other kind of sequence representa-tion comprise word vectors that generating fromcharacter-level convolutional network. We can in-ject the two sequence representations into bidi-rectional LSTM to learn sentence representation.More speciﬁcally, forward directional LSTM ac-cept pre-trained word embedding output and back-ward directional LSTM accept character CNN em-bedding output. The ﬁnal sentence representationis the concatenation of the two direction.

Recurrent neural networks (RNNs) are capable ofmodeling sequences of varying lengths via the re-cursive application of a transition function on ahidden state. For example, at each time step t , anRNN takes the input vector x t ∈ R n and the hid-den state vector h t − ∈ R m , then applies afﬁnetransformation followed by an element-wise non-linearity such as hyperbolic tangent function toproduce the next hidden state vector h t : h t = tanh( Wx t + Uh t − + b ) . (7)A major issue of RNNs using these transi-tion functions is that it is difﬁcult to learn long-range dependencies during training step becausethe components of the gradient vector can grow ordecay exponentially (Bengio et al., 1994).The LSTM architecture (Hochreiter andSchmidhuber, 1998) addresses the problem oflearning long range dependencies by introducinga memory cell that is able to preserve state overlong periods of time. Concretely, at each time step t , the LSTM unit can be deﬁned as a collection ofvectors in R d : an input gate i t , a forget gate f t , an output gate o t , a memory cell c t and a hidden state h t . We refer to d as the memory dimensionality of the LSTM. One step of an LSTM takes asinput x t , h t − , c t − and produces h t , c t via thefollowing transition equations: i t = σ ( W ( i ) x t + U ( i ) h t − + b ( i ) ) , f t = σ ( W ( f ) x t + U ( f ) h t − + b ( f ) ) , o t = σ ( W ( o ) x t + U ( o ) h t − + b ( o ) ) , u t = tanh( W ( u ) x t + U ( u ) h t − + b ( u ) ) , c t = i t (cid:12) u t + f t (cid:12) c t − , h t = o t (cid:12) tanh( c t ) , (8)where σ ( · ) and tanh( · ) are the element-wise sig-moid and hyperbolic tangent functions, (cid:12) is theelement-wise multiplication operator. One shortcoming of conventional RNNs is thatthey are only able to make use of previous context.In text entailment, the decision is made after thewhole sentence pair is digested. Therefore, explor-ing future context would be better for sequencemeaning representation. Bidirectional RNNs ar-chitecture (Graves et al., 2013) proposed a solu-tion of making prediction based on future words.t each time step t , the model maintains two hid-den states, one for the left-to-right propagation −→ h t and the other for the right-to-left propagation ←− h t .The hidden state of the Bidirectional LSTM is theconcatenation of the forward and backward hiddenstates. The following equations illustrate the mainideas: −→ h t = tanh( −→ Wx t + −→ U −→ h t − + −→ b ) ←− h t = tanh( ←− Wx t + ←− U ←− h t +1 + ←− b ) . (9)Deep RNNs can be created by stacking multi-ple RNN hidden layer on top of each other, withthe output sequence of one layer forming the in-put sequence for the next. Assuming the same hid-den layer function is used for all N layers in thestack, the hidden vectors h n are iteratively com-puted from n = 1 to N and t = 1 to T : h nt = tanh( Wh n − t + Uh nt − + b ) . (10)Multilayer bidirectional RNNs can be imple-mented by replacing each hidden vector h n withthe forward and backward vectors −→ h n and ←− h n ,and ensuring that every hidden layer receives in-put from both the forward and backward layers atthe level below. Furthermore, we can apply LSTMmemory cell to hidden layers to construct multi-layer bidirectional LSTM.Finally, we can concatenate sequence hiddenmatrix −→ M ∈ R n × d and reversed sequence hiddenmatrix ←− M ∈ R n × d to form the sentence represen-tation. We refer to n is the number of layers, d asthe memory dimensionality of the LSTM. In thenext section, we will use the two matrixs to gener-ate matching feature planes via linear algebra op-erations. Inspired by (Tai et al., 2015), we apply element-wise merge to ﬁrst sentence matrix M ∈ R n × d and second sentence matrix M ∈ R n × d . Simi-lar to previous method, we can deﬁne two simplematching feature planes ( FPs ) with below equa-tions:

F P = M (cid:12) M ,F P = | M − M | , (11)where (cid:12) is the element-wise multiplication. The F P measure can be interpreted as an element-wise comparison of the signs of the input repre-sentations. The F P measure can be interpretedas the distance between the input representations. In addition to the above measures, we alsofound the following feature plane can improve theperformance: F P = 1 dConv ( Reshape ( J oin ( M , M ))) , (12)In F P , the dConv means one-dimensional con-volution. Join mean concatenate the two represen-tation. The intuition behind F P is let the one-dimensional convolution preserves the commoninformation between sentence pairs. Recall that the multi-layer bidirectional LSTMgenerates sentence representation matrix M ∈ R n × d by concatenating sentence hidden matrix −→ M ∈ R n × d and reversed sentence hidden matrix ←− M ∈ R n × d . Then we conduct element-wise mergeto form feature plane M fp ∈ R n × d . Therefore,the ﬁnal input into temporal convolution layer is a3D tensor I ∈ R f × n × d , where f is the numberof matching feature plane, n is the number of lay-ers, d as the memory dimensionality of the LSTM.Note that the 3D tensor convolutional layer input I can be viewed as an image where each featureplane is a channel. In computer vision and imageprocessing communities, the spatial 2D convolu-tion is often used over an input image composedof several input planes. In experiment section, wewill compare 2D convolution with 1D convolu-tion. In order to facilitate temporal convolution,we need reshape I to 2D tensor. The matching feature planes can be viewed aschannels of images in image processing. In ourscenario, these feature planes hold the matchinginformation. We will use temporal convolutionalneural network to learn hidden matching features.The mechanism of temporal CNN here is the sameas character-level temporal CNN. However, thekernels are totally different.It’s quite important to design a good topologyfor CNN to learn hidden features from heteroge-neous feature planes. After several experiments,we found two topological graphs can be deployedin the architecture. Figure 2 and Figure 3 showthe two CNN graphs. In Topology I, we stacktemporal convolution with kernel width as 1 andtanh activation on top of each feature plane. Afterthat, we deploy another temporal convolution andanh activation operation with kernel width as 2. InTopology II, however, we ﬁrst stack temporal con-volution and tanh activation with kernel width as2. Then we deploy another temporal convolutionand tanh activation operation with kernel width as1. Experiment results demonstrate that the Topol-ogy I is slightly better than the Topology II. Thisconclusion is reasonable. The feature planes areheterogeneous. After conducting convolution andtanh activation transformation, it makes sense tocompare values across different feature planes.

We selected two related sentence relation model-ing tasks: semantic relatedness task, which mea-sures the degree of semantic relatedness of a sen-tence pair by assigning a relatedness score rang-ing from 1 (completely unrelated) to 5 ( very re-lated); and textual entailment task, which deter-mines whether the truth of a text entails the truthof another text called hypothesis. We use stan-dard SICK (Sentences Involving CompositionalKnowledge) dataset for evaluation. It consists ofabout 10,000 English sentence pairs annotated forrelatedness in meaning and entailment. We ﬁrst initialize our word representations usingpublicly available 300-dimensional Glove wordvectors . LSTM memory dimension is 100, thenumber of layers is 2. On the other hand, forCharCNN model we use threshold activation func-tion on top of each temporal convolution and maxpooling pairs . The CharCNN input frame sizeequals alphabet size, output frame size is 100. Themaximum sentence length is 37. The kernel widthof each temporal convolution is set to 3, the step is1, the hidden units of HighwayMLP is 50. Train-ing is done through stochastic gradient descentover shufﬂed mini-batches with the AdaGrad up-date rule (Duchi et al., 2011). The learning rate isset to 0.05. The mini-batch size is 25. The modelparameters were regularized with a per-minibatchL2 regularization strength of − . Note that wordembeddings were ﬁxed during training. http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools http://nlp.stanford.edu/projects/glove/ The task of semantic relatedness prediction triesto measure the degree of semantic relatedness ofa sentence pair by assigning a relatedness scoreranging from 1 (completely unrelated) to 5 (veryrelated). More formally, given a sentence pair, wewish to predict a real-valued similarity score in arange of [1 , K ] , where K > is an integer. Thesequence , , ..., K is the ordinal scale of similar-ity, where higher scores indicate greater degreesof similarity. We can predict the similarity score ˆ y by predicting the probability that the learned hid-den representation x h belongs to the ordinal scale.This is done by projecting an input representationonto a set of hyperplanes, each of which corre-sponds to a class. The distance from the input toa hyperplane reﬂects the probability that the inputwill located in corresponding scale.Mathematically, the similarity score ˆ y can bewritten as: ˆ y = r T · ˆ p θ ( y | x h )= r T · sof tmax ( W · x h + b )= r T · e W i x h + b i (cid:80) j e W j x h + b j (13)where r T = [1 2 . . . K ] and the weight matrix W and b are parameters.In order to introduce the task objective function,we deﬁne a sparse target distribution p that satis-ﬁes y = r T p : p i =  y − (cid:98) y (cid:99) , i = (cid:98) y (cid:99) + 1 (cid:98) y (cid:99) − y + 1 , i = (cid:98) y (cid:99) otherwise (14)where ≤ i ≤ K . The objective function thencan be deﬁned as the regularized KL-divergencebetween p and p θ : J ( θ ) = − m m (cid:88) k =1 KL ( p ( k ) || p kθ ) + λ || θ || , (15)where m is the number of training pairs and thesuperscript k indicates the k -th sentence pair (Taiet al., 2015).Referring to textual entailment recognition task,we want to maximize the likelihood of the correctclass. This is equivalent to minimizing the nega-tive log-likelihood (NLL). More speciﬁcally, thelabel ˆ y given the inputs x h is predicted by a soft-max classiﬁer that takes the hidden state h j at theigure 2: CNN Topology I Figure 3: CNN Topology IInode as input: ˆ p θ ( y | x h ) = sof tmax ( W · x h + b )ˆ y = argmax y ˆ p θ ( y | x h ) (16)After that, the objective function is the negativelog-likelihood of the true class labels y k : J ( θ ) = − m m (cid:88) k =1 log ˆ p θ ( y k | x kh ) + λ || θ || , (17)where m is the number of training pairs and thesuperscript k indicates the k th sentence pair. Table 1 and 2 show the Pearson correlation andaccuracy comparison results of semantic related-ness and text entailment tasks. We can see thatcombining CharCNN with multi-layer bidirec-tional LSTM yields better performance comparedwith other traditional machine learning methodssuch as SVM and MaxEnt approach (Proisl andEvert, 2014; Lai and Hockenmaier, 2014) thatserved with many handcraft features. Note that ourmethod doesn’t need extra handcrafted feature ex-traction procedure. Also our method doesn’t lever-age external linguistic resources such as wordnetor parsing which get best results in (Tai et al.,2015). More importantly, both task prediction re-sults close to the state-of-the-art results. It provedthat our approaches successfully simultaneouslypredict heterogeneous tasks. Note that for seman-tic relatedness task, the latest research (Tai et al., 2015) proposed a tree-structure based LSTM, thePearson correlation score of their system can reach0.863. Compared with their approach, our methoddidn’t use dependency parsing and can be used topredict tasks contains multiple languages.We hope to point out that we implemented themethod in (Tai et al., 2015), but the results arenot as good as our method. Here we use the re-sults reported in their paper. Based on our experi-ments, we believe the method in (Tai et al., 2015)is very sensitive to the initializations, thus it maynot achieve the good performance in different set-tings. However, our method is pretty stable whichmay beneﬁt from the joint tasks training.

In this experiment, we will compare tree LSTMwith sequential LSTM. A limitation of the se-quence LSTM architectures is that they only al-low for strictly sequential information propaga-tion. However, tree LSTMs allow richer networktopologies where each LSTM unit is able to in-corporate information from multiple child units.As in standard LSTM units, each Tree-LSTM unit(indexed by j ) contains input and output gates i j and o j , a memory cell c j and hidden state h j . Thedifference between the standard LSTM unit andtree LSTM units is that gating vectors and memorycell updates are dependent on the states of possiblymany child units. Additionally, instead of a singleforget gate, the tree LSTM unit contains one for-get gate f jk for each child k . This allows the treeLSTM unit to selectively incorporate informationethod Pearson Correlation Features Reported inMaxEnt 0.799 137 (Lai and Hockenmaier, 2014)Decision tree 0.804 214 (Jimenez et al., 2014)RNN 0.827 N/A StanfordNLP run5Logical Inference 0.827 32 (Bjerva et al., 2014)MaxEnt, SVM, kNN, GB, RF 0.828 72 (Zhao et al., 2014)WordEmbedding+MB-LSTM+Temp-CNN 0.849 0 Our implementation

CharCNN+MB-LSTM+Temp-CNN 0.851 0

Our implementation

Table 1: Semantic Relatedness Task Comparison.Method Accuracy Features Reported inSVM 0.823 41 (Proisl and Evert, 2014)Decision tree 0.831 214 (Jimenez et al., 2014)MaxEnt, SVM, kNN, GB, RF 0.836 72 (Zhao et al., 2014)WordEmbedding+MB-LSTM+Temp-CNN 0.838 0

Our implementation

CharCNN+MB-LSTM+Temp-CNN 0.842 0

Our implementation

MaxEnt 0.846 137 (Lai and Hockenmaier, 2014)Table 2: Textual Entailment Task Comparison.from each child.We use dependency tree child-sum tree LSTMproposed by (Tai et al., 2015) as our baseline.Given a tree, let C ( j ) denote the set of children ofnode j . The child-sum tree LSTM transition equa-tions are the following: ˜h j = (cid:88) k ∈ C ( j ) h k , i j = σ ( W ( i ) x j + U ( i ) ˜h j + b ( i ) ) , f jk = σ ( W ( f ) x j + U ( f ) h k + b ( f ) ) , o j = σ ( W ( o ) x j + U ( o ) ˜h j + b ( o ) ) , u j = tanh( W ( u ) x j + U ( u ) ˜h j + b ( u ) ) , c j = i j (cid:12) u j + f jk (cid:12) c k , h j = o j (cid:12) tanh( c j ) . (18)Table 3 show the comparisons between tree andsequential based methods. We can see that, if wedon’t deploy CNN, simple Tree LSTM yields bet-ter result than traditional LSTM, but worse thanBidirectional LSTM. This is reasonable due to thefact that Bidirectional LSTM can enhance sen-tence representation by concatenating forward andbackward representations. We found that addingCNN layer will decrease the accuracy in this sce-nario. Because when feeding into CNN, we haveto reshape the feature planes otherwise convolu-tion will not work. For example, we set convo-lution kernel width as 2, the input 2D tensor will have the shape lager than 2. To boost performancewith CNN, we need more matching features. Wefound Multi-layer Bidirectional LSTM can incor-porate more features and achieve best performancecompared with single-layer Bidirectional LSTM.Method Accuracy PearsonDep-Tree LSTM 0.833 0.849Dep-Tree LSTM + CNN 0.798 0.822LSTM 0.812 0.833LSTM + CNN 0.776 0.8101-Bidirectional LSTM 0.834 0.8481-Bidirectional LSTM+ CNN 0.821 0.846Table 3: Results of Tree LSTM vs SequenceLSTM on auxiliary char embedding. Existing neural sentence models mainly fallinto two groups: convolutional neural networks(CNNs) and recurrent neural networks (RNNs). Inregular 1D CNNs (Collobert et al., 2011; Kalch-brenner and Blunsom, 2013; Kim, 2014), a ﬁxed-size window slides over time (successive words insequence) to extract local features of a sentence;then they pool these features to a vector, usuallytaking the maximum value in each dimension, forsupervised learning. The convolutional unit, whencombined with max-pooling, can act as the com-ositional operator with local selection mecha-nism as in the recursive autoencoder (Socher et al.,2011b). However, semantically related words thatare not in one ﬁlter can’t be captured effectivelyby this shallow architecture. (Kalchbrenner et al.,2014) built deep convolutional models so that lo-cal features can mix at high-level layers. However,deep convolutional models may result in worseperformance (Kim, 2014).On the other hand, RNN can take advantage ofthe parsing or dependency tree of sentence struc-ture information (Socher et al., 2011b; Socher etal., 2014). (Iyyer et al., 2014) used dependency-tree recursive neural network to map text descrip-tions to quiz answers. Each node in the tree is rep-resented as a vector; information is propagated re-cursively along the tree by some elaborate seman-tic composition. One major drawback of RNNs isthe long propagation path of information near leafnodes. As gradient may vanish when propagatedthrough a deep path, such long dependency buriesilluminating information under a complicated neu-ral architecture, leading to the difﬁculty of train-ing. To address this issue, (Tai et al., 2015) pro-posed a Tree-Structured Long Short-Term Mem-ory Networks. This motivates us to investigatemulti-layer bidirectional LSTM that directly mod-els sentence meanings without parsing for RTEtask.

In this paper, we propose a new deep neural net-work architecture that jointly leverage pre-trainedword embedding and character embedding to learnsentence meanings. Our new approach ﬁrst gen-erates two kinds of word sequence representa-tions as inputs into bidirectional LSTM to learnsentence representation. After that, we constructmatching features followed by another temporalCNN to learn high-level hidden matching featurerepresentations. Our model shows that combin-ing pre-trained word embeddings with auxiliarycharacter-level embedding can improve the sen-tence representation. The enhanced sentence rep-resentation generated by multi-layer bidirectionalLSTM will encapsulate the character and wordlevels informations. Furthermore, it may enhancematching features that generated by computingsimilarity measures on sentence pairs. Experimen-tal results on benchmark datasets demonstrate thatour new framework achieved the state-of-the-art performance compared with other deep neural net-works based approaches.

References [Bengio et al.1994] Yoshua Bengio, Patrice Simard,and Paolo Fransconi. 1994. Learning long-termdependencies with gradient descent is difﬁcult. In

IEEE Transactions on Neural Networks 5(2) .[Bjerva et al.2014] Johannes Bjerva, Johan Bos, Robvan der Goot, and Malvina Nissim. 2014. TheMeaning Factory: Formal semantics for recognizingtextual entailment and determining semantic similar-ity. In

Proceedings of SemEval 2014: InternationalWorkshop on Semantic Evaluation. [Collobert et al.2011] Ronan Collobert, Jason Weston,L´eon Bottou, Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural language process-ing (almost) from scratch.

The Journal of MachineLearning Research , 12:2493–2537.[Duchi et al.2011] John Duchi, Elad Hazan, and YoramSinger. 2011. Adaptive subgradient methods for on-line learning and stochastic optimization.

The Jour-nal of Machine Learning Research , 12:2121–2159.[Graves et al.2013] Alex Graves, Navdeep Jaitly, andAbdel rahman Mohamed. 2013. Hybrid speechrecognition with deep bidirectional lstm. In

IEEEWorkshop on Au- tomatic Speech Recognition andUnderstanding (ASRU) , pages 273–278.[Hochreiter and Schmidhuber1998] Sepp Hochreiterand J¨urgen Schmidhuber. 1998. Long short-termmemory. In

Neural Computation 9(8) .[Iyyer et al.2014] Mohit Iyyer, Jordan Boyd-Graber,Leonardo Claudino, Richard Socher, and HalDaum´e III. 2014. A neural network for factoidquestion answering over paragraphs. In

EmpiricalMethods in Natural Language Processing .[Jimenez et al.2014] Sergio Jimenez, George Duenas,Julia Baquero, and Alexander Gelbukh. 2014.UNAL-NLP: Combining soft cardinality features forsemantic textual similarity, relatedness and entail-ment. In

Proceedings of SemEval 2014: Interna-tional Workshop on Semantic Evaluation. [Kalchbrenner and Blunsom2013] Nal Kalchbrennerand Phil Blunsom. 2013. Recurrent continuoustranslation models. In

Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing , pages 1700–1709, Seattle,Washington, USA. Association for ComputationalLinguistics.[Kalchbrenner et al.2014] Nal Kalchbrenner, EdwardGrefenstette, and Phil Blunsom. 2014. A convolu-tional neural network for modelling sentences.

Pro-ceedings of the 52nd Annual Meeting of the Associ-ation for Computational Linguistics .Kim et al.2016] Yoon Kim, Yacine Jernite, David Son-tag, and Alexander M Rush. 2016. Character-awareneural language models. In

Thirtieth AAAI Confer-ence on Artiﬁcial Intelligence .[Kim2014] Yoon Kim. 2014. Convolutional neural net-works for sentence classiﬁcation. In

Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1746–1751, Doha, Qatar. Association for ComputationalLinguistics.[Lai and Hockenmaier2014] Alice Lai and Julia Hock-enmaier. 2014. Illinois-lh: A denotational and dis-tributional approach to semantics. In

Proceedings ofSemEval 2014: International Workshop on SemanticEvaluation. [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever,Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrasesand their compositionality. In

Advances in NeuralInformation Processing Systems , pages 3111–3119.[Proisl and Evert2014] Thomas Proisl and Stefan Evert.2014. Robust semantic similarity at multiple levelsusing maximum weight matching. In

Proceedings ofSemEval 2014: International Workshop on SemanticEvaluation. [Socher et al.2011a] Richard Socher, Eric H Huang,Jeffrey Pennin, Christopher D Manning, and An-drew Y Ng. 2011a. Dynamic pooling and unfold-ing recursive autoencoders for paraphrase detection.In

Advances in Neural Information Processing Sys-tems , pages 801–809.[Socher et al.2011b] Richard Socher, Jeffrey Penning-ton, Eric H Huang, Andrew Y Ng, and Christo-pher D Manning. 2011b. Semi-supervised recur-sive autoencoders for predicting sentiment distribu-tions. In

Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing , pages151–161. Association for Computational Linguis-tics.[Socher et al.2014] Richard Socher, Andrej Karpathy,Quoc V Le, Christopher D Manning, and Andrew YNg. 2014. Grounded compositional semanticsfor ﬁnding and describing images with sentences.

Transactions of the Association for ComputationalLinguistics .[Tai et al.2015] Kai Sheng Tai, Richard Socher, andChristopher D Manning. 2015. Improved se-mantic representations from tree-structured longshort-term memory networks. arXiv preprintarXiv:1503.00075 .[Yin and Schutze2015] Wenpeng Yin and HinrichSchutze. 2015. Multigrancnn: An architecture forgeneral matching of text chunks on multiple levelsof granularity. In

Proceedings of th 53rd AnnualMeeting of the Association for ComputationalLinguistics , pages 63–73. [Zhang et al.2015] Xiang Zhang, Junbo Zhao, and YannLeCun. 2015. Character-level convolutional net-works for text classiﬁcation. In

Advances in NeuralInformation Processing Systems , pages 649–657.[Zhao et al.2014] Jiang Zhao, Tian Tian Zhu, and ManLan. 2014. ECNU: One stone two birds: Ensembleof heterogenous measures for semantic relatednessand textual entailment. In, pages 649–657.[Zhao et al.2014] Jiang Zhao, Tian Tian Zhu, and ManLan. 2014. ECNU: One stone two birds: Ensembleof heterogenous measures for semantic relatednessand textual entailment. In