[PDF] Improved Sentence Modeling using Suffix Bidirectional LSTM

Abstract

Recurrent neural networks have become ubiquitous in computing representations of sequential data, especially textual data in natural language processing. In particular, Bidirectional LSTMs are at the heart of several neural models achieving state-of-the-art performance in a wide variety of tasks in NLP. However, BiLSTMs are known to suffer from sequential bias - the contextual representation of a token is heavily influenced by tokens close to it in a sentence. We propose a general and effective improvement to the BiLSTM model which encodes each suffix and prefix of a sequence of tokens in both forward and reverse directions. We call our model Suffix Bidirectional LSTM or SuBiLSTM. This introduces an alternate bias that favors long range dependencies. We apply SuBiLSTMs to several tasks that require sentence modeling. We demonstrate that using SuBiLSTM instead of a BiLSTM in existing models leads to improvements in performance in learning general sentence representations, text classification, textual entailment and paraphrase detection. Using SuBiLSTM we achieve new state-of-the-art results for fine-grained sentiment classification and question classification.

Full PDF

IImproved Sentence Modeling using Sufﬁx Bidirectional LSTM

Siddhartha Brahma

IBM Research AI, Almaden, USA

Abstract

Recurrent neural networks have become ubiquitous in comput-ing representations of sequential data, especially textual datain natural language processing. In particular, BidirectionalLSTMs are at the heart of several neural models achievingstate-of-the-art performance in a wide variety of tasks in NLP.However, BiLSTMs are known to suffer from sequential bias –the contextual representation of a token is heavily inﬂuencedby tokens close to it in a sentence. We propose a general andeffective improvement to the BiLSTM model which encodeseach sufﬁx and preﬁx of a sequence of tokens in both forwardand reverse directions. We call our model Sufﬁx BidirectionalLSTM or

SuBiLSTM . This introduces an alternate bias thatfavors long range dependencies. We apply SuBiLSTMs toseveral tasks that require sentence modeling. We demonstratethat using SuBiLSTM instead of a BiLSTM in existing mod-els leads to improvements in performance in learning generalsentence representations, text classiﬁcation, textual entailmentand paraphrase detection. Using SuBiLSTM we achieve newstate-of-the-art results for ﬁne-grained sentiment classiﬁcationand question classiﬁcation.

Introduction

Recurrent Neural Networks (RNN) (Elman 1990) haveemerged as a powerful tool for modeling sequential data.Vanilla RNNs have largely given way to more sophisti-cated recurrent architectures like Long Short-Term Memory(Hochreiter and Schmidhuber 1997) and the simpler GatedRecurrent Unit (Cho et al. 2014), owing to their superior gra-dient propagation properties. The importance of LSTMs innatural language processing, where a sentence as a sequenceof tokens represents a fundamental unit, has risen exponen-tially over the past few years. A LSTM processing a sentencein the forward direction produces distributed representationsof its preﬁxes. A Bidirectional LSTM (BiLSTM in short)(Schuster and Paliwal 1997)(Graves and Schmidhuber 2005)additionally processes the sentence in the reverse direction(starting from the last token) producing representations of thesufﬁxes (in the reverse direction). For every token t in the sen-tence, a BiLSTM thus produces a contextual representationof t based on its preﬁx and sufﬁx in the sentence.Despite their sophisticated design, it is well known thatLSTMs suffer from sequential bias (Pascanu, Mikolov, and Copyright c (cid:13)

Bengio 2013). The hidden state of a LSTM is heavily inﬂu-enced by the last few tokens it has processed. This impliesthat the contextual representation of t is highly inﬂuencedby the tokens close to it in the sequential order, with tokensfarther away being less inﬂuential. Computing contextualrepresentations that capture long range dependencies is achallenging research problem, with numerous applications.In this paper, we propose a simple, general and effectivetechnique to compute contextual representations that capturelong range dependencies. For each token t , we encode bothits preﬁx and sufﬁx in both the forward and reverse direction.Notably, the encoding of the sufﬁx in the forward direction isbiased towards tokens sequentially farther away to the rightof t . Similarly, the encoding of the preﬁx in the reverse direc-tion is biased towards tokens sequentially farther away to theleft of t . Further, we combine the preﬁx and sufﬁx representa-tions by a simple max-pooling operation to produce a richercontextual representation of t in both the forward and reversedirection. We call our model Sufﬁx BiLSTM or SuBiLSTM in short. A SuBiLSTM has the same representation length asa BiLSTM with the same hidden dimension.We consider two versions of SuBiLSTMs – a tied versionwhere the sufﬁxes and preﬁxes in each direction are encodedusing the same LSTM and an untied version where two differ-ent LSTMs are used. Note that, as in a BiLSTM, we alwaysuse different LSTMs for the forward and reverse direction. Ingeneral a SuBiLSTM can be used as a drop in replacementin any model that uses the intermediate states of a BiLSTM,without changing any other parts of the model. However, themain motivation for introducing SuBiLSTMs is to apply itto problems that require whole sentence modeling e.g. textclassiﬁcation, where the richer contextual information can behelpful. We demonstrate the effectiveness of SuBiLSTM onseveral sentence modeling tasks in NLP – general sentencerepresentation, text classiﬁcation, textual entailment and para-phrase detection. In each of these tasks, we show gains bysimply replacing BiLSTMs in strong base models, achievinga new state-of-the-art in ﬁne grained sentiment classiﬁcationand question classiﬁcation.

Sufﬁx Bidirectional LSTM

Let s be a sequence with n tokens. We use s [ i : j ] to denotethe sequence of embeddings of the tokens from s [ i ] to s [ j ] ,where j maybe less than i . Let → L p represent a LSTM that a r X i v : . [ c s . L G ] S e p i nh p,i ( L p ) → h s,i ( L s ) → h p,i ( L p ) ← h s,i ( L s ) ←→ →←← , )Max( , )Max( SuBiLSTM

Figure 1: Schematics of SuBiLSTM. The large solid purple arrow represents preﬁxes and large solid seagreen arrow representssufﬁxes. Their directions represent the encoding direction of the corresponding LSTMs. Best viewed in color.encodes preﬁxes of s in the forward direction. For the i -thtoken s [ i ] , we have → h p,i = → L p ( s [1 : i ]) (1)Let → L s represent a LSTM that encodes sufﬁxes of s in the forward direction. → h s,i = → L s ( s [ i : n ]) (2)Note that the → h p,i can be computed in a single pass over s ,while computing → h s,i needs a total of n passes over progres-sively smaller sufﬁxes of s . Now consider ← L p and ← L s thatencodes the preﬁxes and sufﬁxes of s in the reverse direction. ← h p,i = ← L p ( s [ i : 1]) (3) ← h s,i = ← L s ( s [ n : i ]) (4)Note that both → h p,i and ← h p,i encode the same preﬁx, but indifferent directions. Similarly, → h s,i and ← h s,i encode the samesufﬁx, but in different directions. See Fig. 1 for a schematicillustration.We have four vectors → h p,i , → h s,i , ← h p,i , ← h s,i that constitutethe context of s [ i ] . Using these, we deﬁne the followingcontextual representation of s [ i ] . H SuBiLSTM i = (cid:104) max (cid:110) → h p,i , → h s,i (cid:111) ; max (cid:110) ← h p,i , ← h s,i (cid:111)(cid:105) (5)Here ; is the concatenation operator. This deﬁnes the SuBiL-STM model. We also deﬁne another representation wherethe two LSTMs encoding the sequence in the same direc-tion are the same or their weights are tied. This deﬁnes the

SuBiLSTM-Tied model, which concretely is H SuBiLSTM-Tied i = (cid:104) max (cid:110) → h p,i , → h s,i (cid:111) ; max (cid:110) ← h p,i , ← h s,i (cid:111)(cid:105) (6)where → L p ≡ → L s , ← L p ≡ ← L s In contrast to SuBiLSTM, a standard BiLSTM uses the fol-lowing contextual representation of s [ i ] . H BiLSTM i = (cid:104) → h p,i ; ← h s,i (cid:105) (7)For a ﬁxed hidden dimension, SuBiLSTM and SuBiLSTM-Tied have the same representation length as a BiLSTM. Im-portantly, SuBiLSTM-Tied uses the same number of parame-ters as a BiLSTM, while SuBiLSTM uses twice as many. Interpretations of SuBiLSTM

Notice that → h s,i is biased towards tokens that are sequentiallyto the right and farthest away from s [ i ] . Combining it with → h p,i which is inﬂuenced more by tokens close to and to theleft of s [ i ] creates a representation of s [ i ] that is dependenton and inﬂuenced by tokens both close and far away from it.The same argument can be repeated in the reverse directionwith ← h p,i and ← h s,i . We argue that this is a richer contextualrepresentation of s [ i ] which can help in better sentence mod-eling, as compared to BiLSTMs where the representation isbiased towards sequentially close tokens.As an alternate viewpoint, for every token s [ i ] , SuBiL-STM creates two representations of its preﬁx s [1 : i ] , → h p,i and ← h p,i . Their concatenation [ → h p,i ; ← h p,i ] is equivalent to anencoding of the preﬁx with a BiLSTM consisting of → L p and ← L p . Similarly, [ → h s,i ; ← h s,i ] is an encoding of the sufﬁx s [ i : n ] by a BiLSTM consisting of → L s and ← L s . Thus H SuBiLSTM i canbe interpreted as the max-pooling of the bidirectional rep-resentations of the preﬁx and sufﬁx of s [ i ] into a compactrepresentation. This may be contrasted with a BiLSTM wherethe preﬁx is encoded by a LSTM in the forward directionand the sufﬁx is encoded by another LSTM in the reversedirection. SuBiLSTM thus tries to capture more informationby encoding the preﬁx and sufﬁx in a bidirectional manner.In general, the preﬁx and sufﬁx encodings can be combinedin other ways e.g. concatenation, mean or through a learnedgating function. However, we use max-pooling because it is asimple parameterless operation and it performs better in ourexperiments. Since both SuBiLSTM and SuBiLSTM-Tiedproduces representations of each token s [ i ] in the same wayas a BiLSTM, they can be used as drop in replacements for aBiLSTM in any model that uses these representations. Time complexity of a SuBiLSTM

To compute the contextual representations of a minibatchof sentences using a SuBiLSTM, we calculate all the → h p,i in one pass using → L p . We then create several minibatches(determined by the maximum length of a sentence in theminibatch n max ) of successively smaller sufﬁxes starting at i ,for each i ∈ [1 : n max ] and use → L s to compute the encodings → h s,i . The same procedure is repeated for the minibatch ofsentences with tokens reversed to compute ← h s,i and ← h p,i . Asn optimization, several of the minibatches of the shorter suf-ﬁxes can be combined to form larger minibatches. The worstcase time complexity of computing all the representations isquadratic in n max , as compared to the linear time complexityusing a BiLSTM. As we show in later sections, the increasedtime complexity is offset by the consistent gains in perfor-mance on several sentence modeling tasks. The encodings ofthe different can be computed in parallel, which can speedup computation greatly on modern hardware. Evaluation, Datasets, Training and Testing

We evaluate the representational power of SuBiLSTM us-ing several sentence modeling tasks and datasets from NLP.We do not concern ourselves with designing new modelsfor SuBiLSTM. Rather, for each task, we take a stronglyperforming base model that uses the token representationsof a BiLSTM and replace it with SuBiLSTM. The trainingprocedures are kept exactly the same.

General Sentence Representation

First, we investigate whether a SuBiLSTM can be trained toproduce good general sentence representations that transferwell to several NLP tasks. As the base model, we use therecently proposed InferSent (Conneau et al. 2017). It wasshown to give strong results on a set of 10 NLP tasks en-capsulated in the SentEval benchmark (Conneau and Kiela2018). The representation of a sentence is a max-pooling ofthe token representations produced by a SuBiLSTM. H SuBiLSTM ( s ) = max i ∈ [1: n ] H SuBiLSTM i (8)where H SuBiLSTM i is deﬁned in (5). The representation for H SuBiLSTM-Tied i is deﬁned similarly. We train the model on thetextual entailment task, where a pair of sentences (premiseand hypothesis) needs to be classiﬁed into one of three classes- entailment, contradiction and neutral. Let u be the encodingof the premise according to (8) and let v be the encoding ofthe hypothesis. Using a Siamese architecture, the combinedvector of [ u ; v ; | u − v | ; u · v ] is used as the representationof the pair which is then passed through two fully connectedlayers and a ﬁnal classiﬁcation layer. Training . We use the combination of the Stanford NaturalLanguage Inference (SNLI) (Bowman et al. 2015) and theMultiNLI (Williams, Nangia, and Bowman 2018) datasets totrain. We set the hidden dimension of the LSTMs in SuBiL-STM to 2048, which produces a 4096 dimensional encodingfor each sentence. The two fully connected layers are of 512dimensions each. The tokens in the sentence are embeddedusing GloVe embeddings (Pennington, Socher, and Manning2014) which are not updated during training. We follow thesame training procedure used for training the InferSent modelin (Conneau et al. 2017).

Testing . We test the sentence representations learned bySuBiLSTM on the SentEval benchmark. This benchmark con-sists of 6 text classiﬁcation tasks (MR, CR, SUBJ, MPQA,SST, TREC) with accuracy as the performance measure.There is one task on paraphrase detection (MRPC) with ac-curacy and F1 and one on entailment classiﬁcation (SICK-E) with accuracy as the performance measure, respectively. Dataset Classiﬁcation Task

Text Classiﬁcation

We pick two representative tasks for text classiﬁcation – sen-timent classiﬁcation and question classiﬁcation. As the basemodel, we use the Biattentive-Classiﬁcation-Network (BCN)proposed by (McCann et al. 2017), which was shown to givestrong performance on several text classiﬁcation datasets,especially in association with CoVe embeddings (McCannet al. 2017). The BCN model uses two BiLSTMs to encodea sentence. The intermediate states of the ﬁrst BiLSTM areused to compute a self-attention matrix. This is followedby further processing and a second BiLSTM before a ﬁnalclassiﬁcation layer. Our hypothesis is that the richer contex-tual representations of SuBiLSTM should help such attentionbased sentence models. For our experiments, we replace onlythe ﬁrst BiLSTM with a SuBiLSTM.

Training and Testing . For sentiment classiﬁcation, we usethe Stanford Sentiment Treebank dataset (Socher et al. 2013),both in its binary (SST-2) and ﬁne-grained (SST-5) forms. Forquestion classiﬁcation, we use the TREC (Voorhees 2001)dataset, both in its 6 class (TREC-6) and 50 class (TREC-50) forms. The hidden dimension of the LSTMs is set to300. Distinct from (McCann et al. 2017), we use dropoutafter the embedding layer and before the classiﬁcation layer.The two maxout layers are ﬁxed at reduction factors of 4and 2. We also apply weight decay to the parameters duringoptimization, which is done using Adam (Kingma and Ba2015) with a learning rate of 1e-3. We experiment with twoversions of the initial embedding – one using GloVe only andthe other using both GloVe and CoVe, both of which are ﬁxedduring training. Validation and testing are done using the setsassociated with the SST and TREC datasets.

Textual Entailment

As mentioned above, the textual entailment problem is thetask of classifying a pair of sentences into three classes –entailment, contradiction and neutral. It is an important andcanonical text matching problem in NLP. To test SuBiLSTMfor this task, we pick ESIM (Chen et al. 2017) as the basemodel. ESIM has been shown to achieve state-of-the-art re-sults on the SNLI dataset and has been the basis of furtherimprovements. Like BCN above, ESIM uses two BiLSTMlayers to encode sentences, with an inter-sentence attention odel MR CR SUBJ MPQA SST TREC MRPC SICK-R SICK-E STSB

Other Existing Methods

FastSent+AE 71.8 76.7 88.8 81.5 - 80.4 71.2/79.1 - - -SkipThought-LN 79.4 83.1 93.7 89.3 82.9 88.4 - 0.858 79.5 -DisSent 80.1 84.9 93.6 90.1 84.1 93.6 75.0/- 0.849 83.7 -CNN-LSTM 77.8 82.1 93.6 89.4 - 92.6 76.5/83.8 0.862 - -Byte mLSTM 86.9 91.4 94.6 88.5 - - 75.0/82.8 0.792 - -MultiTask 82.5 87.7 94.0 90.9 83.2 93.0 78.6/84.4 0.888 87.8 0.789QuickThoughts 82.4 86.0 94.8 90.2 87.6 92.4 76.9/84.0 0.874 - -

Supervised Training on AllNLI (4096 dimensions)

BiLSTM (InferSent) 81.1 86.3 92.4 90.2 84.6 88.2 76.2/83.1 0.884 86.3 0.758BiLSTM-2layer 81.3 86.2 92.0 90.2

Table 2: Performance of SuBiLSTM on the SentEval benchmark. The ﬁrst 8 methods contain both unsupervised and supervisedones. FastSent is from (Hill, Cho, and Korhonen 2016), SkipThought is described in (Kiros et al. 2015), DisSent in (Nie,Bennett, and Goodman 2018), CNN-LSTM in (Gan et al. 2017), Byte mLSTM in (Radford, J´ozefowicz, and Sutskever 2017),QuickThoughts in (Logeswaran and Lee 2018) and MultiTask in (Subramanian et al. 2018). Our base model is InferSent(Conneau et al. 2017). Bold indicates the best performance among the SuBiLSTM models and the base model. M R C R S U B J M P Q A SS T T R E C M R P C S I C K - R S I C K - E S T S B P e r ce n t a g e SuBiLSTM (Avg. ∆ = 0.58%)SuBiLSTM-Tied (Avg. ∆ = 0.59%)Figure 2: Gains by using SuBiLSTM in the SentEval tasks.For MRPC we use F1 percentage, for SICK-R and STSBwe use 100 × Pearson correlation and for the rest accuracypercentages. The Avg. ∆ is the average of the 10 values.mechanism in between. In our experiments, we only replacethe ﬁrst BiLSTM with a SuBiLSTM. Training and Testing . We use 300 dimensional GloVeembeddings to initialize the word embeddings (which are alsoupdated during training) and use 300 dimensional LSTMs.We follow the same training procedure as (Chen et al. 2017).Validation and testing are on the corresponding sets in theSNLI dataset.

Paraphrase Detection

In this task, a pair of sentences need to be classiﬁed accordingto whether they are paraphrases of each other. To demonstratethe effectiveness of SuBiLSTM in a model that does not useany attention mechanism on the token representations, weuse the same Siamese architecture used for training generalsentence representations described above, except with onefully connected layer at the end followed by ReLU activation.

Training and Testing . We use 300 dimensional GloVeembeddings to initialize the word embeddings (which are alsoupdated during training) and use 600 as the hidden dimensionof all LSTMs and also the dimension of the fully connectedlayer. We apply dropout after the word embedding layer andafter the ReLU activation. Training is done using the Adamoptimizer with a learning rate of 1e-3. We use the QUORAdataset (Iyer et al. 2017) to train and test our models. Asummary of the various datasets used in our evaluation isgiven in Table 1.

Baselines

For each of the tasks, we compare SuBiLSTM andSuBiLSTM-Tied with a single-layer BiLSTM and a 2-layerBiLSTM encoder with the same hidden dimension. While aSuBiLSTM-Tied encoder has the same number of parametersas single-layer BiLSTM, a SuBiLSTM has twice as many.In contrast, a 2-layer BiLSTM has more parameters thaneither of the SuBiLSTM variants if the hidden dimension isat least as large as the input dimension, which is the case inall out models. By comparing with a 2-layer BiLSTM base-line, we account for the larger number of parameters used inSuBiLSTM and also check whether the long range contextualinformation captured by SuBiLSTM can easily be replicatedby adding more layers to the BiLSTM.odel Test Model Test SS T - NSE (Munkhdalai and Yu 2017a) 89.7 T R E C - BCN+Char+CoVe (McCann et al. 2017) 95.8BCN+Char+CoVe (McCann et al. 2017) 90.3 TBCNN (Mou et al. 2015) 96.0Byte mLSTM 91.8 LSTM-CNN (Zhou et al. 2016) 96.1(Radford, J´ozefowicz, and Sutskever 2017)BCN with BiLSTM 89.3 BCN with BiLSTM 95.2BCN with 2-layer BiLSTM 89.5 BCN with 2-layer BiLSTM 95.5BCN with SuBiLSTM 89.8 BCN with SuBiLSTM 95.8BCN with SuBiLSTM-Tied 89.7 BCN with SuBiLSTM-Tied

BCN with BiLSTM+CoVe 90.1 BCN with BiLSTM+CoVe 95.8BCN with 2-layer BiLSTM+CoVe 90.5 BCN with 2-layer BiLSTM+CoVe 95.8BCN with SuBiLSTM+CoVe 91.0 BCN with SuBiLSTM+CoVe 96.0BCN with SuBiLSTM-Tied+CoVe

BCN with SuBiLSTM-Tied+CoVe 95.8 SS T - TE-LSTM (Huang, Qian, and Zhu 2017) 52.6 T R E C - BCN+Char+CoVe (McCann et al. 2017) 90.2NTI (Munkhdalai and Yu 2017b) 53.1 RulesUHC (da Silva et al. 2011) 90.8BCN+Char+CoVe (McCann et al. 2017) 53.7 Rules (Madabushi and Lee 2016) 97.2BCN with BiLSTM 53.2 BCN with BiLSTM 89.8BCN with 2-layer BiLSTM 53.5 BCN with 2-layer BiLSTM 89.4BCN with SuBiLSTM 53.2 BCN with SuBiLSTM 89.8BCN with SuBiLSTM-Tied 53.4 BCN with SuBiLSTM-Tied 89.4BCN with BiLSTM+CoVe 53.6 BCN with BiLSTM+CoVe 90.0BCN with 2-layer BiLSTM+CoVe 54.0 BCN with 2-layer BiLSTM+CoVe 89.2BCN with SuBiLSTM+CoVe 54.5 BCN with SuBiLSTM+CoVe

BCN with SuBiLSTM-Tied+CoVe

Table 3: Comparison of text classiﬁcation methods on the four datasets - SST-2, SST-5, TREC-6 and TREC-50. For each ofthem, we show accuracy numbers for BCN with SuBiLSTM and BCN with BiLSTM (base model), both with and without CoVeembeddings. The best performing ones among these is shown in bold.

Experimental Results

In this section, for the sake of brevity, the terms SuBiLSTMand SuBiLSTM-Tied will sometimes refer to the base modelswhere the BiLSTM has been replaced by our models.

General Sentence Representation

The performance of SuBiLSTM and SuBiLSTM-Tied onthe 10 transfer tasks in SentEval is shown in Table 2. Inall the tasks, SuBiLSTM and SuBiLSTM-Tied matches orexceeds the performance of the base model InferSent thatuses a BiLSTM. For SuBiLSTM, among the classiﬁcationtasks, the gains for SUBJ (0.8%), MPQA (0.5%) and TREC(1.6%) over InferSent are particularly notable. There is also asubstantial gain of 1.2% in the semantic textual similarity task(STSB). The performance of SuBiLSTM-Tied also follows asimilar trend, gaining 0.6% for SUBJ, 0.5% for SST , 2.2%for TREC and and 1.3% for STSB. The better performance onSTSB is noteworthy as the sentence representations derivedfrom a SuBiLSTM can take advantage of the long rangedependencies it encodes. The 2-layer BiLSTM based modelperforms comparably to the single layer BiLSTM, despiteusing a much larger number of parameters.In Fig. 2 we plot the absolute gains made by SuBiLSTMand SuBiLSTM-Tied over BiLSTM for all the 10 tasks. It isinteresting to note that both models perform comparably onan average, although SuBiLSTM has twice as many param- eters as SuBiLSTM-Tied. The performance of our modelsis still some way off from MultiTask (Subramanian et al.2018); but they use a training dataset which is two orders ofmagnitude larger with a complex set of learning objectives.QuickThoughts (Logeswaran and Lee 2018) also uses a muchlarger unsupervised dataset. It is possible that SuBiLSTMcoupled with training objectives and datasets used in thesetwo works will provide substantial gains over the existingresults.

Text Classiﬁcation

The performance of SuBiLSTM and SuBiLSTM-Tied on thefour text classiﬁcation datasets is shown in Table 3. In threeof these tasks (SST-2, SST-5 and TREC-50), SuBiLSTM-Tied using GloVe and CoVe embeddings performs the best.It performs notably better than the single layer BiLSTMbased base model BCN on SST-2 and SST-5, achieving a newstate-of-the-art accuracy of 56.2% on ﬁne-grained sentimentclassiﬁcation (SST-5). On TREC-6, the best result is obtainedfor SuBiLSTM-Tied using GloVe embeddings only, a newstate-of-the-art accuracy of 96.2%.There is no substantialimprovement on the TREC-50 dataset.For text classiﬁcation, we observe that SuBiLSTM-Tiedperforms better than SuBiLSTM and CoVe embeddings givea boost in most cases. The performance of the base modelBCN with a 2-layer BiLSTM is slightly better than with theodel TestESIM with BiLSTM (Chen et al. 2017) 88.0DIIN (Gong, Luo, and Zhang 2018) 88.0BCN+Char+CoVe (McCann et al. 2017) 88.1DR-BiLSTM (Ghaeini et al. 2018) 88.5CAFE (Tay, Tuan, and Hui 2018) 88.5ESIM with BiLSTM (Ours) 87.8ESIM with 2-layer BiLSTM (Ours) 87.9ESIM with SuBiLSTM

ESIM with SuBiLSTM-Tied 88.2ESIM with BiLSTM (Ensemble) 88.6ESIM with 2-layer BiLSTM (Ensemble) 88.7ESIM with SuBiLSTM (Ensemble)

ESIM with SuBiLSTM-Tied (Ensemble)

Table 4: Accuracy of SuBiLSTM and BiLSTM on the SNLItest set with ESIM as the base model.single layer BiLSTM in all cases except TREC-50. However,despite using a larger number of parameters, it does notperform better than both SuBiLSTM and SuBiLSTM-Tied.This implies that the richer contextual information capturedby a SuBiLSTM cannot easily be replicated by adding morelayers to the BiLSTM. Note that BCN uses a self-attentionmechanism on top of the token representations and it is ableto exploit the richer representations provided by SuBiLSTM.

Textual Entailment

The performance of SuBiLSTM and SuBiLSTM-Tied on theSNLI dataset is shown in Table 4. Our implementation ofESIM, when using a BiLSTM, achieves 87.8% accuracy. Us-ing a SuBiLSTM, the accuracy jumps to 88.3% and to 88.2%for the Tied version. On using the 2-layer BiLSTM, accuracyimproves only marginally by 0.1%. This is aligned with theresults shown for text classiﬁcation above. Here again, theattention mechanism on top of the token representations ben-eﬁt from the long range contextual information captured bySuBiLSTM. Note that ESIM uses an inter-sentence attentionmechanism and is able to exploit the better token represen-tations provided by SuBiLSTM across sentences. We alsoreport the performance of an ensemble of 5 models. Both theSuBiLSTM versions achieve an accuracy of 89.1%, whilethe BiLSTM based ones perform worse.

Paraphrase Detection

The accuracies obtained on the QUORA dataset are shownin Table 5. Note that unlike the BCN and ESIM models,we use a simple Siamese architecture without any attentionmechanism. In fact, the representation of a sentence in thiscase is simply the max-pooling of all the intermediate rep-resentations of the SuBiLSTM. Even in this case, we ob-serve gains over both single layer and 2-layer BiLSTMs,although slightly lesser than the attention based models. Thebest model (SuBiLSTM) achieves 88.2%, at par with a morecomplex attention based model BiMPM (Wang, Hamza, andFlorian 2017). Model TestBIMPM (Wang, Hamza, and Florian 2017) 88.2pt-DECATTchar (Tomar et al. 2017) 88.4DIIN (Gong, Luo, and Zhang 2018) 89.1MwAN (Tan et al. 2018) 89.1BiLSTM 87.82-layer BiLSTM 87.9SuBiLSTM

SuBiLSTM-Tied 88.1Table 5: Accuracy of SuBiLSTM and BiLSTM on theQUORA test set with a Siamese base model. All previousresults use attention mechanisms. S e n t E v a l SS T - SS T - T R E C - T R E C - S N L I Q U O R A . . . P e r ce n t a g e SuBiLSTMSuBiLSTM-TiedFigure 3: Gains from using SuBiLSTM and SuBilSTM-Tiedover single layer BiLSTM on all the datasets. The differenceis between the best ﬁgures obtained for each model. ForSentEval we use the average score and accuracy for the rest.

Comparison of SuBiLSTM and SuBiLSTM-Tied

The results shown above clearly show the efﬁcacy of usingSuBiLSTMs in existing models geared towards four differentsentence modeling tasks. The relative performance of SuBiL-STM and SuBiLSTM-Tied are fairly close to each other, asshown by the relative gains in Fig. 3. SuBiLSTM-Tied worksbetter on small datasets (SST and TREC), probably owingto the regularizing effect of using the same LSTM to encodeboth sufﬁxes and preﬁxes. For the larger datasets (SNLI andQUORA), SuBILSTM slightly edges out the tied versionowing to its larger capacity. The training complexity for boththe models is similar and hence, with half the parameters,SuBILSTM-Tied should be the more favored model for sen-tence modeling tasks. elated Work

Recurrent Neural Networks (Elman 1990) have emerged asone of the most powerful tools for computing distributedrepresentations of sequential data. The problems of trainingvanilla RNNs (Bengio, Simard, and Frasconi 1994) were ad-dressed by more sophisticated models – most notably theLong Short Term Memory (LSTM) (Hochreiter and Schmid-huber 1997) and the simpler GRU (Cho et al. 2014). Over theyears, several alternatives to the basic RNN model have beenproposed. A Dilated-RNN (Chang et al. 2017) uses progres-sively dilated connections between recurrent nodes to extractlong range dependencies efﬁciently. A Skip-RNN (Changet al. 2017) learns to skip state updates rather than applyingthem at each token in a sequence, thereby achieving fastertraining and inference times. Recurrent Highway Networks(Zilly et al. 2017) allow for multiple state updates via high-way connections at each time step and a Clockwork-RNN(Koutn´ık et al. 2014) updates its state at multiple timescales.The idea of capturing long term dependencies in better wayshas given rise to memory augmented architectures like Neu-ral Turing Machines (Graves, Wayne, and Danihelka 2014)and TopicRNNs (Dieng et al. 2017).In this paper, we focus on LSTMs. As shown by the workof (Jzefowicz, Zaremba, and Sutskever 2015) and (Greff et al.2016), LSTMs represent a robust recurrent neural network ar-chitecture for modeling sequential data. In particular, LSTMsare a core component in several state-of-the-art neural mod-els for NLP tasks like language modeling (Melis, Dyer, andBlunsom 2018; Merity, Keskar, and Socher 2018), textualentailment (Chen et al. 2017), question answering (Seo etal. 2017), semantic role labeling (He et al. 2017) and namedentity recognition (Ma and Hovy 2016).A unidirectional RNN processes a sequence in a singledirection, usually following the natural order speciﬁc to thesequence. Bidirectional RNNs, where two distinct recurrentnetworks process the input sequence in opposite directionswas ﬁrst proposed by (Schuster and Paliwal 1997). This al-lows the model to have a representation of the preﬁx andthe sufﬁx at each intermediate point in the sequence, therebyproviding context in both directions. Following the work by(Graves and Schmidhuber 2005), Bidirectional LSTMs havebecome a mainstay for sequence representation tasks. Theconcept of having encodings of different contexts has sincebeen generalized to Multidimensional LSTMs (Graves andSchmidhuber 2008) and Grid LSTMs (Kalchbrenner, Dani-helka, and Graves 2016).In the recently proposed Twin-Networks (Serdyuk et al.2018), the authors show that forcing the preﬁx encoding in aBiLSTM to be close to the sufﬁx encoding in the reverse di-rection acts as a regularizer and helps capture more long termdependencies. We take a more direct approach – explicitlyencoding the sufﬁx in the forward direction and forcing aninteraction with the preﬁx encoding through a max-pooling.Although we focus on LSTMs in this paper, our idea general-izes trivially to other RNN cells.

Conclusion

We propose SuBiLSTM and SuBiLSTM-Tied, a simple, gen-eral and effective improvement to the BiLSTM model, wherethe preﬁx and sufﬁx of each token in a sentence is encodedin both forward and reverse directions to capture long rangedependencies. We demonstrate gains in performance by re-placing BiLSTMs in existing models for several sentencemodeling tasks. The main drawback of our method is thequadratic time complexity required to compute the repre-sentations in a SuBiLSTM. As future direction of work, weintend to explore variants of SuBiLSTM, where only sufﬁxesof ﬁxed or small random lengths are computed. We also planto utilize the information (e.g. encodings of subsequences)exposed by SuBiLSTM in more novel ways.

References

Bengio, Y.; Simard, P.; and Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difﬁcult.

IEEETransactions on Neural Networks

EMNLP .Chang, S.; Zhang, Y.; Han, W.; Yu, M.; Guo, X.; Tan, W.;Cui, X.; Witbrock, M.; Hasegawa-Johnson, M.; and Huang,T. S. 2017. Dilated recurrent neural networks. In

NIPS .Chen, Q.; Zhu, X.; Ling, Z.; Wei, S.; and Jiang, H. 2017.Enhancing and combining sequential and tree LSTM fornatural language inference. In

ACL .Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.;Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learningphrase representations using rnn encoder-decoder for statisti-cal machine translation. In

EMNLP .Conneau, A., and Kiela, D. 2018. Senteval: An evalua-tion toolkit for universal sentence representations.

CoRR abs/1803.05449.Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; and Bor-des, A. 2017. Supervised learning of universal sentencerepresentations from natural language inference data. In

EMNLP .da Silva, J. P. C. G.; Coheur, L.; Mendes, A. C.; and Wichert,A. 2011. From symbolic to sub-symbolic information inquestion classiﬁcation.

Artif. Intell. Rev.

ICLR .Elman, J. L. 1990. Finding structure in time.

CognitiveScience

EMNLP .Ghaeini, R.; Hasan, S. A.; Datla, V. V.; Liu, J.; Lee, K.; Qadir,A.; Ling, Y.; Prakash, A.; Fern, X. Z.; and Farri, O. 2018.DR-BiLSTM: Dependent reading bidirectional LSTM fornatural language inference. In

NAACL-HLT .Gong, Y.; Luo, H.; and Zhang, J. 2018. Natural languageinference over interaction space. In

ICLR .raves, A., and Schmidhuber, J. 2005. Framewise phonemeclassiﬁcation with bidirectional lstm and other neural networkarchitectures.

Neural Networks

NIPS .Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural turingmachines.Greff, K.; Srivastava, R. K.; Koutnik, J.; Steunebrink, B. R.;and Schmidhuber, J. 2016. LSTM: A Search Space Odyssey.

IEEE Transactions on Neural Networks and Learning Sys-tems

ACL .Hill, F.; Cho, K.; and Korhonen, A. 2016. Learning dis-tributed representations of sentences from unlabelled data. In

HLT-NAACL .Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory.

Neural computation

ACM Trans. Inf. Syst. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs .Jzefowicz, R.; Zaremba, W.; and Sutskever, I. 2015. Anempirical exploration of recurrent network architectures. InBach, F. R., and Blei, D. M., eds.,

ICML .Kalchbrenner, N.; Danihelka, I.; and Graves, A. 2016. Gridlong short-term memory. In

ICLR .Kingma, D. P., and Ba, J. 2015. Adam: A method for stochas-tic optimization. In

ICLR .Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urtasun,R.; Torralba, A.; and Fidler, S. 2015. Skip-thought vectors.In

NIPS .Koutn´ık, J.; Greff, K.; Gomez, F.; and Schmidhuber, J. 2014.A clockwork rnn. In

ICML .Logeswaran, L., and Lee, H. 2018. An efﬁcient frameworkfor learning sentence representations. In

ICLR .Ma, X., and Hovy, E. 2016. End-to-end sequence labelingvia bi-directional LSTM-CNNs-CRF. In

ACL .Madabushi, H., and Lee, M. 2016. High accuracy rule-basedquestion classiﬁcation using question syntax and semantics.In

COLING .McCann, B.; Bradbury, J.; Xiong, C.; and Socher, R. 2017.Learned in translation: Contextualized word vectors. In

NIPS .Melis, G.; Dyer, C.; and Blunsom, P. 2018. On the state ofthe art of evaluation in neural language models. In

ICLR .Merity, S.; Keskar, N. S.; and Socher, R. 2018. Regularizingand optimizing LSTM language models. In

ICLR .Mou, L.; Peng, H.; Li, G.; Xu, Y.; Zhang, L.; and Jin, Z.2015. Tree-based convolution: A new neural architecture forsentence modeling. In

EMNLP . Munkhdalai, T., and Yu, H. 2017a. Neural semantic encoders.In

EACL .Munkhdalai, T., and Yu, H. 2017b. Neural tree indexers fortext understanding. In

EACL .Nie, A.; Bennett, E. D.; and Goodman, N. D. 2018. DisSent:Sentence representation learning from explicit discourse rela-tions.Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On thedifﬁculty of training recurrent neural networks. In

ICML .Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global vectors for word representation. In

EMNLP .Radford, A.; J´ozefowicz, R.; and Sutskever, I. 2017. Learningto generate reviews and discovering sentiment.

CoRR .Schuster, M., and Paliwal, K. K. 1997. Bidirectional recurrentneural networks.

IEEE Transactions on Signal Processing

ICLR .Serdyuk, D.; Ke, N. R.; Sordoni, A.; Pal, C.; and Bengio, Y.2018. Twin networks: Using the future as a regularizer. In

ICLR .Socher, R.; Perelygin, A.; Wu, J. Y.; Chuang, J.; Manning,C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep modelsfor semantic compositionality over a sentiment treebank. In

EMNLP .Subramanian, S.; Trischler, A.; Bengio, Y.; and Pal, C. J.2018. Learning general purpose distributed sentence repre-sentations via large scale multi-task learning. In

ICLR .Tan, C.; Wei, F.; Wang, W.; Lv, W.; and Zhou, M. 2018.Multiway attention networks for modeling sentence pairs. In

IJCAI .Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018. A compare-propagate architecture with alignment factorization for natu-ral language inference.Tomar, G. S.; Duque, T.; T¨ackstr¨om, O.; Uszkoreit, J.; andDas, D. 2017. Neural paraphrase identiﬁcation of questionswith noisy pretraining. In

SWCN@EMNLP .Voorhees, E. M. 2001. The TREC question answering track.

Nat. Lang. Eng.

IJCAI .Williams, A.; Nangia, N.; and Bowman, S. R. 2018. Abroad-coverage challenge corpus for sentence understandingthrough inference.

NAACL .Zhou, P.; Qi, Z.; Zheng, S.; Xu, J.; Bao, H.; and Xu, B.2016. Text classiﬁcation improved by integrating bidirec-tional LSTM with two-dimensional max pooling. In

COL-ING .Zilly, J. G.; Srivastava, R. K.; Koutn´ık, J.; and Schmidhuber,J. 2017. Recurrent highway networks. In