Improved Sentence Modeling using Suffix Bidirectional LSTM
IImproved Sentence Modeling using Suffix Bidirectional LSTM
Siddhartha Brahma
IBM Research AI, Almaden, USA
Abstract
Recurrent neural networks have become ubiquitous in comput-ing representations of sequential data, especially textual datain natural language processing. In particular, BidirectionalLSTMs are at the heart of several neural models achievingstate-of-the-art performance in a wide variety of tasks in NLP.However, BiLSTMs are known to suffer from sequential bias –the contextual representation of a token is heavily influencedby tokens close to it in a sentence. We propose a general andeffective improvement to the BiLSTM model which encodeseach suffix and prefix of a sequence of tokens in both forwardand reverse directions. We call our model Suffix BidirectionalLSTM or
SuBiLSTM . This introduces an alternate bias thatfavors long range dependencies. We apply SuBiLSTMs toseveral tasks that require sentence modeling. We demonstratethat using SuBiLSTM instead of a BiLSTM in existing mod-els leads to improvements in performance in learning generalsentence representations, text classification, textual entailmentand paraphrase detection. Using SuBiLSTM we achieve newstate-of-the-art results for fine-grained sentiment classificationand question classification.
Introduction
Recurrent Neural Networks (RNN) (Elman 1990) haveemerged as a powerful tool for modeling sequential data.Vanilla RNNs have largely given way to more sophisti-cated recurrent architectures like Long Short-Term Memory(Hochreiter and Schmidhuber 1997) and the simpler GatedRecurrent Unit (Cho et al. 2014), owing to their superior gra-dient propagation properties. The importance of LSTMs innatural language processing, where a sentence as a sequenceof tokens represents a fundamental unit, has risen exponen-tially over the past few years. A LSTM processing a sentencein the forward direction produces distributed representationsof its prefixes. A Bidirectional LSTM (BiLSTM in short)(Schuster and Paliwal 1997)(Graves and Schmidhuber 2005)additionally processes the sentence in the reverse direction(starting from the last token) producing representations of thesuffixes (in the reverse direction). For every token t in the sen-tence, a BiLSTM thus produces a contextual representationof t based on its prefix and suffix in the sentence.Despite their sophisticated design, it is well known thatLSTMs suffer from sequential bias (Pascanu, Mikolov, and Copyright c (cid:13)
Bengio 2013). The hidden state of a LSTM is heavily influ-enced by the last few tokens it has processed. This impliesthat the contextual representation of t is highly influencedby the tokens close to it in the sequential order, with tokensfarther away being less influential. Computing contextualrepresentations that capture long range dependencies is achallenging research problem, with numerous applications.In this paper, we propose a simple, general and effectivetechnique to compute contextual representations that capturelong range dependencies. For each token t , we encode bothits prefix and suffix in both the forward and reverse direction.Notably, the encoding of the suffix in the forward direction isbiased towards tokens sequentially farther away to the rightof t . Similarly, the encoding of the prefix in the reverse direc-tion is biased towards tokens sequentially farther away to theleft of t . Further, we combine the prefix and suffix representa-tions by a simple max-pooling operation to produce a richercontextual representation of t in both the forward and reversedirection. We call our model Suffix BiLSTM or SuBiLSTM in short. A SuBiLSTM has the same representation length asa BiLSTM with the same hidden dimension.We consider two versions of SuBiLSTMs – a tied versionwhere the suffixes and prefixes in each direction are encodedusing the same LSTM and an untied version where two differ-ent LSTMs are used. Note that, as in a BiLSTM, we alwaysuse different LSTMs for the forward and reverse direction. Ingeneral a SuBiLSTM can be used as a drop in replacementin any model that uses the intermediate states of a BiLSTM,without changing any other parts of the model. However, themain motivation for introducing SuBiLSTMs is to apply itto problems that require whole sentence modeling e.g. textclassification, where the richer contextual information can behelpful. We demonstrate the effectiveness of SuBiLSTM onseveral sentence modeling tasks in NLP – general sentencerepresentation, text classification, textual entailment and para-phrase detection. In each of these tasks, we show gains bysimply replacing BiLSTMs in strong base models, achievinga new state-of-the-art in fine grained sentiment classificationand question classification.
Suffix Bidirectional LSTM
Let s be a sequence with n tokens. We use s [ i : j ] to denotethe sequence of embeddings of the tokens from s [ i ] to s [ j ] ,where j maybe less than i . Let → L p represent a LSTM that a r X i v : . [ c s . L G ] S e p i nh p,i ( L p ) → h s,i ( L s ) → h p,i ( L p ) ← h s,i ( L s ) ←→ →←← , )Max( , )Max( SuBiLSTM
Figure 1: Schematics of SuBiLSTM. The large solid purple arrow represents prefixes and large solid seagreen arrow representssuffixes. Their directions represent the encoding direction of the corresponding LSTMs. Best viewed in color.encodes prefixes of s in the forward direction. For the i -thtoken s [ i ] , we have → h p,i = → L p ( s [1 : i ]) (1)Let → L s represent a LSTM that encodes suffixes of s in the forward direction. → h s,i = → L s ( s [ i : n ]) (2)Note that the → h p,i can be computed in a single pass over s ,while computing → h s,i needs a total of n passes over progres-sively smaller suffixes of s . Now consider ← L p and ← L s thatencodes the prefixes and suffixes of s in the reverse direction. ← h p,i = ← L p ( s [ i : 1]) (3) ← h s,i = ← L s ( s [ n : i ]) (4)Note that both → h p,i and ← h p,i encode the same prefix, but indifferent directions. Similarly, → h s,i and ← h s,i encode the samesuffix, but in different directions. See Fig. 1 for a schematicillustration.We have four vectors → h p,i , → h s,i , ← h p,i , ← h s,i that constitutethe context of s [ i ] . Using these, we define the followingcontextual representation of s [ i ] . H SuBiLSTM i = (cid:104) max (cid:110) → h p,i , → h s,i (cid:111) ; max (cid:110) ← h p,i , ← h s,i (cid:111)(cid:105) (5)Here ; is the concatenation operator. This defines the SuBiL-STM model. We also define another representation wherethe two LSTMs encoding the sequence in the same direc-tion are the same or their weights are tied. This defines the
SuBiLSTM-Tied model, which concretely is H SuBiLSTM-Tied i = (cid:104) max (cid:110) → h p,i , → h s,i (cid:111) ; max (cid:110) ← h p,i , ← h s,i (cid:111)(cid:105) (6)where → L p ≡ → L s , ← L p ≡ ← L s In contrast to SuBiLSTM, a standard BiLSTM uses the fol-lowing contextual representation of s [ i ] . H BiLSTM i = (cid:104) → h p,i ; ← h s,i (cid:105) (7)For a fixed hidden dimension, SuBiLSTM and SuBiLSTM-Tied have the same representation length as a BiLSTM. Im-portantly, SuBiLSTM-Tied uses the same number of parame-ters as a BiLSTM, while SuBiLSTM uses twice as many. Interpretations of SuBiLSTM
Notice that → h s,i is biased towards tokens that are sequentiallyto the right and farthest away from s [ i ] . Combining it with → h p,i which is influenced more by tokens close to and to theleft of s [ i ] creates a representation of s [ i ] that is dependenton and influenced by tokens both close and far away from it.The same argument can be repeated in the reverse directionwith ← h p,i and ← h s,i . We argue that this is a richer contextualrepresentation of s [ i ] which can help in better sentence mod-eling, as compared to BiLSTMs where the representation isbiased towards sequentially close tokens.As an alternate viewpoint, for every token s [ i ] , SuBiL-STM creates two representations of its prefix s [1 : i ] , → h p,i and ← h p,i . Their concatenation [ → h p,i ; ← h p,i ] is equivalent to anencoding of the prefix with a BiLSTM consisting of → L p and ← L p . Similarly, [ → h s,i ; ← h s,i ] is an encoding of the suffix s [ i : n ] by a BiLSTM consisting of → L s and ← L s . Thus H SuBiLSTM i canbe interpreted as the max-pooling of the bidirectional rep-resentations of the prefix and suffix of s [ i ] into a compactrepresentation. This may be contrasted with a BiLSTM wherethe prefix is encoded by a LSTM in the forward directionand the suffix is encoded by another LSTM in the reversedirection. SuBiLSTM thus tries to capture more informationby encoding the prefix and suffix in a bidirectional manner.In general, the prefix and suffix encodings can be combinedin other ways e.g. concatenation, mean or through a learnedgating function. However, we use max-pooling because it is asimple parameterless operation and it performs better in ourexperiments. Since both SuBiLSTM and SuBiLSTM-Tiedproduces representations of each token s [ i ] in the same wayas a BiLSTM, they can be used as drop in replacements for aBiLSTM in any model that uses these representations. Time complexity of a SuBiLSTM
To compute the contextual representations of a minibatchof sentences using a SuBiLSTM, we calculate all the → h p,i in one pass using → L p . We then create several minibatches(determined by the maximum length of a sentence in theminibatch n max ) of successively smaller suffixes starting at i ,for each i ∈ [1 : n max ] and use → L s to compute the encodings → h s,i . The same procedure is repeated for the minibatch ofsentences with tokens reversed to compute ← h s,i and ← h p,i . Asn optimization, several of the minibatches of the shorter suf-fixes can be combined to form larger minibatches. The worstcase time complexity of computing all the representations isquadratic in n max , as compared to the linear time complexityusing a BiLSTM. As we show in later sections, the increasedtime complexity is offset by the consistent gains in perfor-mance on several sentence modeling tasks. The encodings ofthe different can be computed in parallel, which can speedup computation greatly on modern hardware. Evaluation, Datasets, Training and Testing
We evaluate the representational power of SuBiLSTM us-ing several sentence modeling tasks and datasets from NLP.We do not concern ourselves with designing new modelsfor SuBiLSTM. Rather, for each task, we take a stronglyperforming base model that uses the token representationsof a BiLSTM and replace it with SuBiLSTM. The trainingprocedures are kept exactly the same.
General Sentence Representation
First, we investigate whether a SuBiLSTM can be trained toproduce good general sentence representations that transferwell to several NLP tasks. As the base model, we use therecently proposed InferSent (Conneau et al. 2017). It wasshown to give strong results on a set of 10 NLP tasks en-capsulated in the SentEval benchmark (Conneau and Kiela2018). The representation of a sentence is a max-pooling ofthe token representations produced by a SuBiLSTM. H SuBiLSTM ( s ) = max i ∈ [1: n ] H SuBiLSTM i (8)where H SuBiLSTM i is defined in (5). The representation for H SuBiLSTM-Tied i is defined similarly. We train the model on thetextual entailment task, where a pair of sentences (premiseand hypothesis) needs to be classified into one of three classes- entailment, contradiction and neutral. Let u be the encodingof the premise according to (8) and let v be the encoding ofthe hypothesis. Using a Siamese architecture, the combinedvector of [ u ; v ; | u − v | ; u · v ] is used as the representationof the pair which is then passed through two fully connectedlayers and a final classification layer. Training . We use the combination of the Stanford NaturalLanguage Inference (SNLI) (Bowman et al. 2015) and theMultiNLI (Williams, Nangia, and Bowman 2018) datasets totrain. We set the hidden dimension of the LSTMs in SuBiL-STM to 2048, which produces a 4096 dimensional encodingfor each sentence. The two fully connected layers are of 512dimensions each. The tokens in the sentence are embeddedusing GloVe embeddings (Pennington, Socher, and Manning2014) which are not updated during training. We follow thesame training procedure used for training the InferSent modelin (Conneau et al. 2017).
Testing . We test the sentence representations learned bySuBiLSTM on the SentEval benchmark. This benchmark con-sists of 6 text classification tasks (MR, CR, SUBJ, MPQA,SST, TREC) with accuracy as the performance measure.There is one task on paraphrase detection (MRPC) with ac-curacy and F1 and one on entailment classification (SICK-E) with accuracy as the performance measure, respectively. Dataset Classification Task
Text Classification
We pick two representative tasks for text classification – sen-timent classification and question classification. As the basemodel, we use the Biattentive-Classification-Network (BCN)proposed by (McCann et al. 2017), which was shown to givestrong performance on several text classification datasets,especially in association with CoVe embeddings (McCannet al. 2017). The BCN model uses two BiLSTMs to encodea sentence. The intermediate states of the first BiLSTM areused to compute a self-attention matrix. This is followedby further processing and a second BiLSTM before a finalclassification layer. Our hypothesis is that the richer contex-tual representations of SuBiLSTM should help such attentionbased sentence models. For our experiments, we replace onlythe first BiLSTM with a SuBiLSTM.
Training and Testing . For sentiment classification, we usethe Stanford Sentiment Treebank dataset (Socher et al. 2013),both in its binary (SST-2) and fine-grained (SST-5) forms. Forquestion classification, we use the TREC (Voorhees 2001)dataset, both in its 6 class (TREC-6) and 50 class (TREC-50) forms. The hidden dimension of the LSTMs is set to300. Distinct from (McCann et al. 2017), we use dropoutafter the embedding layer and before the classification layer.The two maxout layers are fixed at reduction factors of 4and 2. We also apply weight decay to the parameters duringoptimization, which is done using Adam (Kingma and Ba2015) with a learning rate of 1e-3. We experiment with twoversions of the initial embedding – one using GloVe only andthe other using both GloVe and CoVe, both of which are fixedduring training. Validation and testing are done using the setsassociated with the SST and TREC datasets.
Textual Entailment
As mentioned above, the textual entailment problem is thetask of classifying a pair of sentences into three classes –entailment, contradiction and neutral. It is an important andcanonical text matching problem in NLP. To test SuBiLSTMfor this task, we pick ESIM (Chen et al. 2017) as the basemodel. ESIM has been shown to achieve state-of-the-art re-sults on the SNLI dataset and has been the basis of furtherimprovements. Like BCN above, ESIM uses two BiLSTMlayers to encode sentences, with an inter-sentence attention odel MR CR SUBJ MPQA SST TREC MRPC SICK-R SICK-E STSB
Other Existing Methods
FastSent+AE 71.8 76.7 88.8 81.5 - 80.4 71.2/79.1 - - -SkipThought-LN 79.4 83.1 93.7 89.3 82.9 88.4 - 0.858 79.5 -DisSent 80.1 84.9 93.6 90.1 84.1 93.6 75.0/- 0.849 83.7 -CNN-LSTM 77.8 82.1 93.6 89.4 - 92.6 76.5/83.8 0.862 - -Byte mLSTM 86.9 91.4 94.6 88.5 - - 75.0/82.8 0.792 - -MultiTask 82.5 87.7 94.0 90.9 83.2 93.0 78.6/84.4 0.888 87.8 0.789QuickThoughts 82.4 86.0 94.8 90.2 87.6 92.4 76.9/84.0 0.874 - -
Supervised Training on AllNLI (4096 dimensions)
BiLSTM (InferSent) 81.1 86.3 92.4 90.2 84.6 88.2 76.2/83.1 0.884 86.3 0.758BiLSTM-2layer 81.3 86.2 92.0 90.2
Table 2: Performance of SuBiLSTM on the SentEval benchmark. The first 8 methods contain both unsupervised and supervisedones. FastSent is from (Hill, Cho, and Korhonen 2016), SkipThought is described in (Kiros et al. 2015), DisSent in (Nie,Bennett, and Goodman 2018), CNN-LSTM in (Gan et al. 2017), Byte mLSTM in (Radford, J´ozefowicz, and Sutskever 2017),QuickThoughts in (Logeswaran and Lee 2018) and MultiTask in (Subramanian et al. 2018). Our base model is InferSent(Conneau et al. 2017). Bold indicates the best performance among the SuBiLSTM models and the base model. M R C R S U B J M P Q A SS T T R E C M R P C S I C K - R S I C K - E S T S B P e r ce n t a g e SuBiLSTM (Avg. ∆ = 0.58%)SuBiLSTM-Tied (Avg. ∆ = 0.59%)Figure 2: Gains by using SuBiLSTM in the SentEval tasks.For MRPC we use F1 percentage, for SICK-R and STSBwe use 100 × Pearson correlation and for the rest accuracypercentages. The Avg. ∆ is the average of the 10 values.mechanism in between. In our experiments, we only replacethe first BiLSTM with a SuBiLSTM. Training and Testing . We use 300 dimensional GloVeembeddings to initialize the word embeddings (which are alsoupdated during training) and use 300 dimensional LSTMs.We follow the same training procedure as (Chen et al. 2017).Validation and testing are on the corresponding sets in theSNLI dataset.
Paraphrase Detection
In this task, a pair of sentences need to be classified accordingto whether they are paraphrases of each other. To demonstratethe effectiveness of SuBiLSTM in a model that does not useany attention mechanism on the token representations, weuse the same Siamese architecture used for training generalsentence representations described above, except with onefully connected layer at the end followed by ReLU activation.
Training and Testing . We use 300 dimensional GloVeembeddings to initialize the word embeddings (which are alsoupdated during training) and use 600 as the hidden dimensionof all LSTMs and also the dimension of the fully connectedlayer. We apply dropout after the word embedding layer andafter the ReLU activation. Training is done using the Adamoptimizer with a learning rate of 1e-3. We use the QUORAdataset (Iyer et al. 2017) to train and test our models. Asummary of the various datasets used in our evaluation isgiven in Table 1.
Baselines
For each of the tasks, we compare SuBiLSTM andSuBiLSTM-Tied with a single-layer BiLSTM and a 2-layerBiLSTM encoder with the same hidden dimension. While aSuBiLSTM-Tied encoder has the same number of parametersas single-layer BiLSTM, a SuBiLSTM has twice as many.In contrast, a 2-layer BiLSTM has more parameters thaneither of the SuBiLSTM variants if the hidden dimension isat least as large as the input dimension, which is the case inall out models. By comparing with a 2-layer BiLSTM base-line, we account for the larger number of parameters used inSuBiLSTM and also check whether the long range contextualinformation captured by SuBiLSTM can easily be replicatedby adding more layers to the BiLSTM.odel Test Model Test SS T - NSE (Munkhdalai and Yu 2017a) 89.7 T R E C - BCN+Char+CoVe (McCann et al. 2017) 95.8BCN+Char+CoVe (McCann et al. 2017) 90.3 TBCNN (Mou et al. 2015) 96.0Byte mLSTM 91.8 LSTM-CNN (Zhou et al. 2016) 96.1(Radford, J´ozefowicz, and Sutskever 2017)BCN with BiLSTM 89.3 BCN with BiLSTM 95.2BCN with 2-layer BiLSTM 89.5 BCN with 2-layer BiLSTM 95.5BCN with SuBiLSTM 89.8 BCN with SuBiLSTM 95.8BCN with SuBiLSTM-Tied 89.7 BCN with SuBiLSTM-Tied
BCN with BiLSTM+CoVe 90.1 BCN with BiLSTM+CoVe 95.8BCN with 2-layer BiLSTM+CoVe 90.5 BCN with 2-layer BiLSTM+CoVe 95.8BCN with SuBiLSTM+CoVe 91.0 BCN with SuBiLSTM+CoVe 96.0BCN with SuBiLSTM-Tied+CoVe
BCN with SuBiLSTM-Tied+CoVe 95.8 SS T - TE-LSTM (Huang, Qian, and Zhu 2017) 52.6 T R E C - BCN+Char+CoVe (McCann et al. 2017) 90.2NTI (Munkhdalai and Yu 2017b) 53.1 RulesUHC (da Silva et al. 2011) 90.8BCN+Char+CoVe (McCann et al. 2017) 53.7 Rules (Madabushi and Lee 2016) 97.2BCN with BiLSTM 53.2 BCN with BiLSTM 89.8BCN with 2-layer BiLSTM 53.5 BCN with 2-layer BiLSTM 89.4BCN with SuBiLSTM 53.2 BCN with SuBiLSTM 89.8BCN with SuBiLSTM-Tied 53.4 BCN with SuBiLSTM-Tied 89.4BCN with BiLSTM+CoVe 53.6 BCN with BiLSTM+CoVe 90.0BCN with 2-layer BiLSTM+CoVe 54.0 BCN with 2-layer BiLSTM+CoVe 89.2BCN with SuBiLSTM+CoVe 54.5 BCN with SuBiLSTM+CoVe
BCN with SuBiLSTM-Tied+CoVe
BCN with SuBiLSTM-Tied+CoVe
Table 3: Comparison of text classification methods on the four datasets - SST-2, SST-5, TREC-6 and TREC-50. For each ofthem, we show accuracy numbers for BCN with SuBiLSTM and BCN with BiLSTM (base model), both with and without CoVeembeddings. The best performing ones among these is shown in bold.
Experimental Results
In this section, for the sake of brevity, the terms SuBiLSTMand SuBiLSTM-Tied will sometimes refer to the base modelswhere the BiLSTM has been replaced by our models.
General Sentence Representation
The performance of SuBiLSTM and SuBiLSTM-Tied onthe 10 transfer tasks in SentEval is shown in Table 2. Inall the tasks, SuBiLSTM and SuBiLSTM-Tied matches orexceeds the performance of the base model InferSent thatuses a BiLSTM. For SuBiLSTM, among the classificationtasks, the gains for SUBJ (0.8%), MPQA (0.5%) and TREC(1.6%) over InferSent are particularly notable. There is also asubstantial gain of 1.2% in the semantic textual similarity task(STSB). The performance of SuBiLSTM-Tied also follows asimilar trend, gaining 0.6% for SUBJ, 0.5% for SST , 2.2%for TREC and and 1.3% for STSB. The better performance onSTSB is noteworthy as the sentence representations derivedfrom a SuBiLSTM can take advantage of the long rangedependencies it encodes. The 2-layer BiLSTM based modelperforms comparably to the single layer BiLSTM, despiteusing a much larger number of parameters.In Fig. 2 we plot the absolute gains made by SuBiLSTMand SuBiLSTM-Tied over BiLSTM for all the 10 tasks. It isinteresting to note that both models perform comparably onan average, although SuBiLSTM has twice as many param- eters as SuBiLSTM-Tied. The performance of our modelsis still some way off from MultiTask (Subramanian et al.2018); but they use a training dataset which is two orders ofmagnitude larger with a complex set of learning objectives.QuickThoughts (Logeswaran and Lee 2018) also uses a muchlarger unsupervised dataset. It is possible that SuBiLSTMcoupled with training objectives and datasets used in thesetwo works will provide substantial gains over the existingresults.
Text Classification
The performance of SuBiLSTM and SuBiLSTM-Tied on thefour text classification datasets is shown in Table 3. In threeof these tasks (SST-2, SST-5 and TREC-50), SuBiLSTM-Tied using GloVe and CoVe embeddings performs the best.It performs notably better than the single layer BiLSTMbased base model BCN on SST-2 and SST-5, achieving a newstate-of-the-art accuracy of 56.2% on fine-grained sentimentclassification (SST-5). On TREC-6, the best result is obtainedfor SuBiLSTM-Tied using GloVe embeddings only, a newstate-of-the-art accuracy of 96.2%.There is no substantialimprovement on the TREC-50 dataset.For text classification, we observe that SuBiLSTM-Tiedperforms better than SuBiLSTM and CoVe embeddings givea boost in most cases. The performance of the base modelBCN with a 2-layer BiLSTM is slightly better than with theodel TestESIM with BiLSTM (Chen et al. 2017) 88.0DIIN (Gong, Luo, and Zhang 2018) 88.0BCN+Char+CoVe (McCann et al. 2017) 88.1DR-BiLSTM (Ghaeini et al. 2018) 88.5CAFE (Tay, Tuan, and Hui 2018) 88.5ESIM with BiLSTM (Ours) 87.8ESIM with 2-layer BiLSTM (Ours) 87.9ESIM with SuBiLSTM
ESIM with SuBiLSTM-Tied 88.2ESIM with BiLSTM (Ensemble) 88.6ESIM with 2-layer BiLSTM (Ensemble) 88.7ESIM with SuBiLSTM (Ensemble)
ESIM with SuBiLSTM-Tied (Ensemble)
Table 4: Accuracy of SuBiLSTM and BiLSTM on the SNLItest set with ESIM as the base model.single layer BiLSTM in all cases except TREC-50. However,despite using a larger number of parameters, it does notperform better than both SuBiLSTM and SuBiLSTM-Tied.This implies that the richer contextual information capturedby a SuBiLSTM cannot easily be replicated by adding morelayers to the BiLSTM. Note that BCN uses a self-attentionmechanism on top of the token representations and it is ableto exploit the richer representations provided by SuBiLSTM.
Textual Entailment
The performance of SuBiLSTM and SuBiLSTM-Tied on theSNLI dataset is shown in Table 4. Our implementation ofESIM, when using a BiLSTM, achieves 87.8% accuracy. Us-ing a SuBiLSTM, the accuracy jumps to 88.3% and to 88.2%for the Tied version. On using the 2-layer BiLSTM, accuracyimproves only marginally by 0.1%. This is aligned with theresults shown for text classification above. Here again, theattention mechanism on top of the token representations ben-efit from the long range contextual information captured bySuBiLSTM. Note that ESIM uses an inter-sentence attentionmechanism and is able to exploit the better token represen-tations provided by SuBiLSTM across sentences. We alsoreport the performance of an ensemble of 5 models. Both theSuBiLSTM versions achieve an accuracy of 89.1%, whilethe BiLSTM based ones perform worse.
Paraphrase Detection
The accuracies obtained on the QUORA dataset are shownin Table 5. Note that unlike the BCN and ESIM models,we use a simple Siamese architecture without any attentionmechanism. In fact, the representation of a sentence in thiscase is simply the max-pooling of all the intermediate rep-resentations of the SuBiLSTM. Even in this case, we ob-serve gains over both single layer and 2-layer BiLSTMs,although slightly lesser than the attention based models. Thebest model (SuBiLSTM) achieves 88.2%, at par with a morecomplex attention based model BiMPM (Wang, Hamza, andFlorian 2017). Model TestBIMPM (Wang, Hamza, and Florian 2017) 88.2pt-DECATTchar (Tomar et al. 2017) 88.4DIIN (Gong, Luo, and Zhang 2018) 89.1MwAN (Tan et al. 2018) 89.1BiLSTM 87.82-layer BiLSTM 87.9SuBiLSTM
SuBiLSTM-Tied 88.1Table 5: Accuracy of SuBiLSTM and BiLSTM on theQUORA test set with a Siamese base model. All previousresults use attention mechanisms. S e n t E v a l SS T - SS T - T R E C - T R E C - S N L I Q U O R A . . . P e r ce n t a g e SuBiLSTMSuBiLSTM-TiedFigure 3: Gains from using SuBiLSTM and SuBilSTM-Tiedover single layer BiLSTM on all the datasets. The differenceis between the best figures obtained for each model. ForSentEval we use the average score and accuracy for the rest.
Comparison of SuBiLSTM and SuBiLSTM-Tied
The results shown above clearly show the efficacy of usingSuBiLSTMs in existing models geared towards four differentsentence modeling tasks. The relative performance of SuBiL-STM and SuBiLSTM-Tied are fairly close to each other, asshown by the relative gains in Fig. 3. SuBiLSTM-Tied worksbetter on small datasets (SST and TREC), probably owingto the regularizing effect of using the same LSTM to encodeboth suffixes and prefixes. For the larger datasets (SNLI andQUORA), SuBILSTM slightly edges out the tied versionowing to its larger capacity. The training complexity for boththe models is similar and hence, with half the parameters,SuBILSTM-Tied should be the more favored model for sen-tence modeling tasks. elated Work
Recurrent Neural Networks (Elman 1990) have emerged asone of the most powerful tools for computing distributedrepresentations of sequential data. The problems of trainingvanilla RNNs (Bengio, Simard, and Frasconi 1994) were ad-dressed by more sophisticated models – most notably theLong Short Term Memory (LSTM) (Hochreiter and Schmid-huber 1997) and the simpler GRU (Cho et al. 2014). Over theyears, several alternatives to the basic RNN model have beenproposed. A Dilated-RNN (Chang et al. 2017) uses progres-sively dilated connections between recurrent nodes to extractlong range dependencies efficiently. A Skip-RNN (Changet al. 2017) learns to skip state updates rather than applyingthem at each token in a sequence, thereby achieving fastertraining and inference times. Recurrent Highway Networks(Zilly et al. 2017) allow for multiple state updates via high-way connections at each time step and a Clockwork-RNN(Koutn´ık et al. 2014) updates its state at multiple timescales.The idea of capturing long term dependencies in better wayshas given rise to memory augmented architectures like Neu-ral Turing Machines (Graves, Wayne, and Danihelka 2014)and TopicRNNs (Dieng et al. 2017).In this paper, we focus on LSTMs. As shown by the workof (Jzefowicz, Zaremba, and Sutskever 2015) and (Greff et al.2016), LSTMs represent a robust recurrent neural network ar-chitecture for modeling sequential data. In particular, LSTMsare a core component in several state-of-the-art neural mod-els for NLP tasks like language modeling (Melis, Dyer, andBlunsom 2018; Merity, Keskar, and Socher 2018), textualentailment (Chen et al. 2017), question answering (Seo etal. 2017), semantic role labeling (He et al. 2017) and namedentity recognition (Ma and Hovy 2016).A unidirectional RNN processes a sequence in a singledirection, usually following the natural order specific to thesequence. Bidirectional RNNs, where two distinct recurrentnetworks process the input sequence in opposite directionswas first proposed by (Schuster and Paliwal 1997). This al-lows the model to have a representation of the prefix andthe suffix at each intermediate point in the sequence, therebyproviding context in both directions. Following the work by(Graves and Schmidhuber 2005), Bidirectional LSTMs havebecome a mainstay for sequence representation tasks. Theconcept of having encodings of different contexts has sincebeen generalized to Multidimensional LSTMs (Graves andSchmidhuber 2008) and Grid LSTMs (Kalchbrenner, Dani-helka, and Graves 2016).In the recently proposed Twin-Networks (Serdyuk et al.2018), the authors show that forcing the prefix encoding in aBiLSTM to be close to the suffix encoding in the reverse di-rection acts as a regularizer and helps capture more long termdependencies. We take a more direct approach – explicitlyencoding the suffix in the forward direction and forcing aninteraction with the prefix encoding through a max-pooling.Although we focus on LSTMs in this paper, our idea general-izes trivially to other RNN cells.
Conclusion
We propose SuBiLSTM and SuBiLSTM-Tied, a simple, gen-eral and effective improvement to the BiLSTM model, wherethe prefix and suffix of each token in a sentence is encodedin both forward and reverse directions to capture long rangedependencies. We demonstrate gains in performance by re-placing BiLSTMs in existing models for several sentencemodeling tasks. The main drawback of our method is thequadratic time complexity required to compute the repre-sentations in a SuBiLSTM. As future direction of work, weintend to explore variants of SuBiLSTM, where only suffixesof fixed or small random lengths are computed. We also planto utilize the information (e.g. encodings of subsequences)exposed by SuBiLSTM in more novel ways.
References
Bengio, Y.; Simard, P.; and Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difficult.
IEEETransactions on Neural Networks
EMNLP .Chang, S.; Zhang, Y.; Han, W.; Yu, M.; Guo, X.; Tan, W.;Cui, X.; Witbrock, M.; Hasegawa-Johnson, M.; and Huang,T. S. 2017. Dilated recurrent neural networks. In
NIPS .Chen, Q.; Zhu, X.; Ling, Z.; Wei, S.; and Jiang, H. 2017.Enhancing and combining sequential and tree LSTM fornatural language inference. In
ACL .Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.;Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learningphrase representations using rnn encoder-decoder for statisti-cal machine translation. In
EMNLP .Conneau, A., and Kiela, D. 2018. Senteval: An evalua-tion toolkit for universal sentence representations.
CoRR abs/1803.05449.Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; and Bor-des, A. 2017. Supervised learning of universal sentencerepresentations from natural language inference data. In
EMNLP .da Silva, J. P. C. G.; Coheur, L.; Mendes, A. C.; and Wichert,A. 2011. From symbolic to sub-symbolic information inquestion classification.
Artif. Intell. Rev.
ICLR .Elman, J. L. 1990. Finding structure in time.
CognitiveScience
EMNLP .Ghaeini, R.; Hasan, S. A.; Datla, V. V.; Liu, J.; Lee, K.; Qadir,A.; Ling, Y.; Prakash, A.; Fern, X. Z.; and Farri, O. 2018.DR-BiLSTM: Dependent reading bidirectional LSTM fornatural language inference. In
NAACL-HLT .Gong, Y.; Luo, H.; and Zhang, J. 2018. Natural languageinference over interaction space. In
ICLR .raves, A., and Schmidhuber, J. 2005. Framewise phonemeclassification with bidirectional lstm and other neural networkarchitectures.
Neural Networks
NIPS .Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural turingmachines.Greff, K.; Srivastava, R. K.; Koutnik, J.; Steunebrink, B. R.;and Schmidhuber, J. 2016. LSTM: A Search Space Odyssey.
IEEE Transactions on Neural Networks and Learning Sys-tems
ACL .Hill, F.; Cho, K.; and Korhonen, A. 2016. Learning dis-tributed representations of sentences from unlabelled data. In
HLT-NAACL .Hochreiter, S., and Schmidhuber, J. 1997. Long short-termmemory.
Neural computation
ACM Trans. Inf. Syst. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs .Jzefowicz, R.; Zaremba, W.; and Sutskever, I. 2015. Anempirical exploration of recurrent network architectures. InBach, F. R., and Blei, D. M., eds.,
ICML .Kalchbrenner, N.; Danihelka, I.; and Graves, A. 2016. Gridlong short-term memory. In
ICLR .Kingma, D. P., and Ba, J. 2015. Adam: A method for stochas-tic optimization. In
ICLR .Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urtasun,R.; Torralba, A.; and Fidler, S. 2015. Skip-thought vectors.In
NIPS .Koutn´ık, J.; Greff, K.; Gomez, F.; and Schmidhuber, J. 2014.A clockwork rnn. In
ICML .Logeswaran, L., and Lee, H. 2018. An efficient frameworkfor learning sentence representations. In
ICLR .Ma, X., and Hovy, E. 2016. End-to-end sequence labelingvia bi-directional LSTM-CNNs-CRF. In
ACL .Madabushi, H., and Lee, M. 2016. High accuracy rule-basedquestion classification using question syntax and semantics.In
COLING .McCann, B.; Bradbury, J.; Xiong, C.; and Socher, R. 2017.Learned in translation: Contextualized word vectors. In
NIPS .Melis, G.; Dyer, C.; and Blunsom, P. 2018. On the state ofthe art of evaluation in neural language models. In
ICLR .Merity, S.; Keskar, N. S.; and Socher, R. 2018. Regularizingand optimizing LSTM language models. In
ICLR .Mou, L.; Peng, H.; Li, G.; Xu, Y.; Zhang, L.; and Jin, Z.2015. Tree-based convolution: A new neural architecture forsentence modeling. In
EMNLP . Munkhdalai, T., and Yu, H. 2017a. Neural semantic encoders.In
EACL .Munkhdalai, T., and Yu, H. 2017b. Neural tree indexers fortext understanding. In
EACL .Nie, A.; Bennett, E. D.; and Goodman, N. D. 2018. DisSent:Sentence representation learning from explicit discourse rela-tions.Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On thedifficulty of training recurrent neural networks. In
ICML .Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global vectors for word representation. In
EMNLP .Radford, A.; J´ozefowicz, R.; and Sutskever, I. 2017. Learningto generate reviews and discovering sentiment.
CoRR .Schuster, M., and Paliwal, K. K. 1997. Bidirectional recurrentneural networks.
IEEE Transactions on Signal Processing
ICLR .Serdyuk, D.; Ke, N. R.; Sordoni, A.; Pal, C.; and Bengio, Y.2018. Twin networks: Using the future as a regularizer. In
ICLR .Socher, R.; Perelygin, A.; Wu, J. Y.; Chuang, J.; Manning,C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep modelsfor semantic compositionality over a sentiment treebank. In
EMNLP .Subramanian, S.; Trischler, A.; Bengio, Y.; and Pal, C. J.2018. Learning general purpose distributed sentence repre-sentations via large scale multi-task learning. In
ICLR .Tan, C.; Wei, F.; Wang, W.; Lv, W.; and Zhou, M. 2018.Multiway attention networks for modeling sentence pairs. In
IJCAI .Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018. A compare-propagate architecture with alignment factorization for natu-ral language inference.Tomar, G. S.; Duque, T.; T¨ackstr¨om, O.; Uszkoreit, J.; andDas, D. 2017. Neural paraphrase identification of questionswith noisy pretraining. In
SWCN@EMNLP .Voorhees, E. M. 2001. The TREC question answering track.
Nat. Lang. Eng.
IJCAI .Williams, A.; Nangia, N.; and Bowman, S. R. 2018. Abroad-coverage challenge corpus for sentence understandingthrough inference.
NAACL .Zhou, P.; Qi, Z.; Zheng, S.; Xu, J.; Bao, H.; and Xu, B.2016. Text classification improved by integrating bidirec-tional LSTM with two-dimensional max pooling. In
COL-ING .Zilly, J. G.; Srivastava, R. K.; Koutn´ık, J.; and Schmidhuber,J. 2017. Recurrent highway networks. In