[PDF] Learning Robust, Transferable Sentence Representations for Text Classification

Abstract

Despite deep recurrent neural networks (RNNs) demonstrate strong performance in text classification, training RNN models are often expensive and requires an extensive collection of annotated data which may not be available. To overcome the data limitation issue, existing approaches leverage either pre-trained word embedding or sentence representation to lift the burden of training RNNs from scratch. In this paper, we show that jointly learning sentence representations from multiple text classification tasks and combining them with pre-trained word-level and sentence level encoders result in robust sentence representations that are useful for transfer learning. Extensive experiments and analyses using a wide range of transfer and linguistic tasks endorse the effectiveness of our approach.

Full PDF

UUnder review as a conference paper at ICLR 2019 L EARNING R OBUST , T

RANSFERABLE S ENTENCE R EP - RESENTATIONS FOR T EXT C LASSIFICATION

Wasi Uddin Ahmad ∗ , Xueying Bai § , Nanyun Peng † , Kai-Wei Chang ∗∗ University of California, Los Angeles § Stony Brook University, † University of Southern California [email protected] , [email protected]@isi.edu , [email protected] A BSTRACT

Despite deep recurrent neural networks (RNNs) demonstrate strong performancein text classiﬁcation, training RNN models are often expensive and requires an ex-tensive collection of annotated data which may not be available. To overcome thedata limitation issue, existing approaches leverage either pre-trained word embed-ding or sentence representation to lift the burden of training RNNs from scratch.In this paper, we show that jointly learning sentence representations from multi-ple text classiﬁcation tasks and combining them with pre-trained word-level andsentence level encoders result in robust sentence representations that are usefulfor transfer learning. Extensive experiments and analyses using a wide range oftransfer and linguistic tasks endorse the effectiveness of our approach.

NTRODUCTION

Recent advances in deep neural networks have demonstrated the capability to build highly accuratemodels by training on vast amounts of data. The efﬁciency of these techniques comes from theirability to learn an encoder to convert raw inputs into useful continuous feature representations effec-tively. These successes primarily credit to the availability of ample resources, such as an extensivecollection of training data. However, collecting a sufﬁcient amount of manually annotated data isnot always feasible, especially for domains requiring expert annotators.While human annotated data is limited, there are abundant resources that can be used to lift theburden of learning representations from scratch and thus subsidize the requirement of having a largeamount of training data. In the context of modeling natural languages, many successful storiesshowed that learned representations in both word and sentence levels are transferable to other tasks.These pre-trained representations enable us to model many natural language processing (NLP) taskssuch as text classiﬁcation (Bailey & Chopra, 2018) and named entity recognition (Cherry & Guo,2015) with only a few thousands of examples.At the word level, pre-trained word embeddings (Mikolov et al., 2013; Pennington et al., 2014) en-code each word into a continuous vector representation, have been widely used in many applications(Seo et al., 2017; Lee et al., 2017; Venugopalan et al., 2017; Teney et al., 2017). A few recent meth-ods propose to construct contextualized word vectors to address the issue that the meaning of a wordshould be context-dependent. For example, Peters et al. (2018) leveraged a large unannotated corpusto train such contextualized word vectors by feeding word sequences into a deep recurrent neuralnetwork (RNNs) and generating representations based on the hidden states of the RNNs correspondto the respective words. This results in impressive performances in many NLP applications (Leeet al., 2017; Peters et al., 2017; He et al., 2017).At the sentence level, Conneau et al. (2017) showed that an LSTM-based sentence encoder (Hochre-iter & Schmidhuber, 1997) trained on an annotated corpus for natural language inference (NLI)(Bowman et al., 2015) can capture useful features that are transferable to a wide range of text clas-siﬁcation tasks. A few follow-up studies (Subramanian et al., 2018; Logeswaran & Lee, 2018; Ceret al., 2018) extended the approach by leveraging large-scale data and studied how to learn bettertransferable sentence representations. 1 a r X i v : . [ c s . C L ] S e p nder review as a conference paper at ICLR 2019However, all the existing approaches considered training word or sentence level representations fromscratch. In contrast, we argue that by leveraging pre-trained embeddings/encoders and employingmultiple large-scale supervised text classiﬁcation datasets, we can learn more robust and transferablesentence representations. The primary research question that we address is how to build robust andtransferable representations for sentence classiﬁcation. The key challenges are two folds: 1) how totransfer only salient features by distinguishing generic information from task-speciﬁc informationwhen learning an encoder and 2) how to combine representations at word and sentence levels tobuild strong transferable representations for sentence classiﬁcation.To address the ﬁrst challenge, we propose to leverage multi-task learning (MTL) to jointly train sen-tence encoders on three large-scale text classiﬁcation corpora, which cover a variety of domains andtwo language classiﬁcation tasks – textual entailment and question paraphrasing. We exploit an MTLarchitecture that learns to separate generic representations from task-speciﬁc representations usingadversarial training. While generic representations capture language-speciﬁc information, i.e., lan-guage structure, syntax, and semantics that are useful uniformly across a variety of language tasks,the task-speciﬁc representations encode domain knowledge that is helpful if the source and trans-fer tasks are homogeneous. Our experimental results show that when the shared and task-speciﬁcencoders are combined, they become more effective and applicable to a wide range of transfer tasks.Besides, we combine our MTL-based sentence encoders with another existing sentence encoder(Subramanian et al., 2018) trained with different learning signals, and contextualized word vectors(Peters et al., 2018), to build a more robust and transferable sentence encoder. We evaluate ourencoder on 15 transfer (Conneau & Kiela, 2018) and 10 linguistic probing (Conneau et al., 2018)tasks. Experimental results demonstrate that our proposed sentence encoder better captures linguis-tic information and provides a signiﬁcant improvement over existing transfer learning approaches. ELATED W ORK

Our work is closely related to sentence representation learning, multi-task learning, and transferlearning and we brieﬂy review each of these areas in this section. • Sentence Representations Learning.

Training neural networks to form useful sentence repre-sentations has become a core component in many machine learning models. Learning distributionalsentence representations such that they capture the syntactic and semantic regularities has been pro-posed. These approaches range from models that compose of word embeddings (Le & Mikolov,2014; Arora et al., 2017; Wieting et al., 2016) to models with complex network architectures (Zhaoet al., 2015; Wang & Jiang, 2016; Liu et al., 2016c; Lin et al., 2017). Unsupervised approachesare also proposed in literature by utilizing a large collection of unlabeled text corpora to learn dis-tributional sentence representations. For example, Kiros et al. (2015) revised the skip-gram model(Mikolov et al., 2013) to learn a generic sentence encoder, called SkipThought that is further im-proved by using layer normalization (Ba et al., 2016). Among other closely related works, thetechnique proposed by (Hill et al., 2016) fell short to SkipThought while (Logeswaran & Lee, 2018)showed improvement over skip-thought vectors.Unlike word embeddings, learning sentence representations in an unsupervised fashion lack the rea-soning about semantic relationships between sentences. To this end, Conneau et al. (2017) proposedto train a universal sentence encoder in the form of a bidirectional LSTM using the supervised naturallanguage inference data, outperforming unsupervised approaches like SkipThought. Subramanianet al. (2018) propose to build general purpose sentence encoder by learning from a joint objective ofclassiﬁcation, machine translation, parse tree generation and unsupervised skip-thought tasks. Com-pared to their approach, we propose to utilize multiple text classiﬁcation datasets by leveraging amulti-task learning approach and combine them with existing contextualized word vectors (McCannet al., 2017; Peters et al., 2018) to learn robust and transferable sentence representations. Recentworks (Cer et al., 2018; Perone et al., 2018) explored RNN free sentence encoders and evaluatedsentence representations learning methods by using a variety of downstream and linguistic tasks. • Multi-task Learning (MTL).

Multi-task learning has been successfully used in a wide-range ofnatural language processing applications, including text classiﬁcation (Liu et al., 2017), machinetranslation (Luong et al., 2016), sequence labeling (Rei, 2017), sequence tagging (Peng & Dredze,2017), dependency parsing (Peng et al., 2017) etc. Recent works (Liu et al., 2016b; Zhang et al.,2017b; Liu et al., 2016a) proposed multi-task learning architectures with different methods of shar-2nder review as a conference paper at ICLR 2019ing information across the participant tasks. To facilitate scaling and transferring when a large num-ber of tasks are involved, Zhang et al. (2017a) proposed to embed labels by considering semanticcorrelations among tasks. To investigate how much transferable an end-to-end neural network archi-tectures are for NLP applications, Mou et al. (2016) propose to use multi-task learning on sentenceclassiﬁcation tasks. In contrast to these prior work, we aim to learn a universal sentence encodersvia multi-task learning that are transferable to a wide range of heterogeneous tasks. • Transfer Learning.

Transfer learning stores the knowledge gained from solving source tasks(usually with abundant annotated data), and apply it to other tasks (usually suffer from insufﬁcientannotated data to train complex models) to combat the inadequate supervision problem. It hasbecome prevalent in many computer vision applications (Sharif Razavian et al., 2014; Antol et al.,2015) where image features were trained on ImageNet (Deng et al., 2009), and applications whereword vectors (Pennington et al., 2014; Mikolov et al., 2013) were trained on large unlabeled corpora.Despite the beneﬁts of using pre-trained word embeddings, many NLP applications still suffer fromlacking high quality generic sentence representations that can help unseen tasks. In this work, wecombine sentence representations learned using MTL and contextualized word vectors to obtainmore robust sentence representations that transfer better.

ENTENCE R EPRESENTATIONS L EARNING

Our goal is to leverage available text copora and existing sentence and word encoders to build auniversal sentence encoder. In the following, we ﬁrst deﬁne the sentence encoder and then describea multi-task learning approach that learns sentence representations jointly on multiple text classiﬁ-cation tasks. Then we discuss how to combine the learned sentence representations with the existingsentence and contextualized word vectors.3.1 S

ENTENCE E NCODER

A typical text classiﬁcation model consists of two parts: a representation learning component, alsoknown as encoders that convert input text sequences into ﬁxed-size vectors, and a classiﬁer compo-nent that takes the vector representations and predicts the ﬁnal class labels. The encoder is usuallyrealized by a high complexity neural network architecture and requires a large amount of data totrain, as opposed to the classiﬁer which is generally simple (e.g., a linear model). When enoughtraining examples are provided, the encoder and the classiﬁer can be trained jointly from scratchin an end-to-end fashion. However, when data is insufﬁcient, this approach is unfeasible. Instead,we can pre-train the encoder on other tasks (a.k.a source tasks) and transfer the learned encoder tothe target task. In this case, we only require a few labeled examples to train the low-complexityclassiﬁer on top of the pre-trained encoder. We discuss how to build the pre-train encoder in below.We follow (Conneau et al., 2017) to build a transferable encoder based on an one layer bidi-rectional LSTM with max pooling (BiLSTM-max). Formally, given a sentence with T words, [ w , w , ..., w T ] , the encoder ﬁrst runs two LSTM models on input text from both directions. −→ h t = LST M ( −→ h t − , w t ) , ←− h t = LST M ( ←− h t +1 , w t ) (1)and h t = [ −→ h t , ←− h t ] ∈ R d is the t -th hidden vectors in BiLSTM, d is the dimensionality of theLSTM hidden units. To form a ﬁxed-size vector representation of variable length sentences, themaximum value is selected over each dimension of the hidden units: s j = max t ∈ [1 ,...,T ] h j,t , j = 1 , ..., d, (2)where s j is the j -th element of the sentence embedding s .For some transfer tasks (e.g., textural entailment and similarity measuring), the goal is to predictthe relationship between two sentences. Therefore, the input involves two sentences ( s , s ) . Wegenerate the representation of input instances by [ s , s , s − s , s (cid:12) s ] where (cid:12) denotes theelement-wise multiplication, and [ · , · ] denotes vector concatenation. Multi-task learning.

Multi-task learning was shown efﬁcient in many text classiﬁcation tasks.However, its effectiveness in learning transferable sentence representations is comparably less stud-ied. In this paper, we investigate the utility offered by several large-scale text classiﬁcation tasks. In this case, the classiﬁer is the last layer in the network architecture. k ,its shared representation s ks and private representation s kp can be computed using Eq. (1) – (2), andthey are concatenated to construct the sentence embedding: s k = [ s ks , s kp ] . Adversarial training.

Ideally, we want the private encoders to learn only task-speciﬁc features,and the shared encoder to learn generic features. To achieve this goal, we adopt the adversarialtraining strategy proposed by Liu et al. (2017) to introduce a discriminator on top of the sharedBiLSTM-max sentence encoder. The goal of the discriminator, D is to identify which task anencoded sentence s k comes from, and the adversarial training requires the shared sentence encoderto generate representations that can “fool” the discriminator. In this way, the shared encoder isforced not to carry task-related information. The discriminator is deﬁned as, D ( s k ) = softmax ( W s k + b ) , where W ∈ R d × d and b ∈ R d are model parameters. Optimizing the adversarial loss, L adv = min θ E (cid:32) max θ D (cid:0) K (cid:88) k =1 N k (cid:88) i =1 d ki log[ D ( E ( s ))] (cid:1)(cid:33) has two competing goals: the discriminator tries to maximize the classiﬁcation accuracy (insidethe parentheses), and the sentence encoder tries to confuse it (and thus minimize the classiﬁcationaccuracy). E and D represents the shared sentence encoder and the discriminator respectively and θ E and θ D are the model parameters of E and D . d ki denotes the ground-truth label indicating thetype of the current task. To encourage the shared and private encoders to capture different aspectsof the sentences, the following term is added. L diff = (cid:88) Kk =1 (cid:13)(cid:13)(cid:13) H k (cid:62) s H kp (cid:13)(cid:13)(cid:13) F where (cid:107)·(cid:107) F is the squared Frobenius norm. Here, H ks and H kp are matrices where rows are thehidden vectors (see Eq. (1)) generated by the shared and private encoders given an input sentence oftask k . The ﬁnal loss function is a weighted combination of three parts: L = L multi − task + βL adv + γL diff where β and γ are hyper-parameters, L multi − task refers to a simple summation over the cross en-tropy loss for each task. We tune β and γ in the range [0 . , . , . , . , . , . and presentthe best β and γ values in table 4 (provided in the appendix) for different multi-task learning settings.3.2 U NIFYING S ENTENCE E MBEDDINGS AND C ONTEXTUALIZED V ECTORS

Existing studies (Subramanian et al., 2018; Peters et al., 2018) leverage large amount of data totrain sentence or word representations. However, sometimes it is impractical to assume there is anaccess to these large-scale data or computation resources. In these circumstances, we can combineexisting encoders in a post-processing step to leverage the humongous data sources. In this work,we show that by combining our MTL based sentence encoder with an existing sentence encoder(Subramanian et al., 2018) and a contextualized word representation (Peters et al., 2018) encoder,we achieve state-of-the-art transfer performance on a wide variety of text classiﬁcation tasks.The contextualized word vectors refer to the hidden states generated by a BiLSTM (as in Eq. (1))given a sequence of words (sentences) as inputs. To form a ﬁxed size sentence representation fromthe contextual word vectors, we apply average pooling . Although Peters et al. (2018) suggestedlearning the weights of the contextual word vectors, we do not learn any additional weights because We tried max pooling, but it consistently performed more inferior compared to average pooling.

XPERIMENTS

In this section, we ﬁrst show that our proposed sentence encoder can achieve state-of-the-art transferperformances. Then we demonstrate that combining the multi-task trained sentence representationswith other sentence and word vectors yield better universal sentence representations. To betterunderstand our results, we provide a thorough ablation study and conﬁrm our combined encoder canlearn robust and transferable sentence representations.4.1 E

XPERIMENTAL S ETUP

We use three large-scale textual entailment and paraphrasing tasks to train sentence encoders withmulti-task learning, and combine these with an existing sentence encoder and contextualized wordembeddings to compose the ﬁnal sentence representations. We test the generalizability of the sen-tence embeddings on ﬁfteen transfer tasks. A detailed description of the source and transfer tasksare presented in table 5 in the appendix. In addition, we perform a quantitative analysis using tenprobing tasks to show what linguistic information is captured by our proposed sentence encoders.

Source tasks.

The ﬁrst two source tasks are natural language inference (NLI) which determineswhether a natural language hypothesis can be inferred from a natural language premise. We con-sider the SNLI (Bowman et al., 2015) and the Multi-Genre NLI (MNLI) (Williams et al., 2017)which consist of sentence pairs, manually labeled with one of the three categories: entailment, con-tradiction and neutral. Following Conneau et al. (2017), we also conduct experiments that combineSNLI and MNLI datasets, which is denoted as AllNLI. The second task is the Quora question para-phrase (QQP) detection based on a dataset of 404k question pairs. We use the Quora dataset splitas that in Wang et al. (2017). We present and discuss the source task performances in appendix A. Transfer and probing tasks.

We evaluate the sentence encoders on ﬁfteen transfer and ten probingtasks using the SentEval toolkit (Conneau & Kiela, 2018). Among the transfer tasks, six are textclassiﬁcation tasks for sentiment analysis (MR, SST), question-type (TREC), product reviews (CR),subjectivity/objectivity (SUBJ) and opinion polarity (MPQA). Rest of the transfer tasks (SICK-E,SICK-R, MRPC, STSB, and STS12–16) are semantic relatedness and textual similary tasks. We testour sentence encoder on capturing linguistic (surface, syntactic, and semantic) information using theten probing tasks suggested in (Conneau et al., 2018).

Hyper-parameter tuning.

We carefully tune the parameters and report the testing performancewith best parameters. We use SGD with an initial learning rate of . and a weight decay of . .At each epoch, we divide the learning rate by if the development accuracy decreases. We use mini-batches of size and training is stopped when the learning rate goes below the threshold of − .For the task-speciﬁc classiﬁer, we use a multi-layer perceptron with hidden-layer of hiddenunits. We consider the range [256 , , , for the number of hidden units in BiLSTM andfound results in best performance. We use dimensional GloVe word vectors (Penningtonet al., 2014) trained on billions of tokens as ﬁxed word embeddings.4.2 E VALUATION ON T RANSFER T ASKS

We benchmark the performances of our proposed encoders on ﬁfteen transfer tasks in comparisonwith several baselines including unsupervised models, sentence encoders, contextualized word en-coders and supervised models. Table 1 and 2 summarize the results. From the block 4 of table 1, wesee our combined encoders achieve best performance on 5/8 transfer tasks, which demonstrates theefﬁcacy of our proposed uniﬁed sentence encoder.We further analyze the performance of our proposed combined encoders from two aspects: the im-provement achieved by multi-task learned sentence encoders and the efﬁciency of the combination. Model Type MR CR SUBJ MPQA SST TREC SICK-E MRPC1.Unsupervised sentence representations learning1.1. FastSent (Hill et al., 2016) 70.8 78.4 88.7 80.6 - 76.8 - 72.2/80.31.2. SkipThought (Kiros et al., 2015) 76.5 80.1 93.6 87.1 82.0 92.2 82.3 73.0/82.01.3. USE (Transformer) (Cer et al., 2018) 81.4 87.4 93.9 87.0 85.4 92.5 - -1.4. Byte mLSTM (Radford et al., 2017) (this paper) (this paper) (this paper)

Table 1: Evaluation of sentence representations on a set of 8 tasks using a logistic regression classi-ﬁer. “SP” and “ASP” in row 2.3 and 2.4 refers to the shared-private and adversarial shared privatemulti-task learning models. Values indicate the accuracy (accuracy/F1 for MRPC) for the test setsand bold-faced values denote the best transfer performances. We employ an averaging bag-of-wordstechnique to form sentence embeddings, using features from all three layers of ELMo.

Model Type SICK-R STSB Semantic Textual Similarity (STS)2012 2013 2014 2015 2016InferSent (Conneau et al., 2017) 0.884 0.756 (this paper) (this paper) (this paper) (this paper) 0.895

Table 2: Transfer evaluation of the semantic relatedness and textual similarity tasks. “SP” and “ASP”in block 3 refers to the shared-private and adversarial shared private multi-task learning models. Inblock 4, we use ASP setting for Sent2vec. We use features from the top layer of the ELMo toproduce sentence embeddings. Values indicate the Pearson correlation coefﬁcient for the test setsand bold-faced values denote the best performance across all the models.From block 2 of table 1, we see the MTL based sentence encoders, Sent2vec outperforms the singletask based sentence encoder InferSent on 7 out of 8 transfer tasks (comparing row 2.1 with 2.3–2.4). Sent2Vec also provides competitive performance on the other 7 tasks as shown in table 2. Theresults demonstrate that learning from multiple tasks helps to capture more generalizable featuresthat are suitable for transfer learning. In addition, when using adversarial training, we observe im-provements in 9 out of 15 transfer tasks comparing to the non-adversarial setting (see table 1 and 2).To investigate the advantages of adversarial training, we provide a detailed comparison between theshared and private encoders with and without adversarial training in appendix C.Combining sentence encoders and contextualized word vectors improves the transfer learning per-formance signiﬁcantly (comparing row 4.3 to 2.2, 2.4, and 3.2 in table 1). First, we see from table 2,when there are no training examples available for tasks (STS12–16 and STSB tasks), sentence em-beddings (blocks 1 and 3) perform better than contextualized vectors (block 2). The result demon-strates the necessity of learning generic sentence representations so that they can be directly used(without training) in transfer tasks. Although Sent2vec outperforms ELMo on 12/15 tasks (compar-6nder review as a conference paper at ICLR 2019 (a) (b)

Figure 1: (a) Weights learned by transfer tasks for contextualized vectors (ELMo, CoVe), and sen-tence vectors which refer to a concatenation of all the private and shared encoders of the adversarialshared-private multi-task model. (b) Weights learned by the transfer tasks for private (task-speciﬁc)and shared encoders of adversarial shared private multi-task model.ing row 2.4 to 3.2 in table 1 and row ELMo to Sent2vec (ASP) in table 2), it fell short to GenSenon most of the transfer tasks since GenSen is trained using 124M sentence pairs while Sent2vec istrained on 1.4M pairs. Because training an encoder on massive datasets require large computationalresources, the direct utilization of GenSen through a post-processing step can bring beneﬁts fromlarge resources in a computational efﬁcient way. Second, contextualized vectors (ELMo) performbetter in speciﬁc tasks like SST. Also contextualized vectors are shown capable to capture speciﬁclinguistic properties (more details in the analysis part). These indicate that utilizing contextualizedvectors may help learn better sentence representations. As a result, in the block 4 of table 1, whenSent2vec, GenSen, and ELMo are combined, we observe a signiﬁcant improvement on 4 out of 8tasks (MR, CR, SST, and SICK-E) over its individual components and competitive performance onthe other tasks, which conﬁrms the efﬁciency of the combination.4.3 A

NALYSIS

Impact of sentence embeddings and contextualized word vectors.

To analyze the contributionsof our proposed sentence encoder and the sentence representations learned from contextualized wordvectors (CoVe, ELMo) in a combined encoder during transfer learning, we design a classiﬁer witha different network architecture. The classiﬁer ﬁrst generates predicted class probabilities based ona softmax layer using each sentence representations as input. Then the predictions are combined bya pooling layer with a weight parameter for each encoder. By investigating the learned weights inthe pooling layer, we can understand which encoder contributes the most. The learned weights areshown in Figure 1(a). Although contextualized vectors have higher weights in 7/10 tasks, sentencevectors have > contributions for each task and play dominant roles in tasks like SST and SICK-R. As we have shown in table 1, the combined encoder performs better than individual encoders(comparing row 4.1 to 2.4, 3.1, 3.2), indicating the contribution of the sentence and contextual wordencoders are quite complementary. Impact of source tasks on transfer tasks.

To understand the inﬂuences of the source tasks onthe transfer tasks, We conduct a similar analysis as in Figure 1(a) and show the learned weightsassigned for the private (task-speciﬁc) and shared (generic) encoders in the ASP model in Figure1(b). In general, for target tasks that are similar to the source tasks, the private encoders get higherweights, otherwise, the shared encoder is better. The combination of the shared and private encodersenables the transfer task to choose the best combination, thus achieved the best results. Most of thetransfer tasks assign large weights on the SNLI task-speciﬁc and low weights on the QQP task-speciﬁc encoder, which explains why representations learned on SNLI are especially efﬁcient fortransfer tasks as noted in (Conneau et al., 2017). Besides, with adversarial training enforced, theshared encoder gets a lower weight from most of the transfer tasks than non-adversarial training (seeﬁgure 3 in the appendix) demonstrating the efﬁciency of adversarial training to separate generic andtask-speciﬁc representations.

Probing for linguistic properties.

To understand what linguistic properties are captured by ourproposed sentence encoders and the uniﬁed encoders, we conduct experiments on probing tasks7nder review as a conference paper at ICLR 2019

Model Type SentLen WC TreeDepth TopConst BShift Tense SubjNum ObjNum SOMO CoordInvInferSent 84.0 90.5 38.6 47.3 62.3 87.1 85.9 81.5 59.8 68.5GenSen 93.9

Table 3: Probing task accuracies with MLP as the classiﬁer. For ELMo, the same bag-of-words av-eraging technique is employed as used for the downstream transfer tasks. When ELMo is combinedwith Sent2Vec and GenSen, features only from the top layer are used to ﬁt in single GPU (Titan X).Bold-faced values denote the best results across the board.Figure 2: Comparing test performances of supervised learning (using BCN), word (ELMo) andsentence level representations (Sent2vec, GenSen) and their combination (ComboRep refers toSent2vec + GenSen + ELMo) on SST and SICK-E tasks as the training dataset size is varied.proposed in Conneau et al. (2018). Results are provided in table 3. Sentence representations showsuperiority in hard semantic tasks like SOMO and CoordInv, while contextualized word vectors per-form better on embedding surface and syntactic properties. Moreover, MTL encoders (Sent2Vec)outperform both contextualized word vectors (ELMo) and the sentence encoder trained on singletask (Infersent) on hard semantic tasks. The uniﬁed sentence encoder that combines both the sen-tence and contextual word representations captures most of the linguistic properties.

Impact of training data.

Finally, we study the sample efﬁciency of the sentence and contextualizedword encoders, as well as a strong supervised learning baseline, BCN (McCann et al., 2017), trainingfrom scratch on SST and SICK-E tasks. The results are shown in Figure 2. We see that the transfersetting have better sample efﬁciency especially when the training data is limited ( < samples).Besides, our proposed sentence encoder Sent2vec outperforms the GenSen encoder on the SST tassbut it fall short on the SICK-E task. We show that the combined sentence encoder has higher sampleefﬁciency (can be trained with less labeled examples) than individual ones. We compare single andmulti-task sentence encoders by varying the dataset size and present the results in the appendix B. ONCLUSION

In this paper, we propose to leverage available large-scale text classiﬁcation datasets and existingword and sentence encoding models to learn a universal sentence encoder. We utilize multi-tasklearning (MTL) to train sentence encoders that learn both generic and task-speciﬁc sentence rep-resentations from three heterogeneous text classiﬁcation corpora. Experiments show that the MTLtrained representations outperforms sentence encoders trained on singe task on a variety of trans-fer sentence classiﬁcation tasks. We then further combine these sentence encoders with an existingmulti-task pre-trained sentence encoder (with a different set of tasks) and a contextualized wordrepresentation learner. Our proposed uniﬁed sentence encoder yields signiﬁcant improvements over8nder review as a conference paper at ICLR 2019the state-of-the-art sentence representations on transfer learning tasks. Extensive comparisons andthorough analysis using 15 transfer datasets and 10 linguistic probing tasks endorse the robustnessof our proposed universal sentence encoder. R EFERENCES

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit-nick, and Devi Parikh. VQA: Visual question answering. In

Proceedings of the IEEE InternationalConference on Computer Vision , pp. 2425–2433, 2015.Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentenceembeddings.

International Conference on Learning Representations , 2017.Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprintarXiv:1607.06450 , 2016.Katherine Bailey and Sunny Chopra. Few-shot text classiﬁcation with pre-trained word embeddingsand a human in the loop. arXiv preprint arXiv:1804.02063 , 2018.Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large an-notated corpus for learning natural language inference.

Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing , 2015.Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Con-stant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXivpreprint arXiv:1803.11175 , 2018.Colin Cherry and Hongyu Guo. The unreasonable effectiveness of word representations for twitternamed entity recognition. In

Proceedings of the 2015 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies , pp. 735–745,2015.Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence represen-tations. arXiv preprint arXiv:1803.05449 , 2018.Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervisedlearning of universal sentence representations from natural language inference data. In

Proceed-ings of the 2017 conference on Empirical Methods in Natural Language Processing , 2017.Alexis Conneau, German Kruszewski, Guillaume Lample, Lo¨ıc Barrault, and Marco Baroni. Whatyou can cram into a single vector: Probing sentence embeddings for linguistic properties.

Pro-ceedings of the 56th Annual Meeting of the Association for Computational Linguistics , 2018.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In

Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on , pp. 248–255. IEEE, 2009.Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. Deep semantic role labeling: Whatworks and whats next. In

Proceedings of the 55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , volume 1, pp. 473–483, 2017.Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentencesfrom unlabelled data.

Proceedings of the 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies , 2016.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.Yangfeng Ji and Jacob Eisenstein. Discriminative improvements to distributional sentence similarity.In

Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing ,pp. 891–896, 2013.Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Tor-ralba, and Sanja Fidler. Skip-thought vectors. In

Advances in neural information processingsystems , pp. 3294–3302, 2015. 9nder review as a conference paper at ICLR 2019Alice Lai and Julia Hockenmaier. Illinois-lh: A denotational and distributional approach to seman-tics. In

Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) ,pp. 329–334, 2014.Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In

Pro-ceedings of the 31st International Conference on Machine Learning (ICML-14) , pp. 1188–1196,2014.Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference resolu-tion.

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing ,2017.Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, andYoshua Bengio. A structured self-attentive sentence embedding.

International Conference onLearning Representations , 2017.Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Deep multi-task learning with shared memory.

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing ,2016a.Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classiﬁcation withmulti-task learning.

Proceedings of the Twenty-Fifth International Joint Conference on ArtiﬁcialIntelligence , 2016b.Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Adversarial multi-task learning for text classiﬁca-tion.

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics ,2017.Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. Learning natural language inference usingbidirectional lstm model and inner-attention. arXiv preprint arXiv:1605.09090 , 2016c.Lajanugen Logeswaran and Honglak Lee. An efﬁcient framework for learning sentence representa-tions.

International Conference on Learning Representations , 2018.Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-tasksequence to sequence learning. In

International Conference on Learning Representations , 2016.Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation:Contextualized word vectors. In

Advances in Neural Information Processing Systems , pp. 6297–6308, 2017.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed represen-tations of words and phrases and their compositionality. In

Advances in neural information pro-cessing systems , pp. 3111–3119, 2013.Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. How transferable areneural networks in nlp applications?

Proceedings of the 2016 conference on empirical methodsin natural language processing , 2016.Hao Peng, Sam Thomson, and Noah A Smith. Deep multitask learning for semantic dependencyparsing.

Proceedings of the 55th Annual Meeting of the Association for Computational Linguis-tics , 2017.Nanyun Peng and Mark Dredze. Multi-task domain adaptation for sequence tagging. In

Proceedingsof the 2nd Workshop on Representation Learning for NLP , pp. 91–100, 2017.Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for wordrepresentation. In

Proceedings of the 2014 conference on empirical methods in natural languageprocessing , pp. 1532–1543, 2014.Christian S Perone, Roberto Silveira, and Thomas S Paula. Evaluation of sentence embeddings indownstream and linguistic probing tasks. arXiv preprint arXiv:1806.06259 , 2018.Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervisedsequence tagging with bidirectional language models. In

Proceedings of the 55th Annual Meetingof the Association for Computational Linguistics , 2017.10nder review as a conference paper at ICLR 2019Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, andLuke Zettlemoyer. Deep contextualized word representations.

Proceedings of the 2018 Confer-ence of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies , 2018.Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discoveringsentiment. arXiv preprint arXiv:1704.01444 , 2017.Marek Rei. Semi-supervised multitask learning for sequence labeling.

Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics , 2017.Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attentionﬂow for machine comprehension.

International Conference on Learning Representations , 2017.Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition workshops , pp. 806–813, 2014.Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. Learning generalpurpose distributed sentence representations via large scale multi-task learning. In

InternationalConference on Learning Representations , 2018.Damien Teney, Lingqiao Liu, and Anton van den Hengel. Graph-structured representations forvisual question answering.

The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2017.Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Dar-rell, and Kate Saenko. Captioning images with diverse objects. In

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , July 2017.Shuohang Wang and Jing Jiang. Learning natural language inference with lstm.

Proceedings of the2016 Conference of the North American Chapter of the Association for Computational Linguis-tics: Human Language Technologies , 2016.Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for naturallanguage sentences.

Proceedings of the Twenty-Sixth International Joint Conference on ArtiﬁcialIntelligence , 2017.John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Towards universal paraphrasticsentence embeddings. In

International Conference on Learning Representations , 2016.Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus forsentence understanding through inference. arXiv preprint arXiv:1704.05426 , 2017.Honglun Zhang, Liqiang Xiao, Wenqing Chen, Yongkun Wang, and Yaohui Jin. Multi-task labelembedding for text classiﬁcation. arXiv preprint arXiv:1710.07210 , 2017a.Honglun Zhang, Liqiang Xiao, Yongkun Wang, and Yaohui Jin. A generalized recurrent neuralarchitecture for text classiﬁcation with multi-task learning.

Proceedings of the Twenty-Sixth In-ternational Joint Conference on Artiﬁcial Intelligence , 2017b.Han Zhao, Zhengdong Lu, and Pascal Poupart. Self-adaptive hierarchical sentence model. In

Pro-ceedings of the Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence , pp. 4069–4076, 2015.Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. Text classiﬁcationimproved by integrating bidirectional lstm with two-dimensional max pooling. arXiv preprintarXiv:1611.06639 , 2016. 11nder review as a conference paper at ICLR 2019 (a) (b)

Figure 3: Weights learned by the transfer tasks for private (task-speciﬁc) and shared encoders ofshared private multi-task model (a) with and (b) without adversarial training.

Tasks β γ

QQP and SNLI 0.01 0.05SNLI and MNLI 0.005 0.001QQP and AllNLI 0.01 0.05QQP, SNLI and MNLI 0.005 0.001

Table 4: Best β and γ values for adversarial shared private model on different set of tasks. Name N V Task CBinary and multi-class classiﬁcation tasksMR 11k 20.3k sentiment 2CR 4k 5.7k product review 2SUBJ 10k 22.6k subj/obj 2MPQA 11k 6.2k opinion 2SST 70k 17.5k sentiment 2TREC 6k 9.7k question-type 6Recognizing textual entailment tasksSNLI † † † Table 5: Statistics of the datasets for multi-task learning and the transfer tasks. N is the numberof samples, V is the vocabulary size, and C is the number of classes or score range. † denotes thedatasets that are used in multi-task learning. A E

VALUATION ON S OURCE T ASKS

In this section, we discuss the performance of the shared-private multi-task learning (MTL) frame-works on different combinations of QQP, SNLI and MNLI datasets as the source tasks . We concate-nate the representations generated by the shared and private encoders to form sentence embeddings.The results are presented in table 6. We compare the performance of MTL with the models trained12nder review as a conference paper at ICLR 2019on single tasks as in (Conneau et al., 2017). Table 6 shows that learning from multiple tasks performsbetter than learning from a single task. However, to our surprise, the adversarial training does not always excel on source tasks but we show in the transfer evaluation that adversarial training booststhe transfer learning performance.

Model Type QQP SNLI MNLIdev test dev test dev testLearning from in-domain single task(Conneau et al., 2017) 87.1 86.7 84.7 84.5 70.2/70.8 70.8/69.8Learning from 2-datasets and 2-tasks (SNLI and MNLI)Shared-Private - - 85.0

Adversarial Shared-Private - - 84.9 84.9 70.9/71.4 71.0/70.0Learning from 2-datasets and 2-tasks (QQP and SNLI)Shared-Private 87.0 86.8 84.8 84.7 - -Adversarial Shared-Private 87.5

Table 6: Validation and test accuracy of the source tasks obtained through various multi-task learningarchitectures. Bold-faced values indicate best performance across all the models.

Model Type MR CR SUBJ MPQA SST TREC SICK-R SICK-E MRPC STS14Sentence representation learning from single-taskBiLSTM-Max (on SNLI) 80.1 85.3 92.6 89.1 83.6 89.2 0.885 86.0 75.2/82.4 .66/.64BiLSTM-Max (on QQP) 79.2 84.6 92.6 88.8 83.5 88.0 0.861 82.4 74.8/82.8 .62/.60BiLSTM-Max (on MNLI) 81.2 85.8 93.1 89.5 83.4 88.8 0.863 84.7 75.9/83.1 .66/.63BiLSTM-Max (on AllNLI) 80.9 86.3 93.2 89.2 83.3 88.8 0.887 86.7 76.4/83.4 .69/.66Sentence representation learning from two-tasks (QQP and SNLI)Shared-Private 80.5 84.8 93.4 89.1 84.0 90.2 0.881 86.1 75.1/83.2 .65/.62Adversarial Shared-Private 80.9 85.4 93.4 89.2 83.6 90.8 0.886 86.9 76.5/82.9 .68/.65Sentence representation learning from two-tasks (SNLI and MNLI)Shared-Private 81.7 86.4 93.7 89.6 84.8 89.2 0.885 86.7 76.3/82.7 .67/.64Adversarial Shared-Private 81.2 86.0 93.0 89.3 83.7 90.4 0.886 .68/.65

Table 7: Transfer test results for various single-task and multi-task learning architectures trained ona combination of QQP, SNLI and MNLI datasets. Bold-faced values indicate the best performanceamong all models in this table.

Data Size MR CR SUBJ MPQA SST TREC SICK-R SICK-E STS14 MRPCSame for MTL, STL + + + − + + − − + − + + + − + + + + + + Table 8: The accuracy differences between MTL and STL when training with different sizes of data.For the same data size, MTL and STL are trained on equal amount of annotated data. For larger datasize for MTL, MTL is trained on two datasets while STL on one dataset (less data).

B S

INGEL - TASK VS . MULTI - TASK LEARNING WITH VARYING TRAINING DATA

When we compare MTL to STL for transfer learning, one fundamental question that arises is, doesimprovement in transfer learning via MTL only come because of having more annotated data? Com-paring the performance of AllNLI in a single task setting and { SNLI, MNLI } in the multi-task set-tings in table 7, we observe signiﬁcant improvement in 7/10 tasks. In both settings, the amount of13nder review as a conference paper at ICLR 2019training data is the same. To verify the hypothesis that the improvements in transfer learning do notsolely come from having more annotated data, we design an experiment that samples equal amountof data (225k training examples) from SNLI and QQP to match the size of full SNLI dataset. Wefound 0.26% average improvement in transfer tasks compared to single task learning (STL) on theSNLI dataset. With full SNLI and QQP dataset, we observe a larger (0.69% on average) improve-ment in transfer tasks compared to STL on SNLI dataset. The ﬁrst row of table 8 shows that MTL isbeneﬁcial in this setting and the second row demonstrates that with additional data, MTL achieveslarger improvements. C P

RIVATE E NCODERS VS . S

HARED E NCODER

To verify our hypothesis that shared encoder learns generic features that are more suitable for trans-fer learning and with adversarial training enforced, the shared encoder becomes more effective; weprovide a detailed comparison of the private and shared encoders in table 9. However, by concatenat-ing the shared and task-speciﬁc representations, we can achieve better transfer performance whichindicates that transfer tasks also get beneﬁted from task-speciﬁc features, specially when the sourceand transfer tasks are homogenous (more details are provided in the ablation analysis).

Model Type MR CR SUBJ MPQA SST TREC SICK-R SICK-E MRPC STS14Shared-Private (trained on SNLI and MNLI)Private Encoder (on SNLI) 79.5 84.0 92.7 89.1 82.0 87.8 0.881 84.8 75.0/82.7 .65/.63Private Encoder (on MNLI) 80.6 84.6 92.7 89.2 82.9 88.0 0.853 83.8 75.1/82.9 .60/.58Shared Encoder 80.9 86.4 92.9 89.5 84.0 88.0 0.879 84.8 75.8/83.1 .69/.65Combined Encoder 81.7 86.4 93.7 .70/.67

Shared Encoder 80.5 84.8 92.6 89.2 83.2 82.6 0.876 84.8 75.5/83.1 .57/.56Combined Encoder 81.9 85.9 93.0 .66/.64.66/.64