[PDF] Improving Robustness and Generality of NLP Models Using Disentangled Representations

Abstract

Supervised neural networks, which first map an input x to a single representation z , and then map z to the output label y , have achieved remarkable success in a wide range of natural language processing (NLP) tasks. Despite their success, neural models lack for both robustness and generality: small perturbations to inputs can result in absolutely different outputs; the performance of a model trained on one domain drops drastically when tested on another domain. In this paper, we present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning. Instead of mapping x to a single representation z , the proposed strategy maps x to a set of representations { z 1 , z 2 ,..., z K } while forcing them to be disentangled. These representations are then mapped to different logits l s, the ensemble of which is used to make the final prediction y . We propose different methods to incorporate this idea into currently widely-used models, including adding an L 2 regularizer on z s or adding Total Correlation (TC) under the framework of variational information bottleneck (VIB). We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.

Full PDF

IImproving Robustness and Generality of NLP ModelsUsing Disentangled Representations

Jiawei Wu ♣ , Xiaoya Li ♣ , Xiang Ao (cid:7) , Yuxian Meng ♣ , Fei Wu ♠ and Jiwei Li ♠♣♠ Department of Computer Science and Technology, Zhejiang University (cid:7)

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences ♣ ShannonAI{jiawei_wu, xiaoya_li, yuxian_meng, jiwei_li}@[email protected], [email protected]

Abstract

Supervised neural networks, which ﬁrst mapan input x to a single representation z , andthen map z to the output label y , have achievedremarkable success in a wide range of natu-ral language processing (NLP) tasks. Despitetheir success, neural models lack for both ro-bustness and generality: small perturbationsto inputs can result in absolutely different out-puts; the performance of a model trained onone domain drops drastically when tested onanother domain.In this paper, we present methods to improverobustness and generality of NLP models fromthe standpoint of disentangled representationlearning. Instead of mapping x to a singlerepresentation z , the proposed strategy maps x to a set of representations { z , z , ..., z K } while forcing them to be disentangled. Theserepresentations are then mapped to differentlogits l s, the ensemble of which is used tomake the ﬁnal prediction y . We propose differ-ent methods to incorporate this idea into cur-rently widely-used models, including addingan L z s or adding Total Corre-lation (TC) under the framework of variationalinformation bottleneck (VIB). We show thatmodels trained with the proposed criteria pro-vide better robustness and domain adaptationability in a wide range of supervised learningtasks. Supervised neural networks have achieved remark-able success in a wide range of NLP tasks, suchas language modeling (Xie et al., 2017; Devlinet al., 2018a; Liu et al., 2019; Joshi et al., 2020;Meng et al., 2019b), machine reading comprehen-sion (Seo et al., 2016; Yu et al., 2018), and machinetranslation (Sutskever et al., 2014; Vaswani et al.,2017b; Meng et al., 2019a). Despite the success, neural models lack for both robustness and gener-ality and are extremely fragile: the output label canbe changed with a minor change of a single pixel(Szegedy et al., 2013; Goodfellow et al., 2014b;Nguyen et al., 2015; Papernot et al., 2017; Yuanet al., 2019) in an image or a token in a document(Li et al., 2016; Papernot et al., 2016; Jia and Liang,2017; Zhao et al., 2017; Ebrahimi et al., 2017; Jiaet al., 2019b); The model lacks for domain adapta-tion abilities (Mou et al., 2016; Daumé III, 2009): amodel trained on one domain can hardly generalizeto new test distributions (Fisch et al., 2019; Levyet al., 2017). Despite that different avenues havebeen proposed to address model robustness suchas augmenting the training data using rule-basedlexical substitutions (Liang et al., 2017; Ribeiroet al., 2018) or paraphrase models (Iyyer et al.,2018), building robust and domain-adaptive neuralmodels remains a challenge.In a standard supervised learning setup, a neuralnetwork model ﬁrst maps an input x to a singlevector z = f ( x ) . z can be viewed as the hiddenfeature to represent x , and is transformed to itslogit l followed by a softmax operator to outputthe target label y . At training time, parameters in-volved in mapping from x ∈ X to z then to y arelearned. At test time, the pretrained model makesa prediction when presented with a new instance x (cid:48) ∈ X (cid:48) . This methodology works well if X and X (cid:48) come from exactly the same distribution, butsigniﬁcantly suffers if not. This is because theimplicit representation learned through supervisedsignals can easily and overﬁt to the training do-main X , and the mapping function f ( x ) , whichis trained only based on X , can be confused without-of-domain features in x (cid:48) , such as a lexical, prag-matic, and syntactic variation not seen in the train-ing set (Ettinger et al., 2017). We can also interpretthe weakness of this methodology from a domain a r X i v : . [ c s . C L ] S e p daptation point of view (Daume III and Marcu,2006; Daumé III, 2009; Tan et al., 2009; Patel et al.,2014): it is crucial to separate source-speciﬁc fea-tures, target-speciﬁc features and general features(features shared by sources and targets). One ofthe most naive strategies for domain adaptation isto ask the model to only use general features fortest. In the standard x → z → y setup, all features,including source-speciﬁc, target-speciﬁc and gen-eral features, are entangled in z . Due to the lack ofinterpretability (Li et al., 2015; Linzen et al., 2016;Lei et al., 2016; Koh and Liang, 2017) of neuralmodels, it is impossible to disentangle them.Inspired by recent work in disentangled representa-tion learning (Bengio et al., 2013; Kim and Mnih,2018; Hjelm et al., 2018; Kumar et al., 2018; Lo-catello et al., 2019), we propose to improve robust-ness and generality of NLP models using disen-tangled representations. Different from mapping x to a single representation z and then to y , theproposed strategy ﬁrst maps x to a set of distinctrepresentations Z = { z , · · · , z K } , which are thenindividually projected to logits l , · · · , l K . l s areensembled to make the ﬁnal prediction of y . In thissetup, we wish to make z s or l s to be disentangledfrom each other as much as possible, which poten-tially improves both robustness and generality: Forthe former, the decision of y is more immune tosmall changes in x since even though small changeslead to signiﬁcant changes in some z s or l s, othersmay remain invariant. The ultimate inﬂuence on y can be further regulated when l s are combined.For the latter, different l s have the potential to dis-entangle or partially disentangle source-speciﬁc,target-speciﬁc and general features.Practically, we propose two ways to disentanglerepresentations: adding an L • We present two methods to improve the ro-bustness and generality of NLP models in the view of disentangled representation learningand the information bottleneck theory. • Extensive experiments on domain adaptationand defense against adversarial attacks showthat the proposed methods are able to providebetter robustness compared with conventionaltask-speciﬁc models, which indicates the ef-fectiveness of the theory of information bottle-neck and disentangled representation learningfor NLP tasks.The rest of this paper is organized as follows: wepresent related work in Section 2. Models are de-tailed in Section 3 and Section 4. We present exper-imental results and analysis in Section 5, followedby a brief conclusion in Section 6.

Disentangled representation learning was ﬁrst pro-posed by Bengio et al. (2013). InfoGan (Chen et al.,2016) disentangled the representation by maximiz-ing the mutual information between a small subsetof the GAN’s noise latent variables and the obser-vation. Kim and Mnih (2018) learned disentan-gled representations in VAE, by encouraging thedistribution of representations to be factorial andhence independent across the dimensions. Hjelmet al. (2018) learned disentangled representationsby simultaneously estimating and maximizing themutual information between input data and learnedhigh-level representations. Chen et al. (2018) pro-posed β -TCVAE, encouraging the model to ﬁndstatistically independent factors in the data distribu-tion by imposing a total correlation (TC) penalty.Similarly, Kumar et al. (2018) learned disentangledlatents from unlabeled observations by introducinga regularizer over the induced prior. The Information Bottleneck (IB) principle was ﬁrstproposed by Tishby et al. (2000). It treats the super-vised learning task as an optimization problem thatsqueezes the information from an input about theoutput through an information bottleneck. In infor-mation bottleneck, the mutual information I ( X ; Y ) is used as the measurement of the relevant infor-mation between x and the output y . Tishby andaslavsky (2015); Shwartz-Ziv and Tishby (2017)proposed to use it as a theoretical tool for analyzingand understanding representations in deep neuralnetworks. Alemi et al. (2016) proposed a deepvariational version of the IB principle (VIB) to al-low for using deep neural networks to parameterizethe distributions. In the ﬁeld of NLP, not muchattention has been attached to the Information Bot-tleneck principle. Li and Eisner (2019) proposedto extract speciﬁc information for different tasks(which are deﬁned in the output y ) from pretrainedword embeddings using VIB. Less relevant workis from Kong et al. (2019), which proposed a self-supervised objective that maximizes the mutual in-formation between global sentence representationsand n -grams in the sentence. Domain adaptation evalutes the model’s ability ofgeneralization across domains, for which many ef-forts have been devoted to designing more power-ful cross-domain models (Daumé III, 2009; Kimet al., 2015; Lee et al., 2018; Adel et al., 2017;Yang et al., 2018; Ruder, 2019). Sun et al. (2016)proposed CORAL, a method that minimizes do-main shift by aligning the second-order statisticsof source and target distributions without even re-quiring any target labels; Lin and Lu (2018) addeddomain-adaptive layers on top of the model; Jiaet al. (2019a) used cross-domain language modelsas a bridge cross-domains for domain adaptation.Li et al. (2019b); Du et al. (2020) applied adversar-ial learning to learn cross-domain models for thetask of sentiment analysis. For machine translation,the core idea is to utilize large available paralleldata for training NMT models and adapt them todomains with small data (Chu et al., 2017), wheredata augmentation (Sennrich et al., 2016a; Ul Haqet al., 2020), meta-learning (Gu et al., 2018) andﬁnetuning methods (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016; Dakwale, 2017) areproposed to achieve this goal.

Deep neural networks are fragile when attackedby adversarial examples (Goodfellow et al., 2014a;Arjovsky et al., 2017; Mirza and Osindero, 2014).In the context of NLP, Sato et al. (2018) built a can-didate pool that includes adversarial examples, and used the method of Fast Gradient Sign Method(FGSM) (Goodfellow et al., 2014b) to select acandidate word for replacement. Papernot et al.(2016b) showed that the forward derivative (Paper-not et al., 2016a) can be used to produce adver-sarial sequences manipulating both the sequenceoutput and classiﬁcation predictions made by anRNN. Liang et al. (2017) designed three perturba-tion strategies for word-level attack — insertion,modiﬁcation and removal. Miyato et al. (2016);Sato et al. (2018); Zhu et al. (2020); Zhou et al.(2020) restricted the directions of perturbations to-ward the existing words in the input embeddingspace. Ebrahimi et al. (2017) proposed a noveltoken transformation method by computing deriva-tives with respect to a few character-edit operations.Other methods either generate certiﬁed defenses(Jia et al., 2019b; Huang et al., 2019; Shi et al.,2020), or generate examples that maintain lexicalcorrectness, grammatical correctness and semanticsimilarity (Ren et al., 2019a). L Regularizer on Z Here, we present our ﬁrst attempt to learn disen-tangled representations with an L regularizer. Weﬁrst map the input x to multiple representations Z = { z , z , ..., z K } and we wish different z s tobe disentangled. To obtain Z , we can use indepen-dent sets of parameters of RNNs (Hochreiter andSchmidhuber, 1997; Mikolov et al., 2010), CNNs(Krizhevsky et al., 2012; Kalchbrenner et al., 2014)or Transformers (Vaswani et al., 2017b). This ac-tually mimics the idea of the model ensemble. Toavoid the parameter and memory intensity in theensemble setup, we adopt the following simplemethod: we ﬁrst map x to a single vector represen-tation z using RNNs or CNNs. Next, we separatesub-representations from z using distinct projectionmatrices, each of which tries to capture a certainaspect of features, given as follows: z i = W i z, i = 1 , · · · , K (1)where z , z i ∈ R d × , W i ∈ R d × d , and K is thenumber of disentangled representations.To make sure that these sub-representations actu-ally disentangle, we enforce a regularizer on the L distance between each pair of them: L reg = (cid:88) ij (cid:107) z i − z j (cid:107) (2)he regularizer assumes that the distance betweenrepresentations in the Euclidean space is in accor-dance with the distinctiveness between features thatare the most salient for predictions. Each z i is nextmapped to a logit l i as follows: l i = W · z i , i = 1 , · · · , K (3)where W ∈ R T × d and T denotes the number ofpredeﬁned classes for the supervised learning task.Next we aggregate the weighted logits into a sin-gle ﬁnal logit l = (cid:80) α i l i , where α i is the weightassociated with l i . α i can be computed using thesoftmax operator by introducing a learnable param-eter w a ∈ R d × : α = softmax ([ z (cid:62) w a , · · · , z (cid:62) K w a ]) (4)Combining the cross entropy loss with golden label ˆ y and the L regularizer on Z , we can obtain theﬁnal training objective as follow: L separare = CE ( softmax ( l ) , ˆ y ) + β L reg (5) β is the hyper-parameter controlling the weightof the regularizer. The method can be adapted toany neural network. Albeit simple, this model hassigniﬁcantly better ability of learning disentangledfeatures and and is less prone to adversarial attacks,as we will show in the experiments later. Many recent works (Alemi et al., 2016; Higginset al., 2017; Burgess et al., 2018) have shownthat the information bottleneck is more suitablefor learning robust and general features than task-speciﬁc end-to-end models, due to the ﬂexibilityprovided by its learned structure. Here we ﬁrstgo through the preliminaries of the variational in-formation bottleneck (VIB) (Alemi et al., 2016),and then detail how it can be adapted for learningdisentangled representations by adding a Total Cor-relation (TC) regularizer (Ver Steeg and Galstyan,2015; Steeg, 2017; Gao et al., 2018).

Let p ( z | x ) denote an encoding of x , which maps x to representations z . The key point of IB is to learnan encoding that is maximally informative aboutour target Y , measured by the mutual information between z and the target y , denoted by I ( y, z ) .Unfortunately, only modeling I ( y, z ) is not enoughsince the model can always make z = x to ensurethe maximally informative representation, which isnot helpful for learning general features. Instead,we need to ﬁnd the best z subject to a constraint onits complexity, leading to the penalty on the mutualinformation between x and z . The objective for IBis thus given as follows: L IB = I ( z, y ; θ ) − βI ( z, x ; θ ) (6)where β controls the trade-off between I ( z, y ) and I ( z, x ) . Intuitively, the ﬁrst term encourages z tobe predictive of y and the second term enforces z to be concisely representative of x . By leaving details to the appendix, we can obtainthe lower bound of I ( z, y ) and the upper bound of I ( z, x ) : I ( z, y ) ≥ (cid:90) p ( x ) p ( y | x ) p ( z | x ) log q ( y | z ) d x d y d zI ( z, x ) ≤ (cid:90) p ( x ) p ( z | x ) log p ( z | x ) r ( z ) d x d z (7)where q ( y | z ) and r ( z ) are variational approxima-tions to p ( y | z ) and p ( z ) respectively. We can im-mediately have the lower bound of Eq.6: I ( Z, Y ) − βI ( Z, X ) ≥ (cid:90) p ( x ) p ( y | x ) p ( z | x ) log q ( y | z ) d x d y d z − β (cid:90) p ( x ) p ( z | x ) log p ( z | x ) r ( z ) d x d z = L VIB (8)In order to compute this in practice, we approxi-mate p ( x, y ) using the empirical data distribution p ( x, y ) = N (cid:80) Nn =1 δ x n ( x ) δ y n ( y ) , leading to: L VIB ≈ N N (cid:88) n =1 (cid:20)(cid:90) p ( z | x n ) log q ( y n | z ) d z − βp ( z | x n ) log p ( z | x n ) r ( z ) (cid:21) (9)By using the reparameterization trick (Kingma andWelling, 2013) to rewrite p ( z | x )d z = p ( (cid:15) ) d (cid:15) , It is worth noting that Eq.6 resembles the form of β -VAE(Higgins et al., 2017), an unsupervised model for learningdisentangled representations modiﬁed upon the VariationalAutoencoder (VAE) (Kingma and Welling, 2013). Burgesset al. (2018) showed from an information bottleneck view that β -VAE mimics the behavior of information bottleneck andlearns to disentangle representations. here z = f ( x, (cid:15) ) is a deterministic function of x and the Guassian random variable (cid:15) , we put every-thing together to the following objective: L VIB = 1 N N (cid:88) n =1 E p ( (cid:15) ) [ − log q ( y n | f ( x n , (cid:15) ))]+ β D KL ( p ( z | x n ) , r ( z )) (10) p ( z | x ) is set to N ( z | f µe ( x ) , f Σ e ( x )) where f e is anMLP of mapping the input x to a stochastic encod-ing z . The output dimension of f e is D , wherethe ﬁrst D outputs encode µ and the remaining D outputs encode σ . Then we sample (cid:15) ∼ N ( , I ) and combine them together z = µ + (cid:15) · σ . Wetreat r ( z ) = N ( z | , I ) and q ( y | z ) as a softmaxclassiﬁer. Eq. 10 can be trained by directly back-propagating through examples and the gradient isan unbiased estimate of the true gradient. While VIB provides a neat way of parameteriz-ing the information bottleneck approach and efﬁ-ciently training the model with the reparameteriza-tion trick, the learned representations only containthe minimal statistics required to predict the targetlabel, it does not immediately have the ability todisentangle the learned representations. To tacklethis issue, another regularizer is added, the TotalCorrelation (TC) (Ver Steeg and Galstyan, 2015;Steeg, 2017; Gao et al., 2018), to disentangle z :TC ( z , .., z K | x ) = K (cid:88) i =1 H ( z i | x ) − H ( z , · · · , z K | x )= D KL (cid:34) p ( z , · · · , z K | x ) , K (cid:89) i =1 p ( z i | x ) (cid:35) (11)The TC term measures the dependence between p ( z i | x ) s. The penalty on TC forces the model toﬁnd statistically independent factors in the features.In particular, TC ( z , .., z K | x ) is zero if and onlyif all p ( z i | x ) s are independent, in which case wesay that they are disentangled. Thus, the training objective is deﬁned as follows: L VIB+TC = 1 N K (cid:88) i =1 N (cid:88) n =1 E p ( (cid:15) ) [ − log q ( y n | z i )+ β D KL ( p ( z i | x n ) , r ( z i ))]+ λ D KL (cid:34) p ( z , ..., z K | x ) , K (cid:89) i =1 p ( z i | x ) (cid:35) (12)where β and λ are a hyper-parameters to adjust thetrade-off between these two factors. p ( z i | x ) is setto N ( z | f µe,i ( x ) , f Σ e,i ( x )) , in a similar way to p ( z | x ) except that f µe,i ( x ) and f Σ e,i are scalars. Eq.12 canalso be directly trained with an unbiased estimateof the true gradient. In this section, we describe experimental results.We conduct experiments in two NLP subﬁelds: do-main adaptation and defense against adversarialattacks.

The goal of domain adaptation tasks is to testwhether a model trained in one domain (source-domain) can work well when test in another domain(target-domain). In the domain adaptation setup,there should be at least labeled source-domain datafor training and labeled target-domain data for test.Setups can be different regarding whether thereis also a small amount of labeled target-domaindata for training or unlabeled target-domain datafor unsupervised training (Jia et al., 2019a). Inthis paper, we adopt the most naive setting wherethere is neither labeled nor unlabeled target-domaindata for training to straightforwardly test a model’sability for domain adaptation. We perform exper-iments on the following domain adaptation tasks:named entity recognition (NER), part-of-speechtagging (POS), machine translation (MT) and textclassiﬁcation (CLS). The L regularizer, VIB andVIB+TC models are built on top of representationsof the last layer for fair comparison. NER

For the task of NER, we followed the setupin Daumé III (2009) and used the ACE06 datasetas the source domain and the CoNLL 2003 NERdata as the target domain. The training dataset ofACE06 contains 256,145 examples, and the dev ethod

NER POS MT CLS-sentiment CLS-deceptionBaseline 97.88 90.12 34.61 87.4 87.5VIB 98.02 +0 . +0 . +0 . +1 . +1 . VIB+TC +0 . +1 . +0 . +2 . +1 . Regularizer 98.21 +0 . +1 . +0 . +1 . +1 . Table 1: Results for domain adaptation. The evaluation metric for NER, POS and CLS is accuracy, and that forMT is the BLEU score (Papineni et al., 2002). and test datasets from CoNLL03 respectively con-tains 5,258 and 8,806 examples. For evaluation, wefollowed Daumé III (2009) and report only on labelaccuracy. We used the MRC-NER model as thebaseline (Li et al., 2019a), which achieves SOTAperformances on a wide range of NER tasks. Allmodels are trained using using Adam (Kingma andBa, 2014) with β = (0 . , . , (cid:15) = 10 − , a poly-nomial learning rate schedule, warmup up for 4Ksteps and weight decay with − . POS

For the task of POS, we followed the setupin Daumé III (2009). The source domain is the WSJportion of the Penn Treebank, containing 950,028training examples. The target domain is PubMed,with the dev and test sets respectively containing1,987 and 14,554 examples. We used the BERT-large model as the backbone. The model is opti-mized using Adam (Kingma and Ba, 2014).

Machine Translation

We used the WMT 2014English-German dataset for training, which con-tains about 4.5 million sentence pairs. We used theTedtalk dataset (Duh, 2018) for test. We use theTransformer-base model (Vaswani et al., 2017a)as the backbone, where the encoder and decoderrespectively have 6 layers. Sentences are encodedusing BPE (Sennrich et al., 2016b), which has ashared source target vocabulary of about 37000tokens. For fair comparison, we used the Adamoptimizer (Kingma and Ba, 2014) with β = 0.9, β = 0.98 and (cid:15) = − for all models. For thebase setup, following Vaswani et al. (2017a), thedimensionality of inputs and outputs d model is setto 512, and the inner-layer has dimensionality d ff is set to 2,048. MRC-NER transforms tagging tasks to MRC-style spanprediction tasks, which ﬁrst concatenates category descrip-tions with texts to tag. The concatenation is then fed to theBERT-large model (Devlin et al., 2018a) to predict the corre-sponding start index and end index of the entity. We optimize the learning rate in the range 1e-5, 2e-5,3e-5, 5e-5 with dropout rate set to 0.2.

Text Classiﬁcation

For text classiﬁcation, weused two datasets. The ﬁrst dataset we consideris the sentiment analysis on reviews. We used the450K Yelp reviews for training and ∼

3k Ama-zon reviews for test (Li et al., 2018). The task istransformed to a binary classiﬁcation task to decidewhether a review is of positive or negative senti-ment. We also used the deceptive opinion spamdetection dataset (Li et al., 2014), a binary textclassiﬁcation task to classify whether a review isfake or not. We used the hotel reviews for train-ing, which consists of 800 reviews in total fromcustomers, and used the 400 restaurant reviews fortest. For baselines, we used the BERT-large model(Devlin et al., 2018b) as the backbone, where the [cls] is ﬁrst mapped to a scalar and then outputto a sigmoid function. We report accuracy on thetest set.

Results

Results for domain adaptation are shownin Table 1. As can be seen, for all tasks, VIB+TCperforms best among all four models, followed bythe proposed L regularizer model, next followedby the VIB model without disentanglement. Thevanilla VIB model outperforms the baseline super-vised model. This is because the VIB model mapsan input to multiple representations, and this opera-tion to some degree separates features in a naturalway. The L regularizer method consistently out-performs VIB and underperforms VIB+TC. This isbecause VIB+TC uses the TC term to disentanglefeatures deliberately, and the vanilla VIB modeldoes not have this property. Experimental resultsdemonstrate the importance of learning disentan-gled features in domain adaptation. We evaluate the proposed methods on tasks fordefense against adversarial attacks. We conductexperiments on the tasks of text classiﬁcation andnatural language inference in defense against two

MDB

Method

BoW CNN LSTMClean PWWS GA w/LM GA w/o LM Clean PWWS GA w/LM GA w/o LM Clean PWWS GA w/LM GA w/o LM Orig. 88.7 12.4 2.1 0.7 90.0 18.1 4.2 2.0 89.7 1.4 2.5 0.1VIB 88.6 22.4 19.0 11.5 89.3 36.1 34.7 13.1 88.9 14.2 31.4 7.6VIB+TC 89.1

Regularizer

AGNews

Method

BoW CNN LSTMClean PWWS GA w/LM GA w/o LM Clean PWWS GA w/LM GA w/o LM Clean PWWS GA w/LM GA w/o LM Orig.

Regularizer

Table 2: Results for the IMDB and AGNews datasets.

Orig. stands for the original baseline, on which all othermethods are based. Accuracy is reported for comparison.

SNLI

Method

Clean PWWS GA w/LM GA w/o LM Orig.

Regularizer 90.4 48.1 59.6 27.2

Table 3: Results for the SNLI dataset. We report accu-racy for all models. recently proposed attacks: PWWS and GA. PWWS(Ren et al., 2019b), short for Probability WeightedWord Saliency, performs text adversarial attacksbased on word substitutions with synonyms. Theword replacement order is determined by both wordsaliency and prediction probability. GA (Alzantotet al., 2018) uses language models to remove can-didate substitute words that do not ﬁt within thecontext. We report the accuracy under GA attacksfor both with and without using the LM.Following Zhou et al. (2020), for text classiﬁca-tion, we use two datasets, IMDB (Internet MovieDatabase) and AG News corpus (Del Corso et al.,2005). IMDB contains 50, 000 movie reviews forbinary (positive v.s. negative) sentiment classiﬁca-tion, and AGNews contains roughly 30, 000 newsarticles for 4-class classiﬁcation. We use threebase models: bag-of-words models, CNNs andtwo-layer LSTMs. The bag-of-words model ﬁrstaverages the embeddings of constituent words ofthe input, and then passes the average embeddingto a feedforward network to get a 100 d vector. Thevector is then mapped to the ﬁnal logit. CNNs andLSTMs are used to map input text sequences tovectors, which are fed to sigmoid for IMDB and softmax for AGNews.For natural language inference, we conduct experi- ments on the Stanford Natural Language Inference(SNLI) corpus (Bowman et al., 2015). The datasetconsists of 570, 000 English sentence pairs. Thetask is transformed to a 3-class classiﬁcation prob-lem, giving one of the entailment, contradiction,or neutral label to the sentence pair. All modelsuse BERT as backbones and are trained on the CrossEntropy loss, and their hyper-parametersare tuned on the validation set.

Results

Table 2 shows results for the IMDB andAGNews datasets, and Table 3 shows results forthe SNLI dataset. When tested on the clean datasetwhere no attack is performed, variational methods,i.e., VIB and VIB+TC, underperform the baselinemodel. This is in line with our expectation: becauseof the necessity of modeling the KL divergence be-tween z and x , the variational methods do not getsto label prediction as straightly as supervised learn-ing models. But variational methods signiﬁcantlyoutperform supervised baselines when attacks areperformed, which is because of the ﬂexibility of-fered by the disentangled latent representations.VIB+TC outperforms VIB due to the disentangle-ment introduced by TC when attacks are present.As expected, the L regularizer model outperformsthe baseline model in terms of robustness in de-fense against adversarial attacks. It is also interest-ing that with L regularizer, the model performs atleast comparable to, and sometimes outperformsthe baseline in the setup without adversarial attacks,which demonstrates that disentangled representa-tions can also help alleviate overﬁtting, leading tobetter performances. h i s w a s a f a b u l o u s p r e m i s e b a s e d o n l o t s o f f a c t u a l h i s t o r y . b u t t h e s e r i o u s l a c k o f c h a r a c t e r d e v e l o p m e n t l e f t u s n o t r e a ll y li k i n g o r c a r i n g a b o u t a n y o f t h e c h a r a c t e r s , e s p e c i a ll y t h e m u s i c o l o g i s t ! s h e d i d n o t g e t a n y s y m p a t h y ; s h e s ee m s li k e s h e d e s e r v e d h i s o w n b l a c k c l o u d . t h e s o n g s w e r e g r e a t t o a p o i n t , b u t b e c a m e r e p e t i t i v e a f t e r a w h il e . t h i s w a s a f a b u l o u s p r e m i s e b a s e d o n l o t s o f f a c t u a l h i s t o r y . b u t t h e s e r i o u s l a c k o f c h a r a c t e r d e v e l o p m e n t l e f t u s n o t r e a ll y li k i n g o r c a r i n g a b o u t a n y o f t h e c h a r a c t e r s , e s p e c i a ll y t h e m u s i c o l o g i s t ! s h e d i d n o t g e t a n y s y m p a t h y ; s h e s ee m s li k e s h e d e s e r v e d h i s o w n b l a c k c l o u d . t h e s o n g s w e r e g r e a t t o a p o i n t , b u t b e c a m e r e p e t i t i v e a f t e r a w h il e . Figure 1: Heatmaps for models without (top) and with (bottom) L regularizer. IMDB β Clean PWWS GA w/LM GA w/o LM Table 4: Results of varying the hyperparamter β in theRegularizer method. Accuracy is reported for compari-son. The backbone model is CNN. Next, we explore how the strength of the regulariza-tion terms in VIB+TC and Regularizer affects per-formances. Speciﬁcally, we vary the coefﬁcient hy-perparamter β in Regularizer and the γ in VIB+TCto show their inﬂuences on defending against ad-versarial attacks. We use the IMDB dataset forevaluation and use CNNs as baselines, and for eachsetting, we tune all other hyperparamters on thevalidation set. IMDB γ Clean PWWS GA w/LM GA w/o LM Table 5: Results of varying the hyperparamter γ in theVIB+TC method. Accuracy is reported for comparison.The backbone model is CNN. Results are shown in Table 4 and Table 5. Ascan be seen from the tables, when these two hy-perparamters are around . ∼ . , the best re-sults are achieved. For both methods, the perfor-mance ﬁrst rises when increasing the hyperparame-ter value, and then drops as we continue increasingit. Besides, the difference between the best resultand the worst result in the same model is surpris-ingly large (e.g., for the PWWS attack, the differ-ence is 4.6 for Regularizer and 4.1 for VIB+TC),indicating the importance and the sensitivity of theintroduced regularizers. .4 Visualization It would be interesting to visualize how the dis-entangled z s encode the information of differentparts of the input. Unlike feature-based models likeSVMs, it’s intrinsically hard to measure the inﬂu-ence of units of one layer on another layer in an neu-ral architecture (Zeiler and Fergus, 2014; Yosinskiet al., 2014; Bau et al., 2017; Koh and Liang, 2017).We turn to the ﬁrst-derivative saliency method, awidely used tool to visualize the inﬂuence of achange in the input on the model’s predictions (Er-han et al., 2009; Simonyan et al., 2013; Li et al.,2015). Specially, we want to visualize the inﬂuenceof an input token e on the j -th dimension of z i , de-noted by z ji . In the case of deep neural models, z ji is a highly non-linear function of e . The ﬁrst-derivative saliency method approximates z ji witha linear function of e by computing the ﬁrst-orderTaylor expansion z ji ≈ w ji ( e ) (cid:62) e + b (13)where w ji ( e ) is the derivative of z ji with respect tothe embedding e . w ji ( e ) = ∂ ( z ji ) ∂e (cid:12)(cid:12)(cid:12) e (14)The magnitude (absolute value) of the derivative in-dicates the sensitiveness of the ﬁnal decision to thechange in one particular word embedding, tellingus how much one speciﬁc token contributes to z .By summing over j , the inﬂuence of e on z i isgiven as follows: S i ( e ) = (cid:88) j | w ji ( e ) | (15)Figure plots the heatmaps of S i ( e ) with respect toword input vectors for models with and withoutthe TC regularizer. As can be seen, by pushingrepresentations to be disentangled, different repre-sentations are able to encode separate meanings oftexts: z tends to encode more positive informationwhile z tends to encode negative information. Thisability for feature separation and meaning cluster-ing potentially improves the model’s robustness. In this paper, we present methods to improve therobustness and generality on various NLP tasks in the perspective of the information bottlenecktheory and disentangled representation learning. Inparticular, we ﬁnd the two variational methods VIBand VIB+TC perform well on cross domain andadversarial attacks defense tasks. The proposedsimple yet effective end-to-end method of learningdisentangled representations with L regularizerperforms comparably well on cross-domain tasks,while better than vanilla non-disentangled modelson adversarial attacks defense tasks, which showsthe effectiveness of disentangled representations. References

Tameem Adel, Han Zhao, and Alexander Wong. 2017.Unsupervised domain adaptation with a relaxed covari-ate shift assumption. In

Thirty-First AAAI Conferenceon Artiﬁcial Intelligence .Alexander A Alemi, Ian Fischer, Joshua V Dillon, andKevin Murphy. 2016. Deep variational information bot-tleneck. arXiv preprint arXiv:1612.00410 .Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.2018. Generating natural language adversarial exam-ples. arXiv preprint arXiv:1804.07998 .Martin Arjovsky, Soumith Chintala, and Léon Bottou.2017. Wasserstein generative adversarial networks.volume 70 of

Proceedings of Machine Learning Re-search , pages 214–223, International Convention Cen-tre, Sydney, Australia. PMLR.David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva,and Antonio Torralba. 2017. Network dissection:Quantifying interpretability of deep visual representa-tions. In

Proceedings of the IEEE conference on com-puter vision and pattern recognition , pages 6541–6549.Yoshua Bengio, Aaron Courville, and Pascal Vincent.2013. Representation learning: A review and new per-spectives.

IEEE transactions on pattern analysis andmachine intelligence , 35(8):1798–1828.Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large annotatedcorpus for learning natural language inference. In

Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP) . Asso-ciation for Computational Linguistics.Christopher P Burgess, Irina Higgins, Arka Pal, LoicMatthey, Nick Watters, Guillaume Desjardins, andAlexander Lerchner. 2018. Understanding disentan-gling in β -vae. arXiv preprint arXiv:1804.03599 .Ricky TQ Chen, Xuechen Li, Roger B Grosse, andDavid K Duvenaud. 2018. Isolating sources of disen-tanglement in variational autoencoders. In Advances inNeural Information Processing Systems , pages 2610–2620.i Chen, Yan Duan, Rein Houthooft, John Schulman,Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Inter-pretable representation learning by information maxi-mizing generative adversarial nets. In

Advances in neu-ral information processing systems , pages 2172–2180.Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017.An empirical comparison of domain adaptation meth-ods for neural machine translation. In

Proceedings ofthe 55th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers) , pages385–391.Thomas M Cover and Joy A Thomas. 2012.

Elementsof information theory . John Wiley & Sons.Praveen Dakwale. 2017. Fine-tuning for neural ma-chine translation with limited degradation across in-andout-of-domain data.Hal Daumé III. 2009. Frustratingly easy domain adap-tation. arXiv preprint arXiv:0907.1815 .Hal Daume III and Daniel Marcu. 2006. Domain adap-tation for statistical classiﬁers.

Journal of artiﬁcial In-telligence research , 26:101–126.Gianna M. Del Corso, Antonio Gullí, and FrancescoRomani. 2005. Ranking a stream of news. WWW’05, page 97–106, New York, NY, USA. Associationfor Computing Machinery.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018a. Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018b. Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .Chunning Du, Haifeng Sun, Jingyu Wang, Qi Qi, andJianxin Liao. 2020. Adversarial and domain-aware bertfor cross-domain sentiment analysis. In

Proceedings ofthe 58th Annual Meeting of the Association for Compu-tational Linguistics , pages 4019–4028.Kevin Duh. 2018. The multitarget ted talks task.Javid Ebrahimi, Anyi Rao, Daniel Lowd, and De-jing Dou. 2017. Hotﬂip: White-box adversarialexamples for text classiﬁcation. arXiv preprintarXiv:1712.06751 .Dumitru Erhan, Yoshua Bengio, Aaron Courville, andPascal Vincent. 2009. Visualizing higher-layer featuresof a deep network.

University of Montreal , 1341(3):1.Allyson Ettinger, Sudha Rao, Hal Daumé III, andEmily M Bender. 2017. Towards linguistically general-izable nlp systems: A workshop and shared task. arXivpreprint arXiv:1711.01505 .Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-nsol Choi, and Danqi Chen. 2019. Mrqa 2019 sharedtask: Evaluating generalization in reading comprehen-sion. arXiv preprint arXiv:1910.09753 . Markus Freitag and Yaser Al-Onaizan. 2016. Fast do-main adaptation for neural machine translation. arXivpreprint arXiv:1612.06897 .Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, andAram Galstyan. 2018. Auto-encoding total correlationexplanation. arXiv preprint arXiv:1802.05822 .Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014a. Generativeadversarial nets. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, ed-itors,

Advances in Neural Information Processing Sys-tems 27 , pages 2672–2680. Curran Associates, Inc.Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2014b. Explaining and harnessing adversar-ial examples. arXiv preprint arXiv:1412.6572 .Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho,and Victor OK Li. 2018. Meta-learning for low-resource neural machine translation. arXiv preprintarXiv:1808.08437 .I. Higgins, Loïc Matthey, A. Pal, C. Burgess, XavierGlorot, M. Botvinick, S. Mohamed, and Alexander Ler-chner. 2017. beta-vae: Learning basic visual conceptswith a constrained variational framework. In

ICLR .R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, AdamTrischler, and Yoshua Bengio. 2018. Learning deeprepresentations by mutual information estimation andmaximization. arXiv preprint arXiv:1808.06670 .Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory.

Neural computation , 9(8):1735–1780.Po-Sen Huang, Robert Stanforth, Johannes Welbl,Chris Dyer, Dani Yogatama, Sven Gowal, Krish-namurthy Dvijotham, and Pushmeet Kohli. 2019.Achieving veriﬁed robustness to symbol substitu-tions via interval bound propagation. arXiv preprintarXiv:1909.01492 .Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example genera-tion with syntactically controlled paraphrase networks. arXiv preprint arXiv:1804.06059 .Chen Jia, Xiaobo Liang, and Yue Zhang. 2019a. Cross-domain ner using cross-domain language modeling. In

Proceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics , pages 2464–2474.Robin Jia and Percy Liang. 2017. Adversarial ex-amples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328 .Robin Jia, Aditi Raghunathan, Kerem Göksel, andPercy Liang. 2019b. Certiﬁed robustness to adversarialword substitutions. arXiv preprint arXiv:1909.00986 .Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld,Luke Zettlemoyer, and Omer Levy. 2020. Spanbert:Improving pre-training by representing and predictingpans.

Transactions of the Association for Computa-tional Linguistics , 8:64–77.Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network for mod-elling sentences. arXiv preprint arXiv:1404.2188 .Hyunjik Kim and Andriy Mnih. 2018. Disentanglingby factorising. arXiv preprint arXiv:1802.05983 .Young-Bum Kim, Karl Stratos, Ruhi Sarikaya, andMinwoo Jeong. 2015. New transfer learning tech-niques for disparate label sets. In

Proceedings of the53rd Annual Meeting of the Association for Computa-tional Linguistics and the 7th International Joint Con-ference on Natural Language Processing (Volume 1:Long Papers) , pages 473–482, Beijing, China. Associ-ation for Computational Linguistics.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Diederik P Kingma and Max Welling. 2013.Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .Pang Wei Koh and Percy Liang. 2017. Understandingblack-box predictions via inﬂuence functions. In

Pro-ceedings of the 34th International Conference on Ma-chine Learning-Volume 70 , pages 1885–1894. JMLR.org.Lingpeng Kong, Cyprien de Masson d’Autume, WangLing, Lei Yu, Zihang Dai, and Dani Yogatama. 2019.A mutual information maximization perspective oflanguage representation learning. arXiv preprintarXiv:1910.08350 .Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classiﬁcation with deep convolu-tional neural networks. In

Advances in neural informa-tion processing systems , pages 1097–1105.Abhishek Kumar, Prasanna Sattigeri, and Avinash Bal-akrishnan. 2018. Variational inference of disentangledlatent concepts from unlabeled observations. In

Inter-national Conference on Learning Representations .Ji Young Lee, Franck Dernoncourt, and Peter Szolovits.2018. Transfer Learning for Named-Entity Recogni-tion with Neural Networks. In

Proceedings of theEleventh International Conference on Language Re-sources and Evaluation (LREC 2018) , Miyazaki, Japan.European Language Resources Association (ELRA).Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing neural predictions. arXiv preprintarXiv:1606.04155 .Omer Levy, Minjoon Seo, Eunsol Choi, and LukeZettlemoyer. 2017. Zero-shot relation extrac-tion via reading comprehension. arXiv preprintarXiv:1706.04115 .Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky.2015. Visualizing and understanding neural models innlp. arXiv preprint arXiv:1506.01066 . Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Under-standing neural networks through representation era-sure. arXiv preprint arXiv:1612.08220 .Jiwei Li, Myle Ott, Claire Cardie, and Eduard Hovy.2014. Towards a general rule for identifying decep-tive opinion spam. In

Proceedings of the 52nd AnnualMeeting of the Association for Computational Linguis-tics (Volume 1: Long Papers) , pages 1566–1576.Juncen Li, Robin Jia, He He, and Percy Liang.2018. Delete, retrieve, generate: A simple ap-proach to sentiment and style transfer. arXiv preprintarXiv:1804.06437 .Xiang Lisa Li and Jason Eisner. 2019. Specializingword embeddings (for parsing) by information bottle-neck. arXiv preprint arXiv:1910.00163 .Xiaoya Li, Jingrong Feng, Yuxian Meng, QinghongHan, Fei Wu, and Jiwei Li. 2019a. A uniﬁed mrc frame-work for named entity recognition. arXiv preprintarXiv:1910.11476 .Zheng Li, Xin Li, Ying Wei, Lidong Bing, Yu Zhang,and Qiang Yang. 2019b. Transferable end-to-endaspect-based sentiment analysis with selective adver-sarial learning. In

Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Processingand the 9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) , pages 4590–4600, Hong Kong, China. Association for Computa-tional Linguistics.Bin Liang, Hongcheng Li, Miaoqiang Su, PanBian, Xirong Li, and Wenchang Shi. 2017. Deeptext classiﬁcation can be fooled. arXiv preprintarXiv:1704.08006 .Bill Yuchen Lin and Wei Lu. 2018. Neural adap-tation layers for cross-domain named entity recogni-tion. In

Proceedings of the 2018 Conference on Empir-ical Methods in Natural Language Processing , pages2012–2022, Brussels, Belgium. Association for Com-putational Linguistics.Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of lstms to learn syntax-sensitive dependencies.

Transactions of the Associa-tion for Computational Linguistics , 4:521–535.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, LukeZettlemoyer, and Veselin Stoyanov. 2019. Roberta: Arobustly optimized bert pretraining approach. arXivpreprint arXiv:1907.11692 .Francesco Locatello, Stefan Bauer, Mario Lucic, Gun-nar Raetsch, Sylvain Gelly, Bernhard Schölkopf, andOlivier Bachem. 2019. Challenging common assump-tions in the unsupervised learning of disentangled rep-resentations. In

Proceedings of the 36th InternationalConference on Machine Learning , volume 97 of

Pro-ceedings of Machine Learning Research , pages 4114–4124, Long Beach, California, USA. PMLR.Minh-Thang Luong and Christopher D. Manning. 2015.Stanford neural machine translation systems for spokenanguage domain. In

International Workshop on Spo-ken Language Translation , Da Nang, Vietnam.Yuxian Meng, Xiangyuan Ren, Zijun Sun, Xiaoya Li,Arianna Yuan, Fei Wu, and Jiwei Li. 2019a. Large-scale pretraining for neural machine translation withtens of billions of sentence pairs. arXiv preprintarXiv:1909.11861 .Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie,Fan Yin, Muyu Li, Qinghong Han, Xiaofei Sun, and Ji-wei Li. 2019b. Glyce: Glyph-vectors for chinese char-acter representations. In

Advances in Neural Informa-tion Processing Systems , pages 2742–2753.Tomáš Mikolov, Martin Karaﬁát, Lukáš Burget, JanˇCernock`y, and Sanjeev Khudanpur. 2010. Recurrentneural network based language model. In

Eleventh an-nual conference of the international speech communi-cation association .Mehdi Mirza and Simon Osindero. 2014. Condi-tional generative adversarial nets. arXiv preprintarXiv:1411.1784 .Takeru Miyato, Andrew M. Dai, and Ian Goodfel-low. 2016. Adversarial training methods for semi-supervised text classiﬁcation.Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu,Lu Zhang, and Zhi Jin. 2016. How transferable areneural networks in nlp applications? arXiv preprintarXiv:1603.06111 .Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015.Deep neural networks are easily fooled: High conﬁ-dence predictions for unrecognizable images. In

Pro-ceedings of the IEEE conference on computer visionand pattern recognition , pages 427–436.N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B.Celik, and A. Swami. 2016a. The limitations of deeplearning in adversarial settings. In ,pages 372–387.N. Papernot, P. McDaniel, A. Swami, and R. Harang.2016b. Crafting adversarial input sequences for recur-rent neural networks. In

MILCOM 2016 - 2016 IEEEMilitary Communications Conference , pages 49–54.Nicolas Papernot, Patrick McDaniel, Ian Goodfellow,Somesh Jha, Z Berkay Celik, and Ananthram Swami.2017. Practical black-box attacks against machinelearning. In

Proceedings of the 2017 ACM on Asiaconference on computer and communications security ,pages 506–519.Nicolas Papernot, Patrick McDaniel, AnanthramSwami, and Richard Harang. 2016. Crafting adversar-ial input sequences for recurrent neural networks. In

MILCOM 2016-2016 IEEE Military CommunicationsConference , pages 49–54. IEEE.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evalua-tion of machine translation. In

Proceedings of the 40thAnnual Meeting of the Association for Computational Linguistics , pages 311–318, Philadelphia, Pennsylva-nia, USA. Association for Computational Linguistics.Vishal M Patel, Raghuraman Gopalan, Ruonan Li, andRama Chellappa. 2014. Visual domain adaptation: Anoverview of recent advances.

IEEE Signal ProcessingMagazine , 2.Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.2019a. Generating natural language adversarial exam-ples through probability weighted word saliency. In

Proceedings of the 57th Annual Meeting of the Associ-ation for Computational Linguistics , pages 1085–1097,Florence, Italy. Association for Computational Linguis-tics.Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.2019b. Generating natural language adversarial exam-ples through probability weighted word saliency. In

Proceedings of the 57th annual meeting of the associa-tion for computational linguistics , pages 1085–1097.Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Semantically equivalent adversarialrules for debugging nlp models. In

Proceedings of the56th Annual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers) , pages 856–865.Sebastian Ruder. 2019.

Neural Transfer Learning forNatural Language Processing . Ph.D. thesis, NationalUniversity of Ireland, Galway.Motoki Sato, Jun Suzuki, Hiroyuki Shindo, and YujiMatsumoto. 2018. Interpretable adversarial perturba-tion in input embedding space for text. In

Proceed-ings of the Twenty-Seventh International Joint Confer-ence on Artiﬁcial Intelligence, IJCAI-18 , pages 4323–4330. International Joint Conferences on Artiﬁcial In-telligence Organization.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Improving neural machine translation modelswith monolingual data. In

Proceedings of the 54th An-nual Meeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 86–96, Berlin,Germany. Association for Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 1715–1725,Berlin, Germany. Association for Computational Lin-guistics.Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2016. Bidirectional at-tention ﬂow for machine comprehension.

CoRR ,abs/1611.01603.Zhouxing Shi, Huan Zhang, Kai-Wei Chang, MinlieHuang, and Cho-Jui Hsieh. 2020. Robustness veriﬁca-tion for transformers. In

International Conference onLearning Representations .Ravid Shwartz-Ziv and Naftali Tishby. 2017. Openinghe black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 .Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. 2013. Deep inside convolutional networks: Visu-alising image classiﬁcation models and saliency maps. arXiv preprint arXiv:1312.6034 .Greg Ver Steeg. 2017. Unsupervised learningvia total correlation explanation. arXiv preprintarXiv:1706.08984 .Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Re-turn of frustratingly easy domain adaptation.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In

Advances in neural information processing systems ,pages 3104–3112.Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, and RobFergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 .Songbo Tan, Xueqi Cheng, Yuefen Wang, and HongboXu. 2009. Adapting naive bayes to domain adaptationfor sentiment analysis. In

European Conference on In-formation Retrieval , pages 337–349. Springer.Naftali Tishby, Fernando C. N. Pereira, and WilliamBialek. 2000. The information bottleneck method.

CoRR , physics/0004057.Naftali Tishby and Noga Zaslavsky. 2015. Deep learn-ing and the information bottleneck principle. In , pages 1–5.IEEE.Sami Ul Haq, Sadaf Abdul Rauf, Arslan Shoukat, andNoor-e Hira. 2020. Improving document-level neuralmachine translation with domain adaptation. In

Pro-ceedings of the Fourth Workshop on Neural Generationand Translation , pages 225–231, Online. Associationfor Computational Linguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017a. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Process-ing Systems 30 , pages 5998–6008. Curran Associates,Inc.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017b. Attention is allyou need.

CoRR , abs/1706.03762.Greg Ver Steeg and Aram Galstyan. 2015. Maxi-mally informative hierarchical representations of high-dimensional data. In

Artiﬁcial Intelligence and Statis-tics , pages 1004–1012.Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, AimingNie, Dan Jurafsky, and Andrew Y Ng. 2017. Data nois- ing as smoothing in neural network language models. arXiv preprint arXiv:1703.02573 .Jie Yang, Shuailong Liang, and Yue Zhang. 2018. De-sign challenges and misconceptions in neural sequencelabeling. In

Proceedings of the 27th InternationalConference on Computational Linguistics , pages 3879–3889, Santa Fe, New Mexico, USA. Association forComputational Linguistics.Jason Yosinski, Jeff Clune, Yoshua Bengio, and HodLipson. 2014. How transferable are features in deepneural networks? In

Advances in neural informationprocessing systems , pages 3320–3328.Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc VLe. 2018. Qanet: Combining local convolution withglobal self-attention for reading comprehension. arXivpreprint arXiv:1804.09541 .Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li.2019. Adversarial examples: Attacks and defenses fordeep learning.

IEEE transactions on neural networksand learning systems , 30(9):2805–2824.Matthew D Zeiler and Rob Fergus. 2014. Visualizingand understanding convolutional networks. In

Euro-pean conference on computer vision , pages 818–833.Springer.Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2017.Generating natural adversarial examples. arXivpreprint arXiv:1710.11342 .Yi Zhou, Xiaoqing Zheng, Cho-Jui Hsieh, Kai-weiChang, and Xuanjing Huang. 2020. Defense againstadversarial attacks in nlp via dirichlet neighborhood en-semble. arXiv preprint arXiv:2006.11627 .Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Gold-stein, and Jingjing Liu. 2020. Freelb: Enhanced ad-versarial training for natural language understanding.In

International Conference on Learning Representa-tions . Derivation of Variational InformationBottleneck

Below we take the derivation of VIB from Alemiet al. (2016).We ﬁrst decompose the joint distribution p ( X, Y, Z ) into: p ( X, Y, Z ) = p ( X ) p ( Y | X ) p ( Z | X, Y )= p ( X ) p ( Y | X ) p ( Z | X ) (16)Then, for the ﬁrst term in the IB objective I ( Z, Y ) − βI ( Z, X ) , we write it out in full: I ( Z, Y ) = (cid:90) p ( y, z ) log p ( y, z ) p ( y ) p ( z ) d y d z = (cid:90) p ( y, z ) log p ( y | z ) p ( y ) d y d z (17)where p ( y | z ) is fully deﬁned by the encoder andthe Markov Chain as follows: p ( y | z ) = (cid:90) p ( x, y | z ) d x = (cid:90) p ( y | x ) p ( y | x ) d x = (cid:90) p ( y | x ) p ( z | x ) p ( x ) p ( z ) d x (18)Let q ( y | z ) be a variational approximation to p ( y | z ) .By the fact that the KL divergence is non-negative,we have: D KL ( p ( Y | Z ) , q ( Y | Z )) ≥ ⇒ (cid:90) p ( y | z ) log p ( y | z ) d y ≥ (cid:90) p ( y | z ) log q ( y | z ) d y (19)and hence I ( Z, Y ) ≥ (cid:90) p ( y, z ) log q ( y | z ) p ( y ) d y d z = (cid:90) p ( y, z ) log q ( y, z ) d y d z − (cid:90) p ( y ) log p ( y ) d y = (cid:90) p ( y, z ) log q ( y, z ) d y d z + H ( Y ) (20)We omit the second term and rewrite p ( y, z ) as: p ( y, z ) = (cid:90) p ( x, y, z ) d x = (cid:90) p ( x ) p ( y | x ) p ( z | x ) d x (21) which gives: I ( Z, Y ) ≥ (cid:90) p ( x ) p ( y | x ) p ( z | x ) log q ( y | z ) d x d y d z (22)For the term βI ( Z, X ) , we can Similarly expand itas: I ( Z, X ) = (cid:90) p ( x, z ) log p ( z | x ) p ( z ) d z d x = (cid:90) p ( x, z ) log p ( z | x ) d z d x − (cid:90) p ( z ) log p ( z ) d z (23)Computing p ( z ) is intractable, so we introduce avariational approximation r ( z ) to it. Again usingthe fact that the KL divergence is non-negative, wehave: I ( Z, X ) ≤ (cid:90) p ( x ) p ( z | x ) log p ( z | x ) r ( z ) d x d z (24)At last we have that: I ( Z, Y ) − βI ( Z, X ) ≥ (cid:90) p ( x ) p ( y | x ) p ( z | x ) log q ( y | z ) d x d y d z − β (cid:90) p ( x ) p ( z | x ) log p ( z | x ) r ( z ) d x d z (cid:44) L VIB (25)To compute p ( x, y ) we can use the empirical datadistribution p ( x, y ) = N (cid:80) Nn =1 δ x n ( x ) δ y n ( y ) , andhence we can derive the ﬁnal formula with thereparameterization trick p ( z | x )d z = p ( (cid:15) )d (cid:15) : L VIB (cid:44) N N (cid:88) n =1 E p ( (cid:15) ) [ − log q ( y n | f ( x n , (cid:15) ))]+ βD KL ( p ( z | x n ) , r ( z ))))