[PDF] Combining Deep Generative Models and Multi-lingual Pretraining for Semi-supervised Document Classification

Abstract

Semi-supervised learning through deep generative models and multi-lingual pretraining techniques have orchestrated tremendous success across different areas of NLP. Nonetheless, their development has happened in isolation, while the combination of both could potentially be effective for tackling task-specific labelled data shortage. To bridge this gap, we combine semi-supervised deep generative models and multi-lingual pretraining to form a pipeline for document classification task. Compared to strong supervised learning baselines, our semi-supervised classification framework is highly competitive and outperforms the state-of-the-art counterparts in low-resource settings across several languages.

Full PDF

CCombining Deep Generative Models and Multi-lingual Pretrainingfor Semi-supervised Document Classiﬁcation

Yi Zhu ♥ Ehsan Shareghi ♠♥ Yingzhen Li ♦∗ Roi Reichart ♣ Anna Korhonen ♥♥ Language Technology Lab, University of Cambridge ♠ Department of Data Science & AI, Monash University ♦ Department of Computing, Imperial College London ♣ Faculty of Industrial Engineering and Management, Technion, IIT { yz568,alk23 } @cam.ac.uk , [email protected]@imperial.ac.uk , [email protected] Abstract

Semi-supervised learning through deep gen-erative models and multi-lingual pretrainingtechniques have orchestrated tremendous suc-cess across different areas of NLP. Nonethe-less, their development has happened in isola-tion, while the combination of both could po-tentially be effective for tackling task-speciﬁclabelled data shortage. To bridge this gap,we combine semi-supervised deep genera-tive models and multi-lingual pretraining toform a pipeline for document classiﬁcationtask. Compared to strong supervised learningbaselines, our semi-supervised classiﬁcationframework is highly competitive and outper-forms the state-of-the-art counterparts in low-resource settings across several languages. Multi-lingual pretraining has been shown to effec-tively use unlabelled data through learning sharedrepresentations across languages that can be trans-ferred to downstream tasks (Artetxe and Schwenk,2019; Devlin et al., 2019; Wu and Dredze, 2019;Conneau and Lample, 2019). Nonetheless, thelack of labelled data still leads to inferior perfor-mance of the same model compared to those trainedin languages with more labelled data such as En-glish (Zeman et al., 2018; Zhu et al., 2019).Semi-supervised learning is another appealingparadigm that supplements the labelled data withunlabelled data which is easy to acquire (Blum andMitchell, 1998; Zhou and Li, 2005; McClosky et al.,2006, inter alia ). In particular, deep generativemodels (DGMs) such as variational autoencoder(VAE; Kingma and Welling (2014)) are capable ofcapturing complex data distributions at scale withrich latent representations, and they have been used ∗ Work done while at Microsoft Research Cambridge. Code is available at https://github.com/cambridgeltl/mling_sdgms . for semi-supervised learning in various tasks inNLP (Xu et al., 2017; Yin et al., 2018; Choi et al.,2019; Xie and Ma, 2019), as well as inducing cross-lingual word embeddings (Wei and Deng, 2017),and representation learning in combination withTransformers via pretraining (Li et al., 2020).To leverage the beneﬁts of both worlds, wepropose a pipeline method by combining semi-supervised DGMs (SDGMs) based on M1+M2model (Kingma et al., 2014) with multi-lingualpretraining. The pretrained model serves as multi-lingual encoder, and SDGMs can operate on top ofit independently of encoding architecture. To high-light such independence, we experiment with twopretraining settings: (1) our LSTM-based cross-lingual VAE, and (2) the current stat-of-the-art(SOTA) multi-lingual BERT (Devlin et al., 2019).Our experiments on document classiﬁcationin several languages show promising results viathe SDGM framework with different encoders,outperforming the SOTA supervised counterparts.We also illustrate that the end-to-end training ofM1+M2 that was previously considered too unsta-ble to train (Maaløe et al., 2016) is possible with areformulation of the objective function. Variational Autoencoder.

VAE consists of astochastic neural encoder q φ ( z | x ) that maps an in-put x to a latent representation z , and a neuraldecoder p θ ( x | z ) that reconstructs x , jointly trainedby maximising the evidence lower bound (ELBO)of the marginal likelihood of the data: E q φ ( z | x ) (cid:2) log p θ ( x | z ) (cid:3) − KL (cid:0) q φ ( z | x ) (cid:107) p ( z ) (cid:1) (1)where the ﬁrst term (reconstruction) maximises theexpectation of data likelihood under the posteriordistribution of z , and the Kullback-Leibler (KL) di-vergence regulates the distance between the learnedposterior and prior of z . a r X i v : . [ c s . C L ] J a n ( x , y ) = E q φ ( z | x ) [log p θ ( x | z )] (cid:124) (cid:123)(cid:122) (cid:125) Reconstruction − E q φ ( z | x ) q φ ( z | z ,y ) [log q φ ( z | z , y ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y )] (cid:124) (cid:123)(cid:122) (cid:125) KL + log p ( y ) (cid:124) (cid:123)(cid:122) (cid:125) Constant z z y ( a ) M1+M2 x ( b ) VAE xz U ( x ) = E q φ ( z | x ) [log p θ ( x | z )] (cid:124) (cid:123)(cid:122) (cid:125) Reconstruction − E q φ ( z | x ) q φ ( y | z ) q φ ( z | z ,y ) [log q φ ( z | z , y ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) + log q φ ( y | z ) p ( y ) ] (cid:124) (cid:123)(cid:122) (cid:125) KL Table 1: Labelled and unlabelled objectives for M1+M2 model (left), and its corresponding graphical model (right).

Semi-supervised Learning with VAEs.

TheSDGM we use for semi-supervised learning isM1+M2 (Kingma et al., 2014), a graphical model(Table 1 (right)), with two layers of stochastic vari-ables z and z , with each being an isotropic Gaus-sian distribution. The ﬁrst layer encodes the inputsequence x into a deterministic hidden representa-tion h , and outputs the posterior distribution of z : q φ ( z | x ) = N (cid:16) µ φ ( h ) , diag (cid:0) σ φ ( h ) (cid:1)(cid:17) (2)As our SDGM is independent of the encoding archi-tecture, we use different pretrained multi-lingualmodels to obtain h , µ φ ( h ) , and σ φ ( h ) , describedin §3. The second layer computes the posteriordistribution of z , conditioned on sampled z from q φ ( z | x ) and a class variable y .When we use labelled data, i.e. y is ob-served, q φ ( z | z , y ) can be directly obtained.With unlabelled data, we calculate the posterior q φ ( z , y | z ) = q φ ( y | z ) q φ ( z | z , y ) by inferring y with the classiﬁer q φ ( y | z ) , and integrate over allpossible values of y . Therefore, the ELBO for thelabelled data S l = { x , y } is L ( x , y ) : E q φ ( z , z | x ,y ) (cid:104) log p θ ( x , y, z , z ) q φ ( z , z | x , y ) (cid:105) ≤ log p ( x , y ) and for the unlabelled data S u = { x } is U ( x ) : E q φ ( z , z ,y | x ) (cid:104) log p θ ( x , y, z , z ) q φ ( z , z , y | x ) (cid:105) ≤ log p ( x ) where the generation part is p θ ( x , y, z , z ) = p ( y ) p ( z ) p θ ( z | z , y ) p θ ( x | z ) , p ( y ) is uniformdistribution as the prior of y , p ( z ) is standardGaussian distribution as the prior of z , and p θ ( x | z ) is the decoder, which can have differentarchitectures depending on the encoder (§4).The objective function maximises both the la-belled and unlabelled ELBOs while training di-rectly the classiﬁer with the labelled data as well: J = (cid:88) ( x ,y ) ∈S l (cid:0) L ( x , y ) + α J cls ( x , y ) (cid:1) + (cid:88) x ∈S u U ( x ) where J cls ( x , y ) = E q φ ( z | x ) [ q φ ( y | z )] , and α is ahyperparameter to tune. Considering the factorisa-tion of the model according to the graphical model,we can rewrite the L ( x , y ) and U ( x ) as shown in Table 1(left). The reconstruction term is the ex-pected log likelihood of the input sequence x , samefor both ELBOs. The KL term regularises the pos-terior distributions of z and z according to theirpriors. Additionally for U ( x ) , as mentioned before,we ﬁrst infer y and treat it as if it were observed,so we need to compute the expected KL term over q φ ( y | z ) regularised by KL ( q φ ( y | z ) (cid:107) p ( y )) .Due to its training difﬁculty, M1+M2 is trainedlayer-wise in Kingma et al. (2014), where the ﬁrstlayer is trained according to Eq. 1 and ﬁxed, beforethe second layer is trained on top. However, in ourexperiments (§4.1) we found that M1+M2 is easierto train end-to-end. We attribute this to our math-ematical reformulation of the objective functions,giving rise to a more stable optimisation schedule. LSTM-based Encoder with VAE Pretraining.

Our pretraining is based on the framework of Weiand Deng (2017), in which they pretrain a cross-lingual VAE with parallel corpus as input. How-ever, the parallel corpus is expensive to obtain, andonly the resulting cross-lingual embeddings ratherthan the whole encoder could be used due to theparallel input limitation of the model. To addressthese shortcomings, we propose non-parallel cross-lingual VAE (NXVAE), which has the same graph-ical model as the vanilla VAE. Each language i is associated with its own word embedding ma-trix, and its input sequence x i is processed viaa two layer BiLSTM (Hochreiter and Schmidhu-ber, 1997) shared across languages. We use theconcatenation of the BiLSTM last hidden statesas h , and compute q φ ( z | x i ) with Eq. 2, so that z becomes the joint cross-lingual semantic space.A language speciﬁc bag-of-word decoder (BOW;Miao et al. (2016)) is then used to reconstruct theinput sequence. Additionally, we optimise a lan-guage discriminator as an adversary (Lample et al.,2018a) to encourage the mixing of different lan-guage representations and keep the shared encoderlanguage-agnostic. After pretraining NXVAE, wetransfer the whole encoder, including µ φ ( h ) and φ ( h ) , directly into our SDGM framework andtreat it as q φ ( z | x ) component of the model (§4.1). Multi-lingual BERT Encoder.

To show that ourSDGM is effective with other encoding architec-tures, we use the pretrained multi-lingual BERT(mBERT; Devlin et al. (2019)) as our encoder.Given an input sequence, the pooled [CLS] repre-sentation is used as h to compute q φ ( z | x ) (Eq. 2).Different from NXVAE, we initialise the parame-ters of µ φ ( h ) and σ φ ( h ) randomly. We perform document classiﬁcation on the classbalanced multilingual document classiﬁcation cor-pus (MLDoc; Schwenk and Li (2018)). Each doc-ument is assigned to one of the four news topicclasses: corporate/industrial (C), economics (E), government/social (G), and markets (M). We ex-periment with ﬁve representative languages: EN , DE , FR , RU , ZH , and use k instance training setalong with the standard development and test set.For experiments with varying labelled data size,the rest training data from k corpus is used asunlabelled data. The full statistics are shown in Ta-ble 2. Three languages ( EN , DE , FR ) are tested forLSTM encoder with VAE pretraining (§4.1) andall ﬁve languages for mBERT encoder (§4.2). Alldocuments are lowercased. We report accuracy forevaluation following Schwenk and Li (2018).For all experiments, We use Adam (Kingma andBa, 2015) as optimiser, but with different learningrates for both settings and pretraining. We imple-mented the model with Pytorch For pretraining NXVAE,we use three language pairs: EN - DE , EN - FR and DE - FR constructed from Europarl v7 parallel corpus(Koehn, 2005), where only two language pairs areavailable: EN - DE and EN - FR , which consist of fourdatasets in total: ( EN , DE ) EN - DE , and ( EN , FR ) EN - FR .For DE - FR , we pair DE EN - DE and FR EN - FR directly aspseudo parallel data. We trim all datasets into ex-actly the same sentence size, and preprocess them https://github.com/google-research/bert/blob/master/multilingual.md . https://pytorch.org/ . . C E G M Total EN

270 234 252 244 1000228 238 266 268 1000991 1000 1030 979 4000 DE

270 240 245 245 1000229 268 266 237 1000984 1026 1022 968 4000 FR

227 262 258 253 1000257 237 237 269 1000999 973 998 1030 4000 RU

261 288 184 267 1000265 272 204 259 1001073 1121 706 1100 4000 ZH

294 286 109 311 1000324 300 93 283 10001169 1215 363 1253 4000

Table 2: Statistics of MLDoc in ﬁve languages. In-stance numbers for each class along with the total num-bers are shown. For each language, three rows are train-ing, development and test set instance numbers. with: tokenization, lowercasing, substituting digitswith , and removing all punctuations, redundantspaces and empty lines. We randomly sample asmall part of parallel sentences to build a develop-ment set. For models which do not require parallelinput, e.g. NXVAE, we mix the two datasets ofa language pair together. To avoid KL-collapseduring pretraining, a weight α on the KL term inEq. 1 is tuned and ﬁxed to . (Higgins et al., 2017;Alemi et al., 2018). We only run one trial withﬁxed random seed for both pretraining and docu-ment classiﬁcation. Training details can be foundin the Appendix.As our supervised baselines we compare withthe following two groups: (I) NXVAE-based su-pervised models which are pretrained NXVAE en-coder with a multi-layer perceptron classiﬁer ontop (denoted by NXVAE-z ( q φ ( y | z ) ) or NXVAE-h ( q φ ( y | h ) ) depending on the representation fedinto the classiﬁer; or NXVAE-z models initialisedwith different pretrained embeddings: random ini-tialisation (RAND), mono-lingual fastText (FT; Bo-janowski et al. (2017)), unsupervised cross-lingualMUSE (Lample et al., 2018b), pretrained embed-dings from Wei and Deng (2017) (PEMB), andour resulting embeddings from pretrained NXVAE(NXEMB). (II) We also pretrain a word-basedBERT (BERTW) with parameter size akin to NX-VAE on the same data, and ﬁne-tune it directly. For our semi-supervised experiments, we test All embeddings are pretrained on the same Europarl data. We also trained subword-based models for BERT andNXVAE, and observed similar trends. See the Appendix.ord pair Lang kNNs ( k = 3 )president ( EN ) EN mr, madam, gentlemen DE pr¨asident, herr, kommissarpr¨asident ( DE ) EN president, mr, madam DE herr, kommissar, herrengreat ( EN ) EN deal, with, a DE große, eine, gutegroß ( DE ) EN striking, gets, lucrative DE gering, heikel, hochsaid ( EN ) EN already, as, been DE gesagt, mit, demsagte ( DE ) EN he, rightly, said DE vorhin, kollege, kommissar Table 3: Cosine similarity-based nearest neighbours ofwords (left column) in embedding spaces of EN and DE . two types of decoders with different model capac-ities: BOW and GRU (Cho et al., 2014). We useM1+M2+BOW (GRU) to denote the model withjoint training using a speciﬁc decoder, and M1+M2to denote the original model in Kingma et al. (2014)with layer-wise training. We also add a semi-supervised self-training method (McClosky et al.,2006) for BERTW to leverage the unlabelled data(BERTW+ST), where we iteratively add predictedunlabelled data when the model achieves a betterdev. accuracy until convergence.

Qualitative Results.

Table 3 illustrates the qual-ity of the learned alignments in the cross-lingualspace of NXVAE for EN - DE word pairs. Classiﬁcation Results.

Table 4 ( EN - DE ) showsthat within supervised models the NXVAE-z substantially outperforms other supervised base-lines with the exception of BERTW. The fact thatNXVAE-z is signiﬁcantly better than NXVAE-h,suggests that pretraining has enabled z to learnmore general knowledge transferable to this task.Combining with SDGMs, our best pipeline out-performs all baselines across data sizes and lan-guages, including BERTW+ST with bigger gaps infewer labelled data scenario. We observe the sametrend of performance in both supervised and semi-supervised DGM settings on EN - FR and DE - FR .For decoder, BOW outperforms the GRU, a ﬁnd-ing in line with the results of Artetxe et al. (2019)which suggests a few keywords seem to sufﬁce forthis task. The poor performance of the originalM1+M2, implies the domain discrepancy between We also compared this against a more complex Skip DeepGenerative Model (Maaløe et al., 2016), but found that end-to-end M1+M2 performs better. Details in the Appendix. K

32 64 128 1 K EN - DE EN DENXVAE -h 56.5 61.7 59.5 78.4 53.6 66.7 78.9 87.2

NXVAE -z RAND FT MUSE

PEMB

NXEMB

BERTW M M M M BOW - 70.5 79.6 - M M GRU

BERTW + ST EN - FR EN FRNXVAE -h NXVAE -z M M M M BOW - 80.3 - M M GRU DE - FR DE FRNXVAE -h 42.4 53.3 74.3 85.7 39.8 51.8 58.5 86.9

NXVAE -z M M M M BOW - - M M GRU

Table 4: MLDoc test accuracy for EN - DE , EN - FR and DE - FR pairs. The best results for supervised and semi-supervised models are in bold. pretraining and task data, and highlights the im-pact of ﬁne-tuning. In addition, our NXEMB, as abyproduct of NXVAE, performs comparably wellwith MUSE, and better than all other embeddingmodels including its parallel counterpart PEMB. We use the cased mBERT,a 12 layer Transformer (Vaswani et al., 2017)trained on Wikipedia of 104 languages with 100kshared WordPiece vocabulary. The training cor-pus is larger than Europarl by orders of magni-tude, and high-resource languages account for mostof the corpus. We use the best SDGM setup(M1+M2+BOW §4.1), on top of mBERT encoderagainst the mBERT supervised model with a linearlayer as classiﬁer (SUP-h) in 5 representative lan-guages ( EN , DE , FR , RU , ZH ). We report the resultsover 5 runs due to the training instability of BERT(Dodge et al., 2020; Mosbach et al., 2020). Classiﬁcation Results.

Figure 1 demonstratesthat M1+M2+BOW outperforms the SOTA super-vised mBERT (SUP-h) on average across all lan-guages. This corroborates the effectiveness of our

16 32

Labelled Size (Language = EN) T e s t A cc u r a cy Model

SUP-hM1+M2+BOW 8 16 32

Labelled Size (Language = DE)

Labelled Size (Language = FR)

Labelled Size (Language = RU)

Labelled Size (Language = ZH)

Figure 1: Boxplot of test accuracy scores for SUP-h and M1+M2+BOW over 5 runs. The mean is shown as whitedot. The dashed line is the test mean accuracy of SUP-h trained on 1k labelled data of the corresponding language.

SDGM in leveraging unlabelled data within smallerlabelled data regime, as well as its independencefrom encoding architecture. As expected, the gapis generally larger with and labelled data, butreduces as the data size grows to . The vari-ance shows similar pattern, but with relatively largevalues because of the instability of mBERT. Inter-estingly, the performance difference seems to bemore notable in high-resource languages with morepretrained data, whereas in languages with fewerpretrained texts or vocabulary overlaps such as RU and ZH , the two models achieve closer results. We bridged between multi-lingual pretraining anddeep generative models to form a semi-supervisedlearning framework for document classiﬁcation.While outperforming SOTA supervised models inseveral settings, we showed that the beneﬁts ofSDGMs are orthogonal to the encoding architectureor pretraining procedure. It opens up a new avenuefor SDGMs in low-resource NLP by incorporatingunlabelled data potentially from different domainsand languages. Our preliminary results in cross-lingual zero-shot setting with SDGMs+NXVAE arepromising, and we will continue the exploration inthis direction as future work.

Acknowledgments

This work is supported by the ERC ConsolidatorGrant LEXICAL: Lexical Acquisition Across Lan-guages (648909). The ﬁrst author would like tothank Victor Prokhorov and Xiaoyu Shen for theircomments on this work. The authors would liketo thank the three anonymous reviewers for theirhelpful suggestions. Compared to smaller pretraining corpus (§4.1), we foundthat the representations pretrained on large corpus are lessprune to overﬁt to the training instances of the task. Weobserve that training without the KL regularisation yieldsbetter performance for SDMGs+mBERT.

References

Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V.Dillon, Rif A. Saurous, and Kevin Murphy. 2018.Fixing a broken ELBO. In

ICML , volume 80, pages159–168.Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.2019. On the cross-lingual transferability of mono-lingual representations.

CoRR , abs/1910.11856.Mikel Artetxe and Holger Schwenk. 2019. Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond.

TACL ,7:597–610.Avrim Blum and Tom M. Mitchell. 1998. Combin-ing labeled and unlabeled data with co-training. In

COLT , pages 92–100. ACM.Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information.

TACL , 5:135–146.Kyunghyun Cho, Bart van Merri¨enboer, C¸ a˘glarG¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In

EMNLP , pages1724–1734.Jihun Choi, Taeuk Kim, and Sang-goo Lee. 2019.A cross-sentence latent variable model for semi-supervised text sequence matching. In

ACL , pages4747–4761.Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In

NeurIPS ,pages 7057–7067.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In

NAACL , pages 4171–4186.Jesse Dodge, Gabriel Ilharco, Roy Schwartz, AliFarhadi, Hannaneh Hajishirzi, and Noah A. Smith.2020. Fine-tuning pretrained language models:Weight initializations, data orders, and early stop-ping.

CoRR , abs/2002.06305.rina Higgins, Lo¨ıc Matthey, Arka Pal, ChristopherBurgess, Xavier Glorot, Matthew Botvinick, ShakirMohamed, and Alexander Lerchner. 2017. beta-vae:Learning basic visual concepts with a constrainedvariational framework. In

ICLR .Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural Computation ,9(8):1735–1780.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In

ICLR .Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In

ICLR .Durk P Kingma, Shakir Mohamed, DaniloJimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative models.In

NIPS , pages 3581–3589.Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In

MT Summit .Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018a. Unsupervisedmachine translation using monolingual corpora only.In

ICLR .Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Herv´e J´egou. 2018b.Word translation without parallel data. In

ICLR .Chunyuan Li, Xiang Gao, Yuan Li, Xiujun Li, BaolinPeng, Yizhe Zhang, and Jianfeng Gao. 2020. Opti-mus: Organizing sentences via pre-trained modelingof a latent space.

CoRR , abs/2004.04092.Lars Maaløe, Casper Kaae Sønderby, Søren KaaeSønderby, and Ole Winther. 2016. Auxiliary deepgenerative models. In

ICML , volume 48, pages1445–1453.David McClosky, Eugene Charniak, and Mark Johnson.2006. Effective self-training for parsing. In

NAACL .Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neuralvariational inference for text processing. In

ICML ,volume 48, pages 1727–1736.Marius Mosbach, Maksym Andriushchenko, and Diet-rich Klakow. 2020. On the stability of ﬁne-tuningBERT: misconceptions, explanations, and strongbaselines.

CoRR , abs/2006.04884.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In

Advances in Neural Information Pro-cessing Systems , volume 32, pages 8026–8037. Cur-ran Associates, Inc. Holger Schwenk and Xian Li. 2018. A corpus for mul-tilingual document classiﬁcation in eight languages.In

LREC .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Proceedings of NIPS , pages 5998–6008.Liangchen Wei and Zhi-Hong Deng. 2017. A vari-ational autoencoding approach for inducing cross-lingual word embeddings. In

IJCAI , pages 4165–4171.Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas:The surprising cross-lingual effectiveness of BERT.In

EMNLP , pages 833–844.Zhongbin Xie and Shuai Ma. 2019. Dual-view varia-tional autoencoders for semi-supervised text match-ing. In

IJCAI , pages 5306–5312.Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan.2017. Variational autoencoder for semi-supervisedtext classiﬁcation. In

AAAI , pages 3358–3364.Pengcheng Yin, Chunting Zhou, Junxian He, and Gra-ham Neubig. 2018. Structvae: Tree-structured latentvariable models for semi-supervised semantic pars-ing. In

ACL , pages 754–765.Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018. Conll 2018 shared task: Multilin-gual parsing from raw text to universal dependencies.In

CoNLL 2018 Shared Task: Multilingual Parsingfrom Raw Text to Universal Dependencies , pages 1–21.Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Ex-ploiting unlabeled data using three classiﬁers.

IEEETrans. Knowl. Data Eng. , 17(11):1529–1541.Yi Zhu, Benjamin Heinzerling, Ivan Vuli´c, MichaelStrube, Roi Reichart, and Anna Korhonen. 2019. Onthe importance of subword information for morpho-logical tasks in truly low-resource languages. In

CONLL , pages 216–226.

A Derivations of semi-supervised ELBOs

C.1 Data preprocessing and statistics

We use two pairs of data from Europarl v7 (Koehn,2005): EN - DE and EN - FR , which consist offour datasets in total: EN EN - DE , DE EN - DE , EN EN - FR ,and FR EN - FR . Regarding DE - FR data, we take thedatasets of DE EN - DE and FR EN - FR .For each language pair, the sentences in the sameline of both datasets are a pair of parallel sentences.We do the following preprocessing to each dataset:tokenization; lower case; substitute digits with 0;remove all punctuations; remove redundant spacesand empty lines. Then we trim all four datasetsinto exactly the same sentence size. We randomlysplit a small part of parallel sentences to build a dev.set, which leads to 189m lines of training set and13995 lines of dev. set for each language. Then weshufﬂe each dataset so that each language pair isnot parallel anymore (for both train and dev. sets).Our goal is to merge the two datasets of eachpair and scramble them to form a single dataset. Inpractice, we keep each dataset separate, and samplea batch randomly from one language alternatively . during pretraining, so that the data from both lan-guages are mixed. C.2 Model and training details

Instead of optimising the standard VAE, we opti-mise the following objective for NXVAE (Higginset al., 2017; Alemi et al., 2018): J ( x ) = E q φ ( z | x ) [log p θ ( x | z )] − α KL ( q φ ( z | x ) (cid:107) p ( z )) (3)where we manually tune the ﬁxed hyperparameter α on EN - DE data to reach a good balance betweenthe reconstruction and the KL empirically. We se-lect α = 0 . and apply it for the pretraining ofother language pairs as well. The model and train-ing details of XNVAE are shown in Table 5 (left). C.3 Pretraining other models

For MLDoc supervised document classiﬁcation,we also pretrain other baseline models to comparewith ONLY for EN - DE pair: Cross-lingual VAE with parallel input (PEMB;Wei and Deng (2017)):

For the model of Weiand Deng (2017), we run the original code directlyon the same EN - DE Europarl data without changingany of the model architecture. Since the model re-quires parallel input, we take the preprocessed andsplit EN - DE data. However, we do not shufﬂe eachdataset, but rather feed them as parallel input to themodel, so that the model and our correspondingNXVAE use the same amount and content of thedata. Subword-based non-parallel cross-lingual VAESNXVAE:

Instead of having separate vocabularyand decoders for each language, we use a singlevocabulary and decoder for SNXVAE. We build thevocabulary with SentencePiece of size e . Allother settings are the same as NXVAE. Its modeland training details can be found in Table 5 (right). Word and subword-based BERT modelBERTW/BERTSW : For BERTW, we changethe vocabulary and model size to be comparablewith NXVAE. Note that the vocabulary size ofBERTW is the same as the intersected vocabularysize of the two languages in NXVAE. We onlyuse the masked language model objective duringpretraining, and discard the objective of nextsentence prediction. For BERTSW, we use the https://github.com/google/sentencepiece . Both word and subword-based models are trained with: https://github.com/google-research/bert . ame vocabulary as SNXVAE and set the modelto similar parameter size as SNXVAE. The modeland training details of BERTW and BERTSW areshown in Table 6. D More Results on DocumentClassiﬁcation

D.1 LSTM Encoder with VAE PretrainingSupervised Learning.

Our base model isNXVAE-z , which adds an MLP classiﬁer q φ ( y | z ) on top of the encoder with the same architecture ofthe NXVAE. The similar applies to the subword-based models SNXVAE-z . NXVAE-h takes thedeterministic h as the input to q φ ( y | x ) . All ourbaseline models with pretrained embeddings usethe architecture of NXVAE-z . For fastText (FT),we train the embeddings of both languages withthe same data of EN EN - DE and DE EN - DE . For MUSE,we align on the pretrained FT embeddings. ForBERTW and BERTSW, we use the library Trans-formers for classiﬁcation, and initialise the mod-els with the corresponding pretrained parameters.All model and training details can be found in Ta-ble 7. The comparison results of word-based andsubword-based models are shown in Table 8. Semi-supervised learning with SDGMs.

Themain model (NXVAE) and training details are thesame as in supervised learning. Besides M1+M2,we also compare with AUX (Maaløe et al., 2016)with the two decoder types. The training details areshown in Table 9. Regarding the decoding of GRU,all conditional latent variables of p θ ( x | · ) are fed asextra input at each decoding step (Xu et al., 2017).We tune all semi-supervised models on EN EN - DE with 32 labels in semi-supervised settings, and thenapply it to all other languages and data sizes. Wetune only one hyperparameter: the scaling factor β in the weight for the classiﬁcation loss α in theoriginal SDGM paper (Maaløe et al., 2016): α = β N l + N u N l where N l and N u are labelled and unla-belled data point numbers. We tune β from { . , . , . , . , . , . , . , . } . We pickthe β with the best dev. performance for eachmodel, and randomly select one when there is a https://github.com/huggingface/transformers . tie. Then we use such ﬁxed β for all other ex-periments across different training data sizes andlanguages.The results of AUX can be seen in Table 10 alongwith M1+M2 results from the original paper. Theparameter size of each model is shown in Table 11. D.2 mBERT Encoder

The supervised model (SUP-h) adds a single lin-ear transformation layer on the pooled [CLS] rep-resentation of mBERT, and M1+M2+BOW addsthe corresponding SDGM framework on the samemBERT output. Like BERT, as mBERT uses ashared WordPiece vocabulary across languages, theparameter size of the same model will be the samefor each language. All model and training detailsalong with parameter size can be found in Table12.For tuning the hyperparameter ofM1+M2+BOW, different from LSTM en-coder with VAE pretraining, we set α ﬁxedto α = β . We tune β on EN with 8 labelsin semi-supervised settings with 5 trials from { . , . , . , . , . , . , . , . , . } , pickthe β with the best average dev. performance, andthen apply it to all other languages and data sizes.We report the mean and variance over 5 trials, andthe full results for both models can be seen inTable 13. E Conditional document generation

Semi-supervised deep generative models can notonly explore the complex data distributions, butare also equipped with the ability to generate doc-uments conditioned on latent codes, which is an-other advantage over other semi-supervised mod-els. We follow Kingma et al. (2014) by varyinglatent variable y for generation, and ﬁxing z ei-ther sampled from the prior (Table 14) or obtainedfrom the input through the inference model (Ta-ble 15), and generate sequence samples from thetrained semi-supervised models M1+M2+BOWand M1+M2+GRU. Overall, all models generate words or utterancesdirectly related to the class, with the class labelsamong top nouns generated by BOW models, andsubjects/objects in sentences from GRU are alsopertaining to corresponding classes. However, wealso observe that the utterances in GRU are notﬂuent with many repetitions. We argue that it All models are treined on EN en-fr with 128 labelled data. s caused by the high proportion of UNK in thetraining corpus that makes the sequence generationharder, supported by the fact that the most probableword in all BOW decoders is always UNK. yperparameter NXVAE SNXVAEvocabulary size 4e4 ( EN ), 5e4 ( DE , FR ) 1e4embedding size 300 300embedding dropout 0.2 0.2encoder BiLSTM BiLSTMencoder input dimension 300 300encoder hidden dimension 600 for each direction 600 for each directionencoder layer number 2 2encoder dropout 0.2 0.2discriminator conﬁguration [2 × × h to µ or log σ ) conﬁguration [2 × × z dimension 300 300parameter size 41.8M ( EN - DE and EN - FR )/ 44.9M ( DE - FR ) 17.8Mrunning time ∼ α in Equation 3 { , 0.2, 0.5, 1.0 } Table 5: Model and training details of NXVAE.

Hyperparameter

BERTW BERTSW vocabulary size 84101 10005hidden size 300 300max position embeddings 512 512hidden dropout prob 0.1 0.1hidden activation gelu geluintermediate size 2100 1800num attention heads 12 12attention probs dropout prob 0.1 0.1num hidden layers 12 11parameter size 45.0M 19.1Mrunning time ∼ ∼ Table 6: Model and training details of BERTW and BERTSW.

Hyperparameter BERTW/BERTSW VAE-basedvocabulary same as pretrained model same as pretrained modeltraining epoch 5000 5000early stopping 1000 epochs on dev. accuracy 1000batch size 16 16running time ∼ ∼ Table 7: LSTM encoder with VAE pretraining: model and training details of MLDoc supervised document classi-ﬁcation. The running time is calculated on EN EN - DE with 32 labelled data for all models. N - DE EN DE

32 64 128

FULL

32 64 128

FULLBERTW

BERTSW

NXVAE -z SNXVAE -z Table 8: LSTM encoder with VAE pretraining: comparisons of word-based models and subword-based modelsfor BERT and NXVAE in MLDoc supervised document classiﬁcation. Word-based results are directly from theoriginal paper.

Hyperparameter M1+M2 M1+M2+BOW M1+M2+GRU AUX+BOW AUX+GRUtraining epoch 5000 5000 5000 5000 5000early stopping 1000 1000 1000 1000 1000best β z dim 300 300 300 300 300 z dim 300 300 300 300 300tie embedding - False False False Falserunning time ∼ ∼ ∼ ∼ ∼ Table 9: LSTM encoder with VAE pretraining: model and training details of MLDoc semi-supervised documentclassiﬁcation. The running time is calculated on EN EN - DE with 32 labelled data for all models. EN - DE EN DE

32 64 128

FULL

32 64 128

FULLM M M M BOW - M M GRU

AUX + BOW - AUX + GRU EN - FR EN FR

32 64 128

FULL

32 64 128

FULLM M M M BOW - 80.3 - M M GRU

AUX + BOW

AUX + GRU DE - FR DE FR

32 64 128

FULL

32 64 128

FULLM M M M BOW - M M GRU

AUX + BOW - 73.9 79.5 82.1 -

AUX + GRU

Table 10: LSTM encoder with VAE pretraining: test accuracy of AUX models. The header numbers denote numberof labelled training data instances. The best results are in bold. Other results related to M1+M2 are directly fromthe original paper.

N DE FREMBEDDING MODELS

NXVAE -h 26.8M 29.8M 29.8M

NXVAE -z SNXVZE -z BERTW

BERTSW M M M M BOW M M GRU

AUX + BOW

AUX + GRU

Table 11: LSTM encoder with VAE pretraining: parameter size of all supervised and semi-supervised models.The difference between NXVAE-based models and BERTW is caused by language speciﬁc vocabulary of NXVAE,where only one vocabulary is used for mono-lingual document classiﬁcation.

Hyperparameter SUP-h M1+M2+BOWvocabulary size 1e5 1e5 z dim 768 768 z dim 768 768tie embedding True Truebest β - 10.0training epoch 500 500early stopping 100 epochs on dev. accuracy 100batch size 4 4running time ∼ ∼ Table 12: mBERT encoder: model and training details of MLDoc document classiﬁcation. The running time iscalculated on EN EN - DE with 8 labelled data for both models. Model 8 16 32 1 K E N SUP -h 42.2 (4.7) 68.9 (9.7) 82.4 (3.0) 94.2 (0.8) M M BOW (12.8) (2.8) (1.5) - D E SUP -h 55.9 (9.9) 63.5 (10.2) 81.5 (6.5) 95.0 (0.3) M M BOW (11.5) (6.3) (2.6) - F R SUP -h 38.6 (3.3) 55.9 (11.4) 78.5 (3.0) 93.5 (0.7) M M BOW (4.6) (9.1) (2.7) - R U SUP -h 49.4 (6.0) 53.8 (2.6) 68.2 (5.2) 87.2 (0.4) M M BOW (6.0) (4.6) (2.3) - Z H SUP -h 63.4 (12.5) 70.7 (6.5) 81.2 (3.9) 91.1 (0.1) M M BOW (11.1) (2.4) (3.8) -

Table 13: mBERT Encoder: MLDoc average test accuracy for both SUP-h and M1+M2+BOW models. Thevariance is in the bracket after the mean score. The ﬁrst row denotes the number of labelled instances. The bestresults are in bold. lass M1+M2+BOW M1+M2+GRUC

1: UNK, industry, credibility, agreement, ticket, co, decision, con-cept, ltd, people, sale, government, market, president, designations,minister, ﬁrm, plans, partner, deal 1: the bank said it lump of the united ... the new girls ltd saidthe concept ... the new extraordinary and the concept ... said thestatement ...2: UNK, ticket, year, shares, days, results, age, net, demand,securities, period, stock, concept, construction, bank, programme,procedure, statement, value, commission 2: the bank of organisation said on thursday that it had revoked bythe ﬁrst girls ... ﬁrst year to ... E

1: UNK, ﬁnance, market, loophole, budget, surprise, bank, ba-sis, issue, government, system, exchanges, committee, municipal,world, securities, holding, net, conﬁdence, minister 1: the international basic fund said on acknowledged that it saidon publish to vote on publish to a bank said on publish ...2: UNK, ticket, city, escalation, ﬁnance, bank, budget, concept,revenue, net, price, sale, trade, tax, prices, markets, series, rate,fund, pack 2: the bank of submitting on publish ﬂorence said on acknowl-edged that ... it said on publish that ... to the new coherent said onacknowledged to bumping the bank said the bank ... G

1: UNK, government, state, minister, delay, pension, work, presi-dent, plans, summit, ticket, people, procedure, conference, ambas-sador, country, talks, opposition, nations, house 1: the president remarkable said on thursday it surprise of ethno-cide arrival the inﬁdels of the islamic of the waterway the bankwas ...2: UNK, state, president, war, police, ofﬁce, authorities, prob-lem, information, result, country, rights, committee, city, people,biodiversity, justice, health, securities, issue 2: the summit in the authors and a virtual geological and the ﬁrsttime of the ﬁrst party of the ﬁrst time of ... M

1: UNK, ticket, phase, market, government, minister, markets,banks, bank, budget, ﬂoor, points, rate, traders, procedure, strength,economy, ﬁnance, prices, loophole 1: the database distinctions the market closed sharply entire onthursday on acknowledged ...2: UNK, markets, market, stock, loophole, points, trade, shares,ticket, corporate, speaker, issues, fund, bank, group, exchanges,results, anticipation, companies, surprise 2: the following of the the the ries and not have embargo costsunveiling on publish pleading a impact of the japanese ... marketand a bank was to be of the bank ...

Table 14: Generated samples from M1+M2+GRU (BOW) for class C (

Corporate/Industrial ), E (

Economics ), G(

Government/Social ), and M (

Markets ). We randomly sample z from the prior while varying y . : Fiat shares lost nearly two percent on Wednesday, slipping below the psychologically important 4,000 lire level in thin trading on a generallyeasier Milan Bourse, traders said. ”The stock has gradually lost ground but without any major sell orders. At the moment there just isn’t any interestin Fiat,” one trader said. At 1439 GMT, Fiat was quoted 1.99 percent off at 3,980 lire, after touching a day’s low of 3,970 lire, in volume of justunder four million shares. The all-share Mibtel index posted a 0.47 percent fall. – Milan newsroom +392 66129589 (E)1: ﬁat shares lost nearly two percent on UNK slipping below the psychologically important UNK lire level in thin trading on a generally easier milanUNK traders UNK UNK stock has gradually lost ground but without any major sell UNK at the moment there just UNK any interest in UNK onetrader UNK at UNK UNK ﬁat was quoted UNK percent off at UNK UNK after touching a UNK low of UNK UNK in volume of just under fourmillion UNK the UNK UNK index posted a UNK percent UNK UNK milan UNK UNK UNK2: The top prosecutor of Honduras said on Wednesday that his country is a haven for money laundering. ”In Honduras it’s easy to launder money, thesystem allows it,” Edmundo Orellana told reporters. ”It’s permitted because there is no law in Honduras that obligates a Honduran to explain theorigin of his wealth.” Honduran authorities estimate that $

300 million in illegal drug proﬁts is laundered through the country each year. Moneylaundering is not classiﬁed as an offence in Honduras, although legislators have been working on a bill to outlaw it since last year. (G)2: the top prosecutor of honduras said on wednesday that his country is a haven for money UNK UNK honduras UNK easy to launder UNK thesystem allows UNK UNK UNK told UNK UNK permitted because there is no law in honduras that UNK a honduran to explain the origin of hisUNK honduran authorities estimate that UNK million in illegal drug proﬁts is laundered through the country each UNK money laundering is notclassiﬁed as an offence in UNK although legislators have been working on a bill to outlaw it since last UNK

Class M1+M2+BOW M1+M2+GRUC

1: UNK, ticket, proﬁt, concept, net, market, escalation, share, results,shares, delay, group, revision, proﬁts, period, misery, statement, bank,key, procedure 1: the bank said on fourthly it has inject requirement of the ﬁrst groupof ...2: UNK, concept, ticket, group, market, shares, delay, president,stock, companies, bank, statement, government, stake, price, co,state, girls, meeting, ltd 2: the bank of organisation said on acknowledged that it had ameeting ... E

1: UNK, ticket, escalation, inﬂation, key, revision, delay, period,ﬂoor, consumer, bank, contexts, result, instance, show, market, level,government, gross, price 1: the bank of submitting on publish ﬂorence said on acknowledgedthat it said on publish that ... the new coherent ... to the bank ...2: UNK, ticket, bank, government, ﬁnance, market, state, budget, tax,minister, rate, delay, debt, issue, trade, investment, surprise, policy,sale, procedure 2: the international basic fund said on acknowledged that it said onpublish ... to vote on acknowledged to a bank ... G

1: UNK, world, ticket, policies, time, surprise, procedure, demand,campaigns, group, team, president, match, communities, place, min-ister, bank, government, number, relief 1: the ana police said acknowledged it had a tackling ...2: UNK, president, government, people, state, minister, pension,police, designations, meeting, talks, opposition, leaders, country,security, result, statement, authorities, peace, summit 2: the president remarkable said on thursday that it surprise of ethno-cide arrival inﬁdels of her wines of her recall and the white house of... M

1: UNK, shares, ticket, contexts, touch, market, stock, points, esca-lation, share, traders, phase, immigrants, procedure, price, pledges,revision, agriculture, group , level 1: the bank of the settlement following the following vocationalmeda of the deal was delay ... and the market ...2: UNK, market, ticket, bank, traders, anticipation, delay, procedure,trade, prices, immigrants, rate, government, money, meda, escalation,demands, exchange, points, reallocation 2: the bank of the settlement following the following vocational valueof the relative gains of ...

Table 15: Generated samples from M1+M2+GRU (BOW) by varying class label y . We take z2