Combining Deep Generative Models and Multi-lingual Pretraining for Semi-supervised Document Classification
Yi Zhu, Ehsan Shareghi, Yingzhen Li, Roi Reichart, Anna Korhonen
CCombining Deep Generative Models and Multi-lingual Pretrainingfor Semi-supervised Document Classification
Yi Zhu ♥ Ehsan Shareghi ♠♥ Yingzhen Li ♦∗ Roi Reichart ♣ Anna Korhonen ♥♥ Language Technology Lab, University of Cambridge ♠ Department of Data Science & AI, Monash University ♦ Department of Computing, Imperial College London ♣ Faculty of Industrial Engineering and Management, Technion, IIT { yz568,alk23 } @cam.ac.uk , [email protected]@imperial.ac.uk , [email protected] Abstract
Semi-supervised learning through deep gen-erative models and multi-lingual pretrainingtechniques have orchestrated tremendous suc-cess across different areas of NLP. Nonethe-less, their development has happened in isola-tion, while the combination of both could po-tentially be effective for tackling task-specificlabelled data shortage. To bridge this gap,we combine semi-supervised deep genera-tive models and multi-lingual pretraining toform a pipeline for document classificationtask. Compared to strong supervised learningbaselines, our semi-supervised classificationframework is highly competitive and outper-forms the state-of-the-art counterparts in low-resource settings across several languages. Multi-lingual pretraining has been shown to effec-tively use unlabelled data through learning sharedrepresentations across languages that can be trans-ferred to downstream tasks (Artetxe and Schwenk,2019; Devlin et al., 2019; Wu and Dredze, 2019;Conneau and Lample, 2019). Nonetheless, thelack of labelled data still leads to inferior perfor-mance of the same model compared to those trainedin languages with more labelled data such as En-glish (Zeman et al., 2018; Zhu et al., 2019).Semi-supervised learning is another appealingparadigm that supplements the labelled data withunlabelled data which is easy to acquire (Blum andMitchell, 1998; Zhou and Li, 2005; McClosky et al.,2006, inter alia ). In particular, deep generativemodels (DGMs) such as variational autoencoder(VAE; Kingma and Welling (2014)) are capable ofcapturing complex data distributions at scale withrich latent representations, and they have been used ∗ Work done while at Microsoft Research Cambridge. Code is available at https://github.com/cambridgeltl/mling_sdgms . for semi-supervised learning in various tasks inNLP (Xu et al., 2017; Yin et al., 2018; Choi et al.,2019; Xie and Ma, 2019), as well as inducing cross-lingual word embeddings (Wei and Deng, 2017),and representation learning in combination withTransformers via pretraining (Li et al., 2020).To leverage the benefits of both worlds, wepropose a pipeline method by combining semi-supervised DGMs (SDGMs) based on M1+M2model (Kingma et al., 2014) with multi-lingualpretraining. The pretrained model serves as multi-lingual encoder, and SDGMs can operate on top ofit independently of encoding architecture. To high-light such independence, we experiment with twopretraining settings: (1) our LSTM-based cross-lingual VAE, and (2) the current stat-of-the-art(SOTA) multi-lingual BERT (Devlin et al., 2019).Our experiments on document classificationin several languages show promising results viathe SDGM framework with different encoders,outperforming the SOTA supervised counterparts.We also illustrate that the end-to-end training ofM1+M2 that was previously considered too unsta-ble to train (Maaløe et al., 2016) is possible with areformulation of the objective function. Variational Autoencoder.
VAE consists of astochastic neural encoder q φ ( z | x ) that maps an in-put x to a latent representation z , and a neuraldecoder p θ ( x | z ) that reconstructs x , jointly trainedby maximising the evidence lower bound (ELBO)of the marginal likelihood of the data: E q φ ( z | x ) (cid:2) log p θ ( x | z ) (cid:3) − KL (cid:0) q φ ( z | x ) (cid:107) p ( z ) (cid:1) (1)where the first term (reconstruction) maximises theexpectation of data likelihood under the posteriordistribution of z , and the Kullback-Leibler (KL) di-vergence regulates the distance between the learnedposterior and prior of z . a r X i v : . [ c s . C L ] J a n ( x , y ) = E q φ ( z | x ) [log p θ ( x | z )] (cid:124) (cid:123)(cid:122) (cid:125) Reconstruction − E q φ ( z | x ) q φ ( z | z ,y ) [log q φ ( z | z , y ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y )] (cid:124) (cid:123)(cid:122) (cid:125) KL + log p ( y ) (cid:124) (cid:123)(cid:122) (cid:125) Constant z z y ( a ) M1+M2 x ( b ) VAE xz U ( x ) = E q φ ( z | x ) [log p θ ( x | z )] (cid:124) (cid:123)(cid:122) (cid:125) Reconstruction − E q φ ( z | x ) q φ ( y | z ) q φ ( z | z ,y ) [log q φ ( z | z , y ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) + log q φ ( y | z ) p ( y ) ] (cid:124) (cid:123)(cid:122) (cid:125) KL Table 1: Labelled and unlabelled objectives for M1+M2 model (left), and its corresponding graphical model (right).
Semi-supervised Learning with VAEs.
TheSDGM we use for semi-supervised learning isM1+M2 (Kingma et al., 2014), a graphical model(Table 1 (right)), with two layers of stochastic vari-ables z and z , with each being an isotropic Gaus-sian distribution. The first layer encodes the inputsequence x into a deterministic hidden representa-tion h , and outputs the posterior distribution of z : q φ ( z | x ) = N (cid:16) µ φ ( h ) , diag (cid:0) σ φ ( h ) (cid:1)(cid:17) (2)As our SDGM is independent of the encoding archi-tecture, we use different pretrained multi-lingualmodels to obtain h , µ φ ( h ) , and σ φ ( h ) , describedin §3. The second layer computes the posteriordistribution of z , conditioned on sampled z from q φ ( z | x ) and a class variable y .When we use labelled data, i.e. y is ob-served, q φ ( z | z , y ) can be directly obtained.With unlabelled data, we calculate the posterior q φ ( z , y | z ) = q φ ( y | z ) q φ ( z | z , y ) by inferring y with the classifier q φ ( y | z ) , and integrate over allpossible values of y . Therefore, the ELBO for thelabelled data S l = { x , y } is L ( x , y ) : E q φ ( z , z | x ,y ) (cid:104) log p θ ( x , y, z , z ) q φ ( z , z | x , y ) (cid:105) ≤ log p ( x , y ) and for the unlabelled data S u = { x } is U ( x ) : E q φ ( z , z ,y | x ) (cid:104) log p θ ( x , y, z , z ) q φ ( z , z , y | x ) (cid:105) ≤ log p ( x ) where the generation part is p θ ( x , y, z , z ) = p ( y ) p ( z ) p θ ( z | z , y ) p θ ( x | z ) , p ( y ) is uniformdistribution as the prior of y , p ( z ) is standardGaussian distribution as the prior of z , and p θ ( x | z ) is the decoder, which can have differentarchitectures depending on the encoder (§4).The objective function maximises both the la-belled and unlabelled ELBOs while training di-rectly the classifier with the labelled data as well: J = (cid:88) ( x ,y ) ∈S l (cid:0) L ( x , y ) + α J cls ( x , y ) (cid:1) + (cid:88) x ∈S u U ( x ) where J cls ( x , y ) = E q φ ( z | x ) [ q φ ( y | z )] , and α is ahyperparameter to tune. Considering the factorisa-tion of the model according to the graphical model,we can rewrite the L ( x , y ) and U ( x ) as shown in Table 1(left). The reconstruction term is the ex-pected log likelihood of the input sequence x , samefor both ELBOs. The KL term regularises the pos-terior distributions of z and z according to theirpriors. Additionally for U ( x ) , as mentioned before,we first infer y and treat it as if it were observed,so we need to compute the expected KL term over q φ ( y | z ) regularised by KL ( q φ ( y | z ) (cid:107) p ( y )) .Due to its training difficulty, M1+M2 is trainedlayer-wise in Kingma et al. (2014), where the firstlayer is trained according to Eq. 1 and fixed, beforethe second layer is trained on top. However, in ourexperiments (§4.1) we found that M1+M2 is easierto train end-to-end. We attribute this to our math-ematical reformulation of the objective functions,giving rise to a more stable optimisation schedule. LSTM-based Encoder with VAE Pretraining.
Our pretraining is based on the framework of Weiand Deng (2017), in which they pretrain a cross-lingual VAE with parallel corpus as input. How-ever, the parallel corpus is expensive to obtain, andonly the resulting cross-lingual embeddings ratherthan the whole encoder could be used due to theparallel input limitation of the model. To addressthese shortcomings, we propose non-parallel cross-lingual VAE (NXVAE), which has the same graph-ical model as the vanilla VAE. Each language i is associated with its own word embedding ma-trix, and its input sequence x i is processed viaa two layer BiLSTM (Hochreiter and Schmidhu-ber, 1997) shared across languages. We use theconcatenation of the BiLSTM last hidden statesas h , and compute q φ ( z | x i ) with Eq. 2, so that z becomes the joint cross-lingual semantic space.A language specific bag-of-word decoder (BOW;Miao et al. (2016)) is then used to reconstruct theinput sequence. Additionally, we optimise a lan-guage discriminator as an adversary (Lample et al.,2018a) to encourage the mixing of different lan-guage representations and keep the shared encoderlanguage-agnostic. After pretraining NXVAE, wetransfer the whole encoder, including µ φ ( h ) and φ ( h ) , directly into our SDGM framework andtreat it as q φ ( z | x ) component of the model (§4.1). Multi-lingual BERT Encoder.
To show that ourSDGM is effective with other encoding architec-tures, we use the pretrained multi-lingual BERT(mBERT; Devlin et al. (2019)) as our encoder.Given an input sequence, the pooled [CLS] repre-sentation is used as h to compute q φ ( z | x ) (Eq. 2).Different from NXVAE, we initialise the parame-ters of µ φ ( h ) and σ φ ( h ) randomly. We perform document classification on the classbalanced multilingual document classification cor-pus (MLDoc; Schwenk and Li (2018)). Each doc-ument is assigned to one of the four news topicclasses: corporate/industrial (C), economics (E), government/social (G), and markets (M). We ex-periment with five representative languages: EN , DE , FR , RU , ZH , and use k instance training setalong with the standard development and test set.For experiments with varying labelled data size,the rest training data from k corpus is used asunlabelled data. The full statistics are shown in Ta-ble 2. Three languages ( EN , DE , FR ) are tested forLSTM encoder with VAE pretraining (§4.1) andall five languages for mBERT encoder (§4.2). Alldocuments are lowercased. We report accuracy forevaluation following Schwenk and Li (2018).For all experiments, We use Adam (Kingma andBa, 2015) as optimiser, but with different learningrates for both settings and pretraining. We imple-mented the model with Pytorch For pretraining NXVAE,we use three language pairs: EN - DE , EN - FR and DE - FR constructed from Europarl v7 parallel corpus(Koehn, 2005), where only two language pairs areavailable: EN - DE and EN - FR , which consist of fourdatasets in total: ( EN , DE ) EN - DE , and ( EN , FR ) EN - FR .For DE - FR , we pair DE EN - DE and FR EN - FR directly aspseudo parallel data. We trim all datasets into ex-actly the same sentence size, and preprocess them https://github.com/google-research/bert/blob/master/multilingual.md . https://pytorch.org/ . . C E G M Total EN
270 234 252 244 1000228 238 266 268 1000991 1000 1030 979 4000 DE
270 240 245 245 1000229 268 266 237 1000984 1026 1022 968 4000 FR
227 262 258 253 1000257 237 237 269 1000999 973 998 1030 4000 RU
261 288 184 267 1000265 272 204 259 1001073 1121 706 1100 4000 ZH
294 286 109 311 1000324 300 93 283 10001169 1215 363 1253 4000
Table 2: Statistics of MLDoc in five languages. In-stance numbers for each class along with the total num-bers are shown. For each language, three rows are train-ing, development and test set instance numbers. with: tokenization, lowercasing, substituting digitswith , and removing all punctuations, redundantspaces and empty lines. We randomly sample asmall part of parallel sentences to build a develop-ment set. For models which do not require parallelinput, e.g. NXVAE, we mix the two datasets ofa language pair together. To avoid KL-collapseduring pretraining, a weight α on the KL term inEq. 1 is tuned and fixed to . (Higgins et al., 2017;Alemi et al., 2018). We only run one trial withfixed random seed for both pretraining and docu-ment classification. Training details can be foundin the Appendix.As our supervised baselines we compare withthe following two groups: (I) NXVAE-based su-pervised models which are pretrained NXVAE en-coder with a multi-layer perceptron classifier ontop (denoted by NXVAE-z ( q φ ( y | z ) ) or NXVAE-h ( q φ ( y | h ) ) depending on the representation fedinto the classifier; or NXVAE-z models initialisedwith different pretrained embeddings: random ini-tialisation (RAND), mono-lingual fastText (FT; Bo-janowski et al. (2017)), unsupervised cross-lingualMUSE (Lample et al., 2018b), pretrained embed-dings from Wei and Deng (2017) (PEMB), andour resulting embeddings from pretrained NXVAE(NXEMB). (II) We also pretrain a word-basedBERT (BERTW) with parameter size akin to NX-VAE on the same data, and fine-tune it directly. For our semi-supervised experiments, we test All embeddings are pretrained on the same Europarl data. We also trained subword-based models for BERT andNXVAE, and observed similar trends. See the Appendix.ord pair Lang kNNs ( k = 3 )president ( EN ) EN mr, madam, gentlemen DE pr¨asident, herr, kommissarpr¨asident ( DE ) EN president, mr, madam DE herr, kommissar, herrengreat ( EN ) EN deal, with, a DE große, eine, gutegroß ( DE ) EN striking, gets, lucrative DE gering, heikel, hochsaid ( EN ) EN already, as, been DE gesagt, mit, demsagte ( DE ) EN he, rightly, said DE vorhin, kollege, kommissar Table 3: Cosine similarity-based nearest neighbours ofwords (left column) in embedding spaces of EN and DE . two types of decoders with different model capac-ities: BOW and GRU (Cho et al., 2014). We useM1+M2+BOW (GRU) to denote the model withjoint training using a specific decoder, and M1+M2to denote the original model in Kingma et al. (2014)with layer-wise training. We also add a semi-supervised self-training method (McClosky et al.,2006) for BERTW to leverage the unlabelled data(BERTW+ST), where we iteratively add predictedunlabelled data when the model achieves a betterdev. accuracy until convergence.
Qualitative Results.
Table 3 illustrates the qual-ity of the learned alignments in the cross-lingualspace of NXVAE for EN - DE word pairs. Classification Results.
Table 4 ( EN - DE ) showsthat within supervised models the NXVAE-z substantially outperforms other supervised base-lines with the exception of BERTW. The fact thatNXVAE-z is significantly better than NXVAE-h,suggests that pretraining has enabled z to learnmore general knowledge transferable to this task.Combining with SDGMs, our best pipeline out-performs all baselines across data sizes and lan-guages, including BERTW+ST with bigger gaps infewer labelled data scenario. We observe the sametrend of performance in both supervised and semi-supervised DGM settings on EN - FR and DE - FR .For decoder, BOW outperforms the GRU, a find-ing in line with the results of Artetxe et al. (2019)which suggests a few keywords seem to suffice forthis task. The poor performance of the originalM1+M2, implies the domain discrepancy between We also compared this against a more complex Skip DeepGenerative Model (Maaløe et al., 2016), but found that end-to-end M1+M2 performs better. Details in the Appendix. K
32 64 128 1 K EN - DE EN DENXVAE -h 56.5 61.7 59.5 78.4 53.6 66.7 78.9 87.2
NXVAE -z RAND FT MUSE
PEMB
NXEMB
BERTW M M M M BOW - 70.5 79.6 - M M GRU
BERTW + ST EN - FR EN FRNXVAE -h NXVAE -z M M M M BOW - 80.3 - M M GRU DE - FR DE FRNXVAE -h 42.4 53.3 74.3 85.7 39.8 51.8 58.5 86.9
NXVAE -z M M M M BOW - - M M GRU
Table 4: MLDoc test accuracy for EN - DE , EN - FR and DE - FR pairs. The best results for supervised and semi-supervised models are in bold. pretraining and task data, and highlights the im-pact of fine-tuning. In addition, our NXEMB, as abyproduct of NXVAE, performs comparably wellwith MUSE, and better than all other embeddingmodels including its parallel counterpart PEMB. We use the cased mBERT,a 12 layer Transformer (Vaswani et al., 2017)trained on Wikipedia of 104 languages with 100kshared WordPiece vocabulary. The training cor-pus is larger than Europarl by orders of magni-tude, and high-resource languages account for mostof the corpus. We use the best SDGM setup(M1+M2+BOW §4.1), on top of mBERT encoderagainst the mBERT supervised model with a linearlayer as classifier (SUP-h) in 5 representative lan-guages ( EN , DE , FR , RU , ZH ). We report the resultsover 5 runs due to the training instability of BERT(Dodge et al., 2020; Mosbach et al., 2020). Classification Results.
Figure 1 demonstratesthat M1+M2+BOW outperforms the SOTA super-vised mBERT (SUP-h) on average across all lan-guages. This corroborates the effectiveness of our
16 32
Labelled Size (Language = EN) T e s t A cc u r a cy Model
SUP-hM1+M2+BOW 8 16 32
Labelled Size (Language = DE)
Labelled Size (Language = FR)
Labelled Size (Language = RU)
Labelled Size (Language = ZH)
Figure 1: Boxplot of test accuracy scores for SUP-h and M1+M2+BOW over 5 runs. The mean is shown as whitedot. The dashed line is the test mean accuracy of SUP-h trained on 1k labelled data of the corresponding language.
SDGM in leveraging unlabelled data within smallerlabelled data regime, as well as its independencefrom encoding architecture. As expected, the gapis generally larger with and labelled data, butreduces as the data size grows to . The vari-ance shows similar pattern, but with relatively largevalues because of the instability of mBERT. Inter-estingly, the performance difference seems to bemore notable in high-resource languages with morepretrained data, whereas in languages with fewerpretrained texts or vocabulary overlaps such as RU and ZH , the two models achieve closer results. We bridged between multi-lingual pretraining anddeep generative models to form a semi-supervisedlearning framework for document classification.While outperforming SOTA supervised models inseveral settings, we showed that the benefits ofSDGMs are orthogonal to the encoding architectureor pretraining procedure. It opens up a new avenuefor SDGMs in low-resource NLP by incorporatingunlabelled data potentially from different domainsand languages. Our preliminary results in cross-lingual zero-shot setting with SDGMs+NXVAE arepromising, and we will continue the exploration inthis direction as future work.
Acknowledgments
This work is supported by the ERC ConsolidatorGrant LEXICAL: Lexical Acquisition Across Lan-guages (648909). The first author would like tothank Victor Prokhorov and Xiaoyu Shen for theircomments on this work. The authors would liketo thank the three anonymous reviewers for theirhelpful suggestions. Compared to smaller pretraining corpus (§4.1), we foundthat the representations pretrained on large corpus are lessprune to overfit to the training instances of the task. Weobserve that training without the KL regularisation yieldsbetter performance for SDMGs+mBERT.
References
Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V.Dillon, Rif A. Saurous, and Kevin Murphy. 2018.Fixing a broken ELBO. In
ICML , volume 80, pages159–168.Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.2019. On the cross-lingual transferability of mono-lingual representations.
CoRR , abs/1910.11856.Mikel Artetxe and Holger Schwenk. 2019. Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond.
TACL ,7:597–610.Avrim Blum and Tom M. Mitchell. 1998. Combin-ing labeled and unlabeled data with co-training. In
COLT , pages 92–100. ACM.Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information.
TACL , 5:135–146.Kyunghyun Cho, Bart van Merri¨enboer, C¸ a˘glarG¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In
EMNLP , pages1724–1734.Jihun Choi, Taeuk Kim, and Sang-goo Lee. 2019.A cross-sentence latent variable model for semi-supervised text sequence matching. In
ACL , pages4747–4761.Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In
NeurIPS ,pages 7057–7067.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In
NAACL , pages 4171–4186.Jesse Dodge, Gabriel Ilharco, Roy Schwartz, AliFarhadi, Hannaneh Hajishirzi, and Noah A. Smith.2020. Fine-tuning pretrained language models:Weight initializations, data orders, and early stop-ping.
CoRR , abs/2002.06305.rina Higgins, Lo¨ıc Matthey, Arka Pal, ChristopherBurgess, Xavier Glorot, Matthew Botvinick, ShakirMohamed, and Alexander Lerchner. 2017. beta-vae:Learning basic visual concepts with a constrainedvariational framework. In
ICLR .Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural Computation ,9(8):1735–1780.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In
ICLR .Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In
ICLR .Durk P Kingma, Shakir Mohamed, DaniloJimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative models.In
NIPS , pages 3581–3589.Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In
MT Summit .Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018a. Unsupervisedmachine translation using monolingual corpora only.In
ICLR .Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Herv´e J´egou. 2018b.Word translation without parallel data. In
ICLR .Chunyuan Li, Xiang Gao, Yuan Li, Xiujun Li, BaolinPeng, Yizhe Zhang, and Jianfeng Gao. 2020. Opti-mus: Organizing sentences via pre-trained modelingof a latent space.
CoRR , abs/2004.04092.Lars Maaløe, Casper Kaae Sønderby, Søren KaaeSønderby, and Ole Winther. 2016. Auxiliary deepgenerative models. In
ICML , volume 48, pages1445–1453.David McClosky, Eugene Charniak, and Mark Johnson.2006. Effective self-training for parsing. In
NAACL .Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neuralvariational inference for text processing. In
ICML ,volume 48, pages 1727–1736.Marius Mosbach, Maksym Andriushchenko, and Diet-rich Klakow. 2020. On the stability of fine-tuningBERT: misconceptions, explanations, and strongbaselines.
CoRR , abs/2006.04884.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In
Advances in Neural Information Pro-cessing Systems , volume 32, pages 8026–8037. Cur-ran Associates, Inc. Holger Schwenk and Xian Li. 2018. A corpus for mul-tilingual document classification in eight languages.In
LREC .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Proceedings of NIPS , pages 5998–6008.Liangchen Wei and Zhi-Hong Deng. 2017. A vari-ational autoencoding approach for inducing cross-lingual word embeddings. In
IJCAI , pages 4165–4171.Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas:The surprising cross-lingual effectiveness of BERT.In
EMNLP , pages 833–844.Zhongbin Xie and Shuai Ma. 2019. Dual-view varia-tional autoencoders for semi-supervised text match-ing. In
IJCAI , pages 5306–5312.Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan.2017. Variational autoencoder for semi-supervisedtext classification. In
AAAI , pages 3358–3364.Pengcheng Yin, Chunting Zhou, Junxian He, and Gra-ham Neubig. 2018. Structvae: Tree-structured latentvariable models for semi-supervised semantic pars-ing. In
ACL , pages 754–765.Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018. Conll 2018 shared task: Multilin-gual parsing from raw text to universal dependencies.In
CoNLL 2018 Shared Task: Multilingual Parsingfrom Raw Text to Universal Dependencies , pages 1–21.Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Ex-ploiting unlabeled data using three classifiers.
IEEETrans. Knowl. Data Eng. , 17(11):1529–1541.Yi Zhu, Benjamin Heinzerling, Ivan Vuli´c, MichaelStrube, Roi Reichart, and Anna Korhonen. 2019. Onthe importance of subword information for morpho-logical tasks in truly low-resource languages. In
CONLL , pages 216–226.
A Derivations of semi-supervised ELBOs
We derive the full ELBOs of both labelled and un-labelled data for M1+M2 and Auxiliary Skip DeepGenerative Model (AUX; Maaløe et al. (2016)). We first use ( · ) to represent different conditionalvariables for the two models so that the deriva-tions can be unified, then we will realise it with themodel-specific conditions in the end. As mentioned in the footnote of original paper, we com-pare M1+M2 with AUX in LSTM encoder with VAE pre-training, but found that the simpler M1+M2 performs better.Results on AUX can be found in §D. s written in the paper, the labelled ELBO forboth models is: E q φ ( z , z | x ,y ) (cid:104) log p θ ( x , y, z , z ) q φ ( z , z | x , y ) (cid:105) = L ( x , y ) ≤ log p ( x , y ) Expanding the ELBO, we will have: E q φ ( z , z | x ,y ) [log p θ ( x , y, z , z ) q φ ( z , z | x , y ) ]= E q φ ( z | x ) q φ ( z | · ) [log p ( z ) + log p θ ( z | z , y ) + log p θ ( x | · ) + log p ( y ) − log q φ ( z | · ) − log q φ ( z | x )]= E q φ ( z | x ) q φ ( z | · ) [log p θ ( x | · )] − E q φ ( z | x ) q φ ( z | · ) [log q φ ( z | · ) + log q φ ( z | x ) − log p ( z ) − log p θ ( z | z , y ) − log p ( y )]= E q φ ( z1 | x ) q φ ( z | · ) [log p θ ( x | · )] − E q φ ( z | x ) q φ ( z | · ) [log q φ ( z | · ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) − log p ( y )] After realising ( · ), we can then obtain the labelledELBO for M1+M2 and AUX in the original paper: L M1+M2 ( x , y )= E q φ ( z | x ) [log p θ ( x | z )] − E q φ ( z | x ) q φ ( z | z ,y ) [log q φ ( z | z , y ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) − log p ( y )] L AUX ( x , y )= E q φ ( z | x ) q φ ( z | z , x ,y ) [log p θ ( x | z , z , y )] − E q φ ( z | x ) q φ ( z | z , x ,y ) [log q φ ( z | z , x , y ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) − log p ( y )] For the unlabelled ELBO, y is unobservable: E q φ ( z , z ,y | x ) (cid:104) log p θ ( x , y, z , z ) q φ ( z , z , y | x ) (cid:105) = U ( x ) ≤ log p ( x ) After expansion: E q φ ( z , z ,y | x ) [log p θ ( x , y, z , z ) q φ ( z , z , y | x ) ]= E q φ ( z | x ) q φ ( y | · ) q φ ( z | · ) [log p ( z ) + log p θ ( z | z , y ) + log p θ ( x | · ) + log p ( y ) − log q φ ( z | · ) − log q φ ( z | x ) − log q φ ( y | · )]= E q φ ( z | x ) q φ ( y | · ) q φ ( z | · ) [log p θ ( x | · )] − E q φ ( z | x ) q φ ( y | · ) q φ ( z | · ) [log q φ ( z | · ) + log q φ ( z | x ) + log q φ ( y | · ) − log p ( z ) − log p θ ( z | z , y ) − log p ( y )]= E q φ ( z1 | x ) q φ ( y | · ) q φ ( z | · ) [log p θ ( x | · )] − E q φ ( z | x ) q φ ( y | · ) q φ ( z | · ) [log q φ ( z | · ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) + log q φ ( y | · ) p ( y ) ] Similarly, we will get unlabeled ELBO of M1+M2and AUX: U M1+M2 ( x )= E q φ ( z | x ) [log p θ ( x | z )] − E q φ ( z | x ) q φ ( y | z ) q φ ( z | z ,y ) [log q φ ( z | z , y ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) + log q φ ( y | z ) p ( y ) ] U AUX ( x )= E q φ ( z | x ) q φ ( y | z , x ) q φ ( z | z , x ,y ) [log p θ ( x | z , z , y )] − E q φ ( z | x ) q φ ( y | z , x ) q φ ( z | z , x ,y ) [log q φ ( z | z , x , y ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) + log q φ ( y | z , x ) p ( y ) ] In our experiments, we sample z and z onceduring inference, so both labeled and unlabeledELBOs can be approximated by: L ( x , y )= E q φ ( z | x ) q φ ( z | · ) [log p θ ( x | · )] − E q φ ( z | x ) q φ ( z | · ) [log q φ ( z | · ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) − log p ( y )] ≈ log p θ ( x | · ) + log p ( y ) − KL ( q φ ( z | · ) (cid:107) p ( z )) − KL ( q φ ( z | x ) (cid:107) p θ ( z | z , y )) U ( x )= E q φ ( z | x ) q φ ( y | · ) q φ ( z | · ) [log p θ ( x | · )] − E q φ ( z | x ) q φ ( y | · ) q φ ( z | · ) [log q φ ( z | · ) p ( z ) + log q φ ( z | x ) p θ ( z | z , y ) + log q φ ( y | · ) p ( y ) ] ≈ log p θ ( x | · ) − KL ( q φ ( y | · ) (cid:107) p ( y )) − E q φ ( y | · ) [ KL ( q φ ( z | · ) (cid:107) p ( z ))] − E q φ ( y | · ) [ KL ( q φ ( z | x ) (cid:107) p θ ( z | z , y ))] Factorisation of M1+M2 and AUX
The two models have different factorisations, withM1+M2 being written as: q φ ( z , z | x , y ) = q φ ( z | x ) q φ ( z | z , y ) q φ ( z , z , y | x ) = q φ ( z | x ) q φ ( y | z ) q φ ( z | z , y ) p θ ( x , y, z , z ) = p ( y ) p ( z ) p θ ( z | z , y ) p θ ( x | z ) J cls ( x , y ) = E q φ ( z | x ) [ q φ ( y | z )] and AUX is factorised as follows: q φ ( z , z | x , y ) = q φ ( z | x ) q φ ( z | z , x , y ) q φ ( z , z , y | x ) = q φ ( z | x ) q φ ( y | z , x ) q φ ( z | z , x , y ) p θ ( x , y, z , z ) = p ( y ) p ( z ) p θ ( z | z , y ) p θ ( x | z , z , y ) J cls ( x , y ) = E q φ ( z | x ) [ q φ ( y | z , x )] where q φ ( z | x ) , q φ ( z | · ) , and p θ ( z | z , y ) are pa-rameterised as diagonal Gaussians, and other dis-tributions are defined as: q φ ( y | · ) = Cat ( y | π φ ( · )) p ( y ) = Cat ( y | π ) p ( z ) = N ( z | , I ) p θ ( x | · ) = f ( x , · ; θ ) where Cat ( · ) is a multinomial distribution and y is treated as latent variables if it is unobserved inunlabelled case. f ( x , · ; θ ) serves as the decoderand calculates the likelihood of the input sequence x . C Details on LSTM Encoder with VAEPretraining
C.1 Data preprocessing and statistics
We use two pairs of data from Europarl v7 (Koehn,2005): EN - DE and EN - FR , which consist offour datasets in total: EN EN - DE , DE EN - DE , EN EN - FR ,and FR EN - FR . Regarding DE - FR data, we take thedatasets of DE EN - DE and FR EN - FR .For each language pair, the sentences in the sameline of both datasets are a pair of parallel sentences.We do the following preprocessing to each dataset:tokenization; lower case; substitute digits with 0;remove all punctuations; remove redundant spacesand empty lines. Then we trim all four datasetsinto exactly the same sentence size. We randomlysplit a small part of parallel sentences to build a dev.set, which leads to 189m lines of training set and13995 lines of dev. set for each language. Then weshuffle each dataset so that each language pair isnot parallel anymore (for both train and dev. sets).Our goal is to merge the two datasets of eachpair and scramble them to form a single dataset. Inpractice, we keep each dataset separate, and samplea batch randomly from one language alternatively . during pretraining, so that the data from both lan-guages are mixed. C.2 Model and training details
Instead of optimising the standard VAE, we opti-mise the following objective for NXVAE (Higginset al., 2017; Alemi et al., 2018): J ( x ) = E q φ ( z | x ) [log p θ ( x | z )] − α KL ( q φ ( z | x ) (cid:107) p ( z )) (3)where we manually tune the fixed hyperparameter α on EN - DE data to reach a good balance betweenthe reconstruction and the KL empirically. We se-lect α = 0 . and apply it for the pretraining ofother language pairs as well. The model and train-ing details of XNVAE are shown in Table 5 (left). C.3 Pretraining other models
For MLDoc supervised document classification,we also pretrain other baseline models to comparewith ONLY for EN - DE pair: Cross-lingual VAE with parallel input (PEMB;Wei and Deng (2017)):
For the model of Weiand Deng (2017), we run the original code directlyon the same EN - DE Europarl data without changingany of the model architecture. Since the model re-quires parallel input, we take the preprocessed andsplit EN - DE data. However, we do not shuffle eachdataset, but rather feed them as parallel input to themodel, so that the model and our correspondingNXVAE use the same amount and content of thedata. Subword-based non-parallel cross-lingual VAESNXVAE:
Instead of having separate vocabularyand decoders for each language, we use a singlevocabulary and decoder for SNXVAE. We build thevocabulary with SentencePiece of size e . Allother settings are the same as NXVAE. Its modeland training details can be found in Table 5 (right). Word and subword-based BERT modelBERTW/BERTSW : For BERTW, we changethe vocabulary and model size to be comparablewith NXVAE. Note that the vocabulary size ofBERTW is the same as the intersected vocabularysize of the two languages in NXVAE. We onlyuse the masked language model objective duringpretraining, and discard the objective of nextsentence prediction. For BERTSW, we use the https://github.com/google/sentencepiece . Both word and subword-based models are trained with: https://github.com/google-research/bert . ame vocabulary as SNXVAE and set the modelto similar parameter size as SNXVAE. The modeland training details of BERTW and BERTSW areshown in Table 6. D More Results on DocumentClassification
D.1 LSTM Encoder with VAE PretrainingSupervised Learning.
Our base model isNXVAE-z , which adds an MLP classifier q φ ( y | z ) on top of the encoder with the same architecture ofthe NXVAE. The similar applies to the subword-based models SNXVAE-z . NXVAE-h takes thedeterministic h as the input to q φ ( y | x ) . All ourbaseline models with pretrained embeddings usethe architecture of NXVAE-z . For fastText (FT),we train the embeddings of both languages withthe same data of EN EN - DE and DE EN - DE . For MUSE,we align on the pretrained FT embeddings. ForBERTW and BERTSW, we use the library Trans-formers for classification, and initialise the mod-els with the corresponding pretrained parameters.All model and training details can be found in Ta-ble 7. The comparison results of word-based andsubword-based models are shown in Table 8. Semi-supervised learning with SDGMs.
Themain model (NXVAE) and training details are thesame as in supervised learning. Besides M1+M2,we also compare with AUX (Maaløe et al., 2016)with the two decoder types. The training details areshown in Table 9. Regarding the decoding of GRU,all conditional latent variables of p θ ( x | · ) are fed asextra input at each decoding step (Xu et al., 2017).We tune all semi-supervised models on EN EN - DE with 32 labels in semi-supervised settings, and thenapply it to all other languages and data sizes. Wetune only one hyperparameter: the scaling factor β in the weight for the classification loss α in theoriginal SDGM paper (Maaløe et al., 2016): α = β N l + N u N l where N l and N u are labelled and unla-belled data point numbers. We tune β from { . , . , . , . , . , . , . , . } . We pickthe β with the best dev. performance for eachmodel, and randomly select one when there is a https://github.com/huggingface/transformers . tie. Then we use such fixed β for all other ex-periments across different training data sizes andlanguages.The results of AUX can be seen in Table 10 alongwith M1+M2 results from the original paper. Theparameter size of each model is shown in Table 11. D.2 mBERT Encoder
The supervised model (SUP-h) adds a single lin-ear transformation layer on the pooled [CLS] rep-resentation of mBERT, and M1+M2+BOW addsthe corresponding SDGM framework on the samemBERT output. Like BERT, as mBERT uses ashared WordPiece vocabulary across languages, theparameter size of the same model will be the samefor each language. All model and training detailsalong with parameter size can be found in Table12.For tuning the hyperparameter ofM1+M2+BOW, different from LSTM en-coder with VAE pretraining, we set α fixedto α = β . We tune β on EN with 8 labelsin semi-supervised settings with 5 trials from { . , . , . , . , . , . , . , . , . } , pickthe β with the best average dev. performance, andthen apply it to all other languages and data sizes.We report the mean and variance over 5 trials, andthe full results for both models can be seen inTable 13. E Conditional document generation
Semi-supervised deep generative models can notonly explore the complex data distributions, butare also equipped with the ability to generate doc-uments conditioned on latent codes, which is an-other advantage over other semi-supervised mod-els. We follow Kingma et al. (2014) by varyinglatent variable y for generation, and fixing z ei-ther sampled from the prior (Table 14) or obtainedfrom the input through the inference model (Ta-ble 15), and generate sequence samples from thetrained semi-supervised models M1+M2+BOWand M1+M2+GRU. Overall, all models generate words or utterancesdirectly related to the class, with the class labelsamong top nouns generated by BOW models, andsubjects/objects in sentences from GRU are alsopertaining to corresponding classes. However, wealso observe that the utterances in GRU are notfluent with many repetitions. We argue that it All models are treined on EN en-fr with 128 labelled data. s caused by the high proportion of UNK in thetraining corpus that makes the sequence generationharder, supported by the fact that the most probableword in all BOW decoders is always UNK. yperparameter NXVAE SNXVAEvocabulary size 4e4 ( EN ), 5e4 ( DE , FR ) 1e4embedding size 300 300embedding dropout 0.2 0.2encoder BiLSTM BiLSTMencoder input dimension 300 300encoder hidden dimension 600 for each direction 600 for each directionencoder layer number 2 2encoder dropout 0.2 0.2discriminator configuration [2 × × h to µ or log σ ) configuration [2 × × z dimension 300 300parameter size 41.8M ( EN - DE and EN - FR )/ 44.9M ( DE - FR ) 17.8Mrunning time ∼ α in Equation 3 { , 0.2, 0.5, 1.0 } Table 5: Model and training details of NXVAE.
Hyperparameter
BERTW BERTSW vocabulary size 84101 10005hidden size 300 300max position embeddings 512 512hidden dropout prob 0.1 0.1hidden activation gelu geluintermediate size 2100 1800num attention heads 12 12attention probs dropout prob 0.1 0.1num hidden layers 12 11parameter size 45.0M 19.1Mrunning time ∼ ∼ Table 6: Model and training details of BERTW and BERTSW.
Hyperparameter BERTW/BERTSW VAE-basedvocabulary same as pretrained model same as pretrained modeltraining epoch 5000 5000early stopping 1000 epochs on dev. accuracy 1000batch size 16 16running time ∼ ∼ Table 7: LSTM encoder with VAE pretraining: model and training details of MLDoc supervised document classi-fication. The running time is calculated on EN EN - DE with 32 labelled data for all models. N - DE EN DE
32 64 128
FULL
32 64 128
FULLBERTW
BERTSW
NXVAE -z SNXVAE -z Table 8: LSTM encoder with VAE pretraining: comparisons of word-based models and subword-based modelsfor BERT and NXVAE in MLDoc supervised document classification. Word-based results are directly from theoriginal paper.
Hyperparameter M1+M2 M1+M2+BOW M1+M2+GRU AUX+BOW AUX+GRUtraining epoch 5000 5000 5000 5000 5000early stopping 1000 1000 1000 1000 1000best β z dim 300 300 300 300 300 z dim 300 300 300 300 300tie embedding - False False False Falserunning time ∼ ∼ ∼ ∼ ∼ Table 9: LSTM encoder with VAE pretraining: model and training details of MLDoc semi-supervised documentclassification. The running time is calculated on EN EN - DE with 32 labelled data for all models. EN - DE EN DE
32 64 128
FULL
32 64 128
FULLM M M M BOW - M M GRU
AUX + BOW - AUX + GRU EN - FR EN FR
32 64 128
FULL
32 64 128
FULLM M M M BOW - 80.3 - M M GRU
AUX + BOW
AUX + GRU DE - FR DE FR
32 64 128
FULL
32 64 128
FULLM M M M BOW - M M GRU
AUX + BOW - 73.9 79.5 82.1 -
AUX + GRU
Table 10: LSTM encoder with VAE pretraining: test accuracy of AUX models. The header numbers denote numberof labelled training data instances. The best results are in bold. Other results related to M1+M2 are directly fromthe original paper.
N DE FREMBEDDING MODELS
NXVAE -h 26.8M 29.8M 29.8M
NXVAE -z SNXVZE -z BERTW
BERTSW M M M M BOW M M GRU
AUX + BOW
AUX + GRU
Table 11: LSTM encoder with VAE pretraining: parameter size of all supervised and semi-supervised models.The difference between NXVAE-based models and BERTW is caused by language specific vocabulary of NXVAE,where only one vocabulary is used for mono-lingual document classification.
Hyperparameter SUP-h M1+M2+BOWvocabulary size 1e5 1e5 z dim 768 768 z dim 768 768tie embedding True Truebest β - 10.0training epoch 500 500early stopping 100 epochs on dev. accuracy 100batch size 4 4running time ∼ ∼ Table 12: mBERT encoder: model and training details of MLDoc document classification. The running time iscalculated on EN EN - DE with 8 labelled data for both models. Model 8 16 32 1 K E N SUP -h 42.2 (4.7) 68.9 (9.7) 82.4 (3.0) 94.2 (0.8) M M BOW (12.8) (2.8) (1.5) - D E SUP -h 55.9 (9.9) 63.5 (10.2) 81.5 (6.5) 95.0 (0.3) M M BOW (11.5) (6.3) (2.6) - F R SUP -h 38.6 (3.3) 55.9 (11.4) 78.5 (3.0) 93.5 (0.7) M M BOW (4.6) (9.1) (2.7) - R U SUP -h 49.4 (6.0) 53.8 (2.6) 68.2 (5.2) 87.2 (0.4) M M BOW (6.0) (4.6) (2.3) - Z H SUP -h 63.4 (12.5) 70.7 (6.5) 81.2 (3.9) 91.1 (0.1) M M BOW (11.1) (2.4) (3.8) -
Table 13: mBERT Encoder: MLDoc average test accuracy for both SUP-h and M1+M2+BOW models. Thevariance is in the bracket after the mean score. The first row denotes the number of labelled instances. The bestresults are in bold. lass M1+M2+BOW M1+M2+GRUC
1: UNK, industry, credibility, agreement, ticket, co, decision, con-cept, ltd, people, sale, government, market, president, designations,minister, firm, plans, partner, deal 1: the bank said it lump of the united ... the new girls ltd saidthe concept ... the new extraordinary and the concept ... said thestatement ...2: UNK, ticket, year, shares, days, results, age, net, demand,securities, period, stock, concept, construction, bank, programme,procedure, statement, value, commission 2: the bank of organisation said on thursday that it had revoked bythe first girls ... first year to ... E
1: UNK, finance, market, loophole, budget, surprise, bank, ba-sis, issue, government, system, exchanges, committee, municipal,world, securities, holding, net, confidence, minister 1: the international basic fund said on acknowledged that it saidon publish to vote on publish to a bank said on publish ...2: UNK, ticket, city, escalation, finance, bank, budget, concept,revenue, net, price, sale, trade, tax, prices, markets, series, rate,fund, pack 2: the bank of submitting on publish florence said on acknowl-edged that ... it said on publish that ... to the new coherent said onacknowledged to bumping the bank said the bank ... G
1: UNK, government, state, minister, delay, pension, work, presi-dent, plans, summit, ticket, people, procedure, conference, ambas-sador, country, talks, opposition, nations, house 1: the president remarkable said on thursday it surprise of ethno-cide arrival the infidels of the islamic of the waterway the bankwas ...2: UNK, state, president, war, police, office, authorities, prob-lem, information, result, country, rights, committee, city, people,biodiversity, justice, health, securities, issue 2: the summit in the authors and a virtual geological and the firsttime of the first party of the first time of ... M
1: UNK, ticket, phase, market, government, minister, markets,banks, bank, budget, floor, points, rate, traders, procedure, strength,economy, finance, prices, loophole 1: the database distinctions the market closed sharply entire onthursday on acknowledged ...2: UNK, markets, market, stock, loophole, points, trade, shares,ticket, corporate, speaker, issues, fund, bank, group, exchanges,results, anticipation, companies, surprise 2: the following of the the the ries and not have embargo costsunveiling on publish pleading a impact of the japanese ... marketand a bank was to be of the bank ...
Table 14: Generated samples from M1+M2+GRU (BOW) for class C (
Corporate/Industrial ), E (
Economics ), G(
Government/Social ), and M (
Markets ). We randomly sample z from the prior while varying y . : Fiat shares lost nearly two percent on Wednesday, slipping below the psychologically important 4,000 lire level in thin trading on a generallyeasier Milan Bourse, traders said. ”The stock has gradually lost ground but without any major sell orders. At the moment there just isn’t any interestin Fiat,” one trader said. At 1439 GMT, Fiat was quoted 1.99 percent off at 3,980 lire, after touching a day’s low of 3,970 lire, in volume of justunder four million shares. The all-share Mibtel index posted a 0.47 percent fall. – Milan newsroom +392 66129589 (E)1: fiat shares lost nearly two percent on UNK slipping below the psychologically important UNK lire level in thin trading on a generally easier milanUNK traders UNK UNK stock has gradually lost ground but without any major sell UNK at the moment there just UNK any interest in UNK onetrader UNK at UNK UNK fiat was quoted UNK percent off at UNK UNK after touching a UNK low of UNK UNK in volume of just under fourmillion UNK the UNK UNK index posted a UNK percent UNK UNK milan UNK UNK UNK2: The top prosecutor of Honduras said on Wednesday that his country is a haven for money laundering. ”In Honduras it’s easy to launder money, thesystem allows it,” Edmundo Orellana told reporters. ”It’s permitted because there is no law in Honduras that obligates a Honduran to explain theorigin of his wealth.” Honduran authorities estimate that $
300 million in illegal drug profits is laundered through the country each year. Moneylaundering is not classified as an offence in Honduras, although legislators have been working on a bill to outlaw it since last year. (G)2: the top prosecutor of honduras said on wednesday that his country is a haven for money UNK UNK honduras UNK easy to launder UNK thesystem allows UNK UNK UNK told UNK UNK permitted because there is no law in honduras that UNK a honduran to explain the origin of hisUNK honduran authorities estimate that UNK million in illegal drug profits is laundered through the country each UNK money laundering is notclassified as an offence in UNK although legislators have been working on a bill to outlaw it since last UNK
Class M1+M2+BOW M1+M2+GRUC
1: UNK, ticket, profit, concept, net, market, escalation, share, results,shares, delay, group, revision, profits, period, misery, statement, bank,key, procedure 1: the bank said on fourthly it has inject requirement of the first groupof ...2: UNK, concept, ticket, group, market, shares, delay, president,stock, companies, bank, statement, government, stake, price, co,state, girls, meeting, ltd 2: the bank of organisation said on acknowledged that it had ameeting ... E
1: UNK, ticket, escalation, inflation, key, revision, delay, period,floor, consumer, bank, contexts, result, instance, show, market, level,government, gross, price 1: the bank of submitting on publish florence said on acknowledgedthat it said on publish that ... the new coherent ... to the bank ...2: UNK, ticket, bank, government, finance, market, state, budget, tax,minister, rate, delay, debt, issue, trade, investment, surprise, policy,sale, procedure 2: the international basic fund said on acknowledged that it said onpublish ... to vote on acknowledged to a bank ... G
1: UNK, world, ticket, policies, time, surprise, procedure, demand,campaigns, group, team, president, match, communities, place, min-ister, bank, government, number, relief 1: the ana police said acknowledged it had a tackling ...2: UNK, president, government, people, state, minister, pension,police, designations, meeting, talks, opposition, leaders, country,security, result, statement, authorities, peace, summit 2: the president remarkable said on thursday that it surprise of ethno-cide arrival infidels of her wines of her recall and the white house of... M
1: UNK, shares, ticket, contexts, touch, market, stock, points, esca-lation, share, traders, phase, immigrants, procedure, price, pledges,revision, agriculture, group , level 1: the bank of the settlement following the following vocationalmeda of the deal was delay ... and the market ...2: UNK, market, ticket, bank, traders, anticipation, delay, procedure,trade, prices, immigrants, rate, government, money, meda, escalation,demands, exchange, points, reallocation 2: the bank of the settlement following the following vocational valueof the relative gains of ...
Table 15: Generated samples from M1+M2+GRU (BOW) by varying class label y . We take z2