CMV-BERT: Contrastive multi-vocab pretraining of BERT
PPublished as a conference paper at ICLR 2021
CMV-BERT: C
ONTRASTIVE MULTI - VOCAB PRE - TRAINING OF
BERT
Wei Zhu
Department of Computer ScienceEast China Normal UniversityShanghai, China
Daniel Cheung
AI4ALLSan Diego, California, US albert [email protected] A BSTRACT
In this work, we represent CMV-BERT, which improves the pretraining of a lan-guage model via two ingredients: (a) contrastive learning, which is well studiedin the area of computer vision; (b) multiple vocabularies, one of which is fine-grained and the other is coarse-grained. The two methods both provide differentviews of an original sentence, and both are shown to be beneficial. Downstreamtasks demonstrate our proposed CMV-BERT are effective in improving the pre-trained language models.
NTRODUCTION
The pretrained language models (PLMs) including BERT Devlin et al. (2018) and its variants Yanget al. (2019); Liu et al. (2019b) have been proven beneficial for many natural language processing(NLP) tasks, such as text classification, question answering Rajpurkar et al. (2018) and naturallanguage inference (NLI) Bowman et al. (2015), on English, Chinese and many other languages.However, the sentence-level task from BERT Devlin et al. (2018) is challenged by many more recentworks like SpanBERT Joshi et al. (2019), and thus many alternatives are proposed. In light of thedevelopment of contrastive learning in computer vision, we propose to introduce contrastive learningin the pretraining of language models.Figure 1: An illustration of the architecture of CMV-BERT.Now we introduce our proposed method, CMV-BERT, as depicted in Figure 1. The prerequisite isto learn two different vocabularies, one fine-grained and one coarse-grained, using the text corpus,1 a r X i v : . [ c s . C L ] D ec ublished as a conference paper at ICLR 2021which has been introduced in Zhu (2020). With two vocabularies, we have two tokenizers, andtwo embedding layers. A sentence is tokenized and embedded in both branches. An encoder isshared among the two branches. Pooling layer is not shared. Cosine similarity is calculated aftergoing through two different pooling layers. The pooling layers are different and does not shareparameters. Our contrastive learning framework are inspired by SimSiam Chen & He (2020), andwe also adopt the stop-gradient strategy in training. During finetuning, we only use one of the vocabas the final vocab, so the finetuning is the same with the vanilla BERT.Note that our proposed training objective is off-the-shelf and can be easily combined with otherpretraining tasks. Experiments show that our proposed method is beneficial when applied togetherwith masked language model (MLM) task. Before and since Devlin et al. (2018), a large amount of literature on pretrained language modelappear and push the NLP community forward with a speed that has never been witnessed be-fore. Peters et al. (2018) is one of the earliest PLMs that learns contextualized representations ofwords. GPTs Radford et al. (2018; 2019) and BERT Devlin et al. (2018) take advantages of Trans-former Vaswani et al. (2017). GPTs are uni-directional and make prediction on the input text in anauto-regressive manner, and BERT is bi-directional and make prediction on the whole or part of theinput text. In its core, what makes BERT so powerful are the pretraing tasks, i.e., Mask languagemodeling (MLM) and next sentence prediction (NSP), where the former is more important. SinceBERT, a series of improvements have been proposed. The first branch of literature improves themodel architecture of BERT. ALBERT Lan et al. (2019) makes BERT more light-weighted by em-bedding factorization and progressive cross layer parameter sharing. Zaheer et al. (2020) improveBERT’s performance on longer sequences by employing sparser attention.The second branch of literature improve the training of BERT. Liu et al. (2019b) stabilize and im-prove the training of BERT with larger corpus. More work have focused on new language pretrain-ing tasks. ALBERT Lan et al. (2019) introduce sentence order prediction (SOP). StructBERT Wanget al. (2019) designs two novel pre-training tasks, word structural task and sentence structural task,for learning of better representations of tokens and sentences. ERNIE 2.0 Sun et al. (2019) pro-poses a series of pretraining tasks and applies continual learning to incorporate these tasks. ELEC-TRA Clark et al. (2020) has a GAN-style pretraining task for efficiently utilizing all tokens in pre-training. Our work is closely related to this branch of literature by design a series of novel pretrain-ing objective by incorporating multiple vocabularies. Our proposed tasks focus on intra-sentencecontextual learning, and it can be easily incorporated with other sentence structural tasks like SOP.Another branch of literature look into the role of words in pre-training. Although not mentioned inDevlin et al. (2018), the authors propose whole word masking in their open-source repository, whichis effective for pretraining BERT. In SpanBERT Joshi et al. (2019), text spans are masked in pre-training and the learned model can substantially enhance the performances of span selection tasks.It is indicated that word segmentation is especially important for Chinese PLMs. Cui et al. (2019)and Sun et al. (2019) both show that masking tokens in the units of natural Chinese words insteadof single Chinese characters can significantly improve Chinese PLMs. Zhu (2020) proposed a seriesof novel training tasks which leverages multiple multiple vocabularies. In this work, compared toliterature, we propose to leverage vocabularies of different gradularity to naturally form differentviews of one sentence which are used in contrastive learning.
XPERIMENTS
ETUP
For pre-training corpus, we use Chinese Wikipedia. We follow Zhu (2020) to build two vocabs. Thefine-grained vocab is character-level and its size is 21128. The coarse-grained vocab is word-level,and its size is 69341.We pretrain 2 models. The first one is our baseline model, which is the standard BERT. For pretrain-ing, whole word masking is adopted for the MLM task, and total 15% of the words (from CWS)2ublished as a conference paper at ICLR 2021 task vocab br lcqmc xnli ccksmetric - acc acc acc exact F1ALBERT fine-grained 77.83 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 1: Main results on the Chinese benchmark datasets. For each task and each model, experi-ments are repeated for 10 times, and the average and standard deviation of the scores are reported.in the corpus are masked, which are then tokenized into different tokens under different vocabs.Sentence order prediction (SOP) task is also used. The second model also have MLM task, but weadd the CMV objective to the model pretraining.In this article, all models use the ALBERT as encoder. We make use of a smaller parameter settings,that is, the number of layer is 3, the embedding size is 128 and the hidden size is 256. Other ALBERTconfigurations remain the same with ALBERT Lan et al. (2019). The pretraining hyper-parametersare almost the same with ALBERT Lan et al. (2019). The maximum sequence length is 512. Here,the sequence length is counted under the fine-grained vocab. The batch size is 1024, and all themodel are trained for 12.5k steps. The pretraining optimizer is LAMB and the learning rate is 1e-4.For finetuning, the sequence length is 256, the learning rate is 2e-5, the optimizer is Adam Kingma& Ba (2015) and the batch size is set so that each epoch contains less than 1000 steps. Each modelis run on a given task for 10 times and the average performance scores and standard deviations arereported for reproducibility.3.2
BENCHMARK TASKS
For downstream tasks, we select 1 text classification tasks: Book review (book review), collectedfrom Douban by Liu et al. (2019a). For sentence pair classification tasks, we include the following2 datasets: (1) XNLI (xnli) from Conneau et al. (2018); (2) LCQMC (lcqmc) Liu et al. (2018). Wealso investigate 1 NER tasks: CCKS NER (ccks) is collected from medical records.3.3 E XPERIMENTAL RESULTS
The upper rows of Table 1 give the results for vanilla ALBERT. And the lower rows report the resultsof CMV. The results show that our proposed CMV task is effective in promoting the pretrainedmodel’s downstream performances. We can see that CMV is beneficial for both sentence-level tasksand token-level tasks. In addition, we can see that fine-grained vocab is more suitable for token-leveltasks.
ONCLUSIONS
In this work, we first propose a novel method, CMV, which integrates contrastive learning intolanguage model pretraining. Then we construct our contrastive learning framework with the helpof multiple vocabularies and SimSiam. Experiments show that CMV improves PLMs’ downstreamperformances on both sentence-level and token-level tasks. https://embedding.github.io/evaluation/ https://book.douban.com/ https://biendata.com/competition/CCKS2017 2/ R EFERENCES
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large an-notated corpus for learning natural language inference. arXiv e-prints , art. arXiv:1508.05326,August 2015.Xinlei Chen and Kaiming He. Exploring Simple Siamese Representation Learning. arXiv e-prints ,art. arXiv:2011.10566, November 2020.Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ELECTRA:Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv e-prints , art.arXiv:2003.10555, March 2020.Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, HolgerSchwenk, and Veselin Stoyanov. XNLI: Evaluating cross-lingual sentence representations. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pp.2475–2485, Brussels, Belgium, October-November 2018. Association for Computational Lin-guistics. doi: 10.18653/v1/D18-1269. URL .Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. Pre-Training with Whole Word Masking for Chinese BERT. arXiv e-prints , art. arXiv:1906.08101,June 2019.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy.SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprintarXiv:1907.10529 , 2019.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
ICML , 2015.Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Sori-cut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprintarXiv:1909.11942 , 2019.Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang.K-BERT: Enabling Language Representation with Knowledge Graph. arXiv e-prints , art.arXiv:1909.07606, September 2019a.Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, and Buzhou Tang.LCQMC:a large-scale Chinese question matching corpus. In
Proceedings of the 27th Interna-tional Conference on Computational Linguistics , pp. 1952–1962, Santa Fe, New Mexico, USA,August 2018. Association for Computational Linguistics. URL .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pre-training Approach. arXiv e-prints , art. arXiv:1907.11692, July 2019b.Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, andLuke Zettlemoyer. Deep contextualized word representations. In
Proc. of NAACL , 2018.Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-standing by generative pre-training. arXiv e-prints , 2018.Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Languagemodels are unsupervised multitask learners. arXiv e-prints , 2019.Pranav Rajpurkar, Robin Jia, and Percy Liang. Know What You Don’t Know: Unanswerable Ques-tions for SQuAD. arXiv e-prints , art. arXiv:1806.03822, June 2018.4ublished as a conference paper at ICLR 2021Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu,Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXivpreprint arXiv:1904.09223 , 2019.Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. ERNIE2.0: A Continual Pre-training Framework for Language Understanding. arXiv e-prints , art.arXiv:1907.12412, July 2019.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In
NIPS , 2017.Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo Si. Struct-BERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. arXiv e-prints , art. arXiv:1908.04577, August 2019.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv e-prints , art.arXiv:1906.08237, June 2019.Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon,Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big Bird: Transformersfor Longer Sequences. arXiv e-prints , art. arXiv:2007.14062, July 2020.Wei Zhu. MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining. arXiv e-printsarXiv e-prints