Pre-Training a Language Model Without Human Language
PPre-Training a Language Model Without Human Language
Cheng-Han Chiang
National Taiwan University, Taiwan [email protected]
Hung-yi Lee
National Taiwan University, Taiwan [email protected]
Abstract
In this paper, we study how the intrinsic na-ture of pre-training data contributes to thefine-tuned downstream performance. To thisend, we pre-train different transformer-basedmasked language models on several corporawith certain features, and we fine-tune thoselanguage models on GLUE benchmarks. Wefind that models pre-trained on unstructureddata beat those trained directly from scratchon downstream tasks. Our results also showthat pre-training on structured data does not al-ways make the model acquire ability that canbe transferred to natural language downstreamtasks. To our great astonishment, we un-cover that pre-training on certain non-humanlanguage data gives GLUE performance closeto performance pre-trained on another non-English language.
Neural language models (LMs) are prevalent innowadays natural language processing (NLP) com-munity, and they are indispensable to a variety ofNLP tasks. Researchers have devoted themselvesto understanding what these models have learnedand how they work. Probing a trained model iswidely used to understand to what extent a modellearns certain linguistic features (Kovaleva et al.,2019; Hewitt and Manning, 2019; Tenney et al.,2019, 2018; Lin et al., 2019). Another line of re-search focuses more on how training corpora affectthe trained LMs (Micheli et al., 2020; Gururanganet al.; Zhang et al., 2020).In this work, we aim to understand how down-stream performance varies across models pre-trained on data of particular traits. The core prob-lem we determine to answer is: What factors in thepre-training data make a pre-trained transformerLM perform better on downstream tasks than theirtrained from scratch counterparts? To answer thisquestion, we pre-train many different transformer LMs on dataset from miscellaneous disciplines,ranging from amino acid sequences in complexliving organisms to artificial data generated by asimple python script. We then fine-tune them onEnglish downstream tasks. The process is illus-trated in Figure 1.Recently, Papadimitriou and Jurafsky (2020) pro-posed to train an LSTM LM on a non-naturallanguage dataset and test the LM’s perplexityon natural language. They observed that LSTMLM trained on structured dataset gives perplexityfar lower than those trained on unstructured data.While the observations are intriguing, this settingdoesn’t match the common setting widely appliednowadays, in which we fine-tune pre-trained LMson downstream tasks. This is the first paper inves-tigating whether masked language model (MLM)pre-training on non-natural language aids down-stream natural language tasks’ performance.Based on the experiments, we have the followingobservations:• We reveal that fine-tuning models pre-trainedon unstructured data outperforms modeltrained from scratch on downstream tasks.• We find that structured pre-training data is nota sufficient condition to a pre-trained modelthat can perform well on NLP tasks.• We discover that pre-training on a simple arti-ficial dataset with hierarchical structure leadsto downstream performance comparable tomodels pre-trained on human language.• Our experiments show that token distributionis not the key factors to how well the modeltransferred to downstream tasks, while thenumber of token embeddings used during pre-training affects downstream performance. a r X i v : . [ c s . C L ] D ec igure 1: Work flow of our experiments: We first pre-train the whole masked language model on L1 (proteinsequence in this figure), and fine-tune the whole model on English downstream tasks. We then test the performanceon the fine-tuned downstream task. It takes about 3 days to finish the whole process on a single V100. In our experiments, we pre-train n RoBERTa-base (Liu et al., 2019) models on n different typesof pre-training data. We call the pre-training dataL1 (first language). We then evaluate the pre-trained models’ ability by fine-tuning them on dif-ferent downstream tasks. The overall workflow isillustrated in Figure 1.We adopt the classic GLUE (Wang et al., 2019)benchmarks to evaluate the models pre-trained ondifferent L1s while excluding WNLI following De-vlin et al. (2019). For each task, we use a certain setof hyperparameters and the same random seed tofine-tune the model, and we report the results on theevaluation set. Details regarding all experimentscan be found in Appendix A.Our experiment setup may seem to resemblethe Test for Inductive Bias via Language ModelTransfer (TILT) proposed in Papadimitriou andJurafsky (2020) at first sight, which pre-trains anLSTM LM on L1, follows by only fine-tuning wordembeddings on Spanish, and test the perplexity onSpanish. However, the main purpose of TILT is toanalyze the encoding of grammatical structure inLMs, so they do not fine-tune LSTM on Spanish.On the contrary, our goal is to understand whatfactors in pre-training data make the pre-trainedmodel perform better than models trained fromscratch on downstream tasks. We use two baseline pre-training dataset for ourexperiments: the random baseline and the
Zipfbaseline , both corpora have 29995 tokens, exclud- ing 5 special tokens. For the random baseline, wedraw the tokens from a uniform distribution andform sequences with a length of 90 to 120 tokens.For the Zipf baseline, we sample the tokens fromthe same uni-gram distribution of English. We alsopre-train an
English
MLM with a subset of theEnglish Wikipedia to serve as the performance up-per bound. The pre-training corpora size is around80MB for the previous three datasets.We select several pre-training corpora in distinctdisciplines that contain structure, including a bio-logical dataset, a programming language corpus,an artificial dataset with a hierarchical structure,and a human language.The biological dataset we adopt is amino acid sequence corpora obtained from Min et al. (2019).The characteristic of a protein is determined by itsprimary structure, i.e. the amino acid sequence.Chemical bonds between amino acids determinethe secondary and tertiary structure of the foldedprotein, which further determines the functions ofthe protein. We use the one-letter abbreviation(A-Z) to represent each amino acid, and the totalnumber of tokens in this dataset is 36M.For programming language, we use Habeas cor-pus from Movshovitz-Attias and Cohen (2013),which contains tokenized
Java script . We use thecode from Papadimitriou and Jurafsky (2020) toextract the data and remove tokens that are labeledas a comment, making the training corpus containonly programming language. The total numberof tokens in the pre-training data is 10M, and thevocabulary size of the model is 30K.The artificial dataset we construct has a vocabu-
Figure 2: An illustration of the artificial dataset. lary size of 28996, and the total number of tokensin training data is 23.5M. The dataset is generatedby the following stack-based grammar, followingPapadimitriou and Jurafsky (2020): At each timestep t , we sample X t from a Bernoulli distributionwith P ( X t = 1) = 0 . . If X t = 1 , we samplea token based on English’s uni-gram distribution,place the sampled token at position t of the gener-ated sequence, and push the same token into thestack. When X t = 0 , we pop the top element of thestack and put the popped token at position t of thegenerated sequence. Figure 2 shows a simple exam-ple. We can observe from Figure 2 that sequencegenerated in this manner contains a nesting hierar-chical parentheses structure, which is similar to thedependency tree structure in natural language.The last dataset used is a human language. Weselect a human language different from down-stream tasks to compare the effect of non-humanlanguage pre-training data. We use Kannada fromOSCAR dataset (Suárez et al., 2020). Kannadais a language predominantly spoken by the peoplein the southern western region of India. The mainreason we choose this dataset lies in its subject(S)-object(O)-verb(V) structure, different from the S-V-O structure of our target language used in fine-tuning. The pre-training corpora size is 160MB,and the vocabulary size used in pre-training is 30K.
The overall results are illustrated in Table 1. Inthis section, we discuss how certain aspects of thepre-training corpora affect how good a model canbecome. By the word good , we refer to the model’sability to be fine-tuned on downstream tasks, whichis the performance improvement over training themodel from scratch on downstream tasks.
We intend to answer this question: Is structureddata the key to a good pre-trained model? We com-pare the models pre-trained on structured data withmodels pre-trained on unstructured baselines. Ifthe downstream performance of models pre-trained on structured data can beat their unstructured coun-terparts, then we may conclude that structure in thepre-training data is a key factor in the success ofpre-trained transformer language models.From the first two blocks of Table 1, we find thatmodels pre-trained on unstructured data outperformthe models trained from scratch. This suggeststhat the pre-trained model can still aid downstreamperformance, albeit the seemingly meaningless pre-training corpora.From the third block in Table 1, we find that pre-training on structured data may not always lead toa better model. Models pre-trained on amino acidand Java scripts are almost on a par with the modelstrained from scratch. Not much to our surprise, themodel pre-trained on Kannada performs far betterthan the two baseline models.Amazingly, fine-tuning the model pre-trained onartificial data gives comparable performance com-pared with the model pre-trained on Kannada. Thisimplies that it might be worth trying to pre-traina model on this kind of hierarchical nesting struc-tured dataset, and fine-tune the model on some lowresource languages to obtain decent downstreamperformance. The artificial dataset consists of nosemantic knowledge useful for downstream natu-ral language tasks, so it is reasonable to infer thatmost knowledge the model learns from pre-trainingis the skill to model the hierarchical structure andlong-term dependency. Equipped with this ability,the model can outrun models trained from unstruc-tured data.Our results show that models benefit from pre-training on a certain type of structured corpora,while not every structured corpus leads to a goodpre-trained model for NLP downstream tasks.
We notice that the two baseline models’ perfor-mance is similar in almost all downstream tasks.This indicates that the uni-gram distribution oftokens in the training corpora makes little differ-ence to the downstream performance when the pre-training data themselves are unstructured. We fur-ther ask whether this is also the case when the datais structured. We construct the artificial dataset asin Section 3, and aside from sampling based onZipf distribution, we create another dataset whosetokens are sampled from the uniform distribution No Pre-train
Pre-train En
Rand. Baseline
Zipf Baseline
Amino Acid
Java Script
Kannada
Artificial (Uni.)
Artificial (Zipf)
Artificial (5000)
Artificial (500)
Artificial (50)
Artificial (50-s)
Table 1: Downstream results of different pre-trained models, and the model trained from scratch on downstreamtasks (no pre-train in the first row). The evaluation metrics of MRPC and QQP are F1 score, Spearman correlationcoefficient is reported for STS-B, and the rest tasks are evaluated with accuracy. Results of MNLI are the averageof matched and mismatched. Please refer to Section 4.2 and Section 4.3 for the meaning of parentheses in the lasttwo blocks. 50-s stands for 50-substitute in Section 4.3. Abbreviation used: En: English, Rand: random, Uni:uniform. over tokens except for special tokens. The results,demonstrated in the fourth block in Table 1, showthat even when the pre-training data is structured,token distribution still has little influence on howwell the model can be fine-tuned.
This section investigates whether the mismatchbetween vocabulary size during pre-training andfine-tuning contributes to how well the pre-trainedmodel performs on downstream tasks. To study theinfluence of vocabulary size, we construct differentartificial data by sampling tokens from differentbin sizes (50, 500, and 5000). While the vocabu-lary size during pre-training is different for thosemodels, their actual word embedding table sizesare still the same.From the last block in Table 1, we observe thatthe averaged performance significantly degrades inthe case when only 50 tokens are used during pre-training, while the performance gradually recoverwhen the token number mismatch between pre-training and fine-tuning narrows. Tokens appearingin the pre-training data receive disproportionatelylarger gradients than tokens not in the pre-trainingdata during pre-training, and this artifact cripples The number of different tokens in pre-training data. the downstream performance.The above observation make it hard to tellwhether model pre-trained with amino acid se-quence failed to perform well on downstream tasksdue to the token number mismatch. Thus, we con-duct further experiments to remove the undesirableartifact arise from the mismatch. Say we only usethe first 50 tokens (excluding special tokens) duringpre-training while the rest 29950 token embeddingsare not used, then before fine-tuning the model ondownstream tasks, we substitute those unused to-ken embeddings with those 50 used token embed-dings. We call the above setting 50-substitute. Inthis case, different tokens will share the same tokenembeddings when the model starts to be fine-tuned.From the last row in Table 1, we find that themodel recovers its ability to be fine-tuned whenpre-trained on artificial dataset. However, whenperforming the same substitution on the modelpre-trained with amino acid, the model still failto be fine-tuned. Together with Section 4.1, we canconclude that the main reason a pre-trained modelfailed to transfer to human language downstreamtasks lies in the intrinsic property of the pre-trainingdata.
It is innate to fine-tune the word embeddings ofpre-trained models on English before fine-tuningn GLUE. This is for aligning the word embed-dings of L1 acquired during pre-training with theword embeddings of English. We conduct experi-ments similar to Table 1, and the only differencelies in that we fine-tune the word embeddings andlanguage model head of the pre-trained model withMLM on English before fine-tuning on GLUE. Wefind the performance slightly advance mostly, withimprovement in Java script being the most salient.We leave detailed results in Appendix B.
We study how pre-trained data might and mightnot affect the downstream performance of atransformer-based pre-trained LM. We find thatfine-tuning with models pre-trained on data with-out any structures can surpass performance ob-tained directly trained from scratch on downstreamtasks. Our results also show that pre-training withstructured non-human language corpora does notalways equip the model to perform competentlyon downstream tasks in general. We also dis-cover that pre-training on a certain artificial datasetgives downstream performance comparable to pre-training on another natural language. We revealthat token distribution in the pre-training corporamerely affects pre-trained model performance ondownstream tasks. Last, our experiments showthat the number of token embeddings used dur-ing pre-training greatly contribute the downstreamperformance, while this can be mitigate by somemanipulations on the token embeddings in certaincases. We hope our analysis provides insights intowhat kind of pre-training data makes a pre-trainedmodel a pre-trained model.
Broader Impact
We find an surprising simple artificial datasetto pre-train an language model, and we believethat our work have the potential to be applied tolow-resource language when pre-training data arescarce. We think our work do not cause any ethicalissues.
References
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Suchin Gururangan, Ana Marasovi´c, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. Don’t stop pretraining: Adaptlanguage models to domains and tasks. In
Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics , pages 8342–8360.John Hewitt and Christopher D Manning. 2019. Astructural probe for finding syntax in word represen-tations. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4129–4138.Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark secretsof bert. In
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages4356–4365.Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019.Open sesame: Getting inside bert’s linguistic knowl-edge. In
Proceedings of the 2019 ACL WorkshopBlackboxNLP: Analyzing and Interpreting NeuralNetworks for NLP , pages 241–253.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Vincent Micheli, Martin d’Hoffschmidt, and FrançoisFleuret. 2020. On the importance of pre-trainingdata volume for compact language models. In
Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 7853–7858.Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, and Sungroh Yoon. 2019. Pre-trainingof deep bidirectional protein sequence representa-tions with structural information. arXiv preprintarXiv:1912.05625 .Dana Movshovitz-Attias and William Cohen. 2013.Natural language models for predicting program-ming comments. In
Proceedings of the 51st AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 35–40.Isabel Papadimitriou and Dan Jurafsky. 2020. Learn-ing music helps you read: Using transfer to studylinguistic structure in language models. In
Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , pages6829–6839.edro Javier Ortiz Suárez, Laurent Romary, and BenoîtSagot. 2020. A monolingual approach to contextual-ized word embeddings for mid-resource languages.In
ACL 2020-58th Annual Meeting of the Associa-tion for Computational Linguistics .Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.Bert rediscovers the classical nlp pipeline. In
Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4593–4601.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R Bowman, Dipan-jan Das, et al. 2018. What do you learn from con-text? probing for sentence structure in contextual-ized word representations. In
International Confer-ence on Learning Representations .Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In the Pro-ceedings of ICLR.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2019.Huggingface’s transformers: State-of-the-art naturallanguage processing.
ArXiv , abs/1910.03771.Yian Zhang, Alex Warstadt, Haau-Sing Li, andSamuel R Bowman. 2020. When do you need bil-lions of words of pretraining data? arXiv preprintarXiv:2011.04946 . Experiment Details
We give detailed model architectures of ourRoBERTa-base model and hyperparameters usedin pre-training.
A.1 Model
We use RoBERTa-base, a 12-layered transformermodel with hidden dimension 768 and 12 attentionheads per layer. The total number of parameters ofthe model is around 110M. We pre-train RoBERTausing Huggingface (Wolf et al., 2019) code base.
A.2 Hyperparameters
The hyperparameters used in all pre-training exper-iments are listed in Table 2Batch size 150Learning rate 5E-5Total steps 200KWarmup steps 10kMax Position 128
Table 2: Pre-training hyperparemeters for BERT.
A.3 Pre-training Data
We put all details related to all pre-training datain Table 3. We provide download link to the pre-training dataset, along with the training and valida-tion loss at the end of pre-training. The artificialdata and baseline dataset can be generated follow-ing the script in our code. The train/evaluation splitcan be found in the supplementary materials. Wealso include the vocabulary size (including specialtokens) of each model on the last column. Thevocabulary file is obtained by training a WordPiecetokenizer on the training data for Java, Kannada,and Wikipedia dataset.
A.4 Fine-tuning Details
We fine-tune GLUE using Huggingface (Wolf et al.,2019) code base. The model fine-tuned in thissection is RoBERTa base with classifier on top ofthe last transformer layer. The whole model fine-tuned is has 110M parameters.
A.4.1 Dataset
We provide statistics on the 8 GLUE tasks we usedin Table 4
A.4.2 Fine-tuning Hyperparameters
We list the hyperparameters used in fine-tuningGLUE in Tabel 5.
A.5 Resource
Out computation resource is V100 GPU. Pre-training a RoBERTa following our parametersgiven in 2 takes 60 hours on a single V100, andfine-tuning the pre-trained models on the 8 GLUEtasks following hyperparameters in 5 takes about12 hours on a V100.
B Fine-tune the Model on English MLMBefore Fine-tuning on GLUE
This is the detailed experiment data for Section 4.4 ataset Link Training Loss Eval. Loss Vocab Size
Wikipedia Wikidump 2.204 3.354 30000Java Java data 0.03227 1.025 30000Amino Acid PLUS 2.041 2.201 28895Kannada OSCAR 2.366 3.128 30000Random baseline NA 9.428 9.467 30000Zipf Baseline NA 6.351 6.446 30000Artificial (Uniform) NA 1.996 2.409 29991Artificial (Zipf) NA 1.599 1.774 29991Artificial (50) NA 1.558 1.754 29991Artificial (500) NA 1.563 1.762 29991Artificial (5000) NA 1.548 1.701 29991
Table 3: Details for dataset used in pre-training.
Stage 1L1 MLM pre-train Stage 2En MLM fine-tune Stage 3GLUE fine-tune Stage 4GLUE testing
Transformer
Word Embedding Transformer(fixed)Word Embedding
Transformer
Word Embedding Transformer(fixed)Word Embedding (fixed)LM Head LM Head Classifier Head Classifier Head (fixed)L A A B Y U Q English Wikipedia
Figure 3: Work flow of our experiments for Section 4.4: We first pre-train the whole masked language model on L1(protein sequence in this figure), and then only fine-tune the word embedding and language model head on EnglishWikipedia. The third stage is fine-tuning the whole model on English downstream tasks, and the last stage is totest the performance on the fine-tuned downstream task.
Task ExamplesMRPC 3.6K / 0.4K / 1.7KRTE 2.4K / 0.2K / 3KSTS-B 5.7K / 1.5K / 1.3KQNLI 104K / 5.4K / 5.4KQQP 363K / 40.4K / 391.0KCoLA 8.5K / 1.0K / 1.1KMNLI 392.7K / 9.8K + 9.8K / 9.8K + 9.8KSST-2 67.4K / 0.9K / 1.8K
Table 4: Statistics of (train / dev/ test) in GLUE tasksMNLI contains matched and mismatched in dev andtest set. We didn’t evaluate our models’ performanceon test set.
R BSZ RoBERTa DR Classifier DR TS WS MSLCoLA 1.00E-05 16 0 0.1 5336 320 128STS-B 2.00E-05 16 0 0.1 3598 214 128SST-2 1.00E-05 32 0 0.1 20935 1256 128MNLI 3.00E-05 128 0 0.1 10000 1000 128QNLI 1.00E-05 32 0 0.1 33112 1986 128QQP 5.00E-05 128 0 0.1 14000 1000 128RTE 3.00E-05 32 0 0.1 800 200 128MRPC 2.00E-05 32 0 0.1 800 200 128SQuAD2.0 3.00E-05 48 0 0.1 8144 814 128
Table 5: Hyperparameters for ALBERT in downstream tasks. LR: Learning Rate. BSZ: Batch Size. DR: DropoutRate. TS: Training Steps. WS: Warmup Steps. MSL: Maximum Sequence Length
L1 STS-B QNLI QQP CoLA SST-2 MNLI MRPC RTE AvgNo Pre-train
Pre-train on English
Random Baseline
Zipf Baseline
Amino Acid
Java Script
Kannada
Artificial (Uniform)
Artificial (Zipf)0.79 0.79 0.83 0.11 0.82 0.72 0.75 0.57 0.67