[PDF] Pre-Training a Language Model Without Human Language

Abstract

In this paper, we study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance. To this end, we pre-train different transformer-based masked language models on several corpora with certain features, and we fine-tune those language models on GLUE benchmarks. We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks. Our results also show that pre-training on structured data does not always make the model acquire ability that can be transferred to natural language downstream tasks. To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.

Full PDF

PPre-Training a Language Model Without Human Language

Cheng-Han Chiang

National Taiwan University, Taiwan [email protected]

Hung-yi Lee

National Taiwan University, Taiwan [email protected]

Abstract

In this paper, we study how the intrinsic na-ture of pre-training data contributes to theﬁne-tuned downstream performance. To thisend, we pre-train different transformer-basedmasked language models on several corporawith certain features, and we ﬁne-tune thoselanguage models on GLUE benchmarks. Weﬁnd that models pre-trained on unstructureddata beat those trained directly from scratchon downstream tasks. Our results also showthat pre-training on structured data does not al-ways make the model acquire ability that canbe transferred to natural language downstreamtasks. To our great astonishment, we un-cover that pre-training on certain non-humanlanguage data gives GLUE performance closeto performance pre-trained on another non-English language.

Neural language models (LMs) are prevalent innowadays natural language processing (NLP) com-munity, and they are indispensable to a variety ofNLP tasks. Researchers have devoted themselvesto understanding what these models have learnedand how they work. Probing a trained model iswidely used to understand to what extent a modellearns certain linguistic features (Kovaleva et al.,2019; Hewitt and Manning, 2019; Tenney et al.,2019, 2018; Lin et al., 2019). Another line of re-search focuses more on how training corpora affectthe trained LMs (Micheli et al., 2020; Gururanganet al.; Zhang et al., 2020).In this work, we aim to understand how down-stream performance varies across models pre-trained on data of particular traits. The core prob-lem we determine to answer is: What factors in thepre-training data make a pre-trained transformerLM perform better on downstream tasks than theirtrained from scratch counterparts? To answer thisquestion, we pre-train many different transformer LMs on dataset from miscellaneous disciplines,ranging from amino acid sequences in complexliving organisms to artiﬁcial data generated by asimple python script. We then ﬁne-tune them onEnglish downstream tasks. The process is illus-trated in Figure 1.Recently, Papadimitriou and Jurafsky (2020) pro-posed to train an LSTM LM on a non-naturallanguage dataset and test the LM’s perplexityon natural language. They observed that LSTMLM trained on structured dataset gives perplexityfar lower than those trained on unstructured data.While the observations are intriguing, this settingdoesn’t match the common setting widely appliednowadays, in which we ﬁne-tune pre-trained LMson downstream tasks. This is the ﬁrst paper inves-tigating whether masked language model (MLM)pre-training on non-natural language aids down-stream natural language tasks’ performance.Based on the experiments, we have the followingobservations:• We reveal that ﬁne-tuning models pre-trainedon unstructured data outperforms modeltrained from scratch on downstream tasks.• We ﬁnd that structured pre-training data is nota sufﬁcient condition to a pre-trained modelthat can perform well on NLP tasks.• We discover that pre-training on a simple arti-ﬁcial dataset with hierarchical structure leadsto downstream performance comparable tomodels pre-trained on human language.• Our experiments show that token distributionis not the key factors to how well the modeltransferred to downstream tasks, while thenumber of token embeddings used during pre-training affects downstream performance. a r X i v : . [ c s . C L ] D ec igure 1: Work ﬂow of our experiments: We ﬁrst pre-train the whole masked language model on L1 (proteinsequence in this ﬁgure), and ﬁne-tune the whole model on English downstream tasks. We then test the performanceon the ﬁne-tuned downstream task. It takes about 3 days to ﬁnish the whole process on a single V100. In our experiments, we pre-train n RoBERTa-base (Liu et al., 2019) models on n different typesof pre-training data. We call the pre-training dataL1 (ﬁrst language). We then evaluate the pre-trained models’ ability by ﬁne-tuning them on dif-ferent downstream tasks. The overall workﬂow isillustrated in Figure 1.We adopt the classic GLUE (Wang et al., 2019)benchmarks to evaluate the models pre-trained ondifferent L1s while excluding WNLI following De-vlin et al. (2019). For each task, we use a certain setof hyperparameters and the same random seed toﬁne-tune the model, and we report the results on theevaluation set. Details regarding all experimentscan be found in Appendix A.Our experiment setup may seem to resemblethe Test for Inductive Bias via Language ModelTransfer (TILT) proposed in Papadimitriou andJurafsky (2020) at ﬁrst sight, which pre-trains anLSTM LM on L1, follows by only ﬁne-tuning wordembeddings on Spanish, and test the perplexity onSpanish. However, the main purpose of TILT is toanalyze the encoding of grammatical structure inLMs, so they do not ﬁne-tune LSTM on Spanish.On the contrary, our goal is to understand whatfactors in pre-training data make the pre-trainedmodel perform better than models trained fromscratch on downstream tasks. We use two baseline pre-training dataset for ourexperiments: the random baseline and the

Zipfbaseline , both corpora have 29995 tokens, exclud- ing 5 special tokens. For the random baseline, wedraw the tokens from a uniform distribution andform sequences with a length of 90 to 120 tokens.For the Zipf baseline, we sample the tokens fromthe same uni-gram distribution of English. We alsopre-train an

English

MLM with a subset of theEnglish Wikipedia to serve as the performance up-per bound. The pre-training corpora size is around80MB for the previous three datasets.We select several pre-training corpora in distinctdisciplines that contain structure, including a bio-logical dataset, a programming language corpus,an artiﬁcial dataset with a hierarchical structure,and a human language.The biological dataset we adopt is amino acid sequence corpora obtained from Min et al. (2019).The characteristic of a protein is determined by itsprimary structure, i.e. the amino acid sequence.Chemical bonds between amino acids determinethe secondary and tertiary structure of the foldedprotein, which further determines the functions ofthe protein. We use the one-letter abbreviation(A-Z) to represent each amino acid, and the totalnumber of tokens in this dataset is 36M.For programming language, we use Habeas cor-pus from Movshovitz-Attias and Cohen (2013),which contains tokenized

Java script . We use thecode from Papadimitriou and Jurafsky (2020) toextract the data and remove tokens that are labeledas a comment, making the training corpus containonly programming language. The total numberof tokens in the pre-training data is 10M, and thevocabulary size of the model is 30K.The artiﬁcial dataset we construct has a vocabu-

Figure 2: An illustration of the artiﬁcial dataset. lary size of 28996, and the total number of tokensin training data is 23.5M. The dataset is generatedby the following stack-based grammar, followingPapadimitriou and Jurafsky (2020): At each timestep t , we sample X t from a Bernoulli distributionwith P ( X t = 1) = 0 . . If X t = 1 , we samplea token based on English’s uni-gram distribution,place the sampled token at position t of the gener-ated sequence, and push the same token into thestack. When X t = 0 , we pop the top element of thestack and put the popped token at position t of thegenerated sequence. Figure 2 shows a simple exam-ple. We can observe from Figure 2 that sequencegenerated in this manner contains a nesting hierar-chical parentheses structure, which is similar to thedependency tree structure in natural language.The last dataset used is a human language. Weselect a human language different from down-stream tasks to compare the effect of non-humanlanguage pre-training data. We use Kannada fromOSCAR dataset (Suárez et al., 2020). Kannadais a language predominantly spoken by the peoplein the southern western region of India. The mainreason we choose this dataset lies in its subject(S)-object(O)-verb(V) structure, different from the S-V-O structure of our target language used in ﬁne-tuning. The pre-training corpora size is 160MB,and the vocabulary size used in pre-training is 30K.

The overall results are illustrated in Table 1. Inthis section, we discuss how certain aspects of thepre-training corpora affect how good a model canbecome. By the word good , we refer to the model’sability to be ﬁne-tuned on downstream tasks, whichis the performance improvement over training themodel from scratch on downstream tasks.

We intend to answer this question: Is structureddata the key to a good pre-trained model? We com-pare the models pre-trained on structured data withmodels pre-trained on unstructured baselines. Ifthe downstream performance of models pre-trained on structured data can beat their unstructured coun-terparts, then we may conclude that structure in thepre-training data is a key factor in the success ofpre-trained transformer language models.From the ﬁrst two blocks of Table 1, we ﬁnd thatmodels pre-trained on unstructured data outperformthe models trained from scratch. This suggeststhat the pre-trained model can still aid downstreamperformance, albeit the seemingly meaningless pre-training corpora.From the third block in Table 1, we ﬁnd that pre-training on structured data may not always lead toa better model. Models pre-trained on amino acidand Java scripts are almost on a par with the modelstrained from scratch. Not much to our surprise, themodel pre-trained on Kannada performs far betterthan the two baseline models.Amazingly, ﬁne-tuning the model pre-trained onartiﬁcial data gives comparable performance com-pared with the model pre-trained on Kannada. Thisimplies that it might be worth trying to pre-traina model on this kind of hierarchical nesting struc-tured dataset, and ﬁne-tune the model on some lowresource languages to obtain decent downstreamperformance. The artiﬁcial dataset consists of nosemantic knowledge useful for downstream natu-ral language tasks, so it is reasonable to infer thatmost knowledge the model learns from pre-trainingis the skill to model the hierarchical structure andlong-term dependency. Equipped with this ability,the model can outrun models trained from unstruc-tured data.Our results show that models beneﬁt from pre-training on a certain type of structured corpora,while not every structured corpus leads to a goodpre-trained model for NLP downstream tasks.

We notice that the two baseline models’ perfor-mance is similar in almost all downstream tasks.This indicates that the uni-gram distribution oftokens in the training corpora makes little differ-ence to the downstream performance when the pre-training data themselves are unstructured. We fur-ther ask whether this is also the case when the datais structured. We construct the artiﬁcial dataset asin Section 3, and aside from sampling based onZipf distribution, we create another dataset whosetokens are sampled from the uniform distribution No Pre-train

Pre-train En

Rand. Baseline

Zipf Baseline

Amino Acid

Java Script

Kannada

Artiﬁcial (Uni.)

Artiﬁcial (Zipf)

Artiﬁcial (5000)

Artiﬁcial (500)

Artiﬁcial (50)

Artiﬁcial (50-s)

Table 1: Downstream results of different pre-trained models, and the model trained from scratch on downstreamtasks (no pre-train in the ﬁrst row). The evaluation metrics of MRPC and QQP are F1 score, Spearman correlationcoefﬁcient is reported for STS-B, and the rest tasks are evaluated with accuracy. Results of MNLI are the averageof matched and mismatched. Please refer to Section 4.2 and Section 4.3 for the meaning of parentheses in the lasttwo blocks. 50-s stands for 50-substitute in Section 4.3. Abbreviation used: En: English, Rand: random, Uni:uniform. over tokens except for special tokens. The results,demonstrated in the fourth block in Table 1, showthat even when the pre-training data is structured,token distribution still has little inﬂuence on howwell the model can be ﬁne-tuned.

This section investigates whether the mismatchbetween vocabulary size during pre-training andﬁne-tuning contributes to how well the pre-trainedmodel performs on downstream tasks. To study theinﬂuence of vocabulary size, we construct differentartiﬁcial data by sampling tokens from differentbin sizes (50, 500, and 5000). While the vocabu-lary size during pre-training is different for thosemodels, their actual word embedding table sizesare still the same.From the last block in Table 1, we observe thatthe averaged performance signiﬁcantly degrades inthe case when only 50 tokens are used during pre-training, while the performance gradually recoverwhen the token number mismatch between pre-training and ﬁne-tuning narrows. Tokens appearingin the pre-training data receive disproportionatelylarger gradients than tokens not in the pre-trainingdata during pre-training, and this artifact cripples The number of different tokens in pre-training data. the downstream performance.The above observation make it hard to tellwhether model pre-trained with amino acid se-quence failed to perform well on downstream tasksdue to the token number mismatch. Thus, we con-duct further experiments to remove the undesirableartifact arise from the mismatch. Say we only usethe ﬁrst 50 tokens (excluding special tokens) duringpre-training while the rest 29950 token embeddingsare not used, then before ﬁne-tuning the model ondownstream tasks, we substitute those unused to-ken embeddings with those 50 used token embed-dings. We call the above setting 50-substitute. Inthis case, different tokens will share the same tokenembeddings when the model starts to be ﬁne-tuned.From the last row in Table 1, we ﬁnd that themodel recovers its ability to be ﬁne-tuned whenpre-trained on artiﬁcial dataset. However, whenperforming the same substitution on the modelpre-trained with amino acid, the model still failto be ﬁne-tuned. Together with Section 4.1, we canconclude that the main reason a pre-trained modelfailed to transfer to human language downstreamtasks lies in the intrinsic property of the pre-trainingdata.

It is innate to ﬁne-tune the word embeddings ofpre-trained models on English before ﬁne-tuningn GLUE. This is for aligning the word embed-dings of L1 acquired during pre-training with theword embeddings of English. We conduct experi-ments similar to Table 1, and the only differencelies in that we ﬁne-tune the word embeddings andlanguage model head of the pre-trained model withMLM on English before ﬁne-tuning on GLUE. Weﬁnd the performance slightly advance mostly, withimprovement in Java script being the most salient.We leave detailed results in Appendix B.

We study how pre-trained data might and mightnot affect the downstream performance of atransformer-based pre-trained LM. We ﬁnd thatﬁne-tuning with models pre-trained on data with-out any structures can surpass performance ob-tained directly trained from scratch on downstreamtasks. Our results also show that pre-training withstructured non-human language corpora does notalways equip the model to perform competentlyon downstream tasks in general. We also dis-cover that pre-training on a certain artiﬁcial datasetgives downstream performance comparable to pre-training on another natural language. We revealthat token distribution in the pre-training corporamerely affects pre-trained model performance ondownstream tasks. Last, our experiments showthat the number of token embeddings used dur-ing pre-training greatly contribute the downstreamperformance, while this can be mitigate by somemanipulations on the token embeddings in certaincases. We hope our analysis provides insights intowhat kind of pre-training data makes a pre-trainedmodel a pre-trained model.

Broader Impact

We ﬁnd an surprising simple artiﬁcial datasetto pre-train an language model, and we believethat our work have the potential to be applied tolow-resource language when pre-training data arescarce. We think our work do not cause any ethicalissues.

References

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Suchin Gururangan, Ana Marasovi´c, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. Don’t stop pretraining: Adaptlanguage models to domains and tasks. In

Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics , pages 8342–8360.John Hewitt and Christopher D Manning. 2019. Astructural probe for ﬁnding syntax in word represen-tations. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4129–4138.Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark secretsof bert. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages4356–4365.Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019.Open sesame: Getting inside bert’s linguistic knowl-edge. In

Proceedings of the 2019 ACL WorkshopBlackboxNLP: Analyzing and Interpreting NeuralNetworks for NLP , pages 241–253.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Vincent Micheli, Martin d’Hoffschmidt, and FrançoisFleuret. 2020. On the importance of pre-trainingdata volume for compact language models. In

Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 7853–7858.Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, and Sungroh Yoon. 2019. Pre-trainingof deep bidirectional protein sequence representa-tions with structural information. arXiv preprintarXiv:1912.05625 .Dana Movshovitz-Attias and William Cohen. 2013.Natural language models for predicting program-ming comments. In

Proceedings of the 51st AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 35–40.Isabel Papadimitriou and Dan Jurafsky. 2020. Learn-ing music helps you read: Using transfer to studylinguistic structure in language models. In

Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , pages6829–6839.edro Javier Ortiz Suárez, Laurent Romary, and BenoîtSagot. 2020. A monolingual approach to contextual-ized word embeddings for mid-resource languages.In

ACL 2020-58th Annual Meeting of the Associa-tion for Computational Linguistics .Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.Bert rediscovers the classical nlp pipeline. In

Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4593–4601.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R Bowman, Dipan-jan Das, et al. 2018. What do you learn from con-text? probing for sentence structure in contextual-ized word representations. In

International Confer-ence on Learning Representations .Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In the Pro-ceedings of ICLR.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2019.Huggingface’s transformers: State-of-the-art naturallanguage processing.

ArXiv , abs/1910.03771.Yian Zhang, Alex Warstadt, Haau-Sing Li, andSamuel R Bowman. 2020. When do you need bil-lions of words of pretraining data? arXiv preprintarXiv:2011.04946 . Experiment Details

We give detailed model architectures of ourRoBERTa-base model and hyperparameters usedin pre-training.

A.1 Model

We use RoBERTa-base, a 12-layered transformermodel with hidden dimension 768 and 12 attentionheads per layer. The total number of parameters ofthe model is around 110M. We pre-train RoBERTausing Huggingface (Wolf et al., 2019) code base.

A.2 Hyperparameters

The hyperparameters used in all pre-training exper-iments are listed in Table 2Batch size 150Learning rate 5E-5Total steps 200KWarmup steps 10kMax Position 128

Table 2: Pre-training hyperparemeters for BERT.

A.3 Pre-training Data

We put all details related to all pre-training datain Table 3. We provide download link to the pre-training dataset, along with the training and valida-tion loss at the end of pre-training. The artiﬁcialdata and baseline dataset can be generated follow-ing the script in our code. The train/evaluation splitcan be found in the supplementary materials. Wealso include the vocabulary size (including specialtokens) of each model on the last column. Thevocabulary ﬁle is obtained by training a WordPiecetokenizer on the training data for Java, Kannada,and Wikipedia dataset.

A.4 Fine-tuning Details

We ﬁne-tune GLUE using Huggingface (Wolf et al.,2019) code base. The model ﬁne-tuned in thissection is RoBERTa base with classiﬁer on top ofthe last transformer layer. The whole model ﬁne-tuned is has 110M parameters.

A.4.1 Dataset

We provide statistics on the 8 GLUE tasks we usedin Table 4

A.4.2 Fine-tuning Hyperparameters

We list the hyperparameters used in ﬁne-tuningGLUE in Tabel 5.

A.5 Resource

Out computation resource is V100 GPU. Pre-training a RoBERTa following our parametersgiven in 2 takes 60 hours on a single V100, andﬁne-tuning the pre-trained models on the 8 GLUEtasks following hyperparameters in 5 takes about12 hours on a V100.

B Fine-tune the Model on English MLMBefore Fine-tuning on GLUE

This is the detailed experiment data for Section 4.4 ataset Link Training Loss Eval. Loss Vocab Size

Wikipedia Wikidump 2.204 3.354 30000Java Java data 0.03227 1.025 30000Amino Acid PLUS 2.041 2.201 28895Kannada OSCAR 2.366 3.128 30000Random baseline NA 9.428 9.467 30000Zipf Baseline NA 6.351 6.446 30000Artiﬁcial (Uniform) NA 1.996 2.409 29991Artiﬁcial (Zipf) NA 1.599 1.774 29991Artiﬁcial (50) NA 1.558 1.754 29991Artiﬁcial (500) NA 1.563 1.762 29991Artiﬁcial (5000) NA 1.548 1.701 29991

Table 3: Details for dataset used in pre-training.

Stage 1L1 MLM pre-train Stage 2En MLM fine-tune Stage 3GLUE fine-tune Stage 4GLUE testing

Transformer

Word Embedding Transformer(fixed)Word Embedding

Transformer

Word Embedding Transformer(fixed)Word Embedding (fixed)LM Head LM Head Classifier Head Classifier Head (fixed)L A A B Y U Q English Wikipedia

Figure 3: Work ﬂow of our experiments for Section 4.4: We ﬁrst pre-train the whole masked language model on L1(protein sequence in this ﬁgure), and then only ﬁne-tune the word embedding and language model head on EnglishWikipedia. The third stage is ﬁne-tuning the whole model on English downstream tasks, and the last stage is totest the performance on the ﬁne-tuned downstream task.

Task ExamplesMRPC 3.6K / 0.4K / 1.7KRTE 2.4K / 0.2K / 3KSTS-B 5.7K / 1.5K / 1.3KQNLI 104K / 5.4K / 5.4KQQP 363K / 40.4K / 391.0KCoLA 8.5K / 1.0K / 1.1KMNLI 392.7K / 9.8K + 9.8K / 9.8K + 9.8KSST-2 67.4K / 0.9K / 1.8K

Table 4: Statistics of (train / dev/ test) in GLUE tasksMNLI contains matched and mismatched in dev andtest set. We didn’t evaluate our models’ performanceon test set.

R BSZ RoBERTa DR Classiﬁer DR TS WS MSLCoLA 1.00E-05 16 0 0.1 5336 320 128STS-B 2.00E-05 16 0 0.1 3598 214 128SST-2 1.00E-05 32 0 0.1 20935 1256 128MNLI 3.00E-05 128 0 0.1 10000 1000 128QNLI 1.00E-05 32 0 0.1 33112 1986 128QQP 5.00E-05 128 0 0.1 14000 1000 128RTE 3.00E-05 32 0 0.1 800 200 128MRPC 2.00E-05 32 0 0.1 800 200 128SQuAD2.0 3.00E-05 48 0 0.1 8144 814 128

Table 5: Hyperparameters for ALBERT in downstream tasks. LR: Learning Rate. BSZ: Batch Size. DR: DropoutRate. TS: Training Steps. WS: Warmup Steps. MSL: Maximum Sequence Length

L1 STS-B QNLI QQP CoLA SST-2 MNLI MRPC RTE AvgNo Pre-train

Pre-train on English

Random Baseline

Zipf Baseline

Amino Acid

Java Script

Kannada

Artiﬁcial (Uniform)

Artiﬁcial (Zipf)0.79 0.79 0.83 0.11 0.82 0.72 0.75 0.57 0.67