When Do You Need Billions of Words of Pretraining Data?
WWhen Do You Need Billions of Words of Pretraining Data?
Yian Zhang, ∗ , Alex Warstadt, ∗ , Haau-Sing Li, and Samuel R. Bowman , , Dept. of Computer Science, Dept. of Linguistics, Center for Data ScienceNew York University { yian.zhang, warstadt, xl3119, bowman } @nyu.edu Abstract
NLP is currently dominated by general-purpose pretrained language models likeRoBERTa, which achieve strong performanceon NLU tasks through pretraining on billionsof words. But what exact knowledge orskills do Transformer LMs learn from large-scale pretraining that they cannot learn fromless data? We adopt four probing methods—classifier probing, information-theoretic prob-ing, unsupervised relative acceptability judg-ment, and fine-tuning on NLU tasks—anddraw learning curves that track the growth ofthese different measures of linguistic abilitywith respect to pretraining data volume usingthe MiniBERTas, a group of RoBERTa modelspretrained on 1M, 10M, 100M and 1B words.We find that LMs require only about 10M or100M words to learn representations that reli-ably encode most syntactic and semantic fea-tures we test. A much larger quantity of datais needed in order to acquire enough common-sense knowledge and other skills required tomaster typical downstream NLU tasks. Theresults suggest that, while the ability to en-code linguistic features is almost certainly nec-essary for language understanding, it is likelythat other forms of knowledge are the majordrivers of recent improvements in language un-derstanding among large pretrained models.
Pretrained language models (LMs) like BERT andRoBERTa have become ubiquitous in NLP. Thesemodels use massive datasets on the order of tens oreven hundreds of billions of words (Brown et al.,2020) to learn linguistic features and world knowl-edge, and they can be fine-tuned to achieve goodperformance on many downstream tasks. *Equal Contribution
Figure 1: Overall learning curves for the four probingmethods. For each method, we compute overall perfor-mance for each RoBERTa model tested as the macro av-erage over sub-task’s performance after normalization.We fit a logistic curve which we scale to have a maxi-mum value of 1.
Much recent work has used probing methodsto evaluate what these models have and have notlearned (Belinkov and Glass, 2019; Tenney et al.,2019b; Rogers et al., 2020; Ettinger, 2020). Sincemost of these works only focus on models pre-trained on a fixed data volume (usually billionsof words), many interesting questions regardingthe effect of the amount of pretraining data remainunanswered: What do data-rich models know thatmodels with less pretraining data do not? Howmuch pretraining data is required for LMs to learndifferent grammatical features and linguistic phe-nomena? Which of these skills do we expect toimprove if we increase the pretraining data to over30 billion words? Which aspects of grammar canbe learned from data volumes on par with the inputto human learners, around 10M to 100M words(Hart and Risley, 1992)?With these questions in mind, we probe theMiniBERTas (Warstadt et al., 2020b), a group of a r X i v : . [ c s . C L ] N ov oBERTa models pretrained on 1M, 10M, 100M,and 1B words, and RoBERTa BASE (Liu et al., 2019)pretrained on about 30B words, using four methods:First we use standard classifier probing on the edgeprobing suite of NLP tasks (Tenney et al., 2019b)to measure the quality of the syntactic and semanticfeatures that can be extracted by a downstream clas-sifier with each level of pretraining. Second, we ap-ply minimum description length probing (Voita andTitov, 2020) to the edge probing suite, with the goalof quantifying the accessibility of these features.Third, we probe the models’ knowledge of varioussyntactic phenomena using unsupervised accept-ability judgments on the BLiMP suite (Warstadtet al., 2020a). Fourth, we fine-tune the models onfive tasks from SuperGLUE (Wang et al., 2019), tomeasure their ability to solve conventional NLUtasks.Figure 1 shows the interpolated learning curvesfor these four methods as a function of the amountof pretraining data. We have two main findings:First, the results of three probing methods we adoptshow that the linguistic knowledge of RoBERTapretrained on 100M words is already very closeto that of RoBERTa
BASE , which is pretrained onaround 30B words. Second, RoBERTa requires bil-lions of words of pretraining data to make substan-tial improvements in performance on dowstreamNLU tasks. From these results, we conclude thatthere are skills critical to solving downstream NLUtasks that LMs can only acquire with billions ofwords of pretraining data and that we need to lookbeyond probing for linguistic features to explainwhy LMs improve at these large data scales.
We probe the MiniBERTas, a set of 12 RoBERTamodels pretrained from scratch by Warstadt et al.(2020b) on 1M, 10M, 100M, and 1B words sam-pled from a combination of Wikipedia and Smash-words, the sources that Devlin et al. (2019) useto pretrain BERT, and a subset of those used forRoBERTa. Warstadt et al. ran pretraining 25 timeswith varying hyperparameter values for each of 1M,10M, and 100M, and 10 times for 1B. For eachdataset size, they released the three models withthe lowest dev set perplexity, yielding 12 modelsin total.We also test the publicly available https://huggingface.co/nyu-mll . RoBERTa
BASE 2 (Liu et al., 2019), whichis pretrained on about 30B words, and 3RoBERTa BASE models with randomly initializedparameters.We probe the MiniBERTas using four methods:classifier probing on the edge probing suite, mini-mum description length probing on the edge prob-ing suite, unsupervised acceptability judgments onBLiMP, and fine-tuning on NLU tasks from Su-perGLUE. In each probing experiment, we testall 16 models on each task involved. For all ex-periments except for BLiMP, we use min-max nor-malization to adjust the results into the range of[0, 1], where 0 represents the worst score of anymodel on the task (usually a randomly initializedone), and 1 represents the best score of any model(usually RoBERTa
BASE ). We plot the results in afigure for each task, where the y -axis is the (nor-malized) score and the x -axis is the amount ofpretraining data. To show the overall trend of im-provement, we use non-linear least squares to fit alogistic function to the points after log transformingthe x -values. We use the widely-adopted probing approach ofEttinger et al. (2016), Adi et al. (2017), and others—which we call classifier probing —to test the extentto which linguistic features like part-of-speech andcoreference are encoded in the MiniBERTa rep-resentations. In these experiments we freeze therepresentations and train MLP classifiers for theten probing tasks in the edge probing suite (Tenneyet al., 2019b). Admittedly, classifier probing has recently come https://github.com/pytorch/fairseq/tree/master/examples/roberta In addition to Wikipedia and Smashwords, RoBERTa
BASE is also trained on news and web data. The code for all four experiments can befound at https://github.com/nyu-mll/pretraining-learning-curves . The unnormalized results are included in the appendix. We plot the no-pretraining random baseline with an x -value of 1. We assume log-logistic learning curves because of thegoodness of the fit to our empirical findings. It may also bereasonable to fit an exponential learning curve (Heathcoteet al., 2000). Task data source: Part-of-Speech, Constituents, Entities,SRL, and OntoNotes coref. from Weischedel et al. (2013), De-pendencies from Silveira et al. (2014), Sem. Proto Role 1 fromTeichert et al. (2017), Sem. Proto Role 2 from Rudinger et al.(2018), Relations (SemEval) from Hendrickx et al. (2010),Winograd coref. from Rahman and Ng (2012); White et al.(2017) igure 2: Classifier probing results for each task in the edge probing suite, adjusted using min-max normalization.The overall results are identical in each subplot, and are repeated to make comparisons easier. For context, we alsoplot BERT
LARGE performance for each task as reported by Tenney et al. (2019a). under scrutiny. Hewitt and Liang (2019) and Voitaand Titov (2020) caution that the performanceachieved in the classifier probing setting reflectsa combined effort of the representations and theprobe, so a probing classifier’s performance doesnot precisely reveal the quality of the representa-tions. However, we think it is still valuable to in-clude this experiment setting for two reasons: First,the downstream classifier setting and F1 evaluationmetric make these experiments easier to interpret inthe context of earlier results than results from rela-tively novel probing metrics like minimum descrip-tion length. Second, we focus on relative differ-ences between models rather than absolute perfor-mance and include a randomly initialized baselinemodel in the comparison. When the model rep-resentations are random, the probe’s performancereflects the probe’s own ability to solve the targettask. Therefore, any improvements over this base-line value are due to the representation rather thanthe probe itself. On the other hand, since otherprobing methods are well motivated, we we alsolook to minimum description length probing (Voitaand Titov, 2020) in the next section to quantify not just how well a probe can perform, but howcomplex the probe is.
Task formulation and training
Following Ten-ney et al., we take the input for each task to bea pair of token spans or a single span of tokens.For each task T , if T is a pairwise task, we traintwo attention pooling functions f T and f T , andfor each span pairs ( S i , S i ) we generate a repre-sentation pair ( r i , r i ) = ( f T ( S i ) , f T ( S i )) . Thenfor each label L j of T , the probe (which is anMLP) takes in ( r i , r i ) and performs a binary clas-sification to predict whether L j is the correct label.For tasks that involve only a single span (Part-of-Speech, Constituents, and Entities), S i and f T areomitted. We adopt the ‘mix’ representation ap-proach, so each token representation ( t ki ) p from S ki = { ( t ki ) , ( t ki ) , ... } is a linear combination ofRoBERTa’s layer activations projected to a 256-dimensional space.For each task, we fix validation interval to be1000 steps, early stopping patience to be 20 steps,learning rate patience to be 5 steps, and sample 5combinations of batch size and learning rate ran- igure 3: Edge Probing results for each group of tasksadjusted using min-max normalization. Syntactic tasksare Part-of-Speech, Dependencies, and Constituents.The commonsense task is Winograd coref. Semantictasks are all remaining tasks. domly to tune the model with the lowest MLMperplexity at each pretraining scale using the Adamoptimizer (Kingma and Ba, 2014). We use the besthyperparameter setting to train all the models ofthat scale on the task. Results
We plot the experiment results in Figure2, and in each subplot we also plot the overall edge-probing performance, which we calculate for eachMiniBERTa as its average F1 score on the 10 edge-probing tasks (after normalization).From the single-task curves we conclude thatmost of the feature learning occurs with < <
20M words. Most plots showbroadly similar learning curves, which rise sharplywith less than 1M words of pretraining data, reachthe point of fastest growth around 1M words, andare nearly saturated with 100M words. The mostnotable exception to this pattern is the Winogradtask, which only rises significantly between 1B and30B words of pretraining data. As the Winogradtask is designed to test commonsense knowledgeand reasoning, we infer that these features requiremore data to encode than syntactic and semanticones.There are some general differences that we canobserve between different types of tasks. Figure 3 The search range for batch size and learning rate are { } and { } respectively. These results are also somewhat more noisy due to well-known idiosyncrasies of this task. shows the aggregated learning curves of syntactic,semantic, and commonsense tasks. The syntacticlearning curve rises slightly earlier than the seman-tic one and 90% of the improvements in syntacticlearning can be made with about 10M words, whilethe semantic curve is still rising slightly after 100M.This is not surprising, as semantic computation isgenerally thought to depend on syntactic represen-tations (Heim and Kratzer, 1998), and Tenney et al.(2019a) report a similar result. The commonsenselearning curve (for Winograd coref. only) clearlyrises far later, and is projected to continue to riselong after syntactic and semantic features stop im-proving.
In this experiment, we study the MiniBERTas withminimum description length (MDL) probing (Voitaand Titov, 2020), with the goal of revealing not onlythe total amount of feature information extractedby the probe, but also the efforts taken by the probeto extract the features. MDL measures the mini-mum number of bits needed to transmit the labelsfor a given task given that both the sender and thereceiver have access to the pretrained model’s en-coding of the data. In general, it is more efficientfor the sender not to directly transmit the labels,but to instead transmit a decoder model that can beused to extract the labels from the representations.If a decoder cannot losslessly recover the labels,then some additional information must be transmit-ted as well. In this way, fewer bits are requiredto transmit the same information, i.e. the data is compressed .The MDL of a dataset for an encoder modelis thus a sum of the estimates of two terms: Thedata codelength is the number of bits needed totransmit the labels assuming the receiver has thetrained decoder model, i.e. the cross-entropy loss ofthe decoder. The model codelength is the numberof bits needed to transmit the decoder parameters.There is a tradeoff between data codelength andmodel codelength: A simpler decoder is likely tohave worse performance (i.e. decreasing modelcodelength often increases data codelength), andvice-versa.We adopt Voita and Titov’s online code esti-mation of MDL. We compute the online codeby partitioning the training data into 11 portions: { ( x j , y j ) } t i j = t i − +1 for ≤ i ≤ . The values of t , ..., t are the numbers of examples correspond- igure 4: MDL results for each edge probing task. We do not plot a logistic curve for the Winograd coref. resultsbecause would could not find an adequate fit. ing to the following proportions of the training data:0%, 0.1%, 0.2%, 0.4%, 0.8%, 1.6%, 3.2%, 6.25%,12.5%, 25%, 50%, 100%. Then for each i ∈ [1 , ,we train an MLP with parameters θ i on portions through i , and compute its loss on portion i +1 .Finally we compute the online codelength as thesum of the ten loss values and the codelength ofthe first data portion under a uniform prior: L online ( y n | x n ) = t log K −− (cid:88) i =1 log p θ i ( y t i +1: t i +1 | x t i +1: t i +1 ) . (1)where the number of labels K = 2 for all edgeprobing tasks.n Voita and Titov’s MDL experiments with theedge probing suite, the authors convert the edgeprobing tasks to multi-class classification problems.Our implementation skips this step and follows theTenney et al. edge probing task formation with oneclassifier head per candidate label, so that we caninclude the tasks that involve multiple correct la-bels, enabling a full comparison between our MDLresults and our conventional edge probing results.Since Voita and Titov report that MDL is stableacross reasonable hyperparmeter settings, we usethe settings described in Tenney et al. (2019b) forall the models we probe. Results
We plot the online code results on the topof Figure 4. The overall codelength shows a similartrend to edge probing: Most of the reduction in fea-ture codelength is achieved with fewer than 100Mwords. MDL for syntactic features decreases evensooner. Results for Winograd Coref. are idiosyn-cratic, probably due to the failure of the probes tolearn the task.The changes in model codelength and data code-length are shown on the bottom of Figure 4. Wecompute the data codelength following Voita andTitov (2020) using the training set loss of a clas-sifier trained on the entire training set, and themodel codelength is the total codelength minusthe data codelength. The monotonically decreas-ing data codelength simply reflects the fact thatthe more data rich RoBERTa models have smallerloss. When it comes to the model codelength, how-ever, we generally observe the global minimumfor the randomly initialized models (at “None”).This is expected, and simply reflects the fact thatthe decoder can barely extract any feature infor-mation from the random representations (i.e. theprobe can barely learn to recognize these featureseven given the full training set). On many tasks,the model codelength starts to decrease when thepretraining data volume is large enough, suggest-ing that large-scale pretraining may increase thedata regularity of the feature information in therepresentations, making them more accessible toa downstream classifier. However, the decreasingtrend is not consistent among all tasks, and there-fore more evidence needs to be collected before wereach any conclusions about feature accessibility.
We use the BLiMP benchmark (Warstadt et al.,2020a) to test models’ knowledge of individualgrammatical phenomena in English. BLiMP is achallenge set of 67 tasks, each containing 1000 min-imal pairs of sentences that highlight a particularmorphological, syntactic, or semantic phenomena.Minimal pairs in BLiMP consist of two sentencesthat differ only by a single edit, but contrast ingrammatical acceptability. BLiMP is designed forunsupervised evaluation of language models us-ing a forced choice acceptability judgment task: Alanguage model classifies a minimal pair correctlyif it assigns a higher likelihood to the acceptablesentence. We follow the MLM scoring method ofSalazar et al. (2020) to compare candidates.
Results
We plot learning curves for BLiMP inFigure 5. Warstadt et al. organize the 67 tasks inBLiMP into 12 categories based on the phenom-ena tested and for each category we plot the aver-age accuracy for the tasks in the category. We donot normalize results in this plot. For the no-databaseline, we plot chance accuracy of 50% ratherthan making empirical measurements from randomRoBERTa models.We find the greatest improvement in overallBLiMP performance between 1M and 100M wordsof pretraining data. With 100M words, sensitivityto contrasts in acceptability overall is within 9 accu-racy points of humans, and improve only 6 pointswith additional data. This shows that substantialknowledge of many grammatical phenomena canbe acquired from 100M words of raw text.We also observe significant variation in howmuch data is needed to learn different phenomena.We see the steepest learning curves on agreementphenomena, with nearly all improvements occur-ring between 1M and 10M words. For phenom-ena involving wh -dependencies, i.e. filler-gap de-pendencies and island effects, we observe shallowand delayed learning curves with 90% of possibleimprovements occurring between 1M and 100Mwords. These differences can most likely be as-cribed to two factors: First, agreement phenomenatend to involve more local dependencies, while wh-dependencies tend to be long-distance. Second,agreement phenomena are highly frequent, witha large proportion of sentences containing multi-ple instances of determiner-noun and subject-verb igure 5: BLiMP results by category. BLiMP has 67 tasks which belongs to 12 linguistic phenomena. For each taskthe objective is to predict the more grammatically acceptable sentence of a minimal pair in an unsupervised setting.For context, we also plot human agreement with BLiMP reported by Warstadt et al. (2020a) and RoBERTa LARGE performance reported by Salazar et al. (2020). agreement, while wh-dependencies are compara-tively rare. Finally, we observe that the phenom-ena tested in the quantifiers category are never ef-fectively learned, even by RoBERTa
BASE . Thesephenomena include subtle semantic contrasts—forexample
Nobody ate { more than, *at least } twocookies —which may involve difficult-to-learn prag-matic knowledge (Cohen and Krifka, 2014). SuperGLUE is a benchmark suite of eightclassification-based language-understanding tasks(Wang et al., 2019). We test each MiniBERTa onfive SuperGLUE tasks that we expect to see signifi-cant variation at these scales. The hyperparame-ter search range used for each task is described inthe appendix. Task Data source: CB from De Marneffe et al. (2019),BoolQ from Clark et al. (2019), COPA from Roemmele et al.(2011), WiC from Pilehvar and Camacho-Collados (2019);Miller (1995); Schuler (2005), RTE from Dagan et al. (2006);Bar Haim et al. (2006); Giampiccolo et al. (2007); Bentivogliet al. (2009)
Results
We plot the results on SuperGLUE inFigure 6. Improvements in SuperGLUE perfor-mance require a relatively large volume of pretrain-ing data. For most tasks, the point of fastest im-provement in our interpolated curve occurs withmore than 1B words. None of the tasks (with thepossible exception of CommitmentBank) show anysignificant sign of saturation at 30B words. Thissuggests that some key NLU skills are not learntwith fewer than billions of words, and that mod-els are likely to continue improving on these tasksgiven 10 to 100 times more pretraining data.
Having established the learning curves for eachof these probing methods individually, we canbegin to draw some broader conclusions abouthow increasing pretraining data affects TransformerMLMs. Figure 1 plots the overall learning curvesfor these four methods together. The most strikingresult is that improvements in NLU task perfor-mance require far more data than improvements in igure 6: SuperGLUE results. The metric for BoolQ, COPA, WiC, RTE is accuracy, and for CB it is the averageof accuracy and F1 score. For context, we plot RoBERTa
LARGE performance reported at https://github.com/pytorch/fairseq/tree/master/examples/roberta . representations of linguistic features as measuredby these methods. Classifier probing, MDL prob-ing, and acceptability judgment performance allimprove rapidly between 1M and 10M words andshow little improvement beyond 100M words. Bycontrast, performance on the NLU tasks in Super-GLUE appears to improve most rapidly with over1B words and likely continues improving at muchlarger data scales.This implies that at least some of the skills thatRoBERTa uses to solve typical NLU tasks requirebillions of words to be acquired. It is likely thenthat the features being tested by the edge probingsuite and BLiMP are not the key skills implicated inimprovements in NLU performance at these largescales. While edge probing features such as depen-dency and semantic role are undoubtedly crucialto solving NLU tasks, a model that can extract andencode a large proportion of these features (e.g. the100M word models) may still perform poorly onSuperGLUE.Commonsense knowledge may play a large rolein explaining SuperGLUE performance. This hy-pothesis is backed up by results from the Winogradedge-probing task, which suggest that relatively little commonsense knowledge can be learned withfewer than 1B words. The notion that common-sense knowledge takes more data to learn is notsurprising: Intuitively, humans mainly acquire com-monsense knowledge through non-linguistic infor-mation, and so a model learning from text withoutgrounding should require far more data than a hu-man is exposed to in order to acquire comparableknowledge. However, as our experiments focusmainly on linguistic knowledge, additional workis needed to give a more complete picture of theacquisition of commonsense knowledge.Another possible explanation of the delay in therise of the SuperGLUE curve is that being ableto encode certain features does not imply beingable to use them to solve practical tasks. In otherwords, even if a RoBERTa model pretrained on10M–100M words is already able to represent thelinguistic feature we target, it is not guaranteedthat it is able to use them in downstream tasks.This corresponds to the finding of Warstadt et al.(2020b) that RoBERTa can learn to reliably ex-tract many linguistic features with little pretrainingdata, but that it takes orders of magnitude morepretraining data for those features to be used pref-rentially when generalizing. Therefore, it may bea promising research direction to develop methodsto efficiently pretrain and fine-tune NLP modelsto make better use of the linguistic features theyalready recognized.In light of Warstadt et al.’s 2020b findings wehad initially hypothesized that feature accessibil-ity as measured by MDL might show a shalloweror later learning curve than other probing meth-ods. This hypothesis is not supported by ourfindings: Figure 1 shows no obvious difference be-tween the classifier probing curve and the MDLprobing curve. However, this does not prove thatthe accessibility of linguistic features does not im-prove with massive pretraining sets, nor does itprove that the information about a feature and itsaccessibility improve at the same rate.While those conclusions may turn out to be cor-rect, another possibility is that the setting and meth-ods we adopt fail to adequately differentiate be-tween feature information and accessibility. Thebottom of Figure 4 shows that for most tasks thedata codelength has a much larger variance acrosspretraining volumes than the model codelength,and thus the change in overall codelength predomi-nantly reflects the decrease in the loss of a classi-fier trained on the full training set. Therefore, it isnot surprising that the MDL curve resembles thatof classifier probing. However, comparing modelcodelengths alone does not reliably reveal featureaccessibility either, since model codelength is notoptimized individually but as a part of the overallcodelength. New probing methods related to MDLaddress different aspects of these problems (Whit-ney et al., 2020; Pimentel et al., 2020a) and mayyield different conclusions.
Probing neural network representations has beenan active area of research in recent years (Rogerset al., 2020; Belinkov and Glass, 2019). Withthe advent of large pretrained Transformers likeBERT (Devlin et al., 2019), numerous papers haveused classifier probes methods to attempt to locatelinguistic features in learned representations withstriking positive results (Tenney et al., 2019b; He-witt and Manning, 2019). However, another thread Warstadt et al.’s experiments are quite different to ours.They measure RoBERTa’s preference for linguistic featuresover surface features during fine-tuning on ambiguous classifi-cation tasks. This requires strong inductive biases, which maynot correspond straightforwardly to MDL. has found problems with many probing methods:Classifier probes can learn too much from trainingdata (Hewitt and Liang, 2019) and can fail to distin-guish between features that are extractable and fea-tures that are actually used (Voita and Titov, 2020;Pimentel et al., 2020b; Elazar et al., 2020). More-over, it is advisable to look to a variety of probingmethods, as different probing methods often yieldcontradictory results (Warstadt et al., 2019).There have also been a few earlier studies inves-tigating the relationship between pretraining datavolume and linguistic knowledge in language mod-els. Studies of unsupervised acceptability judg-ments find fairly consistent evidence of rapid im-provements in linguistic knowledge up to about10M words of pretraining data, after which im-provements slow down for most phenomena. vanSchijndel et al. (2019) find large improvements inknowledge of subject-verb agreement and reflexivebinding up to 10M words, and few improvementsbetween 10M and 80M words. Hu et al. (2020)find that GPT-2 trained on 42M words performsroughly as well on a syntax benchmark as a similarmodel trained on 100 times that amount. Otherstudies have investigate how one model’s linguisticknowledge changes during the training process, asa function of the number of updates (Saphra andLopez, 2019; Chiang et al., 2020).Raffel et al. (2020) also investigate how per-formance on SuperGLUE (and other downstreamtasks) improves with pretraining dataset sizes be-tween about 8M and 34B words. In contrast to ourfindings, they find that models with around 500Mwords of pretraining data can perform similarlyon downstream tasks to models with 34B words.This discrepancy may arise from several factors.First, the architecture and pretraining for their T5model is not identical to RoBERTa’s or the MiniB-ERTas’. Second, they pretrain their models for afixed number of iterations (totaling 34B tokens),whereas the miniBERTas were trained with earlystopping. Nonetheless, this result suggests that thenumber of unique tokens might matter less than thenumber of iterations, within reasonable limits.There is also some recent work that investi-gates the effect of pretraining data size of otherlanguages. Micheli et al. (2020) pretrain BERT-based language models on 10MB, 100MB, 500MB,1000MB, 2000MB, and 4000MB of French textand test them on a question answering task. Theyfind that the French MLM pretrained on 100MBf raw text has similar performance to the onespretrained on larger datasets on the task, and thatcorpus-specific self-supervised learning does notmake a significant difference. Martin et al. (2020)also show that French MLMs can already learn alot from small-scale pretraining.
We track the ability of language models to acquirerepresentations of linguistic features as a functionof the amount of pretraining data. We use a vari-ety of probing methods, from which we determinethat linguistic features are mostly learnable with100M words of data, while NLU task performancerequires far more data.Our results do not explain what causes NLUtask performance to improve with large quanti-ties of data. To answer these questions, one canuse causal probing methods like amnesic probing(Elazar et al., 2020), in which features are removedfrom a representation. We would also like to under-stand the differences between learning curves forvarious linguistic features, for instance through thelens of the hypothesis that some features acquiredearlier on play a role in bootstrapping knowledgeof other features? Finally, our results show thatto the extent that Transformer LMs like RoBERTaeven approach human language understanding, theyrequire far more data than humans to do so. Extend-ing this investigation to other pretraining settingsand study different model architectures, pretrain-ing tasks, and pretraining data domains—includingones that more closely resemble human learners—could help indicate promising directions for closingthis gap.
Acknowledgments
We thank Haokun Liu and Ian Tenney for providingtechnical support on the edge probing experiment,and Elena Voita for support with MDL. This projecthas benefited from financial support to SB by Ericand Wendy Schmidt (made by recommendation ofthe Schmidt Futures program), by Samsung Re-search (under the project
Improving Deep Learningusing Latent Structure ), by Intuit, Inc., and in-kindsupport by the NYU High-Performance Comput-ing Center. This material is based upon work sup-ported by the National Science Foundation underGrant Nos. 1850208 and 1922658. Any opinions,findings, and conclusions or recommendations ex-pressed in this material are those of the author(s) and do not necessarily reflect the views of the Na-tional Science Foundation.
References
Yossi Adi, Einat Kermany, Yonatan Belinkov, OferLavi, and Yoav Goldberg. 2017. Fine-grained anal-ysis of sentence embeddings using auxiliary pre-diction tasks. In
Proceedings of ICLR ConferenceTrack. Toulon, France.
Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro,Danilo Giampiccolo, Bernardo Magnini, and IdanSzpektor. 2006. The second PASCAL recognisingtextual entailment challenge. In
Proceedings of theSecond PASCAL Challenges Workshop on Recognis-ing Textual Entailment .Yonatan Belinkov and James R. Glass. 2019. Analysismethods in neural language processing: A survey.
Transactions of the Association for ComputationalLinguistics , 7:49–72.Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, DaniloGiampiccolo, and Bernardo Magnini. 2009. Thefifth PASCAL recognizing textual entailment chal-lenge. In
Textual Analysis Conference (TAC) .Tom B. Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child,Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,Clemens Winter, Christopher Hesse, Mark Chen,Eric Sigler, Mateusz Litwin, Scott Gray, BenjaminChess, Jack Clark, Christopher Berner, Sam Mc-Candlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language models are few-shot learn-ers. arXiv preprint 2005.14165 .David C Chiang, Sung-Feng Huang, and Hung-yiLee. 2020. Pretrained language model embry-ology: The birth of ALBERT. arXiv preprintarXiv:2010.02480 .Christopher Clark, Kenton Lee, Ming-Wei Chang,Tom Kwiatkowski, Michael Collins, and KristinaToutanova. 2019. Boolq: Exploring the surprisingdifficulty of natural yes/no questions. In
Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 2924–2936.Ariel Cohen and Manfred Krifka. 2014. Superlativequantifiers and meta-speech acts.
Linguistics andPhilosophy , 37(1):41–90.Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The PASCAL recognising textual entail-ment challenge. In
Machine Learning Challenges.Evaluating Predictive Uncertainty, Visual ObjectClassification, and Recognising Textual Entailment .Springer.arie-Catherine De Marneffe, Mandy Simons, and Ju-dith Tonhauser. 2019. The commitmentbank: Inves-tigating projection in naturally occurring discourse.In proceedings of Sinn und Bedeutung , volume 23,pages 107–124.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and YoavGoldberg. 2020. When BERT forgets how to POS:Amnesic probing of linguistic properties and MLMpredictions. arXiv preprint 2006.00995 .Allyson Ettinger. 2020. What BERT is not: Lessonsfrom a new suite of psycholinguistic diagnostics forlanguage models.
Transactions of the Associationfor Computational Linguistics , 8:34–48.Allyson Ettinger, Ahmed Elgohary, and Philip Resnik.2016. Probing for semantic evidence of compositionby means of simple classification tasks. In
Proceed-ings of the 1st Workshop on Evaluating Vector-SpaceRepresentations for NLP , pages 134–139.Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,and Bill Dolan. 2007. The third PASCAL recogniz-ing textual entailment challenge. In
Proceedings ofthe ACL-PASCAL Workshop on Textual Entailmentand Paraphrasing . Association for ComputationalLinguistics.Betty Hart and Todd R. Risley. 1992. American parent-ing of language-learning children: Persisting differ-ences in family-child interactions observed in natu-ral home environments.
Developmental Psychology ,28(6):1096.Andrew Heathcote, Scott Brown, and Douglas JK Me-whort. 2000. The power law repealed: The case foran exponential law of practice.
Psychonomic bul-letin & review , 7(2):185–207.Irene Heim and Angelika Kratzer. 1998.
Semantics ingenerative grammar . Blackwell Oxford.Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid ´O S´eaghdha, SebastianPad´o, Marco Pennacchiotti, Lorenza Romano, andStan Szpakowicz. 2010. SemEval-2010 task 8:Multi-way classification of semantic relations be-tween pairs of nominals. In
Proceedings of the5th International Workshop on Semantic Evalua-tion , pages 33–38, Uppsala, Sweden. Associationfor Computational Linguistics.John Hewitt and Percy Liang. 2019. Designing and in-terpreting probes with control tasks. In
Conferenceon Empirical Methods in Natural Language Process-ing . Association for Computational Linguistics. John Hewitt and Christopher D Manning. 2019. Astructural probe for finding syntax in word represen-tations. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4129–4138.Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox,and Roger Levy. 2020. A systematic assessmentof syntactic generalization in neural language mod-els. In
Proceedings of the 58th Annual Meetingof the Association for Computational Linguistics ,pages 1725–1744, Online. Association for Compu-tational Linguistics.Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In
Proceedingsof the 3rd International Conference on LearningRepresentations .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Louis Martin, Benjamin Muller, Pedro Javier Or-tiz Su´arez, Yoann Dupont, Laurent Romary, ´Ericde la Clergerie, Djam´e Seddah, and Benoˆıt Sagot.2020. CamemBERT: a tasty French language model.In
Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics , pages7203–7219, Online. Association for ComputationalLinguistics.Vincent Micheli, Martin d’Hoffschmidt, and Franc¸oisFleuret. 2020. On the importance of pre-trainingdata volume for compact language models. In
Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 7853–7858, Online. Association for Computa-tional Linguistics.George A Miller. 1995. WordNet: a lexical databasefor english.
Communications of the ACM .Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: The word-in-context datasetfor evaluating context-sensitive meaning representa-tions. In
Proceedings of the Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL-HLT) . Association for Computational Lin-guistics.Tiago Pimentel, Naomi Saphra, Adina Williams, andRyan Cotterell. 2020a. Pareto probing: Trad-ing off accuracy for complexity. arXiv preprintarXiv:2010.02180 .Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay,Ran Zmigrod, Adina Williams, and Ryan Cotterell.2020b. Information-theoretic probing for linguisticstructure. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,ages 4609–4622, Online. Association for Computa-tional Linguistics.Yada Pruksachatkun, Jason Phang, Haokun Liu,Phu Mon Htut, Xiaoyi Zhang, Richard YuanzhePang, Clara Vania, Katharina Kann, and Samuel R.Bowman. 2020. Intermediate-task transfer learningwith pretrained language models: When and whydoes it work? In
Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics , pages 5231–5247, Online. Association forComputational Linguistics.Colin Raffel, Noam Shazeer, Adam Roberts, Kather-ine Lee, Sharan Narang, Michael Matena, YanqiZhou, Wei Li, and Peter J. Liu. 2020. Exploringthe limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Re-search , 21(140):1–67.Altaf Rahman and Vincent Ng. 2012. Resolving com-plex cases of definite pronouns: The Winogradschema challenge. In
Proceedings of the 2012 JointConference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning , pages 777–789, Jeju Island, Korea.Association for Computational Linguistics.Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S. Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In .Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A primer in BERTology: What we knowabout how BERT works. In
Findings of EMNLP .Rachel Rudinger, Adam Teichert, Ryan Culkin, ShengZhang, and Benjamin Van Durme. 2018. Neural-Davidsonian semantic proto-role labeling. In
Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 944–955, Brussels, Belgium. Association for Computa-tional Linguistics.Julian Salazar, Davis Liang, Toan Q. Nguyen, and Ka-trin Kirchhoff. 2020. Masked language model scor-ing. In
Proceedings of the 58th Annual Meetingof the Association for Computational Linguistics ,pages 2699–2712, Online. Association for Compu-tational Linguistics.Naomi Saphra and Adam Lopez. 2019. Understand-ing learning dynamics of language models withSVCCA. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 3257–3267, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Marten van Schijndel, Aaron Mueller, and Tal Linzen.2019. Quantity doesn’t buy quality syntax withneural language models. In
Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 5831–5837, Hong Kong,China. Association for Computational Linguistics.Karin Kipper Schuler. 2005.
Verbnet: A Broad-coverage, Comprehensive Verb Lexicon . Ph.D. the-sis, University of Pennsylvania.Natalia Silveira, Timothy Dozat, Marie-Catherinede Marneffe, Samuel Bowman, Miriam Connor,John Bauer, and Chris Manning. 2014. A goldstandard dependency corpus for English. In
Pro-ceedings of the Ninth International Conference onLanguage Resources and Evaluation (LREC-2014) ,pages 2897–2904, Reykjavik, Iceland. EuropeanLanguages Resources Association (ELRA).Adam Teichert, Adam Poliak, Benjamin Van Durme,and Matthew Gormley. 2017. Semantic proto-rolelabeling. In
AAAI Conference on Artificial Intelli-gence .Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a.BERT rediscovers the classical NLP pipeline. In
Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4593–4601, Florence, Italy. Association for ComputationalLinguistics.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R Bowman, Dipan-jan Das, et al. 2019b. What do you learn from con-text? Probing for sentence structure in contextual-ized word representations. In
Proceedings of ICLR .Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length.In
Proceedings of the 2020 Conference on Empiri-cal Methods in Natural Language Processing , PuntaCana, Dominican Republic. Association for Compu-tational Linguistics.Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R. Bowman. 2019. SuperGLUE:A stickier benchmark for general-purpose languageunderstanding systems. In .Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng, Ha-gen Blix, Yining Nie, Anna Alsop, Shikha Bordia,Haokun Liu, Alicia Parrish, Sheng-Fu Wang, JasonPhang, Anhad Mohananey, Phu Mon Htut, PalomaJeretiˇc, and Samuel R. Bowman. 2019. InvestigatingBERT’s knowledge of language: Five analysis meth-ods with NPIs. In
Proceedings of EMNLP-IJCNLP ,pages 2870–2880.Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo-hananey, Wei Peng, Sheng-Fu Wang, and Samuel R.Bowman. 2020a. Blimp: The benchmark of linguis-tic minimal pairs for english.
Transactions of the As-sociation for Computational Linguistics , 8:377–392.lex Warstadt, Yian Zhang, Haau-Sing Li, Haokun Liu,and Samuel R Bowman. 2020b. Learning which fea-tures matter: Roberta acquires a preference for lin-guistic generalizations (eventually). In
Proceedingsof the 2020 Conference on Empirical Methods inNatural Language Processing , Punta Cana, Domini-can Republic. Association for Computational Lin-guistics.Ralph Weischedel, Martha Palmer, Marcus Mitchell,Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Ni-anwen Xue, Ann Taylor, Jeff Kaufman, MichelleFranchini, Mohammed El-Bachouti, Robert Belvin,and Ann Houston. 2013. OntoNotes release 5.0LDC2013T19. Linguistic Data Consortium.Aaron Steven White, Pushpendre Rastogi, Kevin Duh,and Benjamin Van Durme. 2017. Inference is ev-erything: Recasting semantic resources into a uni-fied evaluation framework. In
Proceedings of theEighth International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers) ,pages 996–1005.William F Whitney, Min Jae Song, David Brand-fonbrener, Jaan Altosaar, and Kyunghyun Cho.2020. Evaluating representations by the complex-ity of learning low-loss predictors. arXiv preprintarXiv:2009.07368 . A Appendices igure 7: Our absolute edge probing dev set results (not normalized) compared to BERT
LARGE test set results fromTenney et al. (2019b).Figure 8: Our absolute SuperGLUE results (not normalized) compared to RoBERTa
LARGE results from Liu et al.(2019). ask Batch Size Learning Rate validation interval Max Epochs
BoolQ { } { } { } { }
60 40COPA { } { }
100 40RTE { } { } { } { } Table 1: Hyperparameter search ranges for the SuperGLUE tasks. Our search ranges are largely dependent onthose used in Pruksachatkun et al. (2020). M o d e l O v er a ll ANA . A G R AR G . S T R B I ND I N G C T R L . RA I S . D - NA G R ELL I P S I S F I LLE R G A P I RR E G U L AR I S L AND N P I Q UAN T I F I E R S S - VA G R BASE
RoBERTa
BASE
Table 2: BLiMP results. 5-gram, LSTM, TXL, GPT-2 scores come from Warstadt et al. (2020a). BERT