[PDF] COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Abstract

We present COCO-LM, a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences. COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences. It creates more challenging pretraining inputs, where noises are sampled based on their likelihood in the auxiliary language model. COCO-LM then pretrains with two tasks: The first task, corrective language modeling, learns to correct the auxiliary model's corruptions by recovering the original tokens. The second task, sequence contrastive learning, ensures that the language model generates sequence representations that are invariant to noises and transformations. In our experiments on the GLUE and SQuAD benchmarks, COCO-LM outperforms recent pretraining approaches in various pretraining settings and few-shot evaluations, with higher pretraining efficiency. Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.

Full PDF

CCOCO-LM: Correcting and Contrasting Text Sequences forLanguage Model Pretraining

Yu Meng Chenyan Xiong Payal Bajaj Saurabh Tiwary Paul Bennett Jiawei Han Xia Song Abstract

We present COCO-LM, a new self-supervisedlearning framework that pretrains Language Mod-els by COrrecting challenging errors and COn-trasting text sequences. COCO-LM employs anauxiliary language model to mask-and-predict to-kens in original text sequences. It creates morechallenging pretraining inputs, where noises aresampled based on their likelihood in the auxil-iary language model. COCO-LM then pretrainswith two tasks: The ﬁrst task, corrective lan-guage modeling, learns to correct the auxiliarymodel’s corruptions by recovering the originaltokens. The second task, sequence contrastivelearning, ensures that the language model gener-ates sequence representations that are invariant tonoises and transformations. In our experiments onthe GLUE and SQuAD benchmarks, COCO-LMoutperforms recent pretraining approaches in var-ious pretraining settings and few-shot evaluations,with higher pretraining efﬁciency. Our analysesreveal that COCO-LM’s advantages come from itschallenging training signals, more contextualizedtoken representations, and regularized sequencerepresentations.

1. Introduction

Pretraining language models (PLMs) (Devlin et al., 2019;Radford et al., 2019; Raffel et al., 2019) have revolution-ized the way AI systems process natural languages. Bypretraining on large text corpora (Raffel et al., 2019; Brownet al., 2020) and scaling Transformers (Vaswani et al., 2017)to millions and billions of parameters (Devlin et al., 2019;Raffel et al., 2019), the state-of-the-art in many languagerelated tasks has been refreshed at a historic speed in thepast several years. University of Illinois at Urbana-Champaign { yumeng5,hanj } @illinois.edu Microsoft { Chenyan.Xiong,Payal.Bajaj, satiwary, Paul.N.Bennett,xiaso } @microsoft.com .Preprint. Work in progress. On the other hand, within the standard language modelpretraining framework, it is observed that the empiricalperformance of PLMs on downstream tasks only improveslinearly with the exponential growth of parameter size andpretraining cost (Kaplan et al., 2020). This is unsustainableas PLMs have reached trillions of parameters (Brown et al.,2020; Fedus et al., 2021).Recent research has revealed some intrinsic limitations ofexisting pretraining frameworks that may result in this sub-linear efﬁciency. One challenge is that pretraining with ran-domly altered texts ( e.g. , randomly masked tokens) yieldsmany non-informative signals no longer useful after a cer-tain amount of pretraining (Roberts et al., 2020; Guu et al.,2020; Ye et al., 2020). Another one is that pretraining attoken level does not explicitly learn language semantics atthe sequence level and Transformers may not generalizeto higher level semantics efﬁciently during pretraining (Liet al., 2020; Thakur et al., 2020).In this paper, we aim to overcome these limitations with anew self-supervised learning framework, COCO-LM, thatpretrains Language Models by COrrecting and COntrastingtext sequences with more challenging noises. To constructmore informative pretraining signal, COCO-LM leveragesan auxiliary language model, similar to the generator inELECTRA (Clark et al., 2020b), to corrupt text sequencesby sampling more contextually plausible noises from itsmasked language modeling (MLM) probability. Differ-ent from the replaced token detection task in ELECTRA,COCO-LM revives a language modeling task, correctivelanguage modeling (CLM), which pretrains the Transformerto not only detect the challenging noises in the corruptedtexts, but also correct them via a multi-task setting.To improve the learning of sequence level semantics, COCO-LM introduces a sequence level pretraining task, sequencecontrastive learning (SCL), that uses contrastive learningto enforce the pretraining model to align the corrupted textsequence and its cropped original sequence close in the rep-resentation space, while away from other random sequences.This encourages the model to leverage more informationfrom the entire sequence to produce sequence representa-tions that are invariant to token-level alterations. a r X i v : . [ c s . C L ] F e b retrain Language Models by Correcting and Contrasting Text Sequences COCO-LM signiﬁcantly improves the generalization abilityof language models on a variety of downstream tasks inGLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al.,2016) benchmarks. It outperforms recent approaches (Clarket al., 2020b; He et al., 2020; Ke et al., 2020) by largemargins ( e.g. , about . and . points accuracy gain onMNLI and EM gain SQuAD 2.0 respectively with basemodel training). It is also more cost-effective and has betterfew-shot ability in downstream tasks. Our thorough analysesalso reveal that the beneﬁts of COCO-LM come from itschallenging pretraining signals, more contextualized tokenrepresentations, and regularized sequence representations.

2. Related Work

Designing better pretraining tasks than standard languagemodeling (Bengio et al., 2003; Devlin et al., 2019) is animportant research topic in language representation learn-ing (Radford et al., 2019; Song et al., 2019). For example,XLNet proposes permutation language modeling that con-ducts MLM in an autoregressive manner (Yang et al., 2019);UniLM uses pseudo MLM to unify autoregressive and MLMtasks for both language representation and generation (Donget al., 2019; Bao et al., 2020). Lewis et al. (2019) conducta thorough study of these variants and show MLM is stillamong the most effective in many applications.One way to make MLM more informative is to mask moreinformative positions/spans (Joshi et al., 2019; Song et al.,2019; Guu et al., 2020) or to automatically learn maskingpositions (Ye et al., 2020). By masking more informativetokens, the pretrained language models focus more on thesemantics required to recover those tokens ( e.g. , entitiesand attributes). This signiﬁcantly boostes the generalizationability of the pretrained models in semantic-centric tasks,including more accurate question answering and more factu-ally correct language generation (Guu et al., 2020; Robertset al., 2020; Rosset et al., 2020).Instead of optimizing mask positions, ELECTRA (Clarket al., 2020b) employs an auxiliary network to corrupt theinput sequence with more challenging noises. It uses theMLM trained auxiliary model to replace tokens with sam-ples from its MLM probability, and pretrains the main Trans-former to detect the replaced tokens via binary classiﬁca-tion. The two networks are pretrained jointly: The auxiliarymodel generates more and more challenging noises for themain Transformer to detect; the main Transformer so trainedachieves strong performance in downstream tasks.Despite its empirical advantage, there are concerns aboutwhether ELECTRA’s binary classiﬁcation task misses someproperties of language modeling. The ELECTRA authorsexplored a standard language model task on the corrupted We plan to open-source our code and pretrained models. text sequence (All-Token LM), but observed performancedegradation (Clark et al., 2020b). ELECTRIC (Clark et al.,2020a) proposes a language model task that contrasts theoriginal tokens from noises sampled from a cloze model.Although it underperforms ELECTRA on GLUE, ELEC-TRIC still pretrains a language model which can be used intasks like scoring the feasibility of a generated text sequence.MC-BERT uses a multiple-choice task which selects origi-nal tokens from plausible alternatives and performs on parwith ELECTRA (Xu et al., 2020). COCO-LM also lever-ages an auxiliary network to generate pretraining inputs butuses two new pretraining tasks, one of which is a languagemodeling task.Another frontier in pretraining research is to incorporatesentence level signals, for example, next sentence predic-tion (Devlin et al., 2019), sentence ordering (Lan et al.,2019), and previous sentence prediction (Wang et al., 2019).However, RoBERTa found the next sentence prediction tasknot beneﬁcial and only uses token level MLM task (Liuet al., 2019). The beneﬁts of sentence level pretraining taskare usually observed on some speciﬁc tasks (Chi et al., 2020;Lewis et al., 2020) such as modeling long-form texts (Ravulaet al., 2020) and grounded question answering (Guu et al.,2020).Recent successes of contrastive learning with language aremainly achieved in the ﬁne-tuning stage. Gunel et al. (2020)conducts supervised contrasting learning on GLUE and im-proves few-shot accuracy in ﬁne-tuning. Xiong et al. (2020)uses contrastive learning in dense text retrieval, using rel-evant query-document labels to construct contrast pairs.CERT (Fang & Xie, 2020) conducts continuous trainingfrom BERT using contrastive pairs generated from back-translation (Pham et al., 2020) but underperforms RoBERTa.

3. Method

In this section, we ﬁrst recap ELECTRA-Style languagemodel pretraining and then present COCO-LM.

In the masked language modeling (MLM) (Devlinet al., 2019) task, the pretraining model, often a Trans-former (Vaswani et al., 2017), takes an input sequence X orig = [ x , . . . , x i , . . . , x n ] with some tokens randomlyreplaced by [MASK] symbols and learns to predict the orig-inal tokens: [ x , . . . , [MASK] , . . . , x n ] Transformer −−−−−−→ H mlm LM Head −−−−−→ X mlm , where H mlm = [ h mlm , . . . , h mlm i , . . . , h mlm n ] is the contex-tualized representation of the input sequence, X mlm =[ x mlm , . . . , x mlm i , . . . , x mlm n ] is the sequence with masked po-sitions ﬁlled with MLM predicted tokens, and LM Head is retrain Language Models by Correcting and Contrasting Text Sequences Auxiliary Transformer

A C D AAAB+HicbVBNS8NAEN34WetHox69LBbBU0mKoseKF0GEivYD0lA22027dLMJuxOxhv4SLx4U8epP8ea/cdvmoK0PBh7vzTAzL0gE1+A439bS8srq2npho7i5tb1Tsnf3mjpOFWUNGotYtQOimeCSNYCDYO1EMRIFgrWC4eXEbz0wpXks72GUMD8ifclDTgkYqWuXOsAeASDzbi7urv1x1y47FWcKvEjcnJRRjnrX/ur0YppGTAIVRGvPdRLwM6KAU8HGxU6qWULokPSZZ6gkEdN+Nj18jI+M0sNhrExJwFP190RGIq1HUWA6IwIDPe9NxP88L4Xw3M+4TFJgks4WhanAEONJCrjHFaMgRoYQqri5FdMBUYSCyapoQnDnX14kzWrFPa04tyflWjWPo4AO0CE6Ri46QzV0heqogShK0TN6RW/Wk/VivVsfs9YlK5/ZR39gff4AxHOTGg== [MASK] AAAB+HicbVBNS8NAEN34WetHox69LBbBU0mKoseKF0GEivYD0lA22027dLMJuxOxhv4SLx4U8epP8ea/cdvmoK0PBh7vzTAzL0gE1+A439bS8srq2npho7i5tb1Tsnf3mjpOFWUNGotYtQOimeCSNYCDYO1EMRIFgrWC4eXEbz0wpXks72GUMD8ifclDTgkYqWuXOsAeASDzbi7urv1x1y47FWcKvEjcnJRRjnrX/ur0YppGTAIVRGvPdRLwM6KAU8HGxU6qWULokPSZZ6gkEdN+Nj18jI+M0sNhrExJwFP190RGIq1HUWA6IwIDPe9NxP88L4Xw3M+4TFJgks4WhanAEONJCrjHFaMgRoYQqri5FdMBUYSCyapoQnDnX14kzWrFPa04tyflWjWPo4AO0CE6Ri46QzV0heqogShK0TN6RW/Wk/VivVsfs9YlK5/ZR39gff4AxHOTGg== [MASK] AAAB+HicbVBNS8NAEN34WetHox69LBbBU0mKoseKF0GEivYD0lA22027dLMJuxOxhv4SLx4U8epP8ea/cdvmoK0PBh7vzTAzL0gE1+A439bS8srq2npho7i5tb1Tsnf3mjpOFWUNGotYtQOimeCSNYCDYO1EMRIFgrWC4eXEbz0wpXks72GUMD8ifclDTgkYqWuXOsAeASDzbi7urv1x1y47FWcKvEjcnJRRjnrX/ur0YppGTAIVRGvPdRLwM6KAU8HGxU6qWULokPSZZ6gkEdN+Nj18jI+M0sNhrExJwFP190RGIq1HUWA6IwIDPe9NxP88L4Xw3M+4TFJgks4WhanAEONJCrjHFaMgRoYQqri5FdMBUYSCyapoQnDnX14kzWrFPa04tyflWjWPo4AO0CE6Ri46QzV0heqogShK0TN6RW/Wk/VivVsfs9YlK5/ZR39gff4AxHOTGg== [MASK] AAAB+HicbVBNS8NAEN34WetHox69LBbBU0mKoseKF0GEivYD0lA22027dLMJuxOxhv4SLx4U8epP8ea/cdvmoK0PBh7vzTAzL0gE1+A439bS8srq2npho7i5tb1Tsnf3mjpOFWUNGotYtQOimeCSNYCDYO1EMRIFgrWC4eXEbz0wpXks72GUMD8ifclDTgkYqWuXOsAeASDzbi7urv1x1y47FWcKvEjcnJRRjnrX/ur0YppGTAIVRGvPdRLwM6KAU8HGxU6qWULokPSZZ6gkEdN+Nj18jI+M0sNhrExJwFP190RGIq1HUWA6IwIDPe9NxP88L4Xw3M+4TFJgks4WhanAEONJCrjHFaMgRoYQqri5FdMBUYSCyapoQnDnX14kzWrFPa04tyflWjWPo4AO0CE6Ri46QzV0heqogShK0TN6RW/Wk/VivVsfs9YlK5/ZR39gff4AxHOTGg== [MASK] B sampling CA D F A C DB FA

Main Transformer

B D AAAB9XicbVBNS8NAEJ34WetX1aOXYBE8laQoeqzowWMF+wFpLJvtpl262YTdiVpC/4cXD4p49b9489+4bXPQ1gcDj/dmmJkXJIJrdJxva2l5ZXVtvbBR3Nza3tkt7e03dZwqyho0FrFqB0QzwSVrIEfB2oliJAoEawXDq4nfemBK81je4ShhfkT6koecEjTSfQfZEyJmXv3y2h93S2Wn4kxhLxI3J2XIUe+Wvjq9mKYRk0gF0dpznQT9jCjkVLBxsZNqlhA6JH3mGSpJxLSfTa8e28dG6dlhrExJtKfq74mMRFqPosB0RgQHet6biP95XorhhZ9xmaTIJJ0tClNhY2xPIrB7XDGKYmQIoYqbW206IIpQNEEVTQju/MuLpFmtuGcV5/a0XKvmcRTgEI7gBFw4hxrcQB0aQEHBM7zCm/VovVjv1sesdcnKZw7gD6zPH55tkog= [PAD] AAAB9XicbVBNS8NAEJ34WetX1aOXYBE8laQoeqzowWMF+wFpLJvtpl262YTdiVpC/4cXD4p49b9489+4bXPQ1gcDj/dmmJkXJIJrdJxva2l5ZXVtvbBR3Nza3tkt7e03dZwqyho0FrFqB0QzwSVrIEfB2oliJAoEawXDq4nfemBK81je4ShhfkT6koecEjTSfQfZEyJmXv3y2h93S2Wn4kxhLxI3J2XIUe+Wvjq9mKYRk0gF0dpznQT9jCjkVLBxsZNqlhA6JH3mGSpJxLSfTa8e28dG6dlhrExJtKfq74mMRFqPosB0RgQHet6biP95XorhhZ9xmaTIJJ0tClNhY2xPIrB7XDGKYmQIoYqbW206IIpQNEEVTQju/MuLpFmtuGcV5/a0XKvmcRTgEI7gBFw4hxrcQB0aQEHBM7zCm/VovVjv1sesdcnKZw7gD6zPH55tkog= [PAD] C AAAB9XicbVBNS8NAEJ34WetX1aOXYBE8laQoeqzowWMF+wFpLJvtpl262YTdiVpC/4cXD4p49b9489+4bXPQ1gcDj/dmmJkXJIJrdJxva2l5ZXVtvbBR3Nza3tkt7e03dZwqyho0FrFqB0QzwSVrIEfB2oliJAoEawXDq4nfemBK81je4ShhfkT6koecEjTSfQfZEyJmXv3y2h93S2Wn4kxhLxI3J2XIUe+Wvjq9mKYRk0gF0dpznQT9jCjkVLBxsZNqlhA6JH3mGSpJxLSfTa8e28dG6dlhrExJtKfq74mMRFqPosB0RgQHet6biP95XorhhZ9xmaTIJJ0tClNhY2xPIrB7XDGKYmQIoYqbW206IIpQNEEVTQju/MuLpFmtuGcV5/a0XKvmcRTgEI7gBFw4hxrcQB0aQEHBM7zCm/VovVjv1sesdcnKZw7gD6zPH55tkog= [PAD] AAAB9XicbVBNS8NAEJ34WetX1aOXYBE8laQoeqzowWMF+wFpLJvtpl262YTdiVpC/4cXD4p49b9489+4bXPQ1gcDj/dmmJkXJIJrdJxva2l5ZXVtvbBR3Nza3tkt7e03dZwqyho0FrFqB0QzwSVrIEfB2oliJAoEawXDq4nfemBK81je4ShhfkT6koecEjTSfQfZEyJmXv3y2h93S2Wn4kxhLxI3J2XIUe+Wvjq9mKYRk0gF0dpznQT9jCjkVLBxsZNqlhA6JH3mGSpJxLSfTa8e28dG6dlhrExJtKfq74mMRFqPosB0RgQHet6biP95XorhhZ9xmaTIJJ0tClNhY2xPIrB7XDGKYmQIoYqbW206IIpQNEEVTQju/MuLpFmtuGcV5/a0XKvmcRTgEI7gBFw4hxrcQB0aQEHBM7zCm/VovVjv1sesdcnKZw7gD6zPH55tkog= [PAD] AAAB9XicbVDLTgJBEJzFF+IL9ehlIjHxRHaJRo8kXDx4wCiPZFnJ7NDAhNlHZnpVsuE/vHjQGK/+izf/xgH2oGAlnVSqutPd5cdSaLTtbyu3srq2vpHfLGxt7+zuFfcPmjpKFIcGj2Sk2j7TIEUIDRQooR0rYIEvoeWPalO/9QBKiyi8w3EMXsAGoegLztBI9x2EJ0RM3dr1rTfpFkt22Z6BLhMnIyWSod4tfnV6EU8CCJFLprXr2DF6KVMouIRJoZNoiBkfsQG4hoYsAO2ls6sn9MQoPdqPlKkQ6Uz9PZGyQOtx4JvOgOFQL3pT8T/PTbB/6aUijBOEkM8X9RNJMaLTCGhPKOAox4YwroS5lfIhU4yjCapgQnAWX14mzUrZOS/bN2elaiWLI0+OyDE5JQ65IFVyReqkQThR5Jm8kjfr0Xqx3q2PeWvOymYOyR9Ynz+yLJKV [CLS] AAAB9XicbVDLTgJBEJzFF+IL9ehlIjHxRHaJRo8kXDx4wCiPZFnJ7NDAhNlHZnpVsuE/vHjQGK/+izf/xgH2oGAlnVSqutPd5cdSaLTtbyu3srq2vpHfLGxt7+zuFfcPmjpKFIcGj2Sk2j7TIEUIDRQooR0rYIEvoeWPalO/9QBKiyi8w3EMXsAGoegLztBI9x2EJ0RM3dr1rTfpFkt22Z6BLhMnIyWSod4tfnV6EU8CCJFLprXr2DF6KVMouIRJoZNoiBkfsQG4hoYsAO2ls6sn9MQoPdqPlKkQ6Uz9PZGyQOtx4JvOgOFQL3pT8T/PTbB/6aUijBOEkM8X9RNJMaLTCGhPKOAox4YwroS5lfIhU4yjCapgQnAWX14mzUrZOS/bN2elaiWLI0+OyDE5JQ65IFVyReqkQThR5Jm8kjfr0Xqx3q2PeWvOymYOyR9Ynz+yLJKV [CLS] AAAB9XicbVDLTgJBEJzFF+IL9ehlIjHxRHaJRo8kXDx4wCiPZFnJ7NDAhNlHZnpVsuE/vHjQGK/+izf/xgH2oGAlnVSqutPd5cdSaLTtbyu3srq2vpHfLGxt7+zuFfcPmjpKFIcGj2Sk2j7TIEUIDRQooR0rYIEvoeWPalO/9QBKiyi8w3EMXsAGoegLztBI9x2EJ0RM3dr1rTfpFkt22Z6BLhMnIyWSod4tfnV6EU8CCJFLprXr2DF6KVMouIRJoZNoiBkfsQG4hoYsAO2ls6sn9MQoPdqPlKkQ6Uz9PZGyQOtx4JvOgOFQL3pT8T/PTbB/6aUijBOEkM8X9RNJMaLTCGhPKOAox4YwroS5lfIhU4yjCapgQnAWX14mzUrZOS/bN2elaiWLI0+OyDE5JQ65IFVyReqkQThR5Jm8kjfr0Xqx3q2PeWvOymYOyR9Ynz+yLJKV [CLS] AAAB9XicbVDLTgJBEJzFF+IL9ehlIjHxRHaJRo8kXDx4wCiPZFnJ7NDAhNlHZnpVsuE/vHjQGK/+izf/xgH2oGAlnVSqutPd5cdSaLTtbyu3srq2vpHfLGxt7+zuFfcPmjpKFIcGj2Sk2j7TIEUIDRQooR0rYIEvoeWPalO/9QBKiyi8w3EMXsAGoegLztBI9x2EJ0RM3dr1rTfpFkt22Z6BLhMnIyWSod4tfnV6EU8CCJFLprXr2DF6KVMouIRJoZNoiBkfsQG4hoYsAO2ls6sn9MQoPdqPlKkQ6Uz9PZGyQOtx4JvOgOFQL3pT8T/PTbB/6aUijBOEkM8X9RNJMaLTCGhPKOAox4YwroS5lfIhU4yjCapgQnAWX14mzUrZOS/bN2elaiWLI0+OyDE5JQ65IFVyReqkQThR5Jm8kjfr0Xqx3q2PeWvOymYOyR9Ynz+yLJKV [CLS] Original Sequence:

ABCDE

Corrective Language Modeling

Masked Sequence:

A_CD_

Cropped Sequence:

BCD

Random MaskRandom Crop InputInput Inputsampling Sequence Contrastive Learning

Main Transformer

B C D E

COCO-LM Pretraining Tasks:

Corrective Language Modeling (CLM)Sequence Contrastive Learning (SCL)

Figure 1.

The overview of COCO-LM. The auxiliary Transformer is pretrained by MLM. We sample output tokens from its LM probabilityto construct a corrupted sequence, which is used as the pretraining input of the main Transformer for Corrective Language Modeling. Thecorrupted sequence also forms a positive sequence pair with the cropped original sequence in Sequence Contrastive Learning. a classiﬁcation layer that learns to predict the original tokenfrom the vocabulary V with the following probability: p LM ( x i | h i ) = exp( x (cid:62) i h i ) (cid:80) x j ∈ V exp( x (cid:62) j h i ) . The token embeddings x j are parameters shared in Trans-former input and the LM Head output layer. The Trans-former is trained via the cross entropy loss between X mlm and X orig on the masked positions (Devlin et al., 2019).The randomly chosen masks do not always provide the bestpretraining signals: Many masked tokens are trivial com-mon words that may not push the Transformer to capturemeaningful language semantics (Guu et al., 2020), whilesome might be too hard or have many false negatives (Xuet al., 2020). Pretraining on those masks is not guaranteedto elevate the language model’s generalization ability.Clark et al. (2020b) developed a new framework, ELECTRA ,which instead of working on the masked sequences directly,ﬁrst leverages an auxiliary MLM model (“generator”) toinfer a sequence X mlm as the pretraining input for the mainnetwork (“discriminator”). The latter learns to detect whichtokens are replaced via a binary classiﬁcation task called“replaced token detection”: X orig Aux. Transformer −−−−−−−−−→ X mlm Main Transformer −−−−−−−−−→ H disc Sigmoid −−−−→ Y. The main Transformer detects whether the input token iskept or replaced: y i = 1 iff x i = x mlm i , using the sigmoidbinary classiﬁcation head. The two networks are pretrainedside-by-side: The auxiliary Transformer is trained by MLMand outputs more plausible and challenging token replace-ments x mlm i (cid:54) = x i ; the main network learns to better detectthe deceiving replacements from X mlm .The auxiliary network is solely used to construct pretrain-ing signals and is discarded after pretraining. The mainTransformer is ﬁne-tuned for downstream tasks and is quite effective in a wide range of tasks (Clark et al., 2020b). Thesource of its empirical advantage, however, is somehow amystery (Clark et al., 2020a). After all, the main Trans-former is not even trained as a language model; it merelylearns from the binary classiﬁcation task, yet performs wellon tasks where modeling sophisticated language semanticsis required (Wang et al., 2018).Clark et al. (2020b) explored an All-Token MLM task whichtrains the main Transformer to predict the original tokenbesides detecting replacements, but it decreased the gener-alization ability of ELECTRA. Later, Clark et al. (2020a)proposed ELECTRIC that uses language modeling proba-bilities to distinguish the original tokens from noises sam-pled from a cloze model instead of the auxiliary languagemodel. It does not outperform ELECTRA but maintains thelanguage modeling capability, which is necessary in someapplications (Clark et al., 2020a). COCO-LM also employs an auxiliary Transformer to cor-rupt text sequences with more challenging noises—we latershow this is critical for the pretrained model’s generalizationability (Sec. 5.2). Different from ELECTRA (Clark et al.,2020b), COCO-LM ﬁrst revives the language modeling taskon the corrupted text sequences. Then it introduces a newsequence level task with contrastive learning (Sec. 3.2.2).The framework of COCO-LM is illustrated in Figure 1.3.2.1. C

ORRECTIVE L ANGUAGE M ODELING

Corrective Language Modeling (CLM) is a token level pre-training task: Given a text sequence X mlm corrupted by theauxiliary network, CLM aims to recover the original tokens: X orig Aux. Transformer −−−−−−−−−→ X mlm CLM −−−→ X orig . The noises in X mlm are considered plausible in the contextby the auxiliary network thus are more challenging. retrain Language Models by Correcting and Contrasting Text Sequences The main Transformer performs the CLM task as: X mlm Main Transformer −−−−−−−−−→ H clm CLM Head −−−−−−→ X orig ; CLM Head ( h i ) = arg max x ( p CLM ( x i | h i )) . Here H clm are the representations from the main Trans-former. The CLM head is similar to the one used in All-Token MLM (Clark et al., 2020b): A standard languagemodeling softmax plus a copy mechanism: p CLM ( x i | h i ) = ( x i = x mlm i ) p copy ( y i = 1 | h i )+ p copy ( y i = 0 | h i ) exp( x (cid:62) i h i ) (cid:80) j exp( x (cid:62) j h i ) ,p copy ( y i | h i ) = exp( y i · w (cid:62) copy h i )exp( w (cid:62) copy h i ) + 1 , where ( · ) is the indicator function; w copy is a learn-able weight. The copy mechanism adds the probability p copy ( y i = 1 | h i ) to copy the input word using the binaryclassiﬁcation layer p copy ( · ) .As shown in (Clark et al., 2020b), pretraining p CLM ( · ) onlyusing a language modeling loss (All-Token MLM) performsworse than using the simple binary replaced token detec-tion. We ﬁnd that it is mainly due to the All-Token MLM’sineffectiveness in handling the noises from the auxiliary lan-guage model—To recover the original token is much harderthan just to detect the replacement.CLM improves the learning of token recovery using twotechniques: A standard multi-task setting that explicitlylearns the copy mechanism p copy ( · ) using binary labels, anda stop gradient ( sg ) layer to avoid the disturbance from thehard language modeling task to the copy mechanism: L copy = − (cid:88) i log p copy ( y ∗ i | h i ) , (1) L LM = − (cid:88) i log p CLM ( x i | h i )= − (cid:88) i log (cid:32) ( x i = x mlm i ) p sg copy ( y i = 1 | h i )+ p sg copy ( y i = 0 | h i ) exp( x (cid:62) i h i ) (cid:80) j exp( x (cid:62) j h i ) (cid:33) , (2) L CLM = λ copy L copy + L LM , where y ∗ i is the ground truth of the copy mechanism. The sg in Eqn. (2) denote that gradients from L LM do not update p copy ( · ) ; λ copy is a hyperparameter balancing the two tasks.The binary cross entropy loss in Eqn. (1) is dedicated tolearning the copying probability. We prevent the learning of p copy ( · ) from being disturbed by the harder LM task L LM .This way, the main Transformer ﬁrst learns the easier binaryclassiﬁcation task, and uses the learned copy mechanism toimprove the learning of the harder task. 3.2.2. S EQUENCE C ONTRASTIVE L EARNING

Besides token level pretraining, COCO-LM introduces asequence contrastive learning task (SCL), which pretrainsthe model to provide better sequence level representations.Speciﬁcally, in SCL, each original sequence is trans-formed separately via MLM replacement ( X mlm ) andrandom cropping ( X crop ). A training batch B = { X mlm , . . . , X mlm N , X crop , . . . , X crop N } contains both MLMreplaced and cropped sequences (the crop operation keeps arandom contiguous span of the original sequences tomaintain major meanings). We use the following contrastivelearning loss to align the sequence representations of thepositive pairs ( X mlm j , X crop j ) and ( X crop j , X mlm j ) , in contrastto random pairs as negatives. X orig Crop −−→ X crop , L SCL = − (cid:88) pos. pair ( X j ,X k ) ∈ B log exp( s (cid:62) j s k ) (cid:80) X l ∈ B \{ X j } ( s (cid:62) j s l ) , where s j = h j : [CLS] / (cid:107) h j : [CLS] (cid:107) is the l -normalized [CLS] sequence representation. This contrastive learningtask requires the sequence embeddings of X mlm j and X crop j to be close with each other while away from other randomsequences from the same batch B . This encourages themain network to produce representations invariant to minortoken-level alterations (Purushwalkam & Gupta, 2020).As the ﬁrst step to leverage contrastive learning in sequencelevel pretraining, we keep everything straightforward: Asimple cropping as data augmentation and the default tem-perature is used in the softmax. Advanced data trans-formations (Qu et al., 2020) and hyperparameter explo-rations (Oord et al., 2018; Chen et al., 2020) may furtherimprove COCO-LM but are reserved for future work.3.2.3. COCO-LM T RAINING

Putting the two tasks together, the pretraining framework ofCOCO-LM can be summarized as: X orig MLM −−−→ X mlm , X orig Crop −−→ X crop ; (3) X mlm CLM −−−→ X orig , X mlm SCL ←−→ X crop . (4)In Eqn. (3) we construct the pretraining signals for COCO-LM: The auxiliary network is pretrained by standard MLMto provide corrupted training sequences X mlm ; the originalsequence is cropped to form a simple augmentation X crop .In Eqn. (4) we leverage these signals to pretrain the mainnetwork, by correcting the replaced tokens at the token level(CLM) and by contrasting the representations of the replacedand cropped texts at sequence level (SCL). The auxiliarynetwork and the main network are pretrained side-by-sidein COCO-LM’s self-supervised learning framework. retrain Language Models by Correcting and Contrasting Text Sequences GLUE DEV SQuAD 2.0 DEVModel MNLI-(m/mm) QQP QNLI SST-2 CoLA RTE MRPC STS-B AVG EM F1Base Models

Wikipedia + BookCorpusBERT (Devlin et al., 2019) 84.50/- 91.30 91.70 93.20 58.90 68.60 87.30

Bigger Training Data and/or More Training StepsXLNet (Yang et al., 2019) 86.80/- 91.40 91.70 94.70 60.20 74.00 88.20 89.50 84.56 80.20 –RoBERTa (Liu et al., 2019) 87.60/- 91.90 92.80 94.80 63.60 78.70 90.20 / Table 1.

Single model results on GLUE and SQuAD 2.0 development set. All ours runs are the ﬁve-run medians on GLUE and averageson SQuAD 2.0. Results not available in public reports are marked as “–”. Our evaluation metrics are Spearman correlation for STS,Matthews correlation for CoLA, and accuracy for the other GLUE tasks. AVG is the average of the eight tasks on GLUE.

4. Experimental Setup

This section describes our experiment setups.

Pretraining Setting:

We employ two standard pretrainingsettings, base and base++ . Base is the standard BERT basetraining conﬁguration (Devlin et al., 2019): Pretraining onWikipedia and BookCorpus (Zhu et al., 2015) ( GB oftexts) for million samples on token sequences (or

K batches with batch size).

Base++ is to train the model with the same conﬁgurationbut larger corpora and/or more training steps. We followthe settings in XLNet (Yang et al., 2019), RoBERTa (Liuet al., 2019), and UniLM V2 (Bao et al., 2020), whichadd in OpenWebText1 , CC-News (Liu et al., 2019) andSTORIES (Trinh & Le, 2018), to the total of GB texts.We train for . billion (with batch size) samples, thesame with Liu et al. (2019).There are inevitable variations in the pretraining corporaused in different work. Our base corpus is obtained fromthe authors of MC-BERT (Xu et al., 2020) and TUPE (Keet al., 2020). Our base++ corpus is the most similar withthose used in UniLM (Dong et al., 2019; Bao et al., 2020). Downstream Tasks:

We use the tasks included the GLUEbenchmark (Wang et al., 2018) and SQuAD 2.0 reading com-pression (Rajpurkar et al., 2016). The ﬁne-tuning protocolsare based on the open-source implementation released byKe et al. (2020) on GLUE tasks and by huggingface (Wolf https://dumps.wikimedia.org/enwiki/ https://skylion007.github.io/OpenWebTextCorpus/ et al., 2019) on SQuAD. All pretrained models are evalu-ated with the same ﬁne-tuning protocols and the reportedresults are the median/average of ﬁve/ten random seeds inGLUE/SQuAD. Please refer to Appendix for more details. Model Architecture : Our main network uses the RoBERTabase architecture (Liu et al., 2019): layer Transformer, hidden size, and BPE tokenization with , vocab-ulary size (Sennrich et al., 2015), plus T5 relative positionencoding (Raffel et al., 2019). Our auxiliary network is thesame except we used a shallower -layer Transformer (stillwith hidden size). Baselines:

We list the reported numbers from many recentstudies on GLUE and SQuAD, if available (more detailsin Appendix C). To reduce the variances in data process-ing/environments and provide fair comparisons, we alsoimplement, pretrain, and ﬁne-tune RoBERTa and ELEC-TRA under exact the same setting marked with “(Ours)”.

Implementation Details:

Our implementation is built uponthe open-source release of MC-BERT (Xu et al., 2020) andits ELECTRA reproduction based on fairseq (Ott et al.,2019). Standard hyperparameters in pretraining and ﬁne-tuning are used. We conduct pretraining on our NvidiaDGX-2 boxes. The hyperparameter settings and pretrainingenvironments are listed in Appendix D.

5. Evaluation Results

In this section, we ﬁrst present the overall evaluation resultsand ablations of various techniques in COCO-LM. Then weanalyze the inﬂuence of its two pretraining tasks. retrain Language Models by Correcting and Contrasting Text Sequences

Group Method MNLI-(m/mm) QQP QNLI SST-2 CoLA RTE MRPC STS-B AVG

Baseline RoBERTa (Ours) 85.61/85.51 91.34 91.80 93.86 58.64 69.03 87.50 86.53 83.03ELECTRA (Ours) 86.92/86.72 91.86 92.56 93.64 66.50 75.28 88.46 88.04 85.39Original COCO-LM Base 88.67/88.35 92.02 93.00 94.08 65.41 85.42 91.51 88.61 87.05Pretraining Task CLM Only 88.64/88.40 92.03 93.14 93.86 66.95 80.90 89.90 88.45 86.72SCL Only 88.62/88.14 92.14 93.45 93.86 64.70 82.57 90.38 89.35 86.86Architecture w/o. Rel-Pos 88.20/87.75 92.17 93.44 93.75 68.09 82.64 91.19 88.90 87.27w/o. Shallow-Aux 88.05/87.75 91.88 92.71 93.64 63.73 81.53 89.50 88.24 86.14Noise Construction w. Randomly Sampled Noises 84.94/84.74 91.36 91.08 91.63 40.82 70.50 87.34 84.86 80.30w. Fixed Auxiliary 87.94/87.98 92.03 92.96 93.18 64.68 81.53 89.98 88.22 86.32CLM Setup (No SCL) All-Token LM Only 87.17/86.97 91.74 92.58 93.75 61.02 73.54 88.70 87.70 84.51CLM w/o. Copy 88.02/87.87 91.81 93.11 94.53 65.71 76.60 89.42 88.17 85.91CLM w/o. Stop-grad 88.53/88.19 91.95 92.88 94.32 67.52 80.76 89.66 88.78 86.78

Table 2.

Ablation results on GLUE Dev. Variations in each group include eliminate (w/o.), keep (Only) or switch (w.) one component. . . . . D e v . S e t A cc . COCO-LMRoBERTaELECTRA (a) MNLI-m . . . . D e v . S e t A cc . COCO-LMRoBERTaELECTRA (b) MNLI-mm

Figure 2.

COCO-LM Base accuracy on MNLI Dev. sets (y-axes)at different pretraining hours on four DGX-2 nodes ( V100GPUs). The ﬁnal training hours and accuracy of RoBERTa (ours)and ELECTRA (Ours) are measured in the exact same settings andcomputing environments.

Table 1 shows the results of COCO-LM. The smaller GLUEtasks (CoLA, RTE, MPRC, and STS-B) are unstable: Manypretraining research omit them and more advanced ﬁne-tuning strategies are required to achieve stable evalua-tions (Aghajanyan et al., 2020). On tasks where ﬁne-tuningis more stable ( e.g. , MNLI and SQuAD), COCO-LM pro-vides the biggest improvement, for example, . points onMNLI-m, . on SQuAD EM, and . on GLUE AVGover best base setting baselines.For ﬁne-tuning and inference, COCO-LM does not incurextra computation cost as it has the same architecture withBERT besides relative position embedding. The extra com-putation cost in pretraining for better pretrained models isoften considered a worthwhile one time investment. Still, weshow the MNLI accuracy of COCO-LM at different pretrain-ing hours versus the full RoBERTa (Ours) and ELECTRA(Ours) runs in Figure 2. The full pretraining of COCO-LMrequires more GPU hours compared to ELECTRA,with the cost from CLM and SCL, while both are morecostly than RoBERTa due to the auxiliary network. How-ever, COCO-LM turns out to be a better choice in bothaccuracy and efﬁciency : It outperforms ELECTRA andRoBERTa by more than 1 point on MNLI with the samecompute, while requires less than compute to reach thesame accuracy. We conduct ablation studies on COCO-LM base on GLUEDev. sets (Table 2). We reduce variance by ﬁxing seedsand picking median-performing checkpoints from multiplepretraining runs. Nevertheless, there still exists randomnessin pretraining that leads to ± . on MNLI. Pretraining Task.

CLM or SCL individually provides sig-niﬁcantly better performance than previous approaches onMNLI. Their advantages are better observed on differenttasks, for example, CLM on MNLI-mm and SCL on STS-B.Combining the two in COCO-LM provides a better over-all average. In later experiments we further analyze thebehavior of these two pretraining tasks.

Architecture.

The two notable differences in the Trans-former architecture of COCO-LM is relative position encod-ing (Rel-Pos) and a shallow auxiliary network instead of adeeper but skinnier one in ELECTRA (Clark et al., 2020b).Removing Rel-Pos leads to better numbers on some tasksbut signiﬁcantly hurts MNLI; its GLUE AVG is contributedby CoLA. Using a -layer shallow auxiliary network ismore effective than ELECTRA’s -layer but -hiddendimension generator. Pretraining Signal Construction.

Similar to ELECTRA,COCO-LM uses the auxiliary network to sample more chal-lenging noises to push the main language model. This iscritical as the same model but with Randomly SampledNoises performs worse than vanilla RoBERTa. Pretrainingthe two networks side-by-side provides a learning curricu-lum for the main network, as the noises from the auxiliarynetwork start from near random and become more challeng-ing along the way. Pretraining the main network with apretrained and Fixed Auxiliary network performs worse.

CLM Setup.

Switching the multi-task learning in CLM tothe All-Token MLM loss (Clark et al., 2020b) signiﬁcantlyreduces the model’s generalization ability. The copy mech-anism and the stop gradient operation are also importantto maintain CLM’s effectiveness. The next experimentsanalyze how our CLM setup helps handle the challengingnoises from the auxiliary network. retrain Language Models by Correcting and Contrasting Text Sequences × e . . . CLM w/o.

Stop-Grad

All-Token

MLM

OnlyCLMCLM w/o.

Copy C o p y A cc . (a) Copy Acc. (Replaced) × e . . . C o p y A cc . (b) Copy Acc. (Original) × e . . . L M A cc . (c) CLM Acc. (Replaced) × e . . . L M A cc . (d) CLM Acc. (Original) Figure 3.

The training curves of CLM variations in COCO-LM. All x-axes are training steps (in K scale) and y-axes mark the trackedstatus: Copy mechanism Accuracy ( i.e. , the main network’s binary classiﬁcation accuracy) on (a) the replaced tokens and (b) the originaltokens. CLM Accuracy ( i.e. , the accuracy of outputting the original tokens) on (c) the replaced tokens and (d) the original tokens.

Layer A v g . A tt e n t i o n E n t r o p y COCO-LMCLM OnlySCL OnlyRoBERTa

Figure 4.

The entropy of learned attention weights (after softmax)in different Transformer layers. The x-axis marks the layer index(smaller means closer to input tokens). The entropy is averaged onall tokens in the MNLI corpus without ﬁne-tuning.

In this experiment we analyze how CLM helps overcome thechallenging noises and enable better pretraining the mainnetwork with a language modeling loss. In Figure 3, weshow the pretraining curves of CLM and its three variations:CLM without copy mechanism (w/o. Copy), CLM withoutstop gradient operation (w/o. Stop-Grad), and All-TokenMLM Only which only uses the LM loss.The copy mechanism is important to help the model de-tect challenging noises: CLM (w/o. Copy) mistakes manyoriginal tokens as noises, with much worse LM accuracyon orignal tokens (Figure 3d). This hurts its generalizationability as shown in Table 2. Also, the noises from the aux-iliary network make it quite challenging to learn to copyand correct solely from the language modeling loss, as canbe seen from the a large gap between the copy accuracy ofAll-Token MLM Only and the rest (Figure 3a & p copy ( · ) is disturbed too much and never re-covers. The stop gradient operation further helps avoid thedisturbance from the hard LM loss to the classiﬁcation loss.In summary, the multi-task learning and stop gradient de-signs in CLM are important for the language model to effec-tively learn from the challenging pretraining signals insteadof being confused by it. All Words Stop WordsModel

Self Rand. Diff. Self Rand. Diff.RoBERTa 0.781 0.603 0.178 0.812 0.664 0.148ELECTRA 0.722 0.603 0.119 0.682 0.606 0.076COCO-LM 0.738 0.626 0.112 0.699 0.639 0.060CLM Only 0.721 0.616 0.105 0.718 0.651 0.067SCL Only 0.680 0.595 0.085 0.669 0.619 0.050

Table 3.

Average cosine similarity between representations of thesame word in different contexts (Self), random word pairs (Rand.)and their differences (Diff.) in MNLI corpus, without ﬁne-tuning.

What are the differences made by pretraining a Transformerwith COCO-LM? In the following, we analyze the learnedattentions and token representations in the main Transformerpretrained by COCO-LM and baselines.

More Spread Attention Weights.

We ﬁrst calculate the en-tropy of the attention weights (Clark et al., 2019) learned byCOCO-LM base variations and compare it with RoBERTa(Ours) in Figure 4. All COCO-LM variations have sig-niﬁcantly higher attention entropy in their last four layerscompared to RoBERTa, indicating each token attends tomore other tokens rather than concentrates on a few. Themore challenging noises in COCO-LM requires the mainTransformer to consider a wider range of context.

More Contextualized Token Representation.

Next, wecalculate the self-similarity of the representations of a sameword when appearing in different contexts (Ethayarajh,2019)—A less self-similar token representation indicatesthe Transformer is more contextualized. The results areshown in Table 3.Both ELECTRA and COCO-LM have signiﬁcantly morecontextualized representations than RoBERTa. Their chal-lenging noises require the main Transformers to rely onmore contexts to distinguish and recover replaced tokens,while [MASK] tokens in RoBERTa can be easily recog-nized by the model. Among COCO-LM variations, SCLOnly is signiﬁcantly more contextualized than CLM Only.We further analyze the behaviors and inﬂuences of the SCLtask in the next experiment. retrain Language Models by Correcting and Contrasting Text Sequences . . . . . × e . . . C o s i n e S i m . PositiveNegative (a) Without SCL . . . . . × e . . . C o s i n e S i m . PositiveNegative (b) With SCL

Figure 5.

The cosine similarity between the [CLS] embeddingsof positive and negative sequence pairs during pretraining. D e v . S e t A cc . w/o. SCLw.

SCL (a) MNLI-m D e v . S e t A cc . w/o. SCLw.

SCL (b) MNLI-mm

Figure 6.

Few-shot accuracy on MNLI with a fraction of MNLItraining set used (x-axes). The error bars mark the max/min andthe solid lines are the average of ﬁve ﬁne-tuning runs.

The last group of experiments show various notable char-acteristics of sequence contrastive learning. All the experi-ments are conducted under the base pretraining setting.

Contrastive Learning As Regularization.

Our contrastivelearning task is simple: Matching a positive sequence pair( i.e. , cropped sub-sequence and MLM corrupted sequence)among random pairs. Even without SCL, one would expectTransformer to map two sequences with many overlappingterms closer in the representation space by default. However,as shown in Figure 5, this is not the case: When pretrainedwithout SCL, the cosine similarity of the positive pairs isactually lower than random negatives. The representationspace without SCL is also so anisotropic that random pairshave near . cosine similarity. The explicit training atsequence level with SCL is necessary to regularize the se-quence representation space to align similar sequences anddecouple random ones. Better Few-Shot Ability.

One advantage of the more regu-larized sequence representations with SCL is the improvedfew-shot ability. As shown in Figure 6, SCL provides no-table improvements under few-shot settings. Speciﬁcally,“w. SCL” outperforms “w/o. SCL” by . / . on MNLI-m/mm when only ﬁne-tuning labels are used. With only MNLI labels, “w. SCL” reaches . / . MNLIwhich is better than RoBERTa (Ours) ﬁne-tuned with fulldata. Using labels, it performs on par with ELECTRA(Ours) ﬁne-tuned with full data. −

50 0 50 − (a) Without SCL −

50 0 50 − (b) With SCL Figure 7.

The t-SNE plots of learned sequence representations withor without SCL. The points are sampled from the most semanti-cally similar sentences pairs from STS-B (with -score labels).The [CLS] embeddings are obtained without ﬁne-tuning. Somesimilar pairs are random selected and marked by the same shapes. Alignment and Uniformity.

Another advantage of con-trastive learning known in visual representations is to pro-vide better alignment of related pairs and to allocate ran-dom points more uniformly in the space (Wang & Isola,2020). To study whether this holds in language represen-tation learning, we plot the representations of semanticallysimilar STS-B sentence pairs from COCO-LM in Figure 7using t-SNE (Coenen et al., 2019). The similar sentencepairs (marked by same shapes) are aligned closer when pre-trained with SCL. Their average cosine similarity is . when pretrained with SCL, while is . without SCL.The uniformity is less observed. Both ﬁgures show non-uniform patterns, perhaps because the random negativesused in SCL are not sufﬁcient to regularize the representa-tion space. More sophisticated negative sample constructionmight improve the uniformity of the language representationspace (He et al., 2019; Xiong et al., 2020).

6. Conclusions

In this paper, we present COCO-LM, a self-supervised learn-ing framework that pretrains language models via correctingand contrasting text sequences with more challenging noises.The advantages of COCO-LM over previous pretraining ap-proaches include pretraining with more challenging noisesfrom the auxiliary language model, a multi-task correctivelanguage modeling setting that robustly learns to recoveroriginal tokens, and a sequence contrastive learning taskthat regularizes sequence representations during pretraining.Our experiments demonstrate that COCO-LM not only pro-vides better generalization ability, but also enjoys higher ef-ﬁciency in terms of downstream task performance achievedper pretraining hour. More importantly, we conduct exten-sive analyses on the inﬂuence of each technique in COCO-LM and their behaviors in different conditions. We hope ourstudies will inspire more future explorations for more effec-tive and efﬁcient pretraining frameworks including betterconstruction of pretraining signals, more contrastive learn-ing techniques, and new pretraining tasks. retrain Language Models by Correcting and Contrasting Text Sequences

References

Aghajanyan, A., Shrivastava, A., Gupta, A., Goyal, N.,Zettlemoyer, L., and Gupta, S. Better ﬁne-tuningby reducing representational collapse. arXiv preprintarXiv:2008.03156 , 2020.Bao, H., Dong, L., Wei, F., Wang, W., Yang, N., Liu, X.,Wang, Y., Piao, S., Gao, J., Zhou, M., and Hon, H.-W.Unilmv2: Pseudo-masked language models for uniﬁedlanguage model pre-training. In

Preprint , 2020.Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. Aneural probabilistic language model.

Journal of machinelearning research , 3(Feb):1137–1155, 2003.Bentivogli, L., Clark, P., Dagan, I., and Giampiccolo, D.The ﬁfth pascal recognizing textual entailment challenge.In

TAC , 2009.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. Languagemodels are few-shot learners. 2020.Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia,L. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXivpreprint arXiv:1708.00055 , 2017.Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. Asimple framework for contrastive learning of visual rep-resentations. In

International Conference on MachineLearning , 2020.Chi, Z., Dong, L., Wei, F., Yang, N., Singhal, S., Wang,W., Song, X., Mao, X.-L., Huang, H., and Zhou, M. In-foxlm: An information-theoretic framework for cross-lingual language model pre-training. arXiv preprintarXiv:2007.07834 , 2020.Clark, K., Khandelwal, U., Levy, O., and Manning, C. D.What does BERT look at? an analysis of BERT’s atten-tion. In

Proceedings of the 2019 ACL Workshop Black-boxNLP: Analyzing and Interpreting Neural Networksfor NLP , pp. 276–286, August 2019.Clark, K., Luong, M.-T., Le, Q., and Manning, C. D. Pre-training transformers as energy-based cloze models. In

Proceedings of the 2020 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP) , pp. 285–294, 2020a. Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Elec-tra: Pre-training text encoders as discriminators ratherthan generators. In

International Conference on LearningRepresentations , 2020b.Coenen, A., Reif, E., Yuan, A., Kim, B., Pearce, A., Vi´egas,F., and Wattenberg, M. Visualizing and measuring thegeometry of bert. arXiv preprint arXiv:1906.02715 , 2019.Dagan, I., Glickman, O., and Magnini, B. The pascal recog-nising textual entailment challenge. In

Machine LearningChallenges Workshop , pp. 177–190. Springer, 2005.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:Pre-training of Deep Bidirectional Transformers for Lan-guage Understanding. In

Proceedings of the 2019 Con-ference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Tech-nologies , pp. 4171–4186, 2019.Dolan, W. B. and Brockett, C. Automatically construct-ing a corpus of sentential paraphrases. In

Proceedingsof the Third International Workshop on Paraphrasing(IWP2005) , 2005.Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y.,Gao, J., Zhou, M., and Hon, H.-W. Uniﬁed languagemodel pre-training for natural language understandingand generation. In

Advances in Neural Information Pro-cessing Systems , pp. 13063–13075, 2019.Ethayarajh, K. How contextual are contextualized wordrepresentations? comparing the geometry of bert, elmo,and gpt-2 embeddings. arXiv preprint arXiv:1909.00512 ,2019.Fang, H. and Xie, P. Cert: Contrastive self-supervisedlearning for language understanding. arXiv preprintarXiv:2005.12766 , 2020.Fedus, W., Zoph, B., and Shazeer, N. Switch transform-ers: Scaling to trillion parameter models with simple andefﬁcient sparsity. arXiv preprint arXiv:2101.03961 , 2021.Giampiccolo, D., Magnini, B., Dagan, I., and Dolan, B. Thethird pascal recognizing textual entailment challenge. In

Proceedings of the ACL-PASCAL workshop on textualentailment and paraphrasing , pp. 1–9, 2007.Gunel, B., Du, J., Conneau, A., and Stoyanov, V. Supervisedcontrastive learning for pre-trained language model ﬁne-tuning. arXiv preprint arXiv:2011.01403 , 2020.Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.-W. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909 , 2020. retrain Language Models by Correcting and Contrasting Text Sequences

Haim, R. B., Dagan, I., Dolan, B., Ferro, L., Giampiccolo,D., Magnini, B., and Szpektor, I. The second pascal recog-nising textual entailment challenge. In

Proceedings of theSecond PASCAL Challenges Workshop on RecognisingTextual Entailment , 2006.He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-mentum contrast for unsupervised visual representationlearning. arXiv preprint arXiv:1911.05722 , 2019.He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprintarXiv:2006.03654 , 2020.Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer,L., and Levy, O. Spanbert: Improving pre-trainingby representing and predicting spans. arXiv preprintarXiv:1907.10529 , 2019.Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,Chess, B., Child, R., Gray, S., Radford, A., Wu, J., andAmodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 , 2020.Ke, G., He, D., and Liu, T.-Y. Rethinking the posi-tional encoding in language pre-training. arXiv preprintarXiv:2006.15595 , 2020.Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P.,and Soricut, R. Albert: A lite bert for self-supervisedlearning of language representations. arXiv preprintarXiv:1909.11942 , 2019.Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L.Bart: Denoising sequence-to-sequence pre-training fornatural language generation, translation, and comprehen-sion. arXiv preprint arXiv:1910.13461 , 2019.Lewis, M., Ghazvininejad, M., Ghosh, G., Aghajanyan, A.,Wang, S., and Zettlemoyer, L. Pre-training via paraphras-ing.

Advances in Neural Information Processing Systems ,33, 2020.Li, B., Zhou, H., He, J., Wang, M., Yang, Y., and Li, L.On the sentence embeddings from pre-trained languagemodels. arXiv preprint arXiv:2011.05864 , 2020.Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.RoBERTa: A Robustly Optimized BERT Pretraining Ap-proach. arXiv preprint arXiv:1907.11692 , 2019.Oord, A. v. d., Li, Y., and Vinyals, O. Representation learn-ing with contrastive predictive coding. arXiv preprintarXiv:1807.03748 , 2018. Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng,N., Grangier, D., and Auli, M. fairseq: A fast, extensibletoolkit for sequence modeling. In

Proceedings of NAACL-HLT 2019: Demonstrations , 2019.Pham, H., Xie, Q., Dai, Z., and Le, Q. V. Meta pseudolabels. arXiv preprint arXiv:2003.10580 , 2020.Purushwalkam, S. and Gupta, A. Demystifying contrastiveself-supervised learning: Invariances, augmentations anddataset biases. arXiv preprint arXiv:2007.13916 , 2020.Qu, Y., Shen, D., Shen, Y., Sajeev, S., Han, J., and Chen, W.Coda: Contrast-enhanced and diversity-promoting dataaugmentation for natural language understanding. arXivpreprint arXiv:2010.08670 , 2020.Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., andSutskever, I. Language models are unsupervised multitasklearners.

OpenAI blog , 1(8):9, 2019.Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploringthe limits of transfer learning with a uniﬁed text-to-texttransformer. arXiv preprint arXiv:1910.10683 , 2019.Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad:100,000+ questions for machine comprehension of text.In

Proceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing , pp. 2383–2392, 2016.Ravula, A., Alberti, C., Ainslie, J., Yang, L., Pham, P. M.,Wang, Q., Ontanon, S., Sanghai, S. K., Cvicek, V., andFisher, Z. Etc: Encoding long and structured inputs intransformers. 2020.Roberts, A., Raffel, C., and Shazeer, N. How much knowl-edge can you pack into the parameters of a languagemodel? arXiv preprint arXiv:2002.08910 , 2020.Rosset, C., Xiong, C., Phan, M., Song, X., Bennett, P., andTiwary, S. Knowledge-aware language model pretraining. arXiv preprint arXiv:2007.00655 , 2020.Sennrich, R., Haddow, B., and Birch, A. Neural machinetranslation of rare words with subword units. arXivpreprint arXiv:1508.07909 , 2015.Shankar, I., Nikhil, D., and Korn´el, C. First quoradataset release: Question pairs, 2017. URL .Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,C. D., Ng, A. Y., and Potts, C. Recursive deep models forsemantic compositionality over a sentiment treebank. In

Proceedings of the 2013 conference on empirical methodsin natural language processing , pp. 1631–1642, 2013. retrain Language Models by Correcting and Contrasting Text Sequences

Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. Mass:Masked sequence to sequence pre-training for languagegeneration. In

International Conference on MachineLearning , pp. 5926–5936, 2019.Thakur, N., Reimers, N., Daxenberger, J., and Gurevych,I. Augmented sbert: Data augmentation method forimproving bi-encoders for pairwise sentence scoringtasks. arXiv preprint arXiv:2010.08240 , 10 2020. URL https://arxiv.org/abs/2010.08240 .Trinh, T. H. and Le, Q. V. A simple method for common-sense reasoning. arXiv preprint arXiv:1806.02847 , 2018.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In

Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., andBowman, S. R. Glue: A multi-task benchmark and anal-ysis platform for natural language understanding. arXivpreprint arXiv:1804.07461 , 2018.Wang, T. and Isola, P. Understanding contrastive represen-tation learning through alignment and uniformity on thehypersphere. arXiv preprint arXiv:2005.10242 , 2020.Wang, W., Bi, B., Yan, M., Wu, C., Bao, Z., Xia, J., Peng, L.,and Si, L. Structbert: Incorporating language structuresinto pre-training for deep language understanding. arXivpreprint arXiv:1908.04577 , 2019.Warstadt, A., Singh, A., and Bowman, S. R. Neural networkacceptability judgments.

Transactions of the Associationfor Computational Linguistics , 7:625–641, 2019.Williams, A., Nangia, N., and Bowman, S. A broad-coverage challenge corpus for sentence understandingthrough inference. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech-nologies, Volume 1 (Long Papers) , pp. 1112–1122, 2018.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,et al. Huggingface’s transformers: State-of-the-art naturallanguage processing. arXiv preprint arXiv:1910.03771 ,2019.Wu, Z., Wang, S., Gu, J., Khabsa, M., Sun, F., and Ma, H.Clear: Contrastive learning for sentence representation. arXiv preprint arXiv:2012.15466 , 2020.Xiong, L., Xiong, C., Li, Y., Tang, K.-F., Liu, J., Bennett, P.,Ahmed, J., and Overwijk, A. Approximate nearest neigh-bor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 , 2020.Xu, Z., Gong, L., Ke, G., He, D., Zheng, S., Wang, L., Bian,J., and Liu, T.-Y. Mc-bert: Efﬁcient language pre-trainingvia a meta controller. arXiv preprint arXiv:2006.05744 ,2020.Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,R. R., and Le, Q. V. XLNet: Generalized AutoregressivePretraining for Language Understanding. In

Advances inNeural Information Processing Systems , pp. 5754–5764,2019.Ye, Q., Li, B. Z., Wang, S., Bolte, B., Ma, H., Ren,X., Yih, W.-t., and Khabsa, M. Studying strategically:Learning to mask for closed-book qa. arXiv preprintarXiv:2012.15856 , 2020.Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urta-sun, R., Torralba, A., and Fidler, S. Aligning books andmovies: Towards story-like visual explanations by watch-ing movies and reading books. In

Proceedings of theIEEE international conference on computer vision , pp.19–27, 2015. retrain Language Models by Correcting and Contrasting Text Sequences

Size Task Metric(s) Domain

MNLI 393K Inference Accuracy Misc.QQP 364K Similarity Accuracy/F1 Social QAQNLI 108K QA/Inference Accuracy WikipediaSST-2 67K Sentiment Accuracy Movie ReviewsCoLA 8.5K Acceptability Matthews corr. Misc.RTE 2.5K Inference Accuracy Misc.MRPC 3.7K Paraphrase Accuracy/F1 NewsSTS-B 5.7K Similarity Pearson/Spearman. Misc.

Table 4.

The list of benchmarks in GLUE, their training data size,language tasks, evaluation metrics, and domain of corpus.

A. GLUE Tasks

We provide more details of the tasks included in the GLUEbenchmark. Their statistics are listed in Table 4.

MNLI:

Multi-genre Natural Language Inference (Williamset al., 2018) contains

K train examples obtained viacrowdsourcing. The task is to predict whether a givenpremise sentence entails, contradicts or neutral with respectto a given hypothesis sentence.

QQP:

Question Pairs (Shankar et al., 2017) contains

Ktrain examples from the Quora question-answering website.The task is to determine whether a pair of questions askedare semantically equivalent.

QNLI:

Question Natural Language Inference contains

K train examples derived from the Stanford QuestionAnswering Dataset (SQuAD) (Rajpurkar et al., 2016). Thetask is to predict whether a given sentence contains theanswer to a given question sentence.

SST-2:

Stanford Sentiment Treebank (Socher et al., 2013)contains K train examples extracted from movie reviewswith human-annotated sentiment scores. The tasks is todetermine if the sentence has positive or negative sentiment.

CoLA:

Corpus of Linguistic Acceptability (Warstadt et al.,2019) contains . K train examples from books and jour-nal articles on linguistic theory. The task is to determinewhether a given sentence is linguistically acceptable or not.

RTE:

Recognizing Textual Entailment (Dagan et al., 2005;Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al.,2009) contains . K train examples from textual entailmentchallenges. The task is to predict whether a given premisesentence entails a given hypothesis sentence or not.

MRPC:

Microsoft Research Paraphrase Corpus (Dolan &Brockett, 2005) contains . K train examples from onlinenews sources. The task is to predict whether two sentencesare semantically equivalent or not.

STS-B:

Semantic Textual Similarity (Cer et al., 2017) con-tains . K train examples drawn from multiple sources withhuman annotations on sentence pair semantic similarity. Thetask is to predict how semantically similar two sentencesare on a to scoring scale. B. SQuAD Fine-Tuning Details

Our pre-training code is built on top of the MC-BERT code-base including its data training pipelines. We have noticedthat as an artifact of the data pre-processing, speciﬁcallythe punctuation and white-space handling in fairseq, wehave to adjust the start and end span offsets in the SQuADtraining datasets to match those in the pre-processed data.After model inference, we have to post-process the predictedoffsets in the processed data by reversing the previous ad-justment to obtain the desired output offsets in the raw dataformat. As a result, our SQuAD implementation is not ex-actly the same with those used in previous approaches basedon the huggingface codebase. This makes the SQuAD scorecomparison between our methods and previous reportedmethods not perfect due to the different pre-processing,post-processing applied, and also our smaller hyperparame-ter search space in ﬁne-tuning. The SQuAD results of ourown baseline runs are fair comparisons.

C. The Origins of Reported Baseline Scores

The baseline results listed in Table 1 are obtained fromtheir original papers except the following: BERT from (Baoet al., 2020), RoBERTa GLUE from and SQuAD from (Baoet al., 2020), ELECTRA GLUE from (Xu et al., 2020),XLNet base++ from (Bao et al., 2020), RoBERTa base++

SQuAD from (Bao et al., 2020). When multiple papersreport different scores for the same method, we use thehighest of them in our comparisons.

D. More Implementation Details

Pretraining and Fine-tuning Costs.

The pretraining cost of COCO-LM’s CLM task is similar to ELECTRA, which isBERT plus the auxiliary network, which is about / of themain network in size. The addition of SCL task requires onemore forward and backward pass on the cropped sequence X crop . With V100 ( GB Memory), one pretrainingrun takes about hours in base setting and about two-threeweeks in base++ setting. The ﬁne-tuning costs are the samewith BERT plus relative positive encodings as the sameTransformer model is used. MLM Mode for Corrective Language Modeling.

Whencreating the MLM replaced sequence X mlm , we ﬁnd itslightly improves the downstream task performance to dis-able dropout ( i.e. , set the auxiliary MLM in inference mode)for computing the auxiliary network’s output distributionwhere plausible replacing tokens are sampled. We hypothe-size that this leads to more stable generation of challengingreplaced tokens to be corrected by the main Transformerand thus improves downstream task results. retrain Language Models by Correcting and Contrasting Text Sequences Parameters Pre-training ( base ) Pre-training ( base++ ) GLUE Fine-tuning SQuAD Fine-tuningMax Steps 125K 1.95M - -Max Epochs - - {

3, 5, 10 } { } { } Batch Size 2048 2048 {

16, 32 } {

32, 48 } Learning Rate Decay Linear Linear Linear LinearWarm-up Proportion 8% 2.5e-4% 6% 10%Sequence Length 512 512 512 384Adam (cid:15) β , β ) (0.9, 0.98) (0.9, 0.98) (0.9, 0.98) (0.9, 0.98)Clip Norm 2.0 2.0 - -Dropout 0.1 0.1 0.1 - Table 5.

Hyperparameters used in pretraining and hyperparameter ranges searched for ﬁne-tuning.

Masking Special Tokens for MLM Training.

BERT onlymasks real tokens (other than artiﬁcial symbols like [SEP] and [CLS] ) for MLM training, while RoBERTa also masksspecial tokens. We follow the RoBERTa setting which re-sults in slightly improved performance for some tasks.

E. Hyperparameter Settings

Tuning hyperparameter of pretraining is often too costlyand we keep most hyperparameters as default. The auxil-iary MLM pretraining uses the standard [MASK] ratio.The crop transformation in the SCL task uses cropratio, resulting in a sub-sequence of the original se-quence. The softmax temperature in the SCL task is . Allpretraining tasks in COCO-LM have equal weights except λ copy = 50 since the loss of the binary classiﬁcation task ismuch lower those of the LM tasks, which are , -wayclassiﬁcation tasks. All token embeddings are shared be-tween the auxiliary Transformer and the main Transformer.The detailed hyperparameters used in pretraining and ﬁne-tuning are listed in Table 5.All reported methods use the exact same (or equivalent) setof hyperparameters for pretraining and ﬁne-tuning for faircomparison. For COCO-LM and all the baselines imple-mented under our setting, all ﬁne-tuning hyperparametersare searched per task; the medium/average of ﬁve/ten runswith the same set of ﬁve/ten random seeds are reported onGLUE/SQuAD. F. More Discussions on PLM Research

Currently, the biggest challenge with PLM research devel-opment is perhaps its prohibitive computation cost. Onthe one hand, PLMs have inﬂuenced a wide range of tasks,and any further technical improvement can matter a lot for their downstream task applications, considering PLMs havebeen deployed in so many NLP tasks. On the other hand,its expensive computing cost and long experimental circlespose great challenges for careful and thorough studies ofthe problem space, as any test of new designs comes with aconsiderable computing cost—pretraining a new languagemodel can easily consume thousands of dollars, or evenmillions for extra large models.Such challenges call for more systematic evaluationpipelines that can accurately and reliably judge whetheror not a new PLM is really better than previous ones. Cur-rently, the evaluation of PLMs largely relies on GLUE-stylebenchmark which contains a set of different tasks that areweighted equally for PLM evaluations—usually the averageperformance over these tasks is treated as a ﬁnal measure forthe effectiveness of a PLM. However, we ﬁnd that the smalltasks in GLUE have very high variances which may provideunreliable indications for a PLM’s performance. For exam-ple, on CoLA and RTE, ﬁne-tuning with different randomseeds from the same pretrained checkpoint can easily resultin a5