[PDF] A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives

Abstract

Modern natural language processing (NLP) methods employ self-supervised pretraining objectives such as masked language modeling to boost the performance of various application tasks. These pretraining methods are frequently extended with recurrence, adversarial or linguistic property masking, and more recently with contrastive learning objectives. Contrastive self-supervised training objectives enabled recent successes in image representation pretraining by learning to contrast input-input pairs of augmented images as either similar or dissimilar. However, in NLP, automated creation of text input augmentations is still very challenging because a single token can invert the meaning of a sentence. For this reason, some contrastive NLP pretraining methods contrast over input-label pairs, rather than over input-input pairs, using methods from Metric Learning and Energy Based Models. In this survey, we summarize recent self-supervised and supervised contrastive NLP pretraining methods and describe where they are used to improve language modeling, few or zero-shot learning, pretraining data-efficiency and specific NLP end-tasks. We introduce key contrastive learning concepts with lessons learned from prior research and structure works by applications and cross-field relations. Finally, we point to open challenges and future directions for contrastive NLP to encourage bringing contrastive NLP pretraining closer to recent successes in image representation pretraining.

Full PDF

AA Primer on Contrastive Pretraining in Language Processing:Methods, Lessons Learned and Perspectives

Nils Rethmeier , , Isabelle Augenstein German Research Center for AI, Berlin, Germany University of Copenhagen, Copenhagen, [email protected], [email protected]

Abstract

Modern natural language processing (NLP) meth-ods employ self-supervised pretraining objectivessuch as masked language modeling to boost the per-formance of various application tasks. These pre-training methods are frequently extended with re-currence, adversarial or linguistic property mask-ing, and more recently with contrastive learning ob-jectives. Contrastive self-supervised training objec-tives enabled recent successes in image representa-tion pretraining by learning to contrast input-inputpairs of augmented images as either similar or dis-similar. However, in NLP, automated creation oftext input augmentations is still very challengingbecause a single token can invert the meaning ofa sentence. For this reason, some contrastive NLPpretraining methods contrast over input-label pairs,rather than over input-input pairs, using methodsfrom Metric Learning and Energy Based Models.In this survey, we summarize recent self-supervisedand supervised contrastive NLP pretraining meth-ods and describe where they are used to improvelanguage modeling, few or zero-shot learning, pre-training data-efﬁciency and speciﬁc NLP end-tasks.We introduce key contrastive learning conceptswith lessons learned from prior research and struc-ture works by applications and cross-ﬁeld relations.Finally, we point to open challenges and future di-rections for contrastive NLP to encourage bringingcontrastive NLP pretraining closer to recent suc-cesses in image representation pretraining.

Current downstream machine learning applications heavilyrely on the effective pretraining of representation learningmodels. Contrastive learning is one such technique whichenables pretraining of general or task-speciﬁc data encodermodels in a supervised or self-supervised fashion to increasethe downstream performance of language or image repre-sentations. While contrastive pretraining in computer visionhas enabled the recent successes in self-supervised imagerepresentation pretraining, the beneﬁts and best practices ofcontrastive pretraining in natural language processing (NLP) e nd - t a s k a gno s ti c supervised self-supervised e nd - t a s k s p ec i ﬁ c CLESS: Rethmeier, 2020CSS: Klein, 2020GILE: Pappas, 2019 CONPONO: Iter, 2020CPC: Oord, 2018COCO-LM: Meng, 2021DeCLUTR: Giorgi, 2020CLEAR: Wu, 2020CERT: Fang, 2020B I T: Duan, 2019UST: Uehara, 2020 OLFMLM: Aroca, 2020Word2vec: Mikolov, 2013ALIGN: Jia, 2019Electric: Clark, 2020QT: Logeswaran, 2018CoDA: Qu, 2020MixText: Chen, 2020CLIP: Radford, 2020TCN: Jian, 2019CLESS: Rethmeier, 2020

Figure 1:

Types of contrastive pretraining and works that fallwithin these categories. marks text-image contrastive works. are still comparatively less established [Jaiswal et al. , 2021].However, there is a ﬁrst line of works on contrastive NLPmethods which show strong performance and data-efﬁciencybeneﬁts of (self-)supervised contrastive NLP pretraining asillustrated in Fig. 1. For example, supervised contrastive pre-training enables zero-shot prediction of unseen text classesand improves few-shot performance [Pappas and Henderson,2019]. Moreover, task-agnostic self-supervised contrastivepretraining systems have been shown to improve languagemodeling [Logeswaran and Lee, 2018; Clark et al. , 2020;Wu et al. , 2020; Giorgi et al. , 2020], while [Rethmeier andAugenstein, 2020] develop a data-efﬁcient contrastive pre-training method for improved zero-shot and long-tail learn-ing. Others propose task-speciﬁc contrastive self-supervisionfor pronoun disambiguation [Klein and Nabi, 2020], dis-course representation learning [Iter et al. , 2020], text sum-marization [Duan et al. , 2019] and other NLP tasks, as wewill describe in §3. a r X i v : . [ c s . C L ] F e b ontributions: In this primer to contrastive pretraining, wetherefore summarize recent (self-)supervised contrastive NLPpretraining methods and describe how they enable zero-shotlearning and improve language modeling, few-shot learning,pretraining data-efﬁciency or rare event prediction. We coverbasic concepts and crucial design lessons of contrastive NLP,while detailing the resulting beneﬁts such as zero-shot pre-diction and efﬁcient training. Then, we structure existing re-search as supervised or self-supervised contrastive pretrain-ing and explain connections to energy based models (EMBs),since many works refer to EBMs. Finally, we point out openchallenges and outline future and underrepresented researchdirections in contrastive NLP pretraining.

At the core of contrastive methods is the idea of learning tocontrast between pairs of similar and dissimilar data points.A pair of similar data points is called a positive sample ifboth data points are different representations or views of thesame data instance. Negative samples are pairs where thetwo data points are of different data instances. For contrastivelearning, such data points can either be input-input ( x i , x j ) orinput-label ( x i , y j ) pairs. While contrastive computer visionmethods learn from input-input (image-image) pairs ( x i , x j ) [Jaiswal et al. , 2021; Chen et al. , 2020b], NLP methods addi-tionally use input-output (text, label) pairs ( x i , y c ) . Here x i are input text embeddings, while y c are label embeddings ofa short text that describes a label, i.e. an extreme summariza-tion of the input text to get two views of said text. Noise contrastive estimation is the objective used by mostcontrastive learning approaches within NLP. Thus, we brieﬂyoutline its main variants and the core ideas behind them,while pointing to [Ma and Collins, 2018] for detailed, yetreadily understandable explanations of the two main NCEvariants. Both variants can intuitively be understood as asub-sampled softmax with K negative samples a − i and onepositive sample a + i . The ﬁrst variant expresses NCE as a bi-nary objective (loss) in the form of maximum log likelihood,where only K negatives are considered. L B ( θ, γ ) = log σ ( s ( x i , a + i, ; θ ) , γ )+ K (cid:88) k =1 log (1 − σ ( s ( x i , a − i,k ; θ ) , γ ) (1)Here, s ( x i , a i, ◦ ; θ ) is a scoring or similarity function thatmeasures the compatibility between a single text input x i and another sample a i, ◦ . As mentioned above, the samplecan be another input text or an output label (text), thus mod-eling NLP tasks as ‘text-to-text’ prediction similar to lan-guage models. The similarity function is typically a cosinesimilarity, a dot product or a logit (unscaled activation) pro-duced by a input-sample matcher sub-network [Rethmeierand Augenstein, 2020]. The σ ( z, γ ) is a scaling function,which for use in eq. (1) is typically the sigmoid σ ( z ) = https://vimeo.com/306156327 talk by [Ma and Collins, 2018]. exp( z − γ ) / (1 + exp( z − γ )) with a hyperparameter γ ≥ (temperature), that is tuned or omitted depending on the waythat negative samples a − i are attained.The other NCE objective learns to rank a single positivepair ( x i , a + i, ) over K negative pairs ( x i , a − i,k ) : L R ( θ ) = log e ¯ s ( x i , a + i, ; θ ) e ¯ s ( x i , a + i, ; θ ) + (cid:80) Kk =1 e ¯ s ( x i , a − i,k ; θ ) (2)Here, to improve L R or L B performance, [Ma and Collins,2018] propose a regularized scoring function ¯ s ( x i , a i, ◦ ) = s ( x i , a i, ◦ ) − log p N ( a i, ◦ ) that subtracts the probability ofthe current sample a i, ◦ under a chosen noise distribution p N ( a i, ◦ ) . In practice, the noise distribution can be set to 0[Mnih and Teh, 2012; Wu et al. , 2020; Rethmeier and Au-genstein, 2020] to save on computation. To robustly learnword embeddings, p N ( a i, ◦ ) can be set as the word probabil-ity p word in a corpus [Mikolov et al. , 2013b], or as the prob-ability of a sequence under a language model p LM [Deng etal. , 2020], when learning contrastive sequence prediction. Generalization to an arbitrary number of positives:

As[Khosla et al. , 2020] mention, original contrastive formula-tions use only one positive pair per text instance (see e.g.[Mikolov et al. , 2013b; Logeswaran and Lee, 2018]), whilemore recent methods mine multiple positives or use multiplegold class annotation representations for contrastive learn-ing [Rethmeier and Augenstein, 2020; Qu et al. , 2021].This means that e.g. the positive term in eq. (1) becomes (cid:80) Pp =1 log σ ( s ( x i , a + i,p ; θ, γ )) to consider P positives. Importance of negative sampling semantics and lessonslearned:

How positive and negative samples are generatedor sampled is a key component of effective contrastive learn-ing. [Saunshi et al. , 2019] prove and empirically validate that“sampling more negatives improves performance, but only ifthey are sampled from the same context or block of infor-mation such as the same paragraph”. Such hard to contrast(classify) negatives, are sampled in most works [Mikolov etal. , 2013b; Saunshi et al. , 2019; Rethmeier and Augenstein,2020; Iter et al. , 2020]. Otherwise, performance can deteri-orate due to weak contrast learning of conceptually relatedclasses. Additionally, [Rethmeier and Augenstein, 2020] ﬁndthat both positive and negative contrastive samples from along-tail distribution are essential in predicting rare classesand in substantially boosting zero-shot performance, espe-cially over minority classes. [Mikolov et al. , 2013b] under-sample negatives of frequent words to stabilize pretraining ofword embeddings to a similar effect. Additional practical ad-vice for negative sampling is mentioned in 3.1.

Contrastive learning methods are closely related to at leastfour machine learning concepts. First, InfoNCE has beenshown to maximize the lower bound of mutual informationbetween different views of the data [van den Oord et al. ,2018; Hjelm et al. , 2019]. Second, [Zimmermann et al. , mall, separate label-encoder CLESS: Rethmeier 21 pseudo labels

GILE: Pappas, 2019

Contrastive Self-supervision

Pretrained text encoder

Random contrastive 2-channel encoder

Supervision

Pretrained text encoder1 input text x i Contrastive Self+Supervision

Pretrained text encoder n output labels y c = { y c + ,y c - }real labels real labels Figure 2:

Contrastive input-output ( X, Y ) pretraining. Texts andlabels are encoded independently via a medium sized text encoderand a very small label-encoder. This encodes text for n labels withminimal computation to enable large-scale K negative sampling. et al. , 2020]. Finally, many works describe con-trastive learning as an Energy Based Model, EBM, and sincethis may initially be unfamiliar, we outline popular EBM vari-ations for supervised and self-supervised contrastive text pre-training below. Input-output contrastive EBM:

The binary NCE variantfrom eq. (1) is a special case of a “Contrastive Free Energy”loss as described in [Lecun et al. , 2006] Fig. 6b or in [LeCunand Huang, 2005] Fig. 2 and Sec. 3.3 as the negative log-likelihood loss with negative sampling. [Lecun et al. , 2006]originally state that an EBM E learns the compatibility be-tween input-output pairs ( x i , y c ) with x i ∈ X and y c ∈ YE ( X, Y ) or E ( W, X, Y ) (3)where W , or θ in eq. (1), are model parameters thatencode inputs X and labels Y . Here, X and Y areviews or augmentations of either the same data point (pos-itives), or different data points (negatives). The energyfunction E measures the compatibility between its pa-rameters ( X, Y ) , where E ( ◦ )=0 indicates optimal com-patibility – e.g. E ( X = T iger, Y = f elidae )=0 means X and Y match. Note that in the probabilistic framework P ( Y = f elidae | X = T iger, W )=1 . Works which use input-output noise contrastive estimation are [Pappas and Hender-son, 2019; Rethmeier and Augenstein, 2020], visualized inFig. 2. They encode an input text x i using a text-encoder T and a label description y c via a separate label-encoder L to then concatenate both into a single text input-output en-coding pair ( T ( x i ) , L ( y c )) . Once encoded, the input-labelpair similarity is learned via a binary NCE objective L B as COCO-LM: Meng, 2021 ELECTRIC: Clark. 2020CONPONO: Iter. 2020CERT: Fang 2020DeCLUTR: Giorgi, 2020 CLEAR: Wu, 2021

Momentum Encoder

Pretrained

Language Model

Contrastive Self-Supervision Re - Pretrained Language Model

Random contrastive enc.contrastive Re-Pretraining x i n augmented texts a i = { a i + ,a i - } Contrastive Self-Supervision Re - Pretrained Language Modelcontrastive Re-Pretraining

BIT: Deng

Figure 3:

Contrastive input-input ( X, X (cid:48) ) Pretraining:

Input-input methods contrast an original text with augmented positive a + i and negative a − i texts a i ∈ X (cid:48) , which requires more computationthan input-output methods. Achromatism compatible. in eq. (1). Compared to input-input models described be-low, these approaches allow for encoding a large number ofaugmented views, i.e. labels, very compute efﬁciently via asmall label-encoder. This allows them to scale to large sam-ple sizes of positives and negatives, which is crucial to suc-cessful contrastive learning. While [Pappas and Henderson,2019] use this formulation for supervised-only pretraining onlabel encodings, [Rethmeier and Augenstein, 2020] addition-ally sample input words x i ∈ X as pseudo-label encodings y (cid:48) c = L ( x i ) for efﬁcient contrastive self-supervised pretrain-ing. Thus, the later approach uniﬁes supervision and self-supervision as a single task of contrasting real-label encod-ings L ( y c ) or pseudo-label encodings y (cid:48) c = L ( x i ) . The ad-vantage of such methods is that once the NCE classiﬁer ispretrained, it can be reused, i.e. zero-shot transferred, to anydownstream task, without having to initialize a new classiﬁer.In fact, uniﬁed prediction and zero-shot transfer are proper-ties one would expect to have from pretraining, since mostNLP tasks ﬁt into a ‘text-to-text’ prediction description. Asa result of contrastive pseudo-labels, input-output methodsenable efﬁcient contrastive self-supervised pretraining [Reth-meier and Augenstein, 2020], even on very small data, withcommodity hardware, and without complicated mechanismslike cyclic learning rate schedules, residual layers, warmup,specialized optimizers or normalization which current large-data pretraining approaches require as research summarizedin [Mosbach et al. , 2021] shows. Finally, many input-inputcontrastive methods rely on re-pretraining already otherwisepretrained Transformer architectures [Fang and Xie, 2020;Deng et al. , 2020; Giorgi et al. , 2020], since encoding aug-mented inputs is costly in current Transformer architectures. Input-input contrastive EBM:

Input-input methods con-trast input texts X from augmented input texts X (cid:48) rather thanfrom labels Y – see Fig. 3. For example, [Clark et al. , 2020]replace a subset of input text words x i,w with other words x i,w (cid:48) sampled from the vocabulary for self-supervised con-trastive pretraining. The original text x i is augmented into atext a i to provide a positive sample augment a + i or a negativeample augment a − i . Self-supervised pretraining then con-trasts pairs ( x i , a i ) of original texts against augmented onesvia the binary NCE as in eq. (1). Similar to the EBM in eq. (3)this can be summarized as E ( X, X (cid:48) ) or E ( W, X, X (cid:48) ) (4)As mentioned, current input-input contrast models are ham-pered by compute-intense augmentation encoding W ( a i ) . Contrastive pretraining enables zero-shot learning, im-proves few-shot learning and increases parameter learn-ing efﬁciency: [Radford et al. , 2021] replace a Transformerby a CNN to speed up self-supervised zero-shot predictionlearning by a factor of 3, and add text contrastive pretrainingto speed up learning by another factor of 4. [Pappas and Hen-derson, 2019] show that supervised contrastive pretrainingenables supervised zero-shot and improved few-shot learning.[Rethmeier and Augenstein, 2020] run self-supervised con-trastive pretraining for unsupervised zero-shot prediction, i.e.without human annotations, and show that this boosts learn-ing performance on long-tail classes. This is done while pre-training on only portions of an already very small text collec-tion of 6 to 60MB of pretraining text. They also demonstratethat rather than adding more data during pretraining, one canalso increase self-supervised learning signals instead.

The goal of contrastive pretraining is to initialize modelweights for efﬁcient zero-shot transfer or ﬁne-tuning todownstream tasks. pretraining is either supervised or self-supervised. Supervised contrastive pretraining methods usecorpora of hand-annotated data such as paraphrased paral-lel sentences, textual labels or text summarizations to de-ﬁne text data augmentations for contrastive pretraining. Self-supervised contrastive methods aim to scale pretraining bycontrasting automatically augmented input texts X (cid:48) or textualoutput pseudo-labels Y (cid:48) ∼ P ( X ) – see §2.2 for input-input vs.input-output contrastive methods. Both self-supervised andsupervised contrastive methods are used to train language en-coder models from scratch, or can ‘re-pretrain’ or ﬁne-tunean already otherwise pretrained model such as a RoBERTa[Liu et al. , 2019]. Below, we structure self- and supervisedcontrastive pretraining by technique and application. Input-input contrastive text representation pretrainingvia automated text augmentation:

Fig. 3 compares meth-ods that use input-input contrastive (EBM) learning asoverviewed in §2.2. [Qu et al. , 2021] use a contrastive mo-mentum encoder over combinations of recently proposed textdata augmentations like “cutoff, back translation, adversarialaugmentation and mixup”. They ﬁnd that mixing augmen-tations is most useful when the augmentations provide sufﬁ-ciently different views of the data. Further, since constructingtext augmentations which do not alter the meaning (seman-tics) of a sentence is very difﬁcult, they introduce two lossesto ensure both sufﬁcient difference and semantic consistencyof sentence augmentations. They deﬁne a consistency lossto guarantee that augmentations lead to similar predictions y c and a contrastive loss that makes augmented text repre-sentations a i similar to the original text x i . To ensure thata sufﬁciently large amount of negative text augmentationsare sampled, they use an augmentation-embedding memorybank. [Fang and Xie, 2020] only use back-translation, [Wu etal. , 2020; Meng et al. , 2021] investigate other sentence aug-mentation methods, [Giorgi et al. , 2020] contrast text spans,[Clark et al. , 2020; Meng et al. , 2021] replace input wordsby re-sampling a language model and [Simoulin and Crabb´e,2021] investigate contrastive sentence structure pretraining.Finally, [Meng et al. , 2021] also contrasts cropped sentencesafter augmentation via word re-sampling. Contrasting Next or Surrounding Sentence (or Word)Prediction (NSP, SSP)

Sentence prediction is a popularinput-input contrastive method as in §2.2. Next sentenceprediction, NSP, and surrounding sentence prediction, SSP,take inspiration from the skip-gram model [Mikolov et al. ,2013b], where surrounding and non-surrounding words arecontrastively predicted given a central word to learn word em-beddings using an NCE §2.1 variant [Mikolov et al. , 2013b].Methods mostly differ in how they sample positive and neg-ative sentences, where negative sampling strategies such asundersampling frequent words, in [Mikolov et al. , 2013a],are crucial. [Logeswaran and Lee, 2018] propose contrastiveNSP, to predict the next sentence as a positive sample against n random negative sample sentences. Instead of generatingthe next sentence, they learn to discriminate which sentenceencoding follows a given sentence. This allows them to traina better text encoder model with less computation, but sac-riﬁces the ability to generate text. [Liu et al. , 2019] inves-tigate variations of the contrastive NSP objective used in theBERT model. The method contrasts a consecutive sentence asa positive text sample against multiple non-consecutive sen-tences from other documents as negative text samples. Theyﬁnd that sampling negatives from the same document duringself-supervised BERT pretraining is critical to downstreamperformance, but that removing the original BERT NSP taskimproves downstream performance. [Iter et al. , 2020] ﬁndthat predicting surrounding sentences in a k -sized windowaround a given central anchor sentence “improves discourseperformance of language models”. They sample surround-ing sentences: (a) randomly from the corpus to constructeasy negatives, and (b) from the same paragraph, but outsidethe context window as hard (to contrast) negative samples.Contextual negative sampling is theoretically and empiricallyproven by [Saunshi et al. , 2019], who demonstrate that: “in-creased negative sampling only helps if negatives are takenfrom the original texts’ context or block of information”, i.e.the same document, paragraph or sentence. [Aroca-Ouelletteand Rudzicz, 2020] study how to combine different variantsof the NSP pretraining tasks with non-contrastive, auxiliaryself-supervision signals, while [Simoulin and Crabb´e, 2021]explore contrastive sentence structure learning. Input-output contrastive text representation pretraining:

In Fig. 2 [Rethmeier and Augenstein, 2020] use output labelembeddings as an alternative view Y (labels) of text input em-beddings X for contrastive learning of (dis)-similar text-labelembedding pairs ( X, Y ) via binary NCE from §2.1. Using separate label and text encoder allows them to efﬁcientlycompute many negative label samples, while encoding thetext X only once , unlike input-input view methods in Fig. 3.They pretrain with random input words as pseudo-labels forself-supervised pretraining on a very small corpus, which de-spite the limited pretraining data enables unsupervised zero-shot prediction, largely improved few-shot and markedly bet-ter rare concept (long-tail) learning. Distillation: [Sun et al. , 2020] propose CoDIR, a con-trastive language model distillation method to pretrain asmaller student model from an already pretrained largerteacher such as a Masked Transformer Language Model.Compressing a pretrained language model is challenging be-cause nuances such as interactions between the original layerrepresentation are easily lost – without noticing. For distil-lation, they extract layer representations from both the largeteacher and the small student network over the same or twodifferent input texts, to create a student and teacher view ofsaid texts. Using the constrastive InfoNCE loss [van den Oord et al. , 2018], they then learn to make the student representa-tion similar to teacher representations for the same input texts,and dissimilar if they receive different texts. The score or sim-ilarity function in InfoNCE is measured as the cosine distancebetween mean pooled student and teacher Transformer layerrepresentations. For negative sampling in pretraining, theyuse text inputs from the same topic, e.g. a Wikipedia article,to mine hard negative samples – i.e. they sample views fromsimilar contexts as recommended for contrastive methods in[Saunshi et al. , 2019].

Text generation as a discriminative EBM: [Deng et al. ,2020] combine an auto-regressive language model, with acontrastive text continuation EBM model for improved textgeneration. During pretraining, they learn to contrast real datatext continuations and language model generated text contin-uations via conditional NCE from §2.1. For generation, theysample the top-k text completions from the auto-regressivelanguage model and then score the best continuation via thetrained EBM, to markedly improve model perplexity. How-ever, the current approach is computationally expensive.

Cross-modal contrastive representation pretraining:

Representations for zero-shot image classiﬁcation can bepretrained using image caption text for contrastive self-supervised pretraining. [Jia et al. , 2021] automatically minea large amount of noisy text captions for images in ALIGN,to then noise-ﬁlter and use them to construct matching andmismatching pairs of image and augmented text captions forcontrastive training. [Radford et al. , 2021] use the same ideain CLIP, but pretrain on a large collection of well annotatedimage caption datasets. Both methods allow for zero-shotimage classiﬁcation and image-to-text or text-to-imagegeneration, and are inherently zero-shot capable. [Radford et al. , 2021] also run a zero-shot learning efﬁciency analysisfor CLIP and ﬁnd two things. First, that using a data efﬁcientCNN text encoder increases zero-shot image predictionconvergence 3-fold compared to a Transformer text encoder,which they state to be computationally prohibitive. Second,they ﬁnd that adding contrastive self-supervised text pre-training increases zero-shot image classiﬁcation performance 4-fold. Thus, CLIP [Radford et al. , 2021] shows thatcontrastive self-supervised CNN text encoder pretrainingcan substantially outperform current Transformer pretrainingmethods, while ALIGN [Jia et al. , 2021] also automates theimage and caption data collection process to increase datascalability.

Input-input contrastive supervised text representationpretraining [Pappas and Henderson, 2019] train a two-input-lane Siamese CNN network, which encodes text as theinput view x i in one lane, and labels via a label encoder in asecond data view y c , to learn to contrast pairs of ( x i , y x ) assimilar (1) or not (0). Rather than encoding labels as multi-hot vectors such as [0 , , , , , they express each label by atextual description of said label. These textual label descrip-tions can then be encoded by a label encoder subnetwork,which in the simplest case constructs a label embedding byaveraging over the word embeddings of the words that de-scribe a label. However, this requires manually describingeach label. Using embeddings of supervised labels, they pre-train a contrastive text classiﬁcation network on known pos-itive and negative labels, and later apply the pretrained net-work to unseen classes for zero-shot prediction. Their methodthus provides supervised, but zero-shot capable pretraining.While [Rethmeier and Augenstein, 2020] also support su-pervised contrastive input-output pretraining, they automatelabel descriptions construction, and conjecture that in real-world scenarios, most labels, e.g. the word ‘elephant’, arealready part of the input vocabulary and can thus be pre-trained as word embeddings via methods such as Word2Vec[Mikolov et al. , 2013a]. They also note that: “once inputwords are labels, one can sample input words as pseudo la-bel embeddings for contrastive self-supervised pretraining”,as described in section §3.1. Either method is contrastivelypretrained via binary NCE as described in §2.1. Further-more, both methods markedly boost few-shot learning andenable zero-shot predictions, while [Rethmeier and Augen-stein, 2020] enables unsupervised zero-shot learning via self-supervised contrastive pretraining. The added contrastiveself-supervision further boosts few-shot and long-tailed learn-ing performance, while also increasing convergence speedover supervised-only contrastive learning in [Pappas andHenderson, 2019]. Contrasting input views on manual text augmentation: [Klein and Nabi, 2020] use contrastive self-supervised pre-training to reﬁne a pretrained BERT language model to dras-tically increase performance on pronoun disambiguation andthe Winograd Schema Commonsense Reasoning task. Theirmethod contrasts over candidate trigger words that affectwhich word a pronoun refers to. They ﬁrst mine trigger wordcandidates from text differences in paraphrased sentences andthen maximize the contrastive margin between candidate pairlikelihoods. This implicitly pretrains a model for commonsense concepts, and is similar to contrastive self-supervisionin vision [Chen et al. , 2020b], with the difference of the lattergenerating contrastable data augmentations for a given sam-ple. While general pretraining provides little pronoun disam-biguation learning signal, their method demonstrate the de-ign of task-speciﬁc contrastive learning to produce strongperformance increases in un- and supervised commonsensereasoning . Contrastive text summarization: [Duan et al. , 2019] use aTransformer attention mechanism during abstractive sentencesummarization learning to optimize two contrasting loss ob-jectives. One loss maximizes the contributions of tokens withthe most attention when predicting the summarized sentence.The other loss is connected to a second decoder head, whichlearns to minimize the contribution of the attention to other,non-summarization relevant, tokens. This method can per-haps best be understood as contrastive, layer attention noisereduction. The main draw back of this method is the currentdual network head prediction, which introduces a larger com-plexity compared to other contrastive methods.

Cross and multi-modal supervised contrastive text pre-training for representation learning:

Recent work fromcomputer vision and time series prediction train with con-trastive supervised losses to enable zero-shot learning or im-prove data-to-text generation. [Jiang et al. , 2019] fuse im-age an text description information into the same represen-tation space for generalized zero-shot learning – i.e. whereat test time some classes are unseen, zero-shot, while otherclasses were seen during training. To do so, they ﬁrstpretrain a supervised text-image encoder network to con-trast ( image, text, label ) triplets of human annotated imageclasses. At test time, this contrastive network decides whichtext description best matches a given image. This works forseen and unseen classes, because classes are represented astext descriptions. [Radford et al. , 2021] pretrains on man-ually annotated textual image descriptions to enable bettergeneralization to unseen image classes. [Uehara et al. , 2020]turn stock price value time series into textual stock change de-scriptions where the contrastive objectives markedly increasethe ﬂuency and non-receptiveness of generated texts, espe-cially when trained with little data. Datasets construction for contrastive pretraining: [Ra-ganato et al. , 2019] automatically create a corpus of con-trastive sentences for word sense disambiguation in machinetranslation by ﬁrst identifying sense ambiguous source sen-tence words, and then creating replacement word candidatesto mine sentences for contrastive evaluation.

Challenge: need for many negatives.

Current methodsrequire the sampling of many negative instances for con-trastive learning to work well. There is work on the bene-ﬁts and harms of sampling hard to contrast negatives [Cai etal. , 2020], or relevant negatives [Saunshi et al. , 2019], whichcan boost sampling efﬁciency. However, as seen in [Mikolov et al. , 2013b; Rethmeier and Augenstein, 2020] dependingon the task, sampling diverse negatives can play an impor-tant role. To date, the importance of easy to contrast negativesamples is underexplored, but insights from a metric learningsurvey by [Musgrave et al. , 2020], suggest that hard, mediumand easy samples may all be necessary, especially for gener-alization in open class set tasks such as pretraining.

Challenge and directions: text augmentation quality andefﬁciency:

Self-supervised text augmentation research inNLP (§3.1) is gaining momentum and [Qu et al. , 2021;Chen et al. , 2020a] and many others analyze using mixesof recent text data augmentations. However, these input-input contrastive methods often use computationally expen-sive or non-robust mechanisms like: back translation, ini-tializing a new prediction head per downstream task, or re-liance on already otherwise pretrained models like RoBERTa.Works on input-output contrastive learning like [Pappas andHenderson, 2019; Rethmeier and Augenstein, 2020] nullifythese requirements and demonstrate very data efﬁcient pre-training, which is currently an under-researched, but verydesirable property of contrastive learning. [Zimmermann etal. , 2021] further solidify these insights and show that con-trastive methods effectively recover data properties even fromsmall data sets. While many self-supervised contrastive pre-training methods rely on already pretrained Transformers,works [Rethmeier and Augenstein, 2020; Clark et al. , 2020;Wu et al. , 2020; Meng et al. , 2021] make important con-tributions by removing this restriction. [Wu et al. , 2020;Iter et al. , 2020] propose robustly scalable input augmenta-tion, while [Grill et al. , 2020] propose BYOL, which doesnot require negative sampling, and potentially lends itself im-proving to future contrastive NLP methods.

Challenge: under-researched applications: [Deng et al. ,2020] enhance a text generation language model with con-trastive importance resampling of language model generatedtext continuations. [Duan et al. , 2019] propose contrastiveabstractive sentence summarization, which using MomentumContrast can potentially improve on.

Direction: cross-modal generation:

An underresearch di-rection for contrastive NLP are data-to-text tasks that turnnon-text inputs into a textual description. For example [Ue-hara et al. , 2020] contrastively learn to generate stock changetext descriptions from stock price time series using limiteddata, while works like [Radford et al. , 2021; Jia et al. , 2021]show that contrastive text supervision and self-supervisioncan multiply the zero-shot learning efﬁciency in cross-modalrepresentation learning.

Direction: contrastive (language) model fusion:

While[Sun et al. , 2020] compress a large language model, whichfuture work can adapt to fuse multiple language model or mu-tually transfer knowledge between models.

Direction: commonsense contrastive learning:

The con-trastive word sense disambiguation (WSD) dataset construc-tion method by [Raganato et al. , 2019] is potentially adapt-able to automatically mine inputs for the contrastive pronounlearning method by [Klein and Nabi, 2020].

In this primer on contrastive pretraining, we surveyed con-trastive learning concepts and their relations to other ﬁelds.We also structured contrastive pretraining as self- vs. super-vised learning, highlighted existing challenges and providedpointers to future research directions. eferences [Aroca-Ouellette and Rudzicz, 2020] St´ephane Aroca-Ouellette and Frank Rudzicz. On Losses for ModernLanguage Models. In

Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing(EMNLP) , pages 4970–4981, Online, November 2020.Association for Computational Linguistics.[Cai et al. , 2020] Tiffany Tianhui Cai, Jonathan Frankle,David J. Schwab, and Ari S. Morcos. Are all negatives cre-ated equal in contrastive instance discrimination?, 2020.[Chen et al. , 2020a] Jiaao Chen, Zichao Yang, and DiyiYang. MixText: Linguistically-informed interpolation ofhidden space for semi-supervised text classiﬁcation. In

Proceedings of the 58th Annual Meeting of the Associationfor Computational Linguistics , pages 2147–2157, Online,July 2020. Association for Computational Linguistics.[Chen et al. , 2020b] Ting Chen, Simon Kornblith, KevinSwersky, Mohammad Norouzi, and Geoffrey E. Hinton.Big self-supervised models are strong semi-supervisedlearners. In

Advances in Neural Information ProcessingSystems 33: Annual Conference on Neural InformationProcessing Systems 2020, NeurIPS 2020, December 6-12,2020, virtual , 2020.[Clark et al. , 2020] Kevin Clark, Minh-Thang Luong, QuocLe, and Christopher D. Manning. Pre-training transform-ers as energy-based cloze models. In

Proceedings ofthe 2020 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 285–294, Online,November 2020. Association for Computational Linguis-tics.[Deng et al. , 2020] Yuntian Deng, Anton Bakhtin, Myle Ott,Arthur Szlam, and Marc’Aurelio Ranzato. Residualenergy-based models for text generation. In

InternationalConference on Learning Representations , 2020.[Duan et al. , 2019] Xiangyu Duan, Hongfei Yu, MingmingYin, Min Zhang, Weihua Luo, and Yue Zhang. Contrastiveattention mechanism for abstractive sentence summariza-tion. In

Proceedings of the 2019 Conference on Empiri-cal Methods in Natural Language Processing and the 9thInternational Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , Hong Kong, China, November2019. Association for Computational Linguistics.[Fang and Xie, 2020] Hongchao Fang and Pengtao Xie.CERT: contrastive self-supervised learning for languageunderstanding.

CoRR , abs/2005.12766, 2020.[Giorgi et al. , 2020] John M. Giorgi, Osvald Nitski, Gary D.Bader, and Bo Wang. Declutr: Deep contrastive learningfor unsupervised textual representations, 2020.[Grill et al. , 2020] Jean-Bastien Grill, Florian Strub, Flo-rent Altch´e, Corentin Tallec, Pierre Richemond, ElenaBuchatskaya, Carl Doersch, Bernardo Avila Pires, Zhao-han Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koraykavukcuoglu, Remi Munos, and Michal Valko. Bootstrapyour own latent - a new approach to self-supervised learn-ing. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Bal-can, and H. Lin, editors,

Advances in Neural Information Processing Systems , volume 33. Curran Associates, Inc.,2020.[Hjelm et al. , 2019] R. Devon Hjelm, Alex Fedorov, SamuelLavoie-Marchildon, Karan Grewal, Philip Bachman,Adam Trischler, and Yoshua Bengio. Learning deep rep-resentations by mutual information estimation and max-imization. In , 2019.[Iter et al. , 2020] Dan Iter, Kelvin Guu, Larry Lansing, andDan Jurafsky. Pretraining with contrastive sentence objec-tives improves discourse performance of language mod-els. In

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics , pages 4859–4870, Online, July 2020. Association for ComputationalLinguistics.[Jaiswal et al. , 2021] Ashish Jaiswal, Ashwin Ramesh Babu,Mohammad Zaki Zadeh, Debapriya Banerjee, and FilliaMakedon. A survey on contrastive self-supervised learn-ing.

Technologies , 9(1), 2021.[Jia et al. , 2021] Chao Jia, Yinfei Yang, Ye Xia, Yi-TingChen, Zarana Parekh, Hieu Pham, Quoc V. Le, YunhsuanSung, Zhen Li, and Tom Duerig. Scaling up visual andvision-language representation learning with noisy text su-pervision, 2021.[Jiang et al. , 2019] Huajie Jiang, Ruiping Wang, ShiguangShan, and Xilin Chen. Transferable contrastive networkfor generalized zero-shot learning. In ,pages 9764–9773, 2019.[Khosla et al. , 2020] Prannay Khosla, Piotr Teterwak, ChenWang, Aaron Sarna, Yonglong Tian, Phillip Isola, AaronMaschinot, Ce Liu, and Dilip Krishnan. Supervised con-trastive learning. In

Advances in Neural Information Pro-cessing Systems 33: Annual Conference on Neural Infor-mation Processing Systems 2020, NeurIPS 2020, Decem-ber 6-12, 2020, virtual , 2020.[Klein and Nabi, 2020] Tassilo Klein and Moin Nabi. Con-trastive self-supervised learning for commonsense reason-ing. In

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics , pages 7517–7523, Online, July 2020. Association for ComputationalLinguistics.[LeCun and Huang, 2005] Yann LeCun and Fu Jie Huang.Loss functions for discriminative training of energy-basedmodels. In

Proceedings of the Tenth International Work-shop on Artiﬁcial Intelligence and Statistics, AISTATS2005, Bridgetown, Barbados, January 6-8, 2005 , 2005.[Lecun et al. , 2006] Yann Lecun, Sumit Chopra, Raia Had-sell, Marc Aurelio Ranzato, and Fu Jie Huang.

A tutorialon energy-based learning . MIT Press, 2006.[Liu et al. , 2019] Yinhan Liu, Myle Ott, Naman Goyal,Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: robustly optimized BERT pretraining approach.

CoRR ,abs/1907.11692, 2019.[Logeswaran and Lee, 2018] Lajanugen Logeswaran andHonglak Lee. An efﬁcient framework for learningsentence representations. In

International Conference onLearning Representations , 2018.[Ma and Collins, 2018] Zhuang Ma and Michael Collins.Noise Contrastive Estimation and Negative Sampling forConditional Models: Consistency and Statistical Efﬁ-ciency. In

Proceedings of the 2018 Conference on Empir-ical Methods in Natural Language Processing, Brussels,Belgium, October 31 - November 4, 2018 , pages 3698–3707, 2018.[Meng et al. , 2021] Yu Meng, Chenyan Xiong, Payal Bajaj,Saurabh Tiwary, Paul Bennett, Jiawei Han, and Xia Song.COCO-LM: correcting and contrasting text sequencesfor language model pretraining.

CoRR , abs/2102.08473,2021.[Mikolov et al. , 2013a] Tom´as Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. Efﬁcient estimation of wordrepresentations in vector space. In , 2013.[Mikolov et al. , 2013b] Tom´as Mikolov, Ilya Sutskever, KaiChen, Gregory S. Corrado, and Jeffrey Dean. Distributedrepresentations of words and phrases and their composi-tionality. In

Advances in Neural Information ProcessingSystems 26: 27th Annual Conference on Neural Informa-tion Processing Systems 2013. Proceedings of a meetingheld December 5-8, 2013, Lake Tahoe, Nevada, UnitedStates , pages 3111–3119, 2013.[Mnih and Teh, 2012] Andriy Mnih and Yee Whye Teh. Afast and simple algorithm for training neural probabilis-tic language models. In

Proceedings of the 29th Inter-national Conference on International Conference on Ma-chine Learning , ICML’12, Madison, WI, USA, 2012. Om-nipress.[Mosbach et al. , 2021] Marius Mosbach, Maksym An-driushchenko, and Dietrich Klakow. On the stability ofﬁne-tuning { bert } : Misconceptions, explanations, andstrong baselines. In International Conference on LearningRepresentations , 2021.[Musgrave et al. , 2020] Kevin Musgrave, Serge Belongie,and Ser-Nam Lim. A metric learning reality check. InAndrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors,

Computer Vision – ECCV 2020 ,pages 681–699, Cham, 2020. Springer International Pub-lishing.[Pappas and Henderson, 2019] Nikolaos Pappas and JamesHenderson. GILE: A Generalized Input-Label Embeddingfor Text Classiﬁcation.

Trans. Assoc. Comput. Linguistics ,7:139–155, 2019.[Qu et al. , 2021] Yanru Qu, Dinghan Shen, Yelong Shen,Sandra Sajeev, Weizhu Chen, and Jiawei Han. CoDA: Contrast-enhanced and diversity-promoting data augmen-tation for natural language understanding. In

InternationalConference on Learning Representations , 2021.[Radford et al. , 2021] Alec Radford, Jong Wook Kim, ChrisHallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar-wal, Girish Sastry, Amanda Askell, Pamela Mishkin, JackClark, Gretchen Krueger, and Ilya Sutskever. Learningtransferable visual models from natural language supervi-sion. In preprint , 2021.[Raganato et al. , 2019] Alessandro Raganato, Yves Scher-rer, and J¨org Tiedemann. The MuCoW test suite at WMT2019: Automatically harvested multilingual contrastiveword sense disambiguation test sets for machine transla-tion. In

Proceedings of the Fourth Conference on MachineTranslation (Volume 2: Shared Task Papers, Day 1) , Flo-rence, Italy, August 2019. Association for ComputationalLinguistics.[Rethmeier and Augenstein, 2020] Nils Rethmeier and Is-abelle Augenstein. Long-tail zero and few-shot learningvia contrastive pretraining on and for small data, 2020.[Saunshi et al. , 2019] Nikunj Saunshi, Orestis Plevrakis,Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khande-parkar. A theoretical analysis of contrastive unsupervisedrepresentation learning. In Kamalika Chaudhuri and Rus-lan Salakhutdinov, editors,

Proceedings of the 36th Inter-national Conference on Machine Learning , volume 97 of

Proceedings of Machine Learning Research , pages 5628–5637. PMLR, 09–15 Jun 2019.[Simoulin and Crabb´e, 2021] Antoine Simoulin and BenoitCrabb´e. Contrasting distinct structured views to learn sen-tence embeddings, 2021.[Sun et al. , 2020] Siqi Sun, Zhe Gan, Yuwei Fang,Yu Cheng, Shuohang Wang, and Jingjing Liu. Con-trastive distillation on intermediate representations forlanguage model compression. In

Proceedings of the 2020Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 498–508, Online, November2020. Association for Computational Linguistics.[Uehara et al. , 2020] Yui Uehara, Tatsuya Ishigaki, KasumiAoki, Hiroshi Noji, Keiichi Goshima, Ichiro Kobayashi,Hiroya Takamura, and Yusuke Miyao. Learning withcontrastive examples for data-to-text generation. In

Pro-ceedings of the 28th International Conference on Compu-tational Linguistics , pages 2352–2362, Barcelona, Spain(Online), December 2020. International Committee onComputational Linguistics.[van den Oord et al. , 2018] A¨aron van den Oord, Yazhe Li,and Oriol Vinyals. Representation learning with con-trastive predictive coding.

CoRR , abs/1807.03748, 2018.[Wu et al. , 2020] Zhuofeng Wu, Sinong Wang, Jiatao Gu,Madian Khabsa, Fei Sun, and Hao Ma. Clear: Contrastivelearning for sentence representation, 2020.[Zimmermann et al.et al.