[PDF] Calibrate Before Use: Improving Few-Shot Performance of Language Models

Abstract

GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt format, training examples, and even the order of the training examples can cause accuracy to vary from near chance to near state-of-the-art. We demonstrate that this instability arises from the bias of language models towards predicting certain answers, e.g., those that are placed near the end of the prompt or are common in the pre-training data. To mitigate this, we first estimate the model's bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as "N/A". We then fit calibration parameters that cause the prediction for this input to be uniform across answers. On a diverse set of tasks, this contextual calibration procedure substantially improves GPT-3 and GPT-2's average accuracy (up to 30.0% absolute) and reduces variance across different choices of the prompt.

Full PDF

CCalibrate Before Use:Improving Few-Shot Performance of Language Models

Tony Z. Zhao * 1

Eric Wallace * 1

Shi Feng Dan Klein Sameer Singh Abstract

GPT-3 can perform numerous tasks when pro-vided a natural language prompt that contains afew training examples. We show that this type offew-shot learning can be unstable: the choice ofprompt format, training examples, and even theorder of the training examples can cause accuracyto vary from near chance to near state-of-the-art.We demonstrate that this instability arises fromthe bias of language models towards predictingcertain answers, e.g., those that are placed nearthe end of the prompt or are common in the pre-training data. To mitigate this, we ﬁrst estimatethe model’s bias towards each answer by askingfor its prediction when given the training promptand a content-free test input such as “

N/A ”. Wethen ﬁt calibration parameters that cause the pre-diction for this input to be uniform across answers.On a diverse set of tasks, this contextual calibra-tion procedure substantially improves GPT-3 andGPT-2’s average accuracy (up to 30.0% absolute)and reduces variance across different choices ofthe prompt.

1. Introduction

Few-shot learning—the ability to learn tasks with limitedexamples—is an important aspect of intelligence (Lake et al.,2015; Yogatama et al., 2019). Recent work shows that largeneural language models can perform few-shot learning with-out ﬁnetuning (Radford et al., 2019; Brown et al., 2020).Speciﬁcally, GPT-3 (Brown et al., 2020) can perform nu-merous tasks when provided a few examples in a naturallanguage prompt . For example, to perform sentiment analy-sis one can condition GPT-3 on a prompt such as:

Input: Subpar acting. Sentiment: NegativeInput: Beautiful ﬁlm. Sentiment: PositiveInput: Amazing. Sentiment: * Equal contribution UC Berkeley University of Mary-land UC Irvine. Correspondence to: Eric Wallace < [email protected] > . where the ﬁrst two lines correspond to two training examplesand the last line is a test example. To make predictions, themodel predicts whether the subsequent token is more likelyto be the word “Positive” or “Negative”.This style of few-shot “in-context” learning is interestingbecause it shows that the model can learn without parameterupdates. And, more importantly, it has numerous practi-cal advantages over the now-standard approach of ﬁnetun-ing (Radford et al., 2018; Devlin et al., 2019). First, it allowspractitioners to “rapidly prototype” NLP models: changingthe prompt immediately leads to a new model. Second, itprovides a fully natural language interface to a machinelearning model, which allows users—even those withouttechnical expertise—to create NLP systems. Finally, sincein-context learning reuses the same model for each task, itreduces memory requirements and system complexity whenserving many different tasks.However, despite these promises, we show that GPT-3’saccuracy can be highly unstable across different prompts(Section 3). A prompt contains three components: a format,a set of training examples, and a permutation (ordering) forthose examples. We show that different choices for thesefactors can lead to highly different accuracies, e.g., changingthe permutation of the training examples in a sentimentanalysis prompt can change accuracy from near chance(54%) to near state-of-the-art (93%). This instability impliesthat GPT-3 users, who typically design prompts manually,cannot expect to consistently obtain good accuracy.We next analyze what causes this instability. We identifythree pitfalls of language models that lead them to be bi-ased toward certain answers during few-shot learning. Inparticular, they suffer from majority label bias, recency bias,and common token bias (Section 4). The majority label andrecency biases lead the model to predict training answersthat appear frequently or near the end of the prompt. Forexample, a prompt that ends with a Negative training ex-ample may cause a bias towards the Negative class. Onthe other hand, the common token bias leads the model toprefer answers that are frequent in its pre-training data, e.g.,it prefers “United States” over “Saint Lucia”, which is likelysuboptimal for the task of interest.We identify that these biases typically result in a shift inthe output distribution of the model. We can thus coun- a r X i v : . [ c s . C L ] F e b alibrate Before Use: Improving Few-Shot Performance of Language Models A G N e w s A cc u r a c y ( % ) GPT-3 175BWith Calibration M I T D i r e c t o r A cc u r a c y ( % ) GPT-3 13BWith Calibration D B P e d i a A cc u r a c y ( % ) GPT-3 2.7BWith Calibration

Figure 1.

Few-shot learning can be highly unstable across different choices of the prompt. Above, we plot the mean accuracy ( ± onestandard deviation) across different choices of the training examples for three different datasets and model sizes. We show that our method, contextual calibration , improves accuracy, reduces variance, and overall makes tools like GPT-3 more effective for end users. teract these biases by “calibrating” the output distribution.Concretely, we estimate the model’s bias towards certain an-swers by feeding in a dummy test input that is content-free .In the prompt above for example, if we replace “Amazing.”with the string “N/A”, the model predicts 62% Positive. Wethen ﬁt the calibration parameters so that the content-freeinput has uniform scores for each answer. This contextualcalibration procedure provides a good setting of the calibra-tion parameters without additional training data.We test the effectiveness of contextual calibration on a rangeof tasks (Section 5). Contextual calibration consistentlyimproves GPT-3 and GPT-2’s accuracy (up to 30.0% ab-solute) across different choices of the prompt format andexamples (e.g., Figure 1). It also makes the accuracy morestable across different prompts, thus mitigating the needfor prompt engineering. Overall, contextual calibration is asimple method that makes language models better few-shotlearners: it enables end users to obtain higher accuracy withconsiderably less effort.

2. Background and Experimental Setup

Neural autoregressive language models (LMs) take as inputa sequence of tokens and output a probability distributionover the next token. Large neural LMs can perform tasks in azero- or few-shot manner using in-context learning (Radfordet al., 2019; Brown et al., 2020). To do so, a natural language prompt is fed into the model. This prompt contains threecomponents: a format, a set of training examples, and apermutation (ordering) of the training examples.

Prompt Format

The prompt format is a template whichconsists of placeholders for the training and test example(s)and possibly a natural language description of the task. Forexample, the format of the prompt in Section 1 is a templatewith the style: “Input:” input “Sentiment:” label . Many alternate formats exist, e.g., one could frame the task asquestion answering.

Prompt Training Examples

The prompt’s training exam-ples are used to teach the LM how to solve the task athand. The prompt from Section 1 consists of two trainingexamples; we refer to this as “two-shot” learning. We alsoconsider “zero-shot” learning, where no training examplesare present.

Training Example Permutation

When training examplesare used, they have a particular permutation , e.g., the “Sub-par acting” example comes ﬁrst in the prompt from Sec-tion 1. The permutation matters because neural languagemodels update their hidden states in a left-to-right-fashion.To make predictions on an input, we slot it into the testplaceholder and generate from the LM. For example, see the“Amazing.” test example in the prompt from Section 1. Forgeneration tasks, we generate greedily from the LM untilit produces a newline character. For classiﬁcation tasks,the probability for each class is given by the probabilityassigned to its associated label name , e.g., the words “Nega-tive” and “Positive” for sentiment classiﬁcation.

We use datasets for three tasks: text classiﬁcation, factretrieval, and information extraction. We use a ﬁxed promptformat for each dataset unless otherwise speciﬁed. We showthe format and examples from each dataset in Appendix B.

Text Classiﬁcation

We study text classiﬁcation using sixdatasets: sentiment analysis using

SST-2 (Socher et al.,2013), 6-way question classiﬁcation using

TREC (Voorhees& Tice, 2000), textual entailment using 3-way CB (de Marn-effe et al., 2019) and binary RTE (Dagan et al., 2005) fromSuperGLUE (Wang et al., 2019), and topic classiﬁcation alibrate Before Use: Improving Few-Shot Performance of Language Models SS T - A cc u r a c y ( % ) Accuracy Across Training Sets and Permutations

Figure 2.

There is high variance in GPT-3’s accuracy as we changethe prompt’s training examples , as well as the permutation of theexamples. Here, we select ten different sets of four SST-2 trainingexamples. For each set of examples, we vary their permutation andplot GPT-3 2.7B’s accuracy for each permutation (and its quartiles). SS T - A cc u r a c y ( % ) Accuracy Across Formats and Training Sets

Figure 3.

There is high variance in GPT-3’s accuracy as we changethe prompt format . In this ﬁgure, we use ten different promptformats for SST-2. For each format, we plot GPT-3 2.7B’s accuracyfor different sets of four training examples, along with the quartiles. using the 4-way

AGNews (Zhang et al., 2015) and 14-way

DBPedia (Zhang et al., 2015) datasets. The prompt in Sec-tion 1 shows an example of the sentiment analysis task.

Fact Retrieval

We evaluate fact retrieval with

LAMA (Petroni et al., 2019). The dataset consists of knowledgebase triples that are placed into templates with missing ob-jects, e.g. “Obama was born in”. We use these templatesas our prompts, and remove the relations where the missinganswer is not at the end of the template (left-to-right LMscannot solve these). The answers are always single tokens,and we report average accuracy across all triples.

Information Extraction

We consider information extrac-tion using two slot ﬁlling datasets,

ATIS (Hemphill et al.,1990) and

MIT Movies trivia10k13 (Liu et al., 2012). Weuse two random slots for each dataset, airline and departuredate for ATIS, and director name and movie genre for MITMovies. The answer for both datasets is a span of text fromthe input, e.g., the ATIS airline task is to predict “americanairlines” when given the sentence “list a ﬂight on americanairlines from toronto to san diego”. We use Exact Matchbetween the model’s generated output and the ground-truthspan as our evaluation metric.

We run our experiments on three sizes of GPT-3 (2.7B, 13B,and 175B parameters) as well as GPT-2 (1.5B parameters).We access GPT-3 using the OpenAI API. We release codeto replicate our experiments.

3. Accuracy Varies Highly Across Prompts

This section studies how GPT-3’s accuracy changes as wevary each aspect of the prompt (training examples, permu-tation, format). We focus on a subset of the datasets tosimplify our analysis; in Section 5 we show that our ﬁnd-ings hold across all of the datasets we study.

GPT-3’s accuracy depends highly on both selection andpermutation of training examples.

Concretely, we use aﬁxed prompt format and choose different random sets oftraining examples. For each set of training examples, weevaluate the accuracy for all possible permutations.Figure 2 shows the results for SST-2 (4-shot, GPT-3 2.7B).Surprisingly, varying the permutation can be as important,or even more important, than which training examples arechosen. For example, varying the permutation of the train-ing examples can cause accuracy to go from near chance(54.3%) to near state-of-the-art (93.4%). For a qualitativeexample of the sensitivity to permutations, see Table 2 inAppendix A. This high importance on example order is incontrast to standard machine learning, where the orderingof examples during training is typically an afterthought.

The variance persists with more data and larger models.

Adding more training examples into the prompt does notnecessarily reduce the variance in accuracy. We sweep overthe number of training examples for three different datasetsin Figure 1 (red curves). The variance remains high evenwhen we use 16 training examples. Moreover, adding moretraining examples can sometimes hurt accuracy (e.g., meanaccuracy drops from 36.0% to 25.9% for DBPedia 0-shotto 1-shot). The variance in accuracy can also remain highwhen using larger models, e.g., the left of Figure 1. alibrate Before Use: Improving Few-Shot Performance of Language Models [P,P,P,P] [N,P,P,P] [P,N,P,P] [P,P,N,P] [P,P,P,N] [N,N,P,P] [N,P,N,P] [P,N,N,P] [N,P,P,N] [P,N,P,N] [P,P,N,N] [N,N,N,P] [N,N,P,N] [N,P,N,N] [P,N,N,N] [N,N,N,N] P r o b a b ili t y p(Positive) Figure 4.

Majority label and recency biases cause GPT-3 to become biased towards certain answers and help to explain the high varianceacross different examples and orderings. Above, we use 4-shot SST-2 with prompts that have different class balances and permutations,e.g., [P P N N] indicates two positive training examples and then two negative. We plot how often GPT-3 2.7B predicts Positive on thebalanced validation set. When the prompt is unbalanced, the predictions are unbalanced ( majority label bias ). In addition, balancedprompts that have one class repeated near the end, e.g., end with two Negative examples, will have a bias towards that class ( recency bias ). GPT-3’s accuracy depends highly on prompt format.

We next keep the set of training examples and permutationsﬁxed but vary the prompt format. We focus on SST-2, andwe manually design an additional 14 prompt formats. Theformats include question-answer templates, conversation-style templates, prompts that resemble Web pages, and vari-ations on the label names (all formats available in Table 7 inAppendix B). The accuracy for ten of the formats is shownin Figure 3. We ﬁnd that some of the formats are better thanothers on average. However, all of the formats still sufferfrom high variance across different training sets.

4. What Causes the High Variance?

We next analyze why

GPT-3’s accuracy varies across differ-ent training examples, permutations, and prompt formats.Concretely, we show that the variance arises because LMsare biased towards outputting answers that are (1) frequentin the prompt (majority label bias), (2) towards the end of theprompt (recency bias), and (3) common in the pre-trainingdata (common token bias).

Majority Label Bias

We ﬁnd that GPT-3 is biased towardsanswers that are frequent in the prompt. A trivial case iswhen a text classiﬁcation prompt has a class imbalance, e.g.,more Positive than Negative sentiment examples. This isdemonstrated in the “unbalanced” region of Figure 4: whenone class is more common, GPT-3 2.7B is heavily biasedtowards predicting that class. Since the SST-2 sentimentanalysis dataset is balanced, this bias causes large accuracydegradations. The majority label bias also explains why wefrequently observe a drop in accuracy when moving from0-shot to 1-shot—we found that the drop is due to the modelfrequently repeating the class of the one training example.The majority label bias also occurs for generation tasks. Onthe validation set for 4-shot LAMA with GPT-3 2.7B, 50.2%of the model predictions are a repeat of one of the four train- ing answers (the correct repeat rate is 24.7%). Overall, themajority label bias helps to explain why different choices forthe training examples heavily inﬂuence GPT-3’s accuracy—it shifts the distribution of model predictions.

Recency Bias

The model’s majority label bias is aggravatedby its recency bias : the tendency to repeat answers that ap-pear towards the end of the prompt. The “balanced” regionof Figure 4 demonstrates this. For instance, when two Neg-ative examples appear at the end (P P N N), the model willheavily prefer the Negative class. Moreover, the recencybias can outweigh the majority label bias, e.g., the “P P PN” training set leads to nearly 90% of predictions beingNegative, despite of the training examples being Positive.Recency bias also affects generation tasks. For 4-shotLAMA, the training answers that are closer to the end of theprompt are more likely to be repeated by the model. Con-cretely, the model “overpredicts” the answer from the 1st,2nd, 3rd, and 4th training example by 8.5%, 8.3%, 14.3%,and 16.1%, respectively. Overall, recency bias helps toexplain why the permutation of the training examples isimportant—the ordering of the examples heavily inﬂuencesthe distribution of the model predictions.

Common Token Bias

Finally, we ﬁnd that GPT-3 is bi-ased towards outputting tokens that are common in its pre-training distribution, which is likely suboptimal for the dis-tribution of answers on the downstream task. A simple caseof this occurs for the LAMA fact retrieval dataset, wherethe model often predicts common entities such as “America”when the ground-truth answer is instead a rare entity.A more nuanced case of the common token bias occurs for Over all relations, as well as three different sets of trainingexamples, the model repeats the training example at a rate of20.7%, 19.8%, 29.9%, and 26.8%. The ground-truth repeat rate is12.2%, 11.5%, 15.6%, and 10.7%. We deﬁne “overpredicts” as themodel’s repeat rate minus the ground-truth repeat rate. alibrate Before Use: Improving Few-Shot Performance of Language Models text classiﬁcation. Recall that the model makes predictionsby generating the label name associated with each class.Because certain label names appear more frequently in thepre-training data, the model will be inherently biased to-wards predicting certain classes. For example, on DBPedia(a balanced 14-way topic classiﬁcation dataset), GPT-3 pre-dicts the “book” class 11 × more often than the “artist” class.In fact, there is a moderate correlation ( r = 0 . ) betweenthe frequency of a DBPedia label name and the rate at whichGPT-3 predicts its class. Overall, the common token biashelps to explain why the choice of label names is important,and why the model struggles on rare answers.

The Impact of Biases on Model Predictions

We ﬁnd thatthe end result of the above three biases is typically a sim-ple shift in the model’s output distribution. For example,Figure 5 visualizes this shift for a SST-2 sentiment prompt. p(Positive)Prompt With 67% Accuracy

Figure 5.

The Positive class probability for 25 random test inputsfor a particular sentiment analysis prompt. Negative ground-truthexamples are marked with (cid:108) and Positive are marked with (cid:108) . The prompt used in Figure 5 and the model’s intrinsic biasescause it to frequently predict high conﬁdence for the Positiveclass. Since the default 50% threshold is used to make pre-dictions, this results in frequent false positives. Importantly,note that if we could optimally set the classiﬁcation thresh-old (p(Positive) = 0.68 in this case), the classiﬁer would behighly accurate (94% on the validation set).

5. Contextual Calibration

Thus far, we have shown that GPT-3 is biased towards cer-tain answers due to the prompt and the model’s intrinsicbiases. Here, we look to correct this by “calibrating” themodel’s output probabilities. A common technique foradjusting output probabilities is to apply an afﬁne transfor-mation (Platt, 1999; Guo et al., 2017): ˆq = softmax ( Wˆp + b ) , (1)where a weight matrix W and a bias vector b are appliedto the original probabilities ˆp to get the new probabilities The frequency of a token on the web is calculated usingGoogle Ngrams https://books.google.com/ngrams . The pre-dictions are from the 0-shot setting on the validation set. The output of GPT-3 is biased (its outputs are shifted), similarto how measurement devices such as voltage meters or weighingscales are biased. Just like how these devices require “calibrationbefore use”, where the devices’ outputs are scaled/zeroed-out, wehope to apply a similar calibration procedure to LMs. This goal isdistinct from statistical calibration (Brier, 1950; Guo et al., 2017),i.e., aligning a model’s conﬁdence estimate with its true accuracy. ˆq . For classiﬁcation tasks, ˆp is the set of probabilities thatare associated with each label name, renormalized to one.For generation tasks, ˆp is the entire set of probabilities forthe ﬁrst token. In this paper, we restrict the matrix W tobe diagonal, known as vector scaling (Guo et al., 2017), toprevent the parameters from growing quadratically in thesize of ˆp (which is ≈ , for generation tasks).The main challenge in the zero- or few-shot setting is thatwe do not have data to learn W and b . We thus propose anovel data-free procedure to infer a good setting of theseparameters. The key idea is that the model’s bias towardscertain answers can be estimated by feeding in a content-free input such as the string “N/A”. For example, considerthe two-shot prompt: Input: Subpar acting. Sentiment: NegativeInput: Beautiful ﬁlm. Sentiment: PositiveInput: N/A Sentiment: where “N/A” serves as the test input. Ideally, GPT-3 wouldscore this test input as 50% Positive and 50% Negative.However, the model’s biases cause it to score this input as61.8% Positive. Note that this error is contextual : a differentchoice of the training examples, permutation, and formatwill lead to different predictions for the content-free input.We can correct this error by setting W and b so that theclass scores for the content-free input are uniform. We ﬁrstobtain ˆp for the content-free input, denoted ˆp cf . We then set W = diag ( ˆp cf ) − and b to the all-zero vector. To maketest predictions, we compute

Wˆp + b and take the argmax. Implementation Details

This contextual calibration proce-dure adds trivial amounts of computational overhead andis implemented in a few lines of code (compute and save ˆp cf , adjust output probabilities). For the content-free in-put, many good choices exist, including “N/A”, the emptystring, and gibberish tokens. In all our experiments, we aver-age the probabilities from three content-free inputs: “N/A”,“[MASK]”, and the empty string. One could also craft thecontent-free input in a task-speciﬁc manner. We explore thisfor LAMA, where we replace the subject with the content-free input, e.g., we use “N/A was born in” as the input. This afﬁne transformation is usually applied to the logits, i.e.,prior to the softmax. However, we only have access to GPT-3’soutput probabilities in the OpenAI API. We only calibrate the prediction of the ﬁrst output token forgeneration tasks. This is reasonable because, for the tasks weconsider, we found that the model’s predictions are highly deter-ministic after generating the ﬁrst token. An alternate solution is to set b to − ˆp cf and W to zero. Empir-ically, this alternate solution yields higher accuracy for generationtasks (where the dimensionality of ˆp is large). The solution in themain text performs better for classiﬁcation. We found this simple ensemble to achieve the best results forAGNews, and we reuse it for all other datasets. See Section 5.2 foran ablation on the choice of content-free input. alibrate Before Use: Improving Few-Shot Performance of Language Models

Dataset LM 0-shot 1-shot 4-shot 8-shot

Baseline Ours Baseline Ours Baseline Ours Baseline Ours

Text Classiﬁcation

AGNews . B . . . . . . . . . . . . B . . . . . . . . . . . . TREC . B . . . . . . . . . . . . B . . . . . . . . . . . . CB . B . . . . . . . . . . . . B . . . . . . . . . . . . RTE . B . . . . . . . . . . . . B . . . . . . . . . . . SST-2 . B . . . . . . . . . . . . B . . . . . . . . . . . . DBPedia . B . . . . . . . . . . . . B . . . . . . . . . . . . Fact Retrieval

LAMA . B . . . . . . . . . . . B . . . . . . . . . . . . Information Extraction

MIT-G . B . . . . . . . . . . . . . . . . . . . . . . . . MIT-D . B . . . . . . . . . . . . . . . . . . . . . . . ATIS-A . B . . . . . . . . . . . . . . . . . . . . . . . . ATIS-D . B . . . . . . . . . . . . . . . . . . . . . . . Table 1.

Contextual calibration improves accuracy across a range of tasks.

We show the mean and standard deviation across differentchoices of the training examples (the prompt format is ﬁxed). The LM column indicates the GPT-3 size (see Appendix A for GPT-2results). The Baseline column shows the standard approach of greedy decoding (Brown et al., 2020) and

Ours corresponds to greedydecoding after modifying the output probabilities using contextual calibration. We bold the better result of the baseline and ours. MIT-G,MIT-D, ATIS-A, and ATIS-D indicate the MIT Genre, MIT Director, ATIS Airline, and ATIS Departure Date datasets.

Here, we evaluate the effectiveness of contextual calibra-tion across all of our datasets and LMs. We ﬁrst use aﬁxed prompt format and select ﬁve different random setsof training examples, placing them in an arbitrary order inthe prompt. We do not artiﬁcially balance the labels of thetraining examples for the classiﬁcation tasks. We use thesame sets of training examples for the baseline (standard de-coding without calibration) and contextual calibration. Weuse labeling budgets of 0–8 examples; using more than 8-shots causes the cost of querying the OpenAI API to becomeprohibitively expensive.Table 1 shows the results and Figure 1 in Section 1 plots thesame data for a subset of the tasks.

Improves Mean And Worst-Case Accuracy

Contextualcalibration dramatically improves GPT-3’s average andworst-case accuracy, by up to 30.0% absolute. These gainshold for both classiﬁcation and generation tasks. Contextualcalibration also sometimes allows GPT-3 2.7B to outper-form the GPT-3 175B baseline—by up to 19.3%—despitebeing over 50x smaller.

Can Reduce Variance Across Training Sets

Figure 6plots the difference in the standard deviation between thebaseline and contextual calibration for all tasks from Table 1.Contextual calibration reduces the variance considerably ina majority of cases, and it does not increase variance bymuch in the remaining cases.

Reduces Drop from 0-shot to 1-shot

For the baseline,there are four cases where there is a drop in accuracy when alibrate Before Use: Improving Few-Shot Performance of Language Models

15 10 5 0 5

Std Dev of Contextual Calibration Baseline

Figure 6.

Aside from improving mean accuracy, contextual cal-ibration also reduces the standard deviation of accuracy acrossdifferent choices of the training examples. We plot the differ-ence in standard deviation between contextual calibration and thebaseline from Table 1. moving from 0-shot to 1-shot (TREC, AGNews, DBpedia,SST-2). We attribute this drop to the majority label bias (seediscussion in Section 4). Calibration removes this drop inthree out of four cases.

Improves GPT-2

We also test GPT-2 1.5B (see Table 4 inAppendix A). We ﬁnd that like GPT-3, GPT-2’s accuracyalso highly varies across different prompts. This suggeststhat the variance that we observe for few-shot in-contextlearning is a general problem for LMs. Second, contextualcalibration works out-of-the-box for GPT-2—it improvesthe mean accuracy and reduces variance for most tasks.

Improves Accuracy Across Formats

In our next set of ex-periments, we use a ﬁxed set of training examples and varythe prompt format. We use the 15 prompt formats for SST-2discussed in Section 3. We also create 15 prompt formatsfor each of three random relations in LAMA (P20, P159,P19) by using the paraphrases of the original LAMA tem-plates generated by Jiang et al. (2020b). Figure 7 shows theresults before and after calibration for SST-2, and Figure 9in Appendix A show the results for LAMA. Contextual cali-bration improves the average and worst-case accuracy forboth datasets, and reduces the variance for SST-2.

We ﬁnally conduct two analyses/ablations on contextualcalibration. We ﬁrst analyze how effective contextual cal-ibration is at inferring a good setting of W . To do so, wecompare its accuracy to an “oracle calibration” method thatuses the validation set to ﬁnd the best possible diagonal W .We evaluate this oracle on AGNews, and ﬁnd that contextualcalibration is surprisingly close to it (Figure 8).We also study how the choice of content-free input affectsaccuracy. In Table 3 in Appendix A, we show the accu-racy for SST-2 and AGNews for different choices of thecontent-free input. The choice of content-free input matters,however, many good choices exist. SS T - A cc u r a c y ( % ) Accuracy Over Diff. Formats

GPT-3 2.7BWith Calibration

Figure 7.

GPT-3 has high variance across different prompt formats;contextual calibration reduces this variance and improves meanaccuracy. We show the mean accuracy ( ± standard deviation) over15 different prompt formats for SST-2.

6. Discussion

Does Calibration Eliminate the Need to EngineerPrompts?

The motivation behind “prompt engineering”is that not all prompts lead to the same accuracy. Thus, oneshould tune the prompt’s format and examples to achieve thebest possible performance (Brown et al., 2020; Gao et al.,2020). Contextual calibration does not eliminate the need toengineer prompts, however, it does mitigate it: contextualcalibration makes the accuracy of the best, average, andworst-case prompts more similar (and higher).

Should You Finetune in the Few-shot Setting?

We use aﬁxed LM with no ﬁnetuning. As mentioned in Section 1,there are numerous reasons not to ﬁnetune: it enables rapidprototyping, provides a fully natural language interface, andis more efﬁcient in terms of memory requirements and sys-tem complexity when serving many different tasks. More-over, like in-context learning without contextual calibration,ﬁnetuning can be unstable in the few-shot setting (Schick& Sch¨utze, 2021). Nevertheless, if these disadvantages areacceptable or avoidable, ﬁnetuning can improve accuracyover in-context learning in some cases (Schick & Sch¨utze,2020; Gao et al., 2020). An interesting direction for futurework is to study the interplay between contextual calibrationand ﬁnetuning, e.g., does contextual calibration alleviate theneed to ﬁnetune, or vice versa?

7. Related Work

Few-shot Learning with Language Models

Recent workuses LMs to solve NLP tasks, e.g., for story cloze pre-diction (Schwartz et al., 2017), knowledge base comple-tion (Petroni et al., 2019), and Winograd schemas (Trinh &Le, 2018). Radford et al. (2019) and Brown et al. (2020) alibrate Before Use: Improving Few-Shot Performance of Language Models A G N e w s A cc u r a c y ( % ) Accuracy Over Diff. Training Sets

Oracle CalibrationContextual CalibrationUncalibrated Baseline

Figure 8.

Contextual calibration, despite using no training data,achieves similar accuracy to an “oracle” calibration that ﬁnds thebest W using the validation set. The plot shows GPT-3 175B’smean accuracy ( ± standard deviation) on AGNews over differentchoices of the training examples. show that large LMs can be used to solve a myriad of tasksin a few-shot manner via in-context learning. Our paperprovides a simple modiﬁcation to their setting that improvesperformance. Asking LMs to complete natural languageprompts is also used as a method to “probe” LMs, e.g., ana-lyzing their factual (Petroni et al., 2019; Jiang et al., 2020b;Shin et al., 2020) or commonsense knowledge (Bosselutet al., 2019). Our results suggest that these probing methodsmay underestimate model accuracy, and we recommend thatfuture work take advantage of contextual calibration. Volatility of Few-shot Learning in NLP

Recent workshows that when using masked language models such asBERT for zero-shot learning, the prompt format can impactaccuracy (Petroni et al., 2019; Jiang et al., 2020b; Shin et al.,2020). Independent and concurrent work also shows thatwhen ﬁnetuning masked language models on few examples,the choice of training examples can impact results (Schick& Sch¨utze, 2020; Gao et al., 2020). We show that similarinstabilities occur for in-context learning (i.e., no ﬁnetuning)with left-to-right language models. We also show a surpris-ing instability associated with example ordering. Moreover,unlike past work, we analyze why these instabilities occur,and we use insights from this analysis to mitigate the issues.

Failures of Language Models

We identify failures whenLMs are used for in-context learning (e.g., recency bias).Past work identiﬁes similar failures when LMs are usedfor text generation. For example, neural LMs often repeatthemselves (Holtzman et al., 2020), suffer from overconﬁ-dence (Braverman et al., 2020; Jiang et al., 2020a), sufferfrom recency bias (Khandelwal et al., 2018; Ravfogel et al.,2019), and prefer generic responses instead of rare text (Liet al., 2016; Logan et al., 2019). Past work mitigates these degeneracies by modifying the model’s output probabilitiesor generation schemes, e.g., explicitly preventing repeti-tions (Paulus et al., 2018) or using sampling instead ofgreedy decoding (Holtzman et al., 2020).

8. Conclusion and Future Work

We show that few-shot learning can be highly volatile acrossdifferent choices of the prompt. Through a detailed analysis,we identify that this volatility arises from biases in LMs, e.g.,their tendency to output recent or common tokens. We usethese insights to develop contextual calibration—a simpleprocedure to adjust the model’s output probabilities—whichimproves accuracy, reduces variance, and overall makestools like GPT-3 more effective for end users.Looking at the bigger picture, our results inspire two futureresearch directions in few-shot learning for NLP. First, onthe methods side, we show that good few-shot learning re-quires attention to detail : small but non-trivial decisionssuch as calibration can greatly inﬂuence results. This makesit difﬁcult to correctly develop and compare new methods(e.g., pretraining schemes or model architectures). We thushope to make other few-shot learning methods more robust,and also expand our techniques to cover a wider ranger oftasks (e.g., calibration for open-ended generation). Second,on the analysis side, our results highlight the need to under-stand what

GPT-3 learns from the prompt. The model has animpressive ability to improve with more training examples,however, we show that the model learns some superﬁcialpatterns such as repetition of common answers. We hope tobetter understand and analyze the dynamics of in-contextlearning in future work.

Acknowledgements

We thank OpenAI for providing academic access to the GPT-3 API. We thank Sewon Min, Nikhil Kandpal, Nelson Liu,Girish Sastry, Marco Tulio Ribeiro, and the members ofBerkeley NLP for valuable feedback on the paper.This work was supported by DARPA under the LwLL pro-gram/Grant No. FA8750-19-1-0504, DARPA MCS programunder Contract No. N660011924033 with the United StatesOfﬁce Of Naval Research, DARPA and the Air Force Re-search Laboratory (AFRL), and NSF award

References

Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celiky-ilmaz, A., and Choi, Y. COMET: Commonsense trans-formers for automatic knowledge graph construction. In

ACL , 2019.Braverman, M., Chen, X., Kakade, S., Narasimhan, K., alibrate Before Use: Improving Few-Shot Performance of Language Models

Zhang, C., and Zhang, Y. Calibration, entropy rates, andmemory in language models. In

ICML , 2020.Brier, G. W. Veriﬁcation of forecasts expressed in terms ofprobability.

Monthly Weather Review , 1950.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. Languagemodels are few-shot learners. In

NeurIPS , 2020.Dagan, I., Glickman, O., and Magnini, B. The PASCALrecognising textual entailment challenge. In

MachineLearning Challenges Workshop , 2005.de Marneffe, M.-C., Simons, M., and Tonhauser, J. TheCommitmentBank: Investigating projection in naturallyoccurring discourse. In

Sinn und Bedeutung , 2019.Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT:Pre-training of deep bidirectional transformers for lan-guage understanding. In

NAACL , 2019.Gao, T., Fisch, A., and Chen, D. Making pre-trained lan-guage models better few-shot learners. arXiv preprintarXiv:2012.15723 , 2020.Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. Oncalibration of modern neural networks. In

ICML , 2017.Hemphill, C. T., Godfrey, J. J., and Doddington, G. R. TheATIS spoken language systems pilot corpus. In

Speechand Natural Language Workshop , 1990.Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y.The curious case of neural text degeneration. In

ICLR ,2020.Jiang, Z., Araki, J., Ding, H., and Neubig, G. How canwe know when language models know? arXiv preprintarXiv:2012.00955 , 2020a.Jiang, Z., Xu, F. F., Araki, J., and Neubig, G. How can weknow what language models know? In

TACL , 2020b.Khandelwal, U., He, H., Qi, P., and Jurafsky, D. Sharpnearby, fuzzy far away: How neural language models usecontext. In

ACL , 2018.Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.Human-level concept learning through probabilistic pro-gram induction. In

Science , 2015.Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. Adiversity-promoting objective function for neural conver-sation models. In

NAACL , 2016. Liu, J., Cyphers, S., Pasupat, P., McGraw, I., and Glass, J. Aconversational movie search system based on conditionalrandom ﬁelds. In

INTERSPEECH , 2012.Logan, R. L., Liu, N. F., Peters, M. E., Gardner, M., andSingh, S. Barack’s wife Hillary: Using knowledge-graphsfor fact-aware language modeling. In

ACL , 2019.Paulus, R., Xiong, C., and Socher, R. A deep reinforcedmodel for abstractive summarization. In

ICLR , 2018.Petroni, F., Rockt¨aschel, T., Lewis, P., Bakhtin, A., Wu,Y., Miller, A. H., and Riedel, S. Language models asknowledge bases? In

EMNLP , 2019.Platt, J. C. Probabilistic outputs for support vector machinesand comparisons to regularized likelihood methods. In

Advances in Large Margin Classiﬁers , 1999.Radford, A., Narasimhan, K., Salimans, T., and Sutskever,I. Improving language understanding by generative pre-training.

Technical Report , 2018.Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., andSutskever, I. Language models are unsupervised multitasklearners.

Technical Report , 2019.Ravfogel, S., Goldberg, Y., and Linzen, T. Studying theinductive biases of RNNs with synthetic variations ofnatural languages. In

NAACL , 2019.Schick, T. and Sch¨utze, H. It’s not just size that matters:Small language models are also few-shot learners. arXivpreprint arXiv:2009.07118 , 2020.Schick, T. and Sch¨utze, H. Exploiting cloze questions forfew-shot text classiﬁcation and natural language infer-ence. In

EACL , 2021.Schwartz, R., Sap, M., Konstas, I., Zilles, L., Choi, Y., andSmith, N. A. The effect of different writing tasks onlinguistic style: A case study of the ROC story cloze task.In

ACL , 2017.Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., andSingh, S. AutoPrompt: Eliciting knowledge from lan-guage models with automatically generated prompts. In

EMNLP , 2020.Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,C. D., Ng, A., and Potts, C. Recursive deep models forsemantic compositionality over a sentiment treebank. In

EMNLP , 2013.Trinh, T. H. and Le, Q. V. A simple method for common-sense reasoning. arXiv preprint arXiv:1806.02847 , 2018.Voorhees, E. M. and Tice, D. M. Building a question an-swering test collection. In

SIGIR , 2000. alibrate Before Use: Improving Few-Shot Performance of Language Models

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A.,Michael, J., Hill, F., Levy, O., and Bowman, S. Su-perGLUE: A stickier benchmark for general-purpose lan-guage understanding systems. In

NeurIPS , 2019.Yogatama, D., d’Autume, C. d. M., Connor, J., Kocisky,T., Chrzanowski, M., Kong, L., Lazaridou, A., Ling, W.,Yu, L., Dyer, C., et al. Learning and evaluating generallinguistic intelligence. arXiv preprint arXiv:1901.11373 ,2019.Zhang, X., Zhao, J., and LeCun, Y. Character-level con-volutional networks for text classiﬁcation. In

NeurIPS ,2015. alibrate Before Use: Improving Few-Shot Performance of Language Models

A. Additional Results on Variance andCalibration

Table 2 shows an example of the sensitivity to ordering.

Prompt (test input not shown)

Acc.

Review: the whole thing ’s fairly lame , making it par forthe course for disney sequels .Answer: NegativeReview: this quiet , introspective and entertaining indepen-dent is worth seeking .Answer: Positive 88.5%Review: this quiet , introspective and entertaining indepen-dent is worth seeking .Answer: PositiveReview: the whole thing ’s fairly lame , making it par forthe course for disney sequels .Answer: Negative 51.3%

Table 2.

Top: a prompt consisting of two training examples (the testinput is not shown) that leads to good test accuracy for GPT-3 2.7B(88.5%). Bottom: simply reversing the order of the two examplescauses the accuracy to drop to near random chance (51.3%).

Table 3 demonstrates that the choice of content-free inputdoes affect accuracy, however, many good choices exist.

Content-free Input SST-2 AGNews

Uncalibrated Baseline 66.5 48.5N/A 74.2 64.5[MASK] 74.5 63.8‘’ 72.9 64.7N/A, [MASK], ‘’ 79.0 66.5the 69.1 59.0abc 77.5 57.3the man. 79.4 62.0dasjhasjkdhjskdhds 79.3 64.5nfjkhdvy84tr9bpuirvwe 78.4 65.5

Table 3.

We show the accuracy for 1-shot SST-2 and 0-shot AG-News over different choices for the content-free input. The choiceof content-free input matters, however, many good choices exist .The token ‘’ indicates the empty string. Recall that in our experi-ments, we ensemble over N/A, [MASK], and the empty string.

Figure 9 shows how GPT-3 accuracy changes as the promptformat is varied for LAMA, with and without calibration.Table 4 shows the effect of calibration for GPT-2.

B. Prompt Formats Used

Tables 5 and 6 show the default prompt format used for alltasks. Table 7 shows the 15 different formats used whenstudying the effect of prompt format for SST-2. alibrate Before Use: Improving Few-Shot Performance of Language Models L A M A A cc u r a c y ( % ) Accuracy Over Diff. Formats (P20)

GPT-3 2.7BWith Calibration L A M A A cc u r a c y ( % ) Accuracy Over Diff. Formats (P159)

GPT-3 2.7BWith Calibration L A M A A cc u r a c y ( % ) Accuracy Over Diff. Formats (P19)

GPT-3 2.7BWith Calibration

Figure 9.

Contextual calibration improves GPT-3’s accuracy across various prompt formats for LAMA. We plot GPT-2 2.7B’s meanaccuracy over 15 different formats for the LAMA “place of death” relation (P20), “Headquarter Location” relation (P159), and “place ofbirth” relation (P19).

Dataset LM 0-shot 1-shot 4-shot 8-shot

Baseline Ours Baseline Ours Baseline Ours Baseline Ours

Text Classiﬁcation

AGNews GPT-2 . . . . . . . . . . TREC GPT-2 . . . . . . . . . . CB GPT-2 . . . . . . . . . . RTE GPT-2 . . . . . . . . . . . . SST-2 GPT-2 . . . . . . . . . . DBPedia GPT-2 . . . . . . . . . . Fact Retrieval

LAMA GPT-2 . . . . . . . . . Information Extraction

MIT-G GPT-2 . . . . . . . . . . MIT-D GPT-2 . . . . . . . . . . ATIS-A GPT-2 . . . . . . . . . . ATIS-D GPT-2 . . . . . . . . . . Table 4.

Contextual calibration improves accuracy for GPT-2.

This table is analogous to Table 1 but shows results for GPT-2 XL. alibrate Before Use: Improving Few-Shot Performance of Language Models

Task Prompt Label Names

SST-2 Review: This movie is amazing!Sentiment: PositiveReview: Horriﬁc movie, don’t see it.Sentiment: Positive, NegativeAGNews Article: USATODAY.com - Retail sales bounced back a bit in July, and new claims forjobless beneﬁts fell last week, the government said Thursday, indicating the economy isimproving from a midsummer slump.Answer: BusinessArticle: New hard-drive based devices feature color screens, support for WMP 10.Answer: World, Sports, Business, TechnologyTREC Classify the questions based on whether their answer type is a Number, Location, Person,Description, Entity, or Abbreviation.Question: How did serfdom develop in and then leave Russia?Answer Type: DescriptionQuestion: When was Ozzy Osbourne born?Answer Type: Number, Location, Person, Description,Entity, AbbreviationDBPedia Classify the documents based on whether they are about a Company, School, Artist, Ath-lete, Politician, Transportation, Building, Nature, Village, Animal, Plant, Album, Film,or Book.Article: Geoffrey D. Falksen (born July 31 1982) is an American steampunk writer.Answer: ArtistArticle: The Perrin River is a 1.3-mile-long (2.1 km) tidal river in the U.S. state of Vir-ginia. It is a small inlet on the north shore of the York River near that river’s mouth atChesapeake Bay.Answer: Company, School, Artist, Athlete, Politi-cian, Transportation, Building, Nature,Village, Animal, Plant, Album, Film,BookCB But he ended up eating it himself. I was reluctant to kiss my mother, afraid that somehowher weakness and unhappiness would infect me. Naturally I didn’t think for a minute thatmy life and spirit could stimulate her.question: her life and spirit could stimulate her mother. True, False, or Neither?answer: NeitherValence the void-brain, Valence the virtuous valet. Why couldn’t the ﬁgger choose hisown portion of titanic anatomy to shaft? Did he think he was helping?question: Valence was helping. True, False, or Neither?answer: True, False, NeitherRTE Others argue that Mr. Sharon should have negotiated the Gaza pullout - both to obtainat least some written promises of better Palestinian behavior, and to provide Mr. Abbaswith a prime prize to show his people that diplomacy, not violence, delivered Gaza.question: Mr. Abbas is a member of the Palestinian family. True or False?answer: FalseThe program will include Falla’s ”Night in the Gardens of Spain,” Ravel’s Piano Concertoin G, Berlioz’s Overture to ”Beatrice and Benedict,” and Roy Harris’ Symphony No. 3.question: Beatrice and Benedict is an overture by Berlioz. True or False?answer: True, False

Table 5.

The prompts used for text classiﬁcation. We show one training example per task for illustration purposes. The right columnshows the label names (to make predictions, we check the LM’s probability for these tokens). alibrate Before Use: Improving Few-Shot Performance of Language Models

Task Prompt

LAMA Alexander Berntsson was born in SwedenKhalid Karami was born inATIS(Airline) Sentence: what are the two american airlines ﬂights that leave from dallas to san francisco in the eveningAirline name: american airlinesSentence: list a ﬂight on american airlines from toronto to san diegoAirline name:ATIS(Depart Date) Sentence: please list any ﬂight available leaving oakland california tuesday arriving philadelphia wednesdayDepart date - Day name: tuesdaySentence: show me all all ﬂights from pittsburgh to atlanta on wednesday which leave before noon and servebreakfastDepart date - Day name:MIT Movies(Genre) Sentence: last to a famous series of animated movies about a big green ogre and his donkey and cat friendsGenre: animatedSentence: what is a great comedy featuring the talents of steve carell as a loser looking for a friendGenre:MIT Movies(Director) Sentence: in 2005 director christopher nolan rebooted a legendary dc comics superhero with a darker grittier edgein which movieDirector: christopher nolanSentence: what 1967 mike nichols ﬁlm features dustin hoffman in romantic interludes with anne bancroft as mrsrobinsonDirector:

Table 6.

The prompts used for generation tasks. We show one training example per task for illustration purposes. alibrate Before Use: Improving Few-Shot Performance of Language Models

Format ID Prompt Label Names