An Information Divergence Measure Between Neural Text and Human Text
Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, Zaid Harchaoui
MMAUVE: Human-Machine Divergence Curves forEvaluating Open-Ended Text Generation
Krishna Pillutla ♥ Swabha Swayamdipta ♣ Rowan Zellers ♥ John Thickstun ♥ Yejin Choi ♥♣ Zaid Harchaoui ♦♥ Paul G. Allen School of Computer Science & Engineering, University of Washington ♣ Allen Institute for Artificial Intelligence ♦ Department of Statistics, University of Washington { pillutla,rowanz,thickstn,yejin,zaid } @[email protected] Abstract
Despite major advances in open-ended textgeneration, there has been limited progressin designing evaluation metrics for this task.We propose
MAUVE —a metric for open-endedtext generation, which directly compares thedistribution of machine-generated text to thatof human language.
MAUVE measures themean area under the divergence curve for thetwo distributions, exploring the trade-off be-tween two types of errors: those arising fromparts of the human distribution that the modeldistribution approximates well, and those itdoes not. We present experiments across twoopen-ended generation tasks in the web textdomain and the story domain, and a varietyof decoding algorithms and model sizes. Ourresults show that evaluation under
MAUVE in-deed reflects the more natural behavior with re-spect to model size, compared to prior metrics.
MAUVE ’s ordering of the decoding algorithmsalso agrees with that of generation perplexity,the most widely used metric in open-endedtext generation; however,
MAUVE presents amore principled evaluation metric for the taskas it considers both model and human text. The explosive scales of pre-trained neural languagemodels such as OpenAI’s GPT-3 (Brown et al.,2020) have revealed that left-to-right neural lan-guage models can generate text with remarkablequality and coherence. As a result, open-ended textgeneration has become a newly emerging researchfocus for applications around story and dialoguegeneration, where multiple generations are equallyplausible, and diversity in generated text is often de-sired. Despite these advances, however, automaticevaluation metrics for these tasks still consider ei-ther a few human references, or none at all (§2); This is a work-in-progress draft.
P Q
Figure 1: Two different sources which contribute to
MAUVE ( P, Q ) for the human text distribution P , andmodel text distribution Q . Type I Error : mass of P not captured by Q , Type II Error : unlikely regions un-der P with high probability under Q , and, the center represents how different Q is from P on the commonregions supported by both. A scalar λ ∈ (0 , quan-tifies how to attribute the error in the center to eitherType I or II ; different λ values are needed to fully as-sess the generative model Q w.r.t. true distribution P ,resulting in a divergence curve. The proposed metric MAUVE is given by the area under this curve, as a scalarsummary of the divergence. neither approach is sufficient to measure the qual-ity and diversity of machine text, as it compares tohuman text.We introduce
MAUVE , an evaluation metric foropen-ended generation specifically designed for ad-dressing these issues.
MAUVE directly capturesthe divergence between the distribution of gen-erated text to the human-written, reference lan-guage distribution (§3). Inspired by recent work onevaluating generative models (Sajjadi et al., 2018;Kynk¨a¨anniemi et al., 2019; Djolonga et al., 2020)and unlike prior metrics, our metric takes intoaccount two different kinds of errors stemmingfrom systematic differences between the distribu-tion of machine-generated text and that of humanlanguage, as illustrated in Fig. 1. Based on thetrade-off in the importance assigned to each typeof error, we can build a divergence curve between a r X i v : . [ c s . C L ] F e b he two distributions, based on different thresh-olds for trade-off between the two errors. MAUVE summarizes the divergence between human andmachine text as the mean area under this diver-gence curve. Given neural language models areassociated with large distributions which cannot becompletely specified, our contributions include amethod to operationalize this metric via discretiza-tion approaches (Courbariaux et al., 2016).We provide experiments which compare severaldecoding approaches on different model sizes ontwo domains—web text and stories in English (§4).Our results show that evaluation under
MAUVE indeed rewards generations from larger modelsover smaller architectures, confirming trends whichhave now been established (Brown et al., 2020);these results are more consistent, compared to priormetrics.
MAUVE also confirms that decoding ap-proaches which do not rely on the unreliable tail ofthe model distribution (Holtzman et al., 2020) pro-duce better generations than those which do. Whilethese results are in agreement with those undergeneration perplexity, the most widely used metricin open-ended text generation,
MAUVE presents amore principled evaluation metric for the task as itconsiders both model and human text, unlike per-plexity which only considers model generations.Given our work is still under progress, we con-clude with a brief outline of future directions andethical considerations (§6). Our code is publiclyavailable . Language models are probabilistic models whichlearn a distribution Q over a text sequence, con-taining tokens from a fixed vocabulary. The taskof open-ended text generation involves decodingunder the language model, Q , i.e. generating text incontinuation to a given context . This is typicallydone by generating one token at each time-step, ina left-to-right fashion, from the learned distribution Q . The goal, therefore, is to learn a distribution Q which closely resembles the distribution of humantext, P ; evaluation methods (§2.1) either considerthe distributions directly, or samples from it. https://github.com/krishnap25/mauve-experiments Unlike targeted generation tasks like translation or sum-marization, coherence, creativity, and fluency are the maincriteria for open-ended text generation.
Decoding Approaches
The quality of machinegenerated text is deeply dependent on the decod-ing method, for any given trained language model.Decoding involves either maximization under theoriginal distribution Q (such as greedy decoding)or sampling from it. Other approaches involvemodifying Q either via temperature scaling (Ack-ley et al., 1985) or simply truncating the unreliabletail probabilities via top- k sampling (Fan et al.,2018), or nucleus sampling (Holtzman et al., 2020).Finally, Martins et al. (2020) propose a method tonatively train and sample from a sparse text distribu-tion, called entmax sampling. Decoding strategiesare summarized in Table 7 and discussed in detailin Appendix §A.2. We classify existing evaluation metrics of text gen-eration into three categories: evaluation of gener-ated text with respect to (a set of) reference textsample(s) (§2.1.1), or with respect to the entiremodel distribution ( Q ; §2.1.2), or the evaluation ofhuman text with respect to a re-calibrated modeldistribution (§2.1.3). Table 1 summarizes existingautomatic metrics as well as our proposed metric, MAUVE (§3).
Classical metrics such as BLEU (Papineni et al.,2002), ROUGE (Lin, 2004), METEOR (Banerjeeand Lavie, 2005) assume the existence of one truereference, or a small set of human references. De-signed for directed generation tasks like transla-tion and summarization, these metrics reward task-specific properties of the generated text, such as,meaning-preserving translations of the source text,or succinct and faithful summaries of the source.This paradigm is unsuitable in open-ended gen-eration where multiple continuations are possiblefor a given context, and creative generations aredesirable.
It is more natural to evaluate based on the entirelearned distribution Q , instead of a few referencesamples. The most widely used metric in this cate-gory is the generation perplexity (Gen. PPL): it isbased on the likelihood of generated text under theoriginal Q from the language model. For instance,Holtzman et al. (2020) compute the likelihood (orperplexity) of generated text under Q , and com-pare it to the likelihood of human text under the etric References Measures Model Text Human Text Generation Perplexity (Fan et al., 2018; Holtz-man et al., 2020) Plausibility of generation (cid:88)
Zipf Coefficient (Holtzman et al., 2020) Word usage statistics (cid:88)
Self-BLEU (Zhu et al., 2018) Diversity of generation (cid:88)
Distinct n -gram fraction (Holtzman et al., 2020;Welleck et al., 2020b;Martins et al., 2020) Repetitiveness of generation (cid:88) Non-termination Ratio (Welleck et al., 2020a) Consistency (cid:88)
Support per generation (Massarelli et al., 2019) Verifiability recall (cid:88)
Support per verified (Massarelli et al., 2019) Verifiability precision (cid:88) ε -perplexity (Martins et al., 2020) Language modeling quality (cid:88) Sparsemax Score (Martins et al., 2020) Language modeling quality (cid:88)
Jensen-Shannon Divergence (Martins et al., 2020) Language modeling quality (cid:88)
MAUVE
Our Work Generation quality and diversity (cid:88) (cid:88)
Table 1: Summary of automated metrics for evaluating open-ended text generation. Classical metrics (such asBLEU; Papineni et al., 2002) which are task-specific and unsuitable for open-ended generation, as well as metricsspecific to particular domains such as story generation have been omitted. same Q . However, rewarding the likelihood (i.e.perplexity) of machine generated text results in low-diversity generated sequences that have been foundempirically to be degenerate or repetitive, and there-fore atypical in natural language (Holtzman et al.,2020). Moreover, perplexity also penalizes rarewords, common in creative language.Other metrics in this category simply computethe statistics of generated text, such as repetitionfrequency (REP), ratio of unique n -grams (Distinct- n ), or how likely the text is to terminate (Wellecket al., 2020b), or the verifiability of text under agiven knowledge base (Massarelli et al., 2019).Holtzman et al. (2020) use the Zipf coefficient tomeasure how likely, under the model distribution,words are frequent in the training corpus. Whilesuch metrics can reveal important properties of ma-chine text, these are not stand-alone estimators ofits quality. Another metric under this category isSelf-BLEU (Zhu et al., 2018) which measures thediversity of generated text, by computing the av-erage BLEU score between every pair of machinegenerations under the same context; the set of ma-chine generations thereby approximates the modeldistribution. However, it is still subject to the limita-tions of BLEU (Mathur et al., 2020). Overall, suchmetrics only provide a narrow aspect of machinetext properties, such as the amount of repetitions,without providing more systematic measures ofhow the distribution of machine language deviatesfrom the distribution of human language. The final category of metrics compute the likeli-hood of reference or human text (typically) un-der a recalibrated model distribution. These in-clude three metrics used in Martins et al. (2020): ε -perplexity ( ε -PPL), sparsemax score (SP), andthe Jensen-Shannon divergence (JS). ε -perplexitycomputes the perplexity of human text after addinga Laplace smoothing to Q . Sparsemax score isa bounded metric designed for sparse text gener-ated by entmax sampling; it indicates the qualityof human text after the appplication of the entmaxtransformation (Peters et al., 2019) to Q . JS mea-sures the reduction of uncertainty about the modeldistribution when we see samples of human ref-erence text. Given these metrics are designed forsparse text generation, by definition, they cannotapply to deterministic decoding methods such asbeam search, since where a next token distributionmight not always be available. Moreover these met-rics tend to reward text generated via pure samplingand greedy decoding, which has been shown to be“degenerate” (Holtzman et al., 2020). In contrast to prevalent metrics of open-ended textgeneration, we propose a metric,
MAUVE whichis based on the model distribution Q over both machine-generated as well as reference human text;the latter is used to approximate the true distribu-tion, P . The evaluation of generative models must .0 0.2 0.4 0.6 0.8 1.0 KL( Q | R ) K L ( P | R ) Untransformed Divergence CurveSamplingGreedyNucleus, p = 0.95 exp( KL( Q | R )) e x p ( K L ( P | R )) Divergence Curve, c = 1 exp( c KL( Q | R )) e x p ( c K L ( P | R )) Divergence Curve, c = 5 Figure 2:
Left : The curve showing (cid:0)
KL( Q | R λ ) , KL( P | R λ ) (cid:1) , the formalization of Type I and Type II errors,respectively on web text generated by GPT-2 large. Smaller is better. Middle : The corresponding divergencecurves for c = 1 . Since x (cid:55)→ e − x is monotonic decreasing, larger is better is this curve. The divergence curveis now fully contained in the unit rectangle [0 , . Right : The same divergence curves for c = 5 . Changing c maintains the same order but allows for a more meaningful comparison of the numerical value of MAUVE . MAUVE is calculated as the area under the divergence curve; the shaded region denotes the
MAUVE score of sampling. include an assessment of what parts of the true dis-tribution the model is able to approximate well, andwhat other parts of the distribution it is not.
Consider the task of assessing how good a textgenerative model distribution Q is to the true distri-bution P of text written by humans. As highlightedin Figure 1, there are two obvious sources of errors:(I) Q places high mass on text which is unlikelyunder P , or,(II) Q does not capture a part of P .The former, which leads to false positives or typeI errors, can occur because sampling algorithmstend to generate text with semantic repetitions (Di-nan et al., 2019; Holtzman et al., 2020; Wellecket al., 2020b), which are highly unlikely to be writ-ten by humans . The latter case results in falsenegatives, also known as type II errors. These canoccur, for instance, because some pieces of plau-sible human text cannot be generated by sparsedecoding algorithms such as nucleus (Holtzmanet al., 2020) or entmax (Martins et al., 2020) sam-pling. Indeed, this is why the perplexity cannot beused to evaluate sparse language models (Martinset al., 2020).When P and Q place different non-zero amountsof probability mass on some text, we cannot at-tribute this discrepancy between P and Q uniquelyas a false positive or a false negative. Followingrecent work (Sajjadi et al., 2018; Djolonga et al.,2020), we must consider a family of type I and Let text x with P ( x ) (cid:29) be the positive class and P ( x ) ≈ be the negative class. If Q ( x ) (cid:29) for somenegative x , then the model incorrectly considers it a positive,i.e., a false positive, and x can never be generated by P . II error values for different λ ∈ (0 , , where wedesignate λ -fraction of this error as type I and a (1 − λ ) -fraction as type II. MAUVE
To formalize the above discussion, we consider amixture R λ = λP + (1 − λ ) Q for some λ ∈ (0 , .We measure the type I and II errors by how far Q, P are from R λ in terms of the Kullback-Leibler(KL) divergence, denoted KL( ·|· ) .Concretely, KL( Q | R λ ) penalizes Q if there ex-ists text x such that Q ( x ) is large but R λ ( x ) issmall. This measures the type I error. Similarly, KL( P | R λ ) measures the type II error.The errors KL( Q | R λ ) , KL( P | R λ ) are notunique since λ was arbitrary. By varying λ ∈ (0 , we get a divergence curve , C ( P, Q ) = (cid:110)(cid:0) e − c KL( Q | R λ ) , e − c KL( P | R λ ) (cid:1) : R λ = λP + (1 − λ ) Q, λ ∈ (0 , (cid:111) , where c > is a hyperparameter for scaling. Thetransformation x (cid:55)→ e − cx is monotonic for all c > , i.e., it preserves order. Furthermore, thetransformed curve C ( P, Q ) lies entirely in the unitsquare [0 , .Finally, we define the metric MAUVE ( P, Q ) ∈ [0 , , as the area under the divergence curve C ( P, Q ) , as a scalar summary of the divergencecurve. A larger value of MAUVE denotes betterperformance. See Figure 2 for an example.We note that this divergence curve contains moreinformation than KL divergence
KL( P | Q ) , whichcan be obtained from the second coordinate of thecurve C ( P, Q ) as λ → , and the reverse KL di-vergence KL( Q | P ) which can be obtained fromthe first coordinate of the curve C ( P, Q ) as λ → . Further, the Jensen-Shannon (JS) divergence JS(
P, Q ) = (cid:0)
KL( P | R / )+KL( Q | R / ) (cid:1) / , canbe obtained from the two coordinates of C ( P, Q ) at λ = 1 / . MAUVE summarizes all of the diver-gence curve C ( P, Q ) , not just particular points onthe curve, as done by the KL or JS divergences. Remark 1.
The divergence curve C ( P, Q ) encodesthe Pareto frontier of (cid:0) KL( P | R ) , KL( Q | R ) (cid:1) forall distributions R , not just mixtures of the form R λ . We prove this in §B. MAUVE for Evaluating Open-Ended TextGeneration
Next, we turn to efficiently computing
MAUVE to-wards the evaluation of open-ended text generation.In this case, a language model together witha decoding algorithm induces a distribution overpossible generations. However, for a moderatelylarge length of a sentence, the size of the supportof this distribution is intractably large, particularlyfor neural language models. Hence, the divergencecurve for
MAUVE cannot be tractably computed inclosed form.We overcome this problem by using a finite di-mensional, dense representation of text: the embed-ding of the hidden state for the sentence under aneural language model, similar to the method fromZhang et al. (2020). However, estimating the KLdivergence between two high dimensional distribu-tions from samples is still extremely challenging.We perform one further approximation for com-putation tractability—we quantize the distributionof hidden states into a discrete multinomial distri-bution. Towards this, we consider three differentquantization methods:1. k -means: We cluster the hidden representa-tions using k -means, and represent them bytheir cluster membership to get a discrete dis-tribution with as many dimension as the num-ber of clusters. We call this MAUVE - k -means.2. Deep Residual Mixture Models (DRMM):As a generalization of k -means, wetrain a deep generative model known asDRMM (H¨am¨al¨ainen and Solin, 2020). Weconvert the soft clustering so obtained intoa hard clustering by assigning each point toits most likely cluster, and quantize the datausing the cluster membership. We call this MAUVE -DRMM. 3. Lattice quantization of learnt features: Welearn a feature representation of the hiddenstates using a deep network which main-tains the neighborhood structure of the datawhile encouraging the features to be uni-form (Sablayrolles et al., 2019). We simplyquantize the data on a lattice. We call this
MAUVE -Lattice.
We present experiments to show the ability of
MAUVE to evaluate the quality of generated text forvarious state-of-the art decoding algorithms andmodels. We compare
MAUVE to existing meth-ods of text generation (see §2.1) across decodingmethods and model architectures.
We consider open-ended text generation under twodomains: web text and the story domain. We usesize-variants of the GPT-2 model (Radford et al.,2019) in each setting. At decoding time, we ex-plore a text continuation setting, conditioned ona prompt containing human-written text. All ex-periments were built using pretrained models andfunctions available under the HuggingFace Trans-formers library (Wolf et al., 2020).
Web Text Generation
We consider the publiclyavailable analogue of the Webtext dataset. Givenits similarity to the training data of GPT-2, we donot finetune GPT-2 on this data, and simply use thereleased GPT-2 architecture in its small and largevariants. At generation time, we use as promptsthe first tokens of each of the the test ex-amples of the Webtext corpus. The machine gen-erations are allowed to be up to tokens long.As human-written continuations for comparison,we use the corresponding test examples up to amaximum of tokens. Story Continuation
Given a situation and thestarting of a story as a prompt, the goal is the con-tinue the story. Here, we use the GPT-2 mediumarchitecture, finetuned on the training set of theWritingPrompts dataset (Fan et al., 2018). We useas a generation prompt the first tokens of randomly chosen samples from the test set of Writ-ingPrompts. The machine generations are allowedto be up to tokens long. The corresponding https://github.com/openai/gpt-2-output-dataset en. PPL Zipf Coef. REP Distinct-4 Self-BLEU SP JS ε -PPL MAUVE
GPT-2 small Sampling . . . . . . . . . . .
653 0 .
425 19 .
401 0 . . Greedy .
136 1 .
061 0 .
942 0 .
047 0 .
579 0 .
431 0 .
394 1049 .
589 0 . Nucleus, p = 0 .
95 29 . . . . . . . . . . .
653 0 .
419 21 .
928 0 . . Human .
302 0 .
963 0 .
002 0 .
878 0 . GPT-2 large Sampling . . . . . . . . . . . . . . . Greedy .
147 1 .
041 0 .
888 0 .
074 0 .
486 0 . . .
020 0 . Nucleus, p = 0 . . . . . . . . . . . .
679 0 .
374 14 . . . Human .
602 0 .
963 0 .
002 0 .
878 0 . Table 2: Comparing evaluation metrics across different decoding approaches, as well as different GPT-2 architec-tures for web text generations. Subscripts indicate the s.d. across 5 runs for the sampling-based methods; greedydecoding, being deterministic, always returns the same value for a given model. For nucleus sampling, we showthe best hyperparameter value from { . , . , . , . , . } as per MAUVE . Boldfaced numbers indicate perfor-mance closest to the human reference when applicable.
MAUVE shows that larger models perform better, acrossdecoding approaches; moreover, nucleus sampling is the best decoding metric.
Gen. PPL Zipf Coef. REP Distinct-4 Self-BLEU SP JS ε -PPL MAUVE
Sampling . . . . . . . . . . . . . . . Greedy .
557 1 .
253 0 .
988 0 .
101 0 .
742 0 . . .
401 0 . Nucleus, p = 0 . . . . . . . . . . . .
642 0 .
424 22 . . . Human .
124 1 .
052 0 .
001 0 .
783 0 . Table 3: Comparing evaluation metrics across different decoding approaches, on the GPT-2 medium architecturefor story continuation on the WritingPrompts dataset (Fan et al., 2018). Subscripts indicate the s.d. across 5runs for the sampling-based methods; greedy decoding, being deterministic, always returns the same value for agiven model. For nucleus sampling, we show the best hyperparameter value from { . , . , . } as per MAUVE .Boldfaced numbers indicate performance closest to the human reference when applicable.
MAUVE favors nucleussampling over pure sampling and greedy decoding. test examples are used of WritingPrompts are usedas human-written continuations up to a maximumof tokens.
Decoding Algorithms
We generate text by for-ward sampling from the language model (denoted“sampling”) or a reshaped language model obtainedby a truncation heuristic nucleus sampling (Holtz-man et al., 2020). We also consider greedy decod-ing as a likelihood maximization-based alternativeto text generation. The finetuning hyperparametersare detailed in §C.
Hyperparameters of
MAUVE
We compute
MAUVE with machine generations and humansamples each arising from the common prompts.
MAUVE - k -means is computed using k -means im-plemented in FAISS (Johnson et al., 2019). Priorto clustering, we reduce the dimensionality of thehidden state representation with PCA to contain of the variance and normalize the featuresto unit (cid:96) norm. We use MAUVE -DRMM with layers and components per layer for a total of clusters. The DRMM model is trained for epochs. For quantization, we assign each pointto its most likely cluster. For MAUVE -Lattice, wetrain a -dimensional feature representation of the hidden states for for epochs. We then quantizethem using a lattice spherical quantizer into bins. Further details of each method and of thehyperparameters can be found in §C.We find in §4.4 that all discretization algorithmsperform similarly. Hence, unless specified other-wise, we use MAUVE - k -means with k = 500 as ourdefault for its ease of computation. We fix c = 5 throughout as we observed that this Reproducibility
We generate five different con-tinuations for each prompt using different randomseeds. We present here the mean, and where appli-cable, the standard deviation of each metric overthe five sets of generations.
In this work, we compare
MAUVE with the follow-ing automatic metrics to evaluate the quality ofgenerations (cf. §2.1; Table 1):• Model text: generation perplexity (Gen. PPL),Zipf coefficient, Self-BLEU, REP (repetitionfrequency), and Distinct- ,• Human text: ε -perplexity ( ε -PPL), Sparsemaxscore (SP), and Jensen-Shannon divergence(JS). en. PPL Zipf Coef. REP Distinct-4 Self-BLEU SP JS ε -PPLWeb text .
83 0 .
65 0 .
47 0 .
74 0 .
57 0 . − .
47 0 . Stories .
69 0 .
94 0 .
59 0 .
75 0 . . − .
75 0 . Table 4: Spearman rank correlation between
MAUVE and existing metrics. All correlations have p -value < . except ε -PPL on web text. The table shows GPT-2 large for web text generation and GPT-2 medium for storygeneration. For the second category, we compute the metrics onthe test set for each task using directly the reshapedor truncated model probabilities from which gen-erations are sampled. We compute Gen. PPL asthe perplexity of the model text under the originalmodel it is sampled from (without reshaping ortruncation), i.e. the original GPT-2 model.
Interpreting Results
Better performance is indi-cated by larger values of
MAUVE and SP but smallervalues of ε -PPL and JS. Other metrics µ such asGen. PPL compute metrics on the model text; thevalues of these metrics for a decoding algorithmare only meaningful in comparison to the value ofthis metric on human text. Therefore, we prefersmaller values of | µ model − µ human | . Tables 2 and 3 present our comparisons of
MAUVE to the prior evaluation metrics (§4.2) on web textgeneration and story continuation, respectively.First and foremost, we observe that among thedecoding approaches, nucleus sampling achievesthe best
MAUVE followed by sampling and lastlyby greedy decoding. This trend is consistent withthe perplexity of the generated text under the orig-inal model they were generated from as well asthe fraction of distinct -grams. On the otherhand, JS favors greedy decoding, which is knownto produce extremely degenerate text (Wellecket al., 2020b). Likewise, ε -PPL favors pure sam-pling, which also produces somewhat degeneratetext (Holtzman et al., 2020), while SP appears tobe unable to distinguish between pure samplingand nucleus sampling. This makes SP, JS and ε -PPL unsuitable as metrics to evaluate the quality ofgenerated text.While most metrics show expected behavioracross model architectures (larger models producebetter generations), Zipf coefficient and Self-BLEUprefer generations from GPT-2 small over GPT-2large. This indicates that while metrics based onword/token statistics are important diagnostic tools,they do not capture the quality of generated text entirely.We compare the Spearman rank correlation be-tween MAUVE and existing metrics in Table 4. Wesee a strong, statistically signification correlation( . ) with the generation perplexity in the case ofweb text while the correlation in the case of storygeneration is slightly weaker at . . We note thatDistinct- and Self-BLEU show strong correlationswith MAUVE . However, Table 2 offers a cautionarynote where Self-BLEU ranked generations fromGPT-2 small over those from GPT-2 large.
MAUVE is Sensitive to Generation Length
Weplot in Fig. 3 (Left) how
MAUVE varies with themaximum generation length for pure, nucleus andentmax sampling.We observe for each sampling algorithm that
MAUVE reduces as the length of generations in-creases. This shows that the distribution of gener-ated text drifts farther away from human text as thedistribution length increases.Second, we observe that entmax sampling scoresthe highest until length after which its perfor-mance sharply drops. This is explained by thedistribution of sentence lengths in Fig. 3 (Right).The generations from the entmax sampler are, onaggregate, shorter in length than human text. Whenthe maximum generation length is tokens orsmaller, this difference in distribution is erased dueto clipping each generation at tokens long, andthe entmax generations score highly as per
MAUVE .This shows that
MAUVE is sensitive to the lengthdistribution of the generated text compared to thatof human text. This is in contrast to metrics suchas ε -PPL and SP which depend only on the model,not on the generated text. DRMM LatticeWeb text .
97 0 . Stories .
78 0 . Table 5: Spearman rank correlation between
MAUVE - k -means and other quantization schemes. All entrieshave a p -value of < . .
00 400 600 800 1000
Maximum generation length M A U V E MAUVE vs. max. lengthSamplingNucleus, p = 0.92Entmax, = 1.2 Sampling Nucleus Entmax Human L e n g t h Length distribution of generations
Figure 3:
Left : MAUVE as the maximum generation length is varied. The shaded region indicates one s.d. over runs. Right : length statistics of generated text.
Correlation k = 100 0 . k = 250 1 . k = 1000 0 . k = 1500 1 . k = 2000 1 . Table 6: Spearman rank correlation between
MAUVE - k -means with k = 500 and other values of k . All entrieshave a p -value of < . . We now study the effects of varying the differenthyperparameters involved in the discretization un-der
MAUVE . Quantization and Number of Clusters
We seein Table 5 that both
MAUVE -DRMM and
MAUVE -Lattice correlate very strongly with
MAUVE - k -means. Next, we see in Table 6 that MAUVE - k -means with different k correlates nearly perfectlywith k = 500 . We note, however, the Spearmanrank correlation measures the agreement in theranking induced by different variants of MAUVE ;the absolute values could be different. Overall,these results indicate that
MAUVE is robust tochoice of hyperparameters, as long as there is aconsistent selection.
Effect of Number of Generations
We plot inFigure 4 the value of
MAUVE versus the sample size n , with the number of clusters chosen as k = n/ .We observe that a smaller sample size gives anoptimistic estimate of MAUVE ; this is consistentwith (Djolonga et al., 2020, Prop. 8). We also notethat a smaller sample size leads to a larger variancein
MAUVE .
300 600 1250 2500 5000
Number of generations M A U V E MAVUE for different sample sizesSamplingNucleus, p = 0.95 Figure 4: Effect of the sample size on
MAUVE . Evaluation of generative models is an active areaof research in computer vision, where generativeadversarial networks (Goodfellow et al., 2014) arecommonly used. However, metrics such as Incep-tion Score (IS; Salimans et al., 2016) are designedfor supervised classification settings, and thus inap-propriate for text generation. The Frechet Distance(FID; Heusel et al., 2017) and its unbiased coun-terpart, the Kernel Inception Distance (Bi´nkowskiet al., 2018) are both used for evaluating genera-tive models, but unlike
MAUVE , do not take intoaccount a trade-off between different kinds of er-rors between the learned and a reference distribu-tion. Sajjadi et al. (2018) and Kynk¨a¨anniemi et al.(2019) both proposed metrics based on precision-recall curves. Djolonga et al. (2020) proposed aframework which encompassed both these works.
MAUVE extends the above line of work, and is op-erationalized as a metric for evaluating open-endedtext generation, applicable for data generated bylarge-scale neural language models.
Discussion
We presented
MAUVE , a metric for evaluating open-ended text generation. In our experiments, we used
MAUVE to compare different decoding methods,models, and hyperparameters within each setting.Our metric is associated with certain hyperparam-eters such as number of generation samples to beconsidered for obtaining meaningful clusters, andthe parameters of the clustering approach itself.The investigation of the alignment of
MAUVE with human assessment of machine-generated textthrough a human evaluations would be an interest-ing venue for future work. Factors like length oftext, which may determine the quality of humanassessment (Ippolito et al., 2020), could be takeninto consideration in the human evaluations.
Ethical Implications
Our metric rewards modeltext which resembles human-authored text. How-ever, we acknowledge the risks of rewarding sys-tems that try to mimic humans, which is the ul-timate goal of open-ended text generation (Ben-der et al., 2021). While our research is importantfor developing better language generators, we alsoencourage the community to pay attention to thedevelopment of technology that can reliably distin-guish between human and machine text. We leavethe extension of our method towards building suchsystems to future work.
Acknowledgments
This work was supported by NSF CCF-2019844,the DARPA MCS program through NIWC Pacific(N66001-19-2-4031), the CIFAR program “Learn-ing in Machines and Brains”, a Qualcomm Innova-tion Fellowship, and faculty research awards.
References
David H Ackley, Geoffrey E Hinton, and Terrence J Se-jnowski. 1985. A learning algorithm for boltzmannmachines.
Cognitive science , 9(1):147–169.Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgments. In
Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion and/or Summarization , pages 65–72, Ann Ar-bor, Michigan. Association for Computational Lin-guistics.Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language modelsbe too big? In
Proc. of FAccT .Mikołaj Bi´nkowski, Danica J. Sutherland, Michael Ar-bel, and Arthur Gretton. 2018. Demystifying MMDGANs. In
Proc. of ICLR .Mark Braverman, Xinyi Chen, Sham M. Kakade,Karthik Narasimhan, Cyril Zhang, and Yi Zhang.2019. Calibration, entropy rates, and memory in lan-guage models.Tom B. Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child,Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,Clemens Winter, Christopher Hesse, Mark Chen,Eric Sigler, Mateusz Litwin, Scott Gray, BenjaminChess, Jack Clark, Christopher Berner, Sam Mc-Candlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language models are few-shot learn-ers.Matthieu Courbariaux, Itay Hubara, Daniel Soudry,Ran El-Yaniv, and Yoshua Bengio. 2016. Bina-rized neural networks: Training deep neural net-works with weights and activations constrained to +1or -1.Emily Dinan, Varvara Logacheva, Valentin Ma-lykh, Alexander Miller, Kurt Shuster, Jack Ur-banek, Douwe Kiela, Arthur Szlam, Iulian Serban,Ryan Lowe, Shrimai Prabhumoye, Alan W Black,Alexander Rudnicky, Jason Williams, Joelle Pineau,Mikhail Burtsev, and Jason Weston. 2019. The sec-ond conversational intelligence challenge (convai2).Josip Djolonga, Mario Lucic, Marco Cuturi, OlivierBachem, Olivier Bousquet, and Sylvain Gelly. 2020.Precision-Recall Curves Using Information Diver-gence Frontiers. In
International Conference on Ar-tificial Intelligence and Statistics , pages 2550–2559.Angela Fan, Mike Lewis, and Yann N. Dauphin. 2018.Hierarchical Neural Story Generation. In
Proc. As-sociation for Computational Linguistics , pages 889–898.Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial networks. In
Proc. of NeurIPS .Perttu H¨am¨al¨ainen and Arno Solin. 2020. DeepResidual Mixture Models. arXiv preprintarXiv:2006.12063 .Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter. 2017. Ganstrained by a two time-scale update rule convergeto a local nash equilibrium. In
Proc. of NeurIPS ,page 6629–6640, Red Hook, NY, USA. Curran As-sociates Inc.ri Holtzman, Jan Buys, Maxwell Forbes, and YejinChoi. 2020. The Curious Case of Neural Text De-generation. In
International Conference on Learn-ing Representations .Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. Automatic detec-tion of generated text is easiest when humans arefooled. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 1808–1822, Online. Association for Computa-tional Linguistics.Jeff Johnson, Matthijs Douze, and Herv´e J´egou. 2019.Billion-scale similarity search with GPUs.
IEEETransactions on Big Data .Tuomas Kynk¨a¨anniemi, Tero Karras, Samuli Laine,Jaakko Lehtinen, and Timo Aila. 2019. Improvedprecision and recall metric for assessing generativemodels. In
NeurIPS .Chin-Yew Lin. 2004. ROUGE: A package for auto-matic evaluation of summaries. In
Text Summariza-tion Branches Out , pages 74–81.Pedro Henrique Martins, Zita Marinho, and Andr´e F. T.Martins. 2020. Sparse Text Generation. In
Con-ference on Empirical Methods in Natural LanguageProcessing , pages 4252–4273.Luca Massarelli, Fabio Petroni, Aleksandra Piktus,Myle Ott, Tim Rockt¨aschel, Vassilis Plachouras,Fabrizio Silvestri, and Sebastian Riedel. 2019. HowDecoding Strategies Affect the Verifiability of Gen-erated Text. arXiv preprint arXiv:1911.03587 .Nitika Mathur, Timothy Baldwin, and Trevor Cohn.2020. Tangled up in BLEU: Reevaluating the eval-uation of automatic machine translation evaluationmetrics. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 4984–4997, Online. Association for Computa-tional Linguistics.Kaisa Miettinen. 2012.
Nonlinear Multiobjective Opti-mization , volume 12. Springer Science & BusinessMedia.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In
Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics , pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.Ben Peters, Vlad Niculae, and Andr´e F. T. Martins.2019. Sparse sequence-to-sequence models. In
Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1504–1519, Florence, Italy. Association for ComputationalLinguistics. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.
OpenAIblog , 1(8):9.Alexandre Sablayrolles, Matthijs Douze, CordeliaSchmid, and Herv´e J´egou. 2019. Spreading vectorsfor similarity search. In
International Conferenceon Learning Representations .Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic,Olivier Bousquet, and Sylvain Gelly. 2018. Assess-ing generative models via precision and recall. In
NeurIPS .Tim Salimans, Ian Goodfellow, Wojciech Zaremba,Vicki Cheung, Alec Radford, and Xi Chen. 2016.Improved techniques for training GANs.Sean Welleck, Ilia Kulikov, Jaedeok Kim,Richard Yuanzhe Pang, and Kyunghyun Cho.2020a. Consistency of a recurrent language modelwith respect to incomplete decoding. In
Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 5553–5568.Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-nan, Kyunghyun Cho, and Jason Weston. 2020b.Neural text generation with unlikelihood training. In
International Conference on Learning Representa-tions, ICLR 2020 .Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In
Empirical Methods in Natural LanguageProcessing , pages 38–45.Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian QWeinberger, and Yoav Artzi. 2020. BERTScore:Evaluating text generation with BERT. In
Proc. ofICLR .Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo,Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texy-gen: A benchmarking platform for text generationmodels.
Background: Decoding for Natural Language Generation
We start by revisiting some preliminaries for text generation (§A.1), and common decoding strategies(§A.2).
A.1 Preliminaries
Consider a fixed and finite set Y of tokens. Let Y denote the set of all sequences from Y of maximallength N : Y = ∪ n =1 , ··· ,N Y n . We denote such a sequence as y = ( y , · · · , y n ) .Let Q ( y i | y
Neural text generation methods employ two types of decoding based on either (a) sampling, or, (b)maximization. Each decoding method constructs a distribution P ∈ P ( Y ) . A sampling-based decodersamples from this distribution to generate text, while a maximization-based decoder typically finds themode of P (or approximate it). The methods are summarized in Table 7. Original Distribution
A sampling-based decoder samples from the learned distribution Q to generatetext, while a maximization-based decoder typically finds the mode of Q (or approximates it). Greedydecoding employs maximization, i.e. it picks the candidate with the highest probability at a given timestep.Beam search generalizes over greedy by maintaining a beam of size b over a few time steps, selectingthe best candidate from the beam at a given time. Both these approaches employ local approximations,since it is intractable to perform exact maximization. Sampling approaches are, on the other hand,non-deterministic. Pure sampling simply samples a token from Q at every timestep. While pure samplingmight produce diverse text, it can also generate either very generic text, or even incorrect tokens (accordingto humans) due to the miscalibration of Q (Braverman et al., 2019) on infrequently seen data. Modified Distribution
More recently, decoding strategies that modify Q to remove low probabilityevents before sampling, have been proposed. Sampling with temperature (Ackley et al., 1985) transformsthe distribution to favor more high probability events. Top- k sampling (Fan et al., 2018) involves samplingfrom a subset of candidates which have the k highest probabilities; greedy decoding is a special case ofthis where k = 1 . Nucleus or top- p sampling (Holtzman et al., 2020) involves sampling from the nucleusof the distribution, i.e. only considering the smallest number of candidates whose cumulative probabilitiessum up to p at every timestep. In contrast to the above methods which modify the trained distribution forgeneration, Martins et al. (2020) propose entmax sampling, which samples from a sparse distribution,trained natively to discard tokens undesirable for diverse, non-repetitive generations. A.3 Sampling-Based Decoders
Sampling-based decoding algorithms depend on a recalibration function σ : P ( Y ) → P ( Y ) . Theygenerate a sequence by sampling from a distribution σ ◦ p ∈ P ( Y ) over sequences defined as ( σ ◦ p )(( y , · · · , y n ) | x ) = n (cid:89) i =1 σ (cid:0) Q ( y i | y
L π j , where Sampling for L steps fol-lowed by beam search Table 7: A survey of different decoding algorithms used in sampling. where we denote y < = ∅ . Below, we let π denote the next token distribution. We consider commondecoding strategies which fall into this framework.• Pure sampling, where σ ( π ) = π .• Top- K (Fan et al., 2018), which uses an integer K as a parameter to define [ σ K ( π )] j ∝ (cid:40) π j , if j ∈ arg top k K ( π )0 else . Here, arg top k K ( π ) denotes the set of K largest indices of π ,• Nucleus (top- ρ ) sampling (Holtzman et al., 2020), which uses a parameter ρ ∈ (0 , to define [ σ nuc ,ρ ( π )] j ∝ (cid:40) π j , if j ∈ arg top prob ρ ( π )0 else . Here, arg top prob ρ ( π ) denotes the smallest set of indices of π whose sum is at least ρ . Formally, arg top prob ρ ( π ) is the smallest set J ⊂ Y such that (cid:88) j ∈ J π j ≥ ρ . • Greedy decoding is a special case of top- K decoding with K = 1 .• Consistent alternatives to top- K and nucleus sampling (Welleck et al., 2020a) fall into this frameworkwith [ σ con ,K ( π )] j and [ σ con , nuc ,ρ ( π )] j including j eos , the special end of sentence token, in the top- K or top- p set.• The entmax sampler (Martins et al., 2020) is pure sampling, which corresponds to σ ( π ) = π inthis framework. However, we note that the way the entmax sampler is not parameterized using thesoftmax. A.4 Deterministic Decoders
Deterministic decoders deterministically search through the output space Y to produce an output y ∈ Y . Ingeneral, deterministic decoders aim towards maximization, i.e., to compute the mode arg max y ∈Y Q ( y | x ) , of the distribution Q ( ·| x ) . However, one must resort to approximations since the argmax cannot becomputed exactly in a tractable manner, in general. Instances of maximization-based deterministicdecoding algorithms include:• Greedy decoding, which constructs ¯ y ∈ Y incrementally as ¯ y t = arg max j ∈ Y Q ( j | ¯ y Here, we show the property of Pareto optimality of C ( P, Q ) . The main property we will show in thissection is the following. The proof is at the end of the section. Proposition 2. Consider two distributions P, Q with finite support and a scaling constant c > . Let R λ be such that (cid:0) e − c KL( Q | R λ ) , e − c KL( P | R λ ) (cid:1) ∈ C ( P, Q ) . Then, R λ is Pareto-optimal for the pair ofobjectives (cid:0) KL( Q |· ) , KL( P |· ) (cid:1) . In other words, there does not exist distribution R such that KL( Q | R ) < KL( Q | R λ ) and KL( P | R ) < KL( P | R λ ) simultaneously.Proof. Let F ( P, Q ) be the Pareto frontier of (cid:0) KL( Q |· ) , KL( P |· ) (cid:1) . From (Miettinen, 2012, Thm. 3.4.5,3.5.4), if follows that F ( P, Q ) = (cid:110)(cid:0) KL( P | R (cid:63)λ ) , KL( P | R (cid:63)λ ) (cid:1) : λ ∈ [0 , (cid:111) where R (cid:63)λ ∈ arg min R { λ KL( Q | R ) + (1 − λ )KL( P | R ) } . We invoke the next lemma to show that R (cid:63)λ = λP + (1 − λ ) Q to complete the proof. Lemma 3. Let P, Q, S be discrete distributions with finite support. For any λ ∈ [0 , and ¯ λ = 1 − λ ,letting R λ = λP + ¯ λQ , we have the identity λ KL( P | S ) + ¯ λ KL( Q | S ) = λ KL( P | R λ ) + ¯ λ KL( Q | R λ ) + KL( R λ | S ) . Consequently, we have that R λ ∈ arg min S (cid:8) λ KL( P | S ) + ¯ λ KL( Q | S ) (cid:9) . Proof. By adding and subtracting (cid:80) i R λ,i log( R λ,i ) , we get, λ KL( P | S ) + ¯ λD KL ( Q | S ) = (cid:88) i λP i log P i + ¯ λQ i log Q i − R λ,i log S i = (cid:88) i λP i log P i R λ,i + ¯ λQ i log Q i R λ,i + R λ,i log R λ,i S i = λ KL( P | R λ ) + ¯ λ KL( Q | R λ ) + KL( R λ | S ) . The first two terms are independent of S and the last term is minimized at S = R λ . Connection to (Djolonga et al., 2020) The Pareto frontier F ( P, Q ) of (cid:0) KL( Q |· ) , KL( P |· ) (cid:1) (defined inthe proof of Proposition 2) coincides exactly with the notion of a inclusive divergence frontier , as definedby (Djolonga et al., 2020). It follows that the inclusive divergence frontier is related to the divergencecurve we have defined as, F ( P, Q ) = (cid:110) (cid:0) c − log t − , c − log t − (cid:1) : ( t , t ) ∈ C ( P, Q ) (cid:111) . Additional Details of Experiments C.1 Training HyperparametersWeb Text We use a pre-trained GPT-2 model to generate web text. In addition, we also consider entmaxsampling (Martins et al., 2020), which is finetuned using the entmax loss; we use the authors’ code . Wefinetune GPT-2 large on the training set of web text with the entmax loss with an effective batch size of for one epoch. The learning rate and optimizer was default used by Martins et al. (2020), i.e., the Adamoptimizer with a learning rate of . × − , which was linearly decayed to zero over the course oftraining. We used a block size of . Story Continuation We finetune GPT-2 medium on the training set of the WritingPrompts dataset usingthe cross entropy loss for one epoch over the training set with an effective batch size of and a blocksize of . We use the default optimizer and learning rate schedules of the HuggingFace Transformerslibrary, i.e., the Adam optimizer with a learning rate of × − . In addition, we also finetune GPT-2medium with the entmax loss in order to use the entmax sampler (Martins et al., 2020). The finetuningsettings are identical to those described in the previous paragraph. C.2 MAUVE Hyperparameters MAUVE - k -means We first run PCA on the data matrix obtained from concatenating the hidden staterepresentations of the human text and model text. We keep of the explained variance and normalizeeach datapoint to have unit (cid:96) norm. We then run k -means with FAISS for a maximum of iterations for redos. We quantize represent the human text distribution and the model text distribution by a histogramobtained from cluster memberships. MAUVE -DRMM We use the code released by the authors . We take components per layer and layers for a total of layers. We train the DRMM for epochs using a batch size of and an initiallearning γ of . . The learning rate γ t used is γ t = γ min { , (2 − t/T ) } , where T is the total number of updates. That is, it is set constant for the first half of the updates and thenannealed quadratically. For more details, see (H¨am¨al¨ainen and Solin, 2020, Appendix C). MAUVE -Lattice We use the code provided by the authors . We train a -dimensional feature represen-tation of the hidden states for for epochs using the triplet loss of (Sablayrolles et al., 2019), so that thelearnt feature representations are nearly uniformly distributed. We use a -layer multilayer perceptronwith batch normalization to learn a feature representation. We train this MLP for epochs with batchsize of and an initial learning rate of . . The learning rate is cut to . after half the training and . after of the training.The learnt feature representations are then quantized using the lattice spherical quantizer into bins. This work as follows: let S r denote the integral points of the unit sphere of radius r = √ in R . A hidden state vector x is run through the trained MLP f to get its feature representation f ( x ) . It isquantized to the bucket u given by I ( x ) = arg min u ∈ S r (cid:107) r × f ( x ) − u (cid:107) . https://github.com/deep-spin/sparse_text_generation https://github.com/PerttuHamalainen/DRMM9