[PDF] Discovering Useful Sentence Representations from Large Pretrained Language Models

Abstract

Despite the extensive success of pretrained language models as encoders for building NLP systems, they haven't seen prominence as decoders for sequence generation tasks. We explore the question of whether these models can be adapted to be used as universal decoders. To be considered "universal," a decoder must have an implicit representation for any target sentence s , such that it can recover that sentence exactly when conditioned on its representation. For large transformer-based language models trained on vast amounts of English text, we investigate whether such representations can be easily discovered using standard optimization methods. We present and compare three representation injection techniques for transformer-based models and three accompanying methods which map sentences to and from this representation space. Experiments show that not only do representations exist for sentences from a variety of genres. More importantly, without needing complex optimization algorithms, our methods recover these sentences almost perfectly without fine-tuning the underlying language model at all.

Full PDF

DDiscovering Useful Sentence Representations from Large PretrainedLanguage Models

Nishant Subramani

Scale AI [email protected]

Nivedita Suresh

Arrive [email protected]

Abstract

Despite the extensive success of pretrained lan-guage models as encoders for building NLPsystems, they haven’t seen prominence as de-coders for sequence generation tasks. We ex-plore the question of whether these models canbe adapted to be used as universal decoders.To be considered ”universal,” a decoder musthave an implicit representation for any tar-get sentence s , such that it can recover thatsentence exactly when conditioned on its rep-resentation. For large transformer-based lan-guage models trained on vast amounts of En-glish text, we investigate whether such repre-sentations can be easily discovered using stan-dard optimization methods. We present andcompare three representation injection tech-niques for transformer-based models and threeaccompanying methods which map sentencesto and from this representation space. Exper-iments show that not only do representationsexist for sentences from a variety of genres.More importantly, without needing complexoptimization algorithms, our methods recoverthese sentences almost perfectly without ﬁne-tuning the underlying language model at all . Recently, pretrained language models such asELMo, BERT, and T5 have seen widespread suc-cess as encoders for a variety of natural languageprocessing tasks often with little or no ﬁnetun-ing (Peters et al., 2018; Devlin et al., 2019; Raffelet al., 2019). However, this has not transferred todecoders, i.e. most decoders for sequence gener-ation tasks are task-speciﬁc and are trained fromscratch (Nallapati et al., 2016; Johnson et al., 2017;Aharoni et al., 2019). We explore whether pre-trained language models can be modiﬁed to beused as ”universal” decoders.For a decoder to be considered ”universal”, itmust be able to successfully recover a sentence when conditioned on its implicit sentence repre-sentation. Such a decoder would provide manybeneﬁts: make training text generation modelson little amounts of annotated data possible, al-low considerable parameter sharing in memory-and data-limited environments, and improve zero-shot text generation performance. Imagine you aretasked with building a Kurdish to English transla-tion model. You ﬁnd that there’s very little paralleldata on this language pair to learn from and realizethat an end-to-end trainable sequence-to-sequencemodel cannot be ﬁt well. If you had a universal de-coder, you may be able to train a Kurdish encoder,which is much smaller than the entire sequence-to-sequence model, and optimize it to work with theuniversal decoder.In this work, we take an initial step towards eval-uating whether large pretrained language modelscan be used as universal decoders without ﬁne-tuning. We ﬁrst deﬁne the sentence space of atransformer language model, GPT-2 (Radford et al.,2019), and reparametrize each point in this space toa lower-dimensional point by adding a single biasterm z to various locations in the model. Keepingthe language model ﬁxed, we optimize z to maxi-mize the likelihood of the original sentence x andrecover x from z in order to evaluate how usefulthe representation is. In other words, we reverse-engineer a sentence representation that generatesthe target sentence.Our experiments uncover that we can achievenearly perfect recoverability with a reparametrizedsentence space of dimension equal to the latentdimension of the language model. That is to say,for nearly all sentences, there exists at least onerelatively low-dimensional vector that, by itself,can recover the sentence of interest nearly exactly.Further, we show that this holds for text from avariety of genres ranging from books to news tomovie quotes to Wikipedia. We learn that discover- a r X i v : . [ c s . C L ] A ug igure 1: We add a bias Z (cid:48) based on Equation 2 to three different locations in GPT-2: to the embedding, to thetransformer layers, and before the language modeling head. Here ’Embeds’ refers to the embedding, ’SA’ to self-attention, ’LN’ to layer normalization (Ba et al., 2016), ’FFN’ to a fully-connected layer, and ’LM Head’ to thelast fully-connected layer. ing nearly perfect representations is relatively easyusing simple optimization with Adam (Kingmaand Ba, 2014), unlike previous work (Subramaniet al., 2019). Our experiments show that recov-erability increases as the dimensionality of thereparametrized space increases and decreases withincreased sentence length, i.e. recoverability islower for longer sentences. Using PCA, we ﬁnd thatthe reparametrized sentence space does not lie on alower-dimensional linear manifold, and conﬁrmsthat the intrinsic dimension of the reparametrizedspace is approximately equal to the latent dimen-sion of the language model. Below, we discuss background on transformer-based language models and characterize how thesemodels represent sentences (Vaswani et al., 2017).We show how to reparametrize this space intoa lower-dimensional space and deﬁne the no-tion of the recoverability of a sentence in thisreparametrized space. We show these for GPT-2, but indicate how our methodology is model-agnostic.Transformer language models such as GPT-2,represent a sentence x = x , . . . , x T as a sequenceof hidden states h , . . . , h T , which come fromthe ﬁnal layer of the transformer model. Since h i ∈ R d , where d is the latent dimension of thelanguage model, the model encodes x , . . . , x T ina sentence space H ∈ R d × T . Representations inthis sentence space are sequence length dependent,making comparisons between sentences with dif-fering lengths inequitable and measuring the efﬁ-cacy of using an unconditional language model asa universal decoder impossible. To resolve these is- sues and to make analysis easier, we reparametrizethe sentence space into a lower-dimensional andsentence-length agnostic vector space. We propose to reparametrize the original sentencespace

H ∈ R d × T to Z ∈ R d (cid:48) , mapping a sentencelength dependent, high-dimensional vector spaceinto a lower dimensional, sentence-length agnosticvector space of dimension d (cid:48) . In our experiments, d (cid:48) ≤ d . We do this by adding a bias term z ∈ R d (cid:48) to the ﬁxed language model and ﬁnd a ˆ z that min-imizes the cross entropy loss of the sentence. Weinject z by using a projection matrix W z ∈ R d × d (cid:48) ,which is never trained and is ﬁxed throughout. W z = [ I d (cid:48) ; W mix ] (cid:62) (1)Here, W mix ∈ R d (cid:48) × ( d − d (cid:48) ) is a probability weightmatrix where the columns sum to 1, where wesample each entry from a standard Gaussian andcompute a softmax over columns. We randomlypermute the independent and dependent compo-nents of W z to avoid an arbitrary, ﬁxed ordering ofcolumns.Our reparametrization must give us the abilityto project a sequence of tokens x = x , . . . , x T into a representation z (sentence encoding) andto recover x from z (sentence recovery) via thelanguage model. Without this property, we cannotmeasure recoverability. Imagine a task-speciﬁc en-coder trained to produce context for a conditionalgeneration task. The output of such an encoder re-sembles the z ∈ Z we wish to discover. With ourreparameterization approach, we expect z to en-code the target sentence using sentence encodingand regenerate it using sentence recovery. .2 Representation Injection We experiment with three z injection locations:embedding (embed), each layer of the transformer(layers), and language model head (head). See Fig-ure 1 for details. We also experiment with threerepresentation injection mechanisms that transform z to z (cid:48) and inject z (cid:48) into the language model: noensembling, attention-based ensembling, and inter-leaved ensembling. Ensembling splits up z into k experts and allows those k experts to work togetherto learn a sentence representation. Here, z is splitup into a matrix Z ∈ R d (cid:48) k × k and W z ∈ R d × d (cid:48) k .In no ensembling, k = 1 , so Z = z . In attention-based ensembling, we use soft-attention with theprevious layer’s hidden state (Bahdanau et al.,2015), allowing the model to learn an adaptivecombination of the k vectors per input token. In in-terleaved ensembling, we use the ﬁrst vector for theﬁrst token, the second for the second token, until wereach k . After we process the k th token, we start theprocess over again with the ﬁrst vector. This way,each of the k vectors are responsible for only every k th token. To do this, we use W int ∈ R T × k , whichcomprises of Tk many I k matrices concatenated to-gether and the ﬁrst T rows chosen. Below are theequations for no ensembling, attention-based en-sembling, and interleaved ensembling respectively: Z (cid:48) =  W z Z, softmax ( H t − ( W z Z ))( W z Z ) (cid:62) ,W int ( W z Z ) (cid:62) , (2) In sentence encoding, we project a sentence x intoa representation z via the language model Θ LM using Equation 2. We estimate z by maximizingthe log probability of x , while keeping Θ LM ﬁxed: ˆ z = argmax z ∈Z T (cid:88) t =1 log p ( x t | x

BLEU-Max . Under the lens of recoverability, we deﬁne the in-trinsic dimension of the reparametrized sentencespace to be the smallest dimension of z ( d (cid:48) ) thatproduces a speciﬁc target recoverability τ (Bo-janowski et al., 2018; Subramani et al., 2019): ˆ d (cid:48) ( θ, τ ) = min d (cid:48) (cid:8) d (cid:48) : BLEU ( D | ( d (cid:48) , θ )) > τ (cid:9) (4)Here, BLEU is the target recoverability measurefor dimension d (cid:48) for model θ and is computed as:BLEU ( D x | θ, d (cid:48) ) = (cid:80) x ∈ D x (cid:80) ni =0 BLEU ( ˆ x i , x ) | D x | · n (5)BLEU ( D | θ, d (cid:48) ) = 1 | D | (cid:88) D x ∈ D BLEU ( D x | θ, d (cid:48) ) (6)Here, | D | is the number of corpora, | D x | is the num-ber of sentences in each corpus, n is the number ofdifferent random initializations of z per sentenceper corpus, and ˆ x is the predicted sentence.In addition, we analyze the intrinsic dimension-ality of Z using principal component analysis bytransforming Z ∈ R d (cid:48) into orthogonal basis vec-tors. Equipped with these orthogonal bases, we canmeasure how many components are required tocapture a proportion p of the variability in the datausing cumulative explained variance. Data Collection

For experiments on sentencerecoverability, we create a dataset which com-bines four corpora from different genres: moviedialogs (movies), classic books (books), news ar-ticles (news), and Wikipedia (wiki). For movies,we choose the Cornell Movie Dialogs cor-pus (Danescu-Niculescu-Mizil and Lee, 2011),which consists of ﬁctional conversations from 617raw movie scripts. We choose NLTK’s Gutenbergdataset for our books portion, which consists ofa subset of texts from Project Gutenberg (Lebert,2008). Our news subset comes from the Gigaworddataset for abstractive summarization (Graff et al.,2003), consisting of 3.8 million articles. Lastly, ourWikipedia portion comes from WikiText-103 (Mer-ity et al., 2017), a dataset with 28,475 veriﬁed articles. For movies, news, and wiki, we extractsentences from its pre-speciﬁed validation set. Forbooks, since NLTK’s Gutenberg dataset lacks a pre-speciﬁed data split, we consider the entire dataset.

Data Preprocessing

We sentence tokenize allof our datasets using NLTK’s sentence tokenizer.Next, we randomly sample 16 sentences from eachcorpus, making sure sentences are between 5 and100 words according to NLTK’s word-level, regularexpression tokenizer. We call this the small recov-ery corpus (SRC). To construct a larger corpus, thelarge recovery corpus (LRC), we group sentencesby sentence length into 8 bins: 5-10, 10-15, 15-20,20-25, 25-30, 30-35, 35-40, and 40-100, and ran-domly sample 64 sentences from each of the bins,ensuring that no sentences overlap between LRCand SRC. Lastly, we create a third corpus that wecall the gibberish recovery corpus (GRC), by sam-pling tokens uniformly at random with replacementfrom the GPT2 vocabulary such that we have 8 gib-berish sentences in each of the 8 sentence lengthbins above similarly to Subramani et al. (2019).

Phase I: Experimental Phase

We use SRC toevaluate the best initialization technique (I), injec-tion location (II), and ensembling strategy (III) inan iterative manner in this order. Refer to Table 1for details. In these experiments, we use stochasticgradient descent with Adam with a learning rateof 0.01 (Kingma and Ba, 2014), maximum numberof optimization steps of 1000, learning rate decaywith a plateau with a patience of 3 and decay factorof 0.8, dimensionality of z of 768, and n , the num-ber of random z initializations, of 4. Motivated bylooking at a few iterations of sentence encoding, westop optimization early if the learning rate decaysto − . We also stop optimization early if meancross entropy loss reaches min(0 . , T ) , where T is sequence length. This heuristic is not crucial, butallows experimentation to run quickly without adegradation in performance. Phase II: Testing Phase

We use LRC to evalu-ate recoverability in order to estimate the intrinsicdimension of Z (IV). Using the same hyperparam-eters from phase I and choosing the best initial-ization method, injection location, and ensemblingstrategy, we estimate the intrinsic dimension ofthe reparameterized sentence space by varying thedimension of z , d (cid:48) , to be 192, 384, 576, and 768. nit Location Ensembling EM PM BLEU EM-max PM-max BLEU-maxI L2 All None 98.1 98.4 98.1 100.0 100.0 100.0 Xavier All None 99.0 99.0 98.9 100.0 100.0 100.0

II Xavier Embed None 44.8 44.9 44.6 72.3 72.2 71.9Xavier +Layers None 98.8 98.8 98.8 100.0 100.0 100.0Xavier Head None 4.1 3.8 3.3 4.1 3.8 3.3

Xavier All None 99.0 99.0 98.9 100.0 100.0 100.0

III Xavier All Attention (k=2) 82.8 82.2 83.0 97.3 97.3 97.3Xavier All Attention (k=4) 49.4 49.0 49.5 79.2 79.0 79.9Xavier All Interleave (k=2) 69.3 68.0 69.7 82.2 81.3 82.6Xavier All Interleave (k=4) 65.4 65.0 65.4 89.2 89.1 89.2

Xavier All None 99.0 99.0 98.9 100.0 100.0 100.0

Table 1: Recoverability results for Phase I on SRC

Recoverability on SRC

Experiment I indicatesthat initialization strategy does not affect perfor-mance signiﬁcantly, but xavier normal performsbetter than l2 normalization. Injection location, onthe other hand, has a tremendous effect on perfor-mance. Injecting z at the language modeling headalone leads to poor performance as the ﬁnal fullyconnected layer is severely bottlenecked in termsof capacity (Yang et al., 2018), but injection intothe embedding alone allows the transformer modelto work with z and learn from it — leading to a 10ximprovement over just the lm head. Above all ofthis, injecting into the transformer model at everylayer including the embedding virtually solves thetask, achieving nearly perfect recoverability acrossthe board. We theorize that this is due to the modelcontinuously seeing z at each layer, which makeoptimization easier and more stable. We ﬁnd thatadditionally injecting into the head leads to a slightincrease in recovery, so we inject z at all threeplaces for all of the following experiments.Representation injection mechanisms also havea large impact on recovery: both attention-basedand interleaved experts perform signiﬁcantly worsethan no experts. These methods suffer from thefact that splitting z into k smaller vectors reducescapacity and makes retaining information more dif-ﬁcult. See Table 1 for details. We ﬁnd that regard-less of experimental criteria, all six metrics areextremely consistent and correlate nearly perfectlyto one another. As a result, we only report BLEUscore means for the remainder of experiments. Intrinsic Dimension via Recoverability:

In ex-periment IV, we estimate the intrinsic dimensionof Z . We observe that BLEU increases as d (cid:48) in-creases until d (cid:48) = 768 , where BLEU is nearly per- fect — hinting that the intrinsic dimension of Z isapproximately 768. However, a lower-dimensionalrepresentation can recover most sentences, drop-ping off as sentence length increases, see Figure 2.This is well-known; the number of bits needed toencode a sequence grows linearly with its length.We observe low variances in our estimations, espe-cially as d (cid:48) increases, indicating that the differencesin BLEU for different values of d (cid:48) are statisticallysigniﬁcant. Figure 2: Plot of sentence length vs. BLEU score onLRC for experiment IV with error regions of ± σ .Figure 3: Cumulative explained variance plot underPCA with on LRC with number of components equalto d (cid:48) = 768 . ntrinsic Dimension via PCA: We pick the bestperforming z under BLEU-max for each sentencefrom experiment IV with d (cid:48) = 768 and apply PCAto retain 768 components ( n comp ). We observe thatboth intrinsic dimension experiments via PCA andvia recoverability show similar patterns. The shapeof the curve in Figure 3, hints that Z does not lieon a lower-dimensional linear manifold and thatits intrinsic dimensionality is approximately 768. n comp ≈ explains almost 95% of the data’svariance, which supports our observations from ex-periment IV that shows d (cid:48) = 576 achieving nearlyperfect BLEU (Figure 2).

Recoverability on GRC:

We run the intrinsic di-mension experiment on the gibberish dataset (GRC)and ﬁnd that performance on the real dataset ex-ceeds that on the gibberish dataset for all dimen-sions. This hints at the fact that although our rep-resentations memorize, they also leverage the lan-guage model. Even though

BLEU for d (cid:48) = 576 and d (cid:48) = 768 for GRC seem high, the error on GRC is5x that of LRC (Figure 4). Figure 4:

BLEU performance on LRC versus GRC fordifferent dimensionalities of z . Interpolation:

In Figure 6, we show linear in-terpolations of two pairs of z ’s that recover sen-tences exactly. The space is smooth with well-formed grammatical sentences occupying areaswith λ = [0 . , . . Our learned representationsseem to have some synonym awareness: ”tale”transforms to ”story” in the ﬁrst sentence pair and”long” transforms to ”long-running” when referringto a war. In the second sentence pair, we observesome notion of syntactical awareness: at the 0.7mixture level the syntax of the ﬁrst sentence is re- tained with mostly words from the second sentence.Lastly, for each individual sentence there exists a d dimensional volume that is fairly large. This couldindicate that nearly all sentences have some rep-resentative volume from which, if any vector wassampled, sentence recovery could generate that sen-tence exactly. Figure 5:

BLEU performance on LRC stratiﬁed bygenre for different dimensionalities of z . Towards a Universal Decoder:

We can dis-cover representations, which exactly recover targetsentences of interest in a low-dimensional spaceusing Adam. Other work found this impossiblewith

BLEU < even for short sentences withless than 10 words, when applying an analogoustechnique on LSTM-based language models (Sub-ramani et al., 2019). For sentences up to 100 words,we discover representations which achieve over98 BLEU , generalizing to text from a variety ofgenres (Figure 5). Our representations do not sim-ply memorize, but actually leverage the ﬁxed lan-guage model, leading to representations with someinterpretability. Lastly, interpolation experimentsshow that our reparametrized space has some syn-onym and syntactical awareness, while maintaininga strong prior for sentences to be mostly grammat-ically correct even in regions near the midpointbetween two sentences. As a result, our formula-tion and representation space analysis hints at thefact that unconditional language models have thepotential to be used as universal decoders and thatdesigning an encoder to learn these types of repre-sentations may be possible. igure 6: Two linear interpolations between perfectly recovered pairs of representations. Pink indicates tokenoverlap to the ﬁrst sentence, while blue indicates token overlap to the second sentence.

General-purpose Decoders

Large pretrainedlanguage models are used for extracting mean-ingful task-speciﬁc representations for differentNatural language processing tasks. (Gulcehreet al., 2015; Zoph et al., 2016; Sriram et al., 2018;Nogueira and Cho, 2019). Other methods pre-train sequence-to-sequence decoders for tasks suchas abstractive summarization and neural machinetranslation (Edunov et al., 2019; Song et al., 2019;Chan et al., 2019). None of these methods analyzesentence representations or evaluate the difﬁcultyin discovering such representations.

Latent Space of Models

Our notion of sentencespace resembles work on generative latent opti-mization because we also perform inference on aimplicit latent variable z , the sentence representa-tion, using a ﬁxed language model θ (Bojanowskiet al., 2018). Using ideas about difﬁculty of latentvariable optimization and interpolation from priorwork on latent variable language models based onvariational autoencoders (Bowman et al., 2016),denoising autoencoders (Lewis et al., 2019), gen-erative adversarial networks (Yu et al., 2017), andplug-and-play models for image and text genera-tion (Nguyen et al., 2017; Dathathri et al., 2019),we develop our notion of the reparametrized sen-tence space Z and analyses that follow. We focus on analyzing the sentence space of a ﬁxed pre-trained unconditional language model rather thantraining or ﬁne-tuning. Analysis of Language Models

Many works fo-cus on probing language models to understandwhat they know: evaluating their performance onquestion-answering or ﬁll-in-the-blank tasks orevaluating how well they transfer these kinds oftasks (Donahue et al., 2020; Tamkin et al., 2020;Hu et al., 2020; ”Gururangan et al., 2020). We fo-cus on understanding how these models representsentences, the complexity of that representation,and how easily discoverable those representationsare. The goal of identifying complexity of a sen-tence representation resembles work that analyzescontinuous bag-of-words representations with low-rank subspaces (Mu et al., 2017). Subramanianet al. (2018) learn latent representations based ongeneral-purpose encoders for neural outlines andconclude that these outlines are informative forgeneration. We focus on a different and more basicquestion, whether a pretrained language model hasthe potential to be used as a universal decoder.Recently, there has been work on investigatingwhether LSTM-based language models have sen-tence representations from which they can recoverthe original sentence (Subramani et al., 2019). Thiswork is the closest to ours. We extend their work totransformer-based language models and improvepon their reparametrization leading to representa-tions which are 5x smaller that still achieve nearlyperfect recovery across a much greater variety ofgenres. Furthermore, we show that our represen-tations are easily discoverable using simple opti-mization rather than needing to use specializedconjugate gradient methods.

To evaluate whether unconditional language mod-els have the potential to be used as universaldecoders without ﬁne-tuning, we introduce areparametrized sentence space Z . In this space,a sentence is represented as a low-dimensional vec-tor z , which we use to condition a language model,which is optimized to generate that sentence dur-ing decoding. We present two methods, sentenceencoding and sentence recovery, which allow us tomap a sentence to and from Z . Using these proce-dures, we evaluate whether we can discover repre-sentations that recover a sentence nearly perfectly.Further, we measure the intrinsic dimension of Z under the lenses of recoverability and PCA.We observe that such representations are easilydiscoverable with simple stochastic optimization,unlike prior work, even while varying genres oftext. We ﬁnd that recoverability increases with thedimension of the reparametrized sentence space,reaching nearly perfect performance when equalto the latent dimension of the model. ExperimentIV shows that sentence length and recoverabilityare inversely related. Analysis using PCA indicatesthat Z does not lie on a lower-dimensional linearmanifold and conﬁrms that the intrinsic dimensionof Z is close to the latent dimension d of the lan-guage model. Our estimates for intrinsic dimensionare upper-bounds, while the associated recoverabil-ities are lower-bounds due to the non-convexity ofthe objective function, the stochasticity of the sen-tence encoding step, and the approximate nature ofgreedy decoding.Our sentence representation formulation hasmany useful properties: nearly perfect recoverabil-ity, smoothness in the representation space, andeasy representation recovery (simple optimization)— indicating the potential for GPT-2 to be used as auniversal decoder. As a result, a next step could beto design an encoder which would learn mappingsfrom its task-speciﬁc input representation space toour reparametrized sentence space. Another avenuefor future work could be adapting this approach to work on more transformer-based language models.Having a universal decoder could result intremendous progress for low-resource sequencegeneration tasks from both a data and memoryperspective. Translation tasks such as Kurdish toEnglish are an ideal use case because they havelittle parallel data, but have a target language(English) with abundant monolingual data. Ourreparametrized sentence space formulation and thepotential of using an unconditional language modelas a universal decoder may drive progress in build-ing more generalizable systems with large-scalelanguage models. These models may encode andamplify some unwanted biases present in both thedata sources and the organizations building them.Many language models are used in commercialNLP applications without much concern for biasmitigation, but our approach could be modiﬁed toattempt to mitigate some of these biases. As with se-quence generation models broadly, there are alwayssigniﬁcant risks of this research aiding misinforma-tion spread. Our work indicates that well-trainedlarge language models have a sentence representa-tion for any well-formed target sentence, so mali-cious attackers could build harmful sequence gen-eration systems in news headline summarizationand dialog to name a few. Acknowledgments

We gratefully acknowledge Ty Wilkins for many ofthe visualizations and plots in this paper. We thankmembers of the Scale AI machine learning teamand members of the Arrive engineering team forfeedback on iterations of this work.

References

Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019.Massively multilingual neural machine translation.In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) , pages3874–3884, Minneapolis, Minnesota. Associationfor Computational Linguistics.Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton.2016. Layer normalization.

ArXiv , abs/1607.06450.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In

ICLR .Piotr Bojanowski, Armand Joulin, David Lopez-Pas,and Arthur Szlam. 2018. Optimizing the latentspace of generative networks. In

ICML .amuel R Bowman, Luke Vilnis, Oriol Vinyals, An-drew M Dai, Rafal Jozefowicz, and Samy Bengio.2016. Generating sentences from a continuousspace.

CoNLL 2016 .William Chan, Nikita Kitaev, Kelvin Guu, MitchellStern, and Jakob Uszkoreit. 2019. Kermit: Gener-ative insertion-based modeling for sequences. arXivpreprint arXiv:1906.01604 .Boxing Chen and Colin Cherry. 2014. A systematiccomparison of smoothing techniques for sentence-level bleu. In

Proceedings of the Ninth Workshopon Statistical Machine Translation .Cristian Danescu-Niculescu-Mizil and Lillian Lee.2011. Chameleons in imagined conversations: Anew approach to understanding coordination of lin-guistic style in dialogs. In

CMCL@ACL .Sumanth Dathathri, Andrea Madotto, Janice Lan, JaneHung, Eric Frank, Piero Molino, Jason Yosinski, andRosanne Liu. 2019. Plug and play language models:A simple approach to controlled text generation. In

ICLR .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In

ACL .Chris Donahue, Mina Lee, and Percy Liang. 2020. En-abling language models to ﬁll in the blanks. arXivpreprint arXiv:2005.05339 .Sergey Edunov, Alexei Baevski, and Michael Auli.2019. Pre-trained language model representa-tions for language generation. arXiv preprintarXiv:1903.09722 .David Graff, Junbo Kong, Ke Chen, and KazuakiMaeda. 2003. English gigaword.

Linguistic DataConsortium, Philadelphia .Caglar Gulcehre, Orhan Firat, Kelvin Xu, KyunghyunCho, Loic Barrault, Huei-Chi Lin, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. 2015. On us-ing monolingual corpora in neural machine transla-tion. arXiv preprint arXiv:1503.03535 .Suchin ”Gururangan, Ana Marasovi´c, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A.” Smith. 2020. ”don’t stop pretraining:Adapt language models to domains and tasks”. In

ACL .Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox,and Roger P Levy. 2020. A systematic assessmentof syntactic generalization in neural language mod-els. arXiv preprint arXiv:2005.03692 .Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Vi´egas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: En-abling zero-shot translation.

Transactions of the As-sociation for Computational Linguistics , 5:339–351.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Marie Lebert. 2008. Project gutenberg (1971-2008).Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Ves Stoyanov, and Luke Zettlemoyer. 2019.Bart: Denoising sequence-to-sequence pre-trainingfor natural language generation, translation, andcomprehension. arXiv preprint arXiv:1910.13461 .Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2017. Pointer sentinel mixture mod-els.

ArXiv , abs/1609.07843.Jiaqi Mu, Suma Bhat, and Pramod Viswanath. 2017.Representing sentences as low-rank subspaces. In

Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers) , pages 629–634, Vancouver, Canada.Association for Computational Linguistics.Ramesh Nallapati, Bowen Zhou, C. D. Santos, aglarG¨ulehre, and B. Xiang. 2016. Abstractive text sum-marization using sequence-to-sequence rnns and be-yond. In

CoNLL .A Nguyen, J Clune, Y Bengio, A Dosovitskiy, andJ Yosinski. 2017. Plug & play generative networks:Conditional iterative generation of images in latentspace. In

CVPR .Rodrigo Nogueira and Kyunghyun Cho. 2019. Pas-sage re-ranking with bert. arXiv preprintarXiv:1901.04085 .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evalu-ation of machine translation. In

ACL .Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In

NAACL .Matt Post. 2018. A call for clarity in reporting bleuscores. In

Proceedings of the Third Conference onMachine Translation: Research Papers , pages 186–191.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2019. Exploring the limitsof transfer learning with a uniﬁed text-to-text trans-former. arXiv preprint arXiv:1910.10683 .aitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequencepre-training for language generation. In

ICML .Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, andAdam Coates. 2018. Cold fusion: Training seq2seqmodels together with language models. In

Inter-speech .Nishant Subramani, Samuel Bowman, and KyunghyunCho. 2019. Can unconditional language models re-cover arbitrary sentences? In

NeurIPS .Sandeep Subramanian, Sai Rajeswar, Alessandro Sor-doni, Adam Trischler, Aaron C. Courville, andC. Pal. 2018. Towards text generation with adver-sarially learned neural outlines. In

NeurIPS .Alex Tamkin, Trisha Singh, Davide Giovanardi, andNoah Goodman. 2020. Investigating transferabil-ity in pretrained language models. arXiv preprintarXiv:2004.14975 .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

NeurIPS .Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, andWilliam W. Cohen. 2018. Breaking the softmax bot-tleneck: A high-rank rnn language model. In

ICLR .Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017. Seqgan: Sequence generative adversarial netswith policy gradient. In

AAAI .Barret Zoph, Deniz Yuret, Jonathan May, and KevinKnight. 2016. Transfer learning for low-resourceneural machine translation. In

EMNLP . Intrinsic Dimensionality Results

We have included a table with the recoverabitilitymetrics for experiment IV, measuring intrinsic di-mension via recoverability, from the original paper,on LRC (the large recoverability corpus). The plotin the original paper is consistent with the resultsin Table 2. Recoverability performances are high-est when the intrinsic dimension is close to themodel’s hidden dimension, d (768). In ﬁgure 7 and8 we visualize EM and P M performance scoresfor different intrinsic dimension d (cid:48) for differentsentence lengths. The two plots are very similarto the BLEU vs Sentence length plot we haveprovided in the Results section of the paper. Perfor-mance metrics for each corpus indicate that aver-age recoverability over sentences is highest for theMovie dataset. This is also consistent with

BLEU by genre results we observed in the paper.

Figure 7: Plot of sentence length vs. EM score on LRCfor experiment IV with error regions of ± σ .Figure 8: Plot of sentence length vs. PM score on LRCfor experiment IV with error regions of ± σ . B Interpolation

We have provided some more examples of interpo-lation of sentence representations. In Figure 9, weshow another two sentence pairs. On the left, wesee the same trends as we saw before with well-formed, grammatical sentences occupying everylevel of the interpolation. We observe a mixing of the two sentences with lambda equaling 0.5. Oneinteresting ﬁnding is that the model outputs ”Pa-ciﬁc theater,” a very speciﬁc historical term usedto describe World War II in the Paciﬁc Ocean, anduses it correctly. In the second sentence pair in Fig-ure 9, we observe more synonym awareness, butalso observe further evidence of the nonlinearity ofthe sentence representation as the word ”Iroquois”is forgotten when lambda equals 0.7 and 0.8. Fig-ure 10 shows a long sentence’s representation beingencoded when lambda equals 0.6 that is thematicand ﬂuent. Figure 11, however, hints at the nonlin-earity of the space, generating gibberish at the endwith B-B-B-B repeated 24 times. ataset Dimension EM PM BLEU EM-max PM-max BLEU-maxComplete 192 35.10 34.71 35.33 45.11 44.25 45.12384 86.33 86.20 86.71 93.90 93.81 94.25576 96.19 96.10 96.58 98.50 98.44 98.87768 97.99 97.96 98.37 99.32 99.32 99.68Books 192 34.77 34.25 34.86 44.92 43.88 44.70384 85.28 85.14 85.40 92.41 92.28 92.47576 96.02 95.83 96.09 98.35 98.12 98.43768 97.91 97.90 98.01 99.51 99.50 99.59News 192 29.52 29.28 30.14 37.17 36.51 37.69384 85.87 85.76 86.94 94.16 94.10 95.25576 96.25 96.18 97.33 98.01 98.01 99.10768 97.38 97.35 98.42 98.20 98.20 99.30Wiki 192 34.37 33.91 34.36 44.78 43.75 44.49384 84.71 84.61 84.76 92.14 92.00 92.12576 95.06 94.99 95.15 98.27 98.25 98.28768 98.07 98.02 98.14 100.00 100.00 100.00Movies 192 41.73 41.41 41.95 53.57 52.85 53.59384 89.45 89.29 89.76 96.89 96.84 97.16576 97.43 97.38 97.75 99.38 99.37 99.65768 98.60 98.59 98.91 99.57 99.57 99.84ataset Dimension EM PM BLEU EM-max PM-max BLEU-maxComplete 192 35.10 34.71 35.33 45.11 44.25 45.12384 86.33 86.20 86.71 93.90 93.81 94.25576 96.19 96.10 96.58 98.50 98.44 98.87768 97.99 97.96 98.37 99.32 99.32 99.68Books 192 34.77 34.25 34.86 44.92 43.88 44.70384 85.28 85.14 85.40 92.41 92.28 92.47576 96.02 95.83 96.09 98.35 98.12 98.43768 97.91 97.90 98.01 99.51 99.50 99.59News 192 29.52 29.28 30.14 37.17 36.51 37.69384 85.87 85.76 86.94 94.16 94.10 95.25576 96.25 96.18 97.33 98.01 98.01 99.10768 97.38 97.35 98.42 98.20 98.20 99.30Wiki 192 34.37 33.91 34.36 44.78 43.75 44.49384 84.71 84.61 84.76 92.14 92.00 92.12576 95.06 94.99 95.15 98.27 98.25 98.28768 98.07 98.02 98.14 100.00 100.00 100.00Movies 192 41.73 41.41 41.95 53.57 52.85 53.59384 89.45 89.29 89.76 96.89 96.84 97.16576 97.43 97.38 97.75 99.38 99.37 99.65768 98.60 98.59 98.91 99.57 99.57 99.84