Evaluating representations by the complexity of learning low-loss predictors
William F. Whitney, Min Jae Song, David Brandfonbrener, Jaan Altosaar, Kyunghyun Cho
EEvaluating representations by the complexityof learning low-loss predictors
William F. Whitney Min Jae Song David BrandfonbrenerJaan Altosaar Kyunghyun Cho
Courant InstituteNew York University [email protected]
Abstract
We consider the problem of evaluating representations of data for use in solvinga downstream task. We propose to measure the quality of a representation by thecomplexity of learning a predictor on top of the representation that achieves lowloss on a task of interest, and introduce two methods, surplus description length(SDL) and ε sample complexity ( ε SC). In contrast to prior methods, which measurethe amount of information about the optimal predictor that is present in a specificamount of data, our methods measure the amount of information needed from thedata to recover an approximation of the optimal predictor up to a specified tolerance.We present a framework to compare these methods based on plotting the validationloss versus training set size (the “loss-data” curve). Existing measures, such asmutual information and minimum description length probes, correspond to slicesand integrals along the data-axis of the loss-data curve, while ours correspond toslices and integrals along the loss-axis. We provide experiments on real data tocompare the behavior of each of these methods over datasets of varying size alongwith a high performance open-source library for representation evaluation.
One of the first steps in building a machine learning system is selecting a representation of data.Whereas classical machine learning pipelines often begin with feature engineering, the advent ofdeep learning has led many to argue for pure end-to-end learning where the deep network constructsthe features (LeCun et al., 2015). However, huge strides in unsupervised learning (Hénaff et al., 2019;Chen et al., 2020; He et al., 2019; van den Oord et al., 2018; Bachman et al., 2019; Devlin et al.,2019; Liu et al., 2019; Raffel et al., 2019; Brown et al., 2020) have led to a reversal of this trend inthe past two years, with common wisdom now recommending that the design of most systems startfrom a pretrained representation. With this boom in representation learning techniques, practitionersand representation researchers alike have the question: Which representation is best for my task?Simple, traditional means of evaluating representations, such as the validation accuracy of linearprobes (Ettinger et al., 2016; Shi et al., 2016; Alain & Bengio, 2016), have been widely criticized(Hénaff et al., 2019; Resnick et al., 2019). So, researchers have taken up a variety of alternativessuch as the validation accuracy (VA) of nonlinear probes (Conneau et al., 2018; Hénaff et al., 2019),mutual information (MI) between representations and labels (Bachman et al., 2019; Pimentel et al.,2020), and minimum description length (MDL) of the labels conditioned on the representations (Blier& Ollivier, 2018; Yogatama et al., 2019; Voita & Titov, 2020). We propose two measures to resolve
Preprint. Under review. a r X i v : . [ c s . L G ] S e p ome of the limitations in prior work and propose a simple framework (shown in Figure 1) to contrastthe different methods. Dataset size
Small Large Infinite L o ss MDL, AMI, AVA, AMDL, BMI, BVA, B (a) Existing measures
Dataset size S m a ll L a r g e SDL, ASDL, BεSC, AεSC, B (b) Proposed measures
10 100 1,000 10,000
Dataset size
Representation (c) Illustrative experiment
Figure 1: Each measure for evaluating representation quality is a simple function of the “loss-data”curve shown here, which plots validation loss of a probe against training dataset size.
Left:
Validationaccuracy (VA), mutual information (MI), and minimum description length (MDL) measure propertiesof a given dataset, with VA measuring the loss at a finite amount of data, MI measuring it at infinity,and MDL integrating it from zero to n . This dependence on dataset size can lead to misleadingconclusions as the amount of available data changes. Middle:
Our proposed methods instead measurethe complexity of learning a predictor with a particular loss tolerance. ε sample complexity ( ε SC)measures the number of samples required to reach that loss tolerance, while surplus description length(SDL) integrates the surplus loss incurred above that tolerance. Neither depends on the dataset size.
Right:
A simple example task which illustrates the issue. One representation, which consists of noisylabels, allows quick learning, while the other supports low loss in the limit of data. Evaluating eitherrepresentation at a particular dataset size risks drawing the wrong conclusion.Answering the question of "which representation is best for my task?" in a principled manner requiresprecise notions of “task” and “best”. In this paper, we consider the “task” to be finding a predictorthat achieves low population risk on a downstream supervised learning problem, where “low” must bedefined on a problem-dependent basis. We define the “best” representation as the one which allowsfor the most efficient learning of a predictor to solve the task, where we will argue for notions ofefficiency in terms of either information or number of samples.From this perspective, prior methods have clear limitations. VA and MDL measure the amountof information about the optimal predictor that is present in a specific amount of data instead ofmeasuring the amount of information needed from the data to recover an approximation of theoptimal predictor up to a specified tolerance . This difference is subtle, but we show (as can be seen inFigure 1) that this makes VA and MDL liable to choose different representations for the same taskwhen given training sets of different sizes. On the other hand, MI measures the lowest loss achievableby any predictor irrespective of the complexity of learning such a function. It is important to notethat while these methods do not correspond to our notion of measuring the best representation for atask, there may be different notions of “best” and “task” for which they measure the proper quantity.To eliminate these issues, we propose two measures. In both of our measures, the user must specify atolerance ε so that a population loss of less than ε qualifies as solving the task, i.e. approximatingthe optimal predictor to low error. The first measure is the surplus description length (SDL) whichmodifies the MDL to measure the complexity of learning an ε approximation of the optimal predictorrather than the complexity of the training dataset. The second is the ε -sample complexity ( ε SC) whichmeasures the sample complexity of learning an ε approximation of the optimal predictor. We showthat these measures resolve the issues with prior work.We also propose a framework called the loss-data framework , illustrated in Figure 1, that plots thevalidation loss against the training set size (Talmor et al., 2019; Yogatama et al., 2019; Voita & Titov,2020). This framework simplifies comparisons between methods. Prior work measures quantitiescorresponding to integrals (MDL) and slices (VA and MI) along the data-axis. Our work proposesinstead measuring integrals (SDL) and slices ( ε SC) along the loss-axis. This clearly illustrateshow prior work makes tacit assumptions about what quantity to measure based on the choice of2ataset size. Our work instead makes an explicit, interpretable choice of threshold ε and measuresthe complexity of solving the task to ε error. We experimentally investigate the behavior of thesemethods, confirming the sensitivity of VA and MDL to dataset size and illustrating the robustness ofSDL and ε SC.
Open source library.
To enable reproducible and efficient representation evaluation for representa-tion researchers, we provide a highly optimized library at https://github.com/willwhitney/reprieve . This library enables construction of loss-data curves with arbitrary representations anddatasets and is library-agnostic, supporting representations and learning algorithms implementedin any Python ML library. By leveraging the JAX library (Bradbury et al., 2018) to parallelize thetraining of probes on a single accelerator, this library can construct loss-data curves in around twominutes on one GPU.
In this section we formally present the representation evaluation problem, define our loss-dataframework, and show how prior work fits into the framework.
Notation.
We use bold letters to denote random variables. A supervised learning problem is definedby a joint distribution D over observations and labels ( X , Y ) in the sample space X × Y withdensity denoted by p . Let the random variable D n be a sample of n i.i.d. ( X , Y ) pairs, realizedby D n = ( X n , Y n ) = { ( x i , y i ) } ni =1 . Let R denote a representation space and φ : X → R a representation function. The methods we consider all use parametric probes, which are neuralnetworks ˆ p θ : R → P ( Y ) parameterized by θ ∈ R d that are trained on D n to estimate the conditionaldistribution p ( y | x ) . We often abstract away the details of learning the probe by simply referringto an algorithm A which returns a predictor: ˆ p = A ( φ ( D n )) . Abusing notation, we denote thecomposition of A with φ by A φ . Define the population loss and the expected population loss for ˆ p = A φ ( D n ) , respectively as L ( A φ , D n ) = E ( X , Y ) − log ˆ p ( Y | X ) , L ( A φ , n ) = E D n L ( A φ , D n ) . (1)In this section we will focus on population quantities, but note that any algorithmic implementationmust replace these by their empirical counterparts. The representation evaluation problem.
The representation evaluation problem asks us to definea real-valued measurement of the quality of a representation φ for solving solving the task defined by ( X , Y ) . Explicitly, each method defines a real-valued function m ( φ, D , A , Ψ) of a representation φ , data distribution D , probing algorithm A , and some method-specific set of hyperparameters Ψ .By convention, minimizing the measure m corresponds to better representations. Defining such ameasurement allows us to compare different representations. The loss-data framework is a lens through which we contrast different measures of representationquality. The key idea, demonstrated in Figure 1, is to plot the loss L ( A φ , n ) against the datasetsize n . Explicitly, at each n , we train a probing algorithm A using a representation φ to produce apredictor ˆ p , and then plot the loss of ˆ p against n . Similar analysis has appeared in Voita & Titov(2020); Yogatama et al. (2019); Talmor et al. (2019). We observe that we can represent each of theprior measures as points on the curve at fixed x (VA, MI) or integrals of the curve along the x -axis(MDL). Our measures correspond to evaluating points at fixed y ( ε SC) and integrals along the y -axis(SDL). A simple strategy for evaluating representations is to choosea probe architecture and train it on a limited amount of data from the task and representation ofinterest (Hénaff et al., 2019; Zhang & Bowman, 2018). On the loss-data curve, this corresponds toevaluation at x = n , so that m VA ( φ, D , A , n ) = L ( A φ , n ) . (2)3 utual information. Mutual information (MI) between a representation φ ( X ) and targets Y isanother often-proposed metric for learning and evaluating representations (Pimentel et al., 2020;Bachman et al., 2019). In terms of entropy, mutual information is equivalent to the information gainabout Y from knowing φ ( X ) : I ( φ ( X ); Y ) = H ( Y ) − H ( Y | φ ( X )) . (3)In general mutual information is intractable to estimate for high-dimensional or continuous-valuedvariables (McAllester & Stratos, 2020), and a common approach is to use a very expressive model for ˆ p and maximize a variational lower bound: I ( φ ( X ); Y ) ≥ H ( Y ) + E ( X , Y ) log ˆ p ( Y | φ ( X )) (4)Since H ( Y ) is not a function of the parameters, maximizing the lower bound is equivalent tominimizing the negative log-likelihood. Moreover, if we assume that ˆ p is expressive enough torepresent p and take n → ∞ , this inequality becomes tight. As such, MI estimation can be seen aspecial case of nonlinear probes as described above, where instead of choosing some particular settingof n we push it to infinity. We formally define the mutual information measure of a representation as m MI ( φ, D , A ) = lim n →∞ L ( A φ , n ) . (5)A decrease in this measure reflects an increase in the mutual information. On the loss-data curve, thiscorresponds to evaluation at x = ∞ . Minimum description length.
Recent studies (Yogatama et al., 2019; Voita & Titov, 2020) proposeusing the Minimum Description Length (MDL) principle (Rissanen, 1978; Grünwald, 2004) toevaluate representations. These works use an online or prequential code (Blier & Ollivier, 2018) toencode the labels given the representations. The codelength (cid:96) of Y n given φ ( X n ) is then defined as (cid:96) ( Y n | φ ( X n )) = − n (cid:88) i =1 log ˆ p i ( y i | φ ( x i )) , (6)where ˆ p i is the output of running a pre-specified algorithm A on the dataset up to element i : ˆ p i = A φ ( X n i , Y n i ) . Taking an expectation over the sampled datasets for each i , we define apopulation variant of the MDL measure (Voita & Titov, 2020) as m MDL ( φ, D , A , n ) = E (cid:104) (cid:96) ( Y n | φ ( X n )) (cid:105) = n (cid:88) i =1 L ( A , i ) . (7)Thus, m MDL measures the area under the loss-data curve on the interval x ∈ [0 , n ] . Each of the prior methods, VA, MDL, and MI, have limitations that we attempt to solve with ourmethods. In this section we present these limitations. In Section 3.1, we describe a toy examplewhich demonstrates why evaluation metrics that depend on the training set size like VA and MDLcan be misleading. Then in Section 3.2 we argue that MI, which does not depend on the training setsize, can be misleading as well since it is insensitive to the quality of the representation.
As seen in Section 2.2, the representation quality measures of VA and MDL both depend on n , thesize of the training set. Because of this dependence, the ranking of representations given by theseevaluation metrics can change as n increases. As a result, favoring one representation over othersby comparing these metrics at arbitrary n may lead to premature decisions in the machine learningpipeline since evaluating on a larger training set could give a different representation ranking. A theoretical example.
Let s ∈ { , } d be a fixed binary vector and consider a data generationprocess where the { , } label of a data point is given by the parity on s , i.e., y i = (cid:104) x i , s (cid:105) mod 2 where y i ∈ { , } and x i ∈ { , } d . Let Y n = { y i } ni =1 be the given labels and consider the followingtwo representations. 4. Noisy label: z i = (cid:104) x i , s (cid:105) + e i mod 2 , where e i ∈ { , } is a random bit with bias α < / .2. Raw data: x i .For the noisy label representation, guessing y i = z i achieves validation accuracy of − α for any n , which, is information-theoretically optimal. On the other hand, the raw data representation willachieve perfect validation accuracy once the training set contains d linearly independent x i ’s. In thiscase, Gaussian elimination will exactly recover s . The probability that a set of n > d random vectorsin { , } d does not contain d linearly independent vectors decreases exponentially in n − d . Hence,the expected validation accuracy for n sufficiently larger than d will be exponentially close to 1. As aresult, the representation ranking given by validation accuracy and description length favors the noisylabel representation when n (cid:28) d , but the raw data representation will be much better in these metricswhen n (cid:29) d . This can be misleading.Although this is a concocted example for illustration purposes, our experiments in Section 5 showthat dependence of representation rankings on n does occur in practice. MI considers the lowest validation loss achievable with the given representation. This ignores anyconcerns about statistical or computational complexity of achieving such accuracy which leads tosome counterintuitive properties:1. MI is insensitive to statistical complexity. Two random variables which are perfectlypredictive of one another have maximal MI, though their relationship may be sufficientlycomplex that it requires exponentially many samples to verify (McAllester & Stratos, 2020).2. MI is insensitive to computational complexity. For example, the mutual information betweenan intercepted encrypted message and the enemy’s plan is high (Shannon, 1948; Xu et al.,2020), despite the extreme computational cost required to break the encryption.3. MI is insensitive to representation. By the data processing inequality (Cover & Thomas,2006), any φ applied to X can only decrease its mutual information with Y ; no matter thequery, MI always reports that the raw data is at least as good as the best representation.As a result, we believe that in most settings MI is an undesirable metric for evaluating representations. All three prior methods lack a predefined notion of successfully solving the task. They will returnan ordering of representations regardless of whether or not this order is meaningful. Ultimately,we care about achieving high predictive accuracy on the given task. We would not even care aboutthe rankings of representations if all gave terrible validation loss. That is, there is often an implicitminimum requirement for the validation loss a representation should achieve for it to be consideredmeaningful. As we will see in the next section, our methods makes this requirement explicit. ε sample complexity The methods discussed above measure a property of the data, such as the attainable accuracy on n points, by learning an unspecified function. Instead, we propose to precisely define the function ofinterest and measure its complexity using data. Fundamentally we shift from making a statementabout the inputs of an algorithm, like validation accuracy and MDL do, to imposing a constraint onthe outputs of the algorithm. Imagine trying to efficiently encode a large number of samples of a random variable e which takesvalues in { . . . K } with probability p ( e ) . An optimal code for these events has expected length E [ (cid:96) ( e )] = E e [ − log p ( e )] = H ( e ) . If this data is instead encoded using a probability distribution in nats p , the expected length becomes H ( e ) + D KL (cid:0) p || ˆ p (cid:1) . We call D KL (cid:0) p || ˆ p (cid:1) the surplus descriptionlength (SDL) from encoding according to ˆ p instead of p : D KL (cid:0) p || ˆ p (cid:1) = E e ∼ p [log p ( e ) − log ˆ p ( e )] . (8)When the true distribution p is a delta mass, the entire length of a code under ˆ p is surplus since log 1 = 0 .Recall that the prequential code for estimating MDL computes the description length of the labelsgiven observations in a dataset by iteratively creating tighter approximations ˆ p . . . ˆ p n and integratingthe area under the curve. Examining Equation (7), we see that m MDL ( φ, D , A , n ) = n (cid:88) i =1 L ( A φ , i ) ≥ n (cid:88) i =1 H ( Y | φ ( X )) (9)If H ( Y | φ ( X )) > , the description length grows without bound as n increases.We instead propose to measure the complexity of an approximate labeling function p ( Y | φ ( X )) by computing the surplus description length of encoding an infinite stream of data according to theonline code instead of the true conditional distribution. Definition 1 (Surplus description length of online codes) . Given random variables X , Y ∼ D , arepresentation function φ , and a learning algorithm A , define m SDL ( φ, D , A ) = ∞ (cid:88) i =1 E X , Y (cid:104) − log ˆ p i ( Y | φ ( X )) + log p ( Y | X ) (cid:105) (10) = ∞ (cid:88) i =1 (cid:104) L ( A φ , i ) − H ( Y | X ) (cid:105) (11)We generalize this definition to measure the complexity of learning an approximating conditionaldistribution with loss ε , rather than the true conditional distribution only: Definition 2 (Surplus description length of online codes with an arbitrary baseline) . Given randomvariables X , Y ∼ D , a representation function φ , a learning algorithm A , and a loss tolerance ε ≥ H ( Y | X ) , define m SDL ( φ, D , A , ε ) = ∞ (cid:88) i =1 (cid:104) L ( A φ , i ) − ε (cid:105) + (12) where [ c ] + denotes max(0 , c ) . In our framework, the surplus description length corresponds to computing the area between theloss-data curve and a baseline set by y = ε . Whereas MDL measures the complexity of a sample of n points, SDL measures the complexity of a function which solves the task to ε tolerance. Estimating the SDL.
Naively computing SDL would require unbounded data and the estimationof L ( A φ , i ) for every i . However, if we assume that algorithms are monotonically improving sothat L ( A , i + 1) ≤ L ( A , i ) , SDL only depends on i up to the first point where L ( A , n ) ≤ ε .Approximating this integral in practice can be done efficiently by taking a log-uniform partition of thedataset size and computing the Riemann sum as in Voita & Titov (2020). Crucially, if the tolerance ε is set too low or the maximum amount of available data is insufficient, an implementation may reportthat the given complexity estimate is only a lower bound.In Appendix A we provide a detailed algorithm for estimating surplus description length, along with atheorem proving its data requirements as a function of the sample complexity and desired confidence. ε sample complexity ( ε SC)
In addition to the surplus description length, we suggest the use of a second, simpler complexitymeasure: sample complexity. 6 efinition 3 (Sample complexity of an ε -loss predictor) . Given random variables X , Y ∼ D , arepresentation function φ , a learning algorithm A , and a loss tolerance ε ≥ H ( Y | φ ( X )) , define m ε SC ( φ, D , A , ε ) = min (cid:110) n ∈ N : L ( A φ , n ) ≤ ε (cid:111) . (13)Sample complexity measures the complexity of learning an ε -loss predictor by the number of samplesit takes a given algorithm to find it. In our framework, sample complexity corresponds to taking ahorizontal slice of the loss-data curve at y = ε , and in this sense it is analogous to VA. Whereas VAmakes a statement about the data (by setting n ) and reports the accuracy of some function given thatdata, sample complexity specifies the desired function and determines its complexity by how manysamples are needed to learn it. Estimating the ε SC.
Given an assumption that algorithms are monotonically improving suchthat L ( A , n + 1) ≤ L ( A , n ) , ε SC can be estimated efficiently. With n finite samples of trainingdata, an algorithm may estimate ε SC by splitting the data into k uniform-sized bins and estimating L ( A , ik / n ) for i ∈ { . . . k } . By recursively performing this search on the interval which containsthe transition from L > ε to L < ε , an algorithm can rapidly reach a precise estimate or report that m ε SC ( φ, D , A , ε ) > n .A more detailed examination of the algorithmic considerations of estimating ε SC can be found inAppendix B.
Using objectives other than negative log-likelihood.
Our exposition of ε SC uses negative log-likelihood for consistency with other methods, such as MDL, which require it. However, it isstraightforward to extend ε SC to work with whatever objective function is desired under the assump-tion that said objective is monotone with increasing data when using algorithm A . Representation CIFAR Pixels VAEn60 Val loss 0.88 1.54
MDL 122.75 147.34
SDL, ε =1 65.33 > 87.34 SDL, ε =0.2 > 110.75 > 135.34 > 81.8 ε SC, ε =1 60 > 60.0 ε SC, ε =0.2 > 60.0 > 60.0 > 60.020398 Val loss ε =1 65.33 93.06 SDL, ε =0.2 ε SC, ε =1 60 147 ε SC, ε =0.2
10 100 1,000 10,000
Dataset size T e s t l o ss CIFARPixelsVAE
Representation
Figure 2: Results using three representations on the MNIST dataset. We omit MI as for any finiteamount of data, the MI measure is the same as the validation loss.We empirically demonstrate the issue of sensitivity to dataset size by two experiments on real data.For the first, shown in Figure 2, we train probes to solve MNIST from three representations: (1) thelast hidden layer of a small convolutional network pretrained on CIFAR-10; (2) raw pixels; and (3) avariational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) trained on MNIST.At small n , the VAE representation would appear to be the best according to VA and MDL, but as theamount of data grows, both prefer the CIFAR representation. By contrast, SDL and ε SC are awarewhen they do not have sufficient data to estimate complexity. Even with very little data available, they7stimate the complexity of a 1.0-loss function; however, they only bound the complexity of a 0.2-lossfunction until they have more data. Once SDL and ε SC estimate a quantity, such as the complexity oflearning a 1.0-loss predictor from the VAE representation, it does not change with n ; we suggest thatthis may be a useful property for an evaluation benchmark.For the second experiment, we use the part-of-speech task introduced by Hewitt & Liang (2019) asimplemented by Voita & Titov (2020) with the same probe architecture and other hyperparameters asthose works. We compare the representations given by different layers of a pretrained ELMo (Peterset al., 2018) model. As shown in Figure 3, we find the ranking of the three representations accordingto VA and MDL changes as the amount of data increases. Meanwhile the measures computed by SDLand ε SC reflect the insufficient data in the low-data regime, since the sample complexity is greaterthan the amount of available training data.Details of the experiments, including representation training, probe architectures, and hyperparame-ters, are available in Appendix C.
ELMo layer 0 1 2n461 Val loss 0.73 ε =0.5 > 283.75 > 334.43 > 366.35SDL, ε =0.1 > 472.15 > 522.83 > 554.75 ε SC, ε =0.5 > 461 > 461 > 461 ε SC, ε =0.1 > 461 > 461 > 461474838 Val loss 0.17 ε =0.5 ε =0.1 > 42162.47 ε SC, ε =0.5 1256 ε SC, ε =0.1 > 474838
100 1,000 10,000 100,000
Dataset size T e s t l o ss ELMo layer 0ELMo layer 1ELMo layer 2
Representation
Figure 3: Results using three representations on a part of speech task. We omit MI as for any finiteamount of data, the MI measure is the same as the validation loss.
Zhang & Bowman (2018) and Hewitt & Liang (2019) propose random baselines for linguistic tasksto provide context for how much linguistic structure is readily accessible in representations. To showseparation between the validation accuracy achieved by these random baselines and representationspretrained on genuine linguistic labels, they had to limit the amount of training data or restrict thecapacity of probes. As an alternative, Voita & Titov (2020) propose using the MDL framework,which accounts for the “effort of learning” required by the probes to achieve high validation accuracy,to demonstrate the separation between pretrained representations and random baselines. An earlierwork by Yogatama et al. (2019) also uses prequential codes to evaluate representations for linguistictasks. Talmor et al. (2019) look at the loss-data curve (called “learning curve” in their work) and usea weighted average of the validation loss at various training set sizes to evaluate representations.
In this work we have introduced the loss-data framework for comparing representation evaluationmeasures and used it to diagnose the issue of sensitivity to dataset size in the validation accuracy andminimum description length measures. We proposed two measures, surplus description length and ε sample complexity, which eliminate this issue by measuring the complexity of learning a predictorwhich solves the task of interest to ε tolerance. Empirically we showed that sensitivity to dataset size8ccurs in practice for VA and MDL, while SDL and ε SC are robust to the amount of available dataand are able to report when it is insufficient to make a judgment.Each of these measures depends on a choice of algorithm A , including hyperparameters such asprobe architecture, which could make the evaluation procedure less robust. To alleviate this, futurework might consider a set of algorithms A = {A i } Ki =1 and a method of combining them, such as themodel switching technique of Blier & Ollivier (2018); Erven et al. (2012) or a Bayesian prior.Finally, while existing measures such as VA, MI, and MDL do not measure our notion of the bestrepresentation for a task, under other settings they may be the correct choice. For example, if onlya fixed set of data will ever be available, selecting representations using VA might be a reasonablechoice; and if unbounded data is available for free, perhaps MI is the most appropriate measure.However, in many cases the robustness and interpretability offered by SDL and ε SC make them apractical choice for practitioners and representation researchers alike.9 roader Impact
Improving the tools for evaluating representations could lead to better pretrained representations andincreasing use of those pretrained models. Given that large-scale learning from scratch is expensivein terms of energy and thus also carbon output (Strubell et al., 2019), making the use of small modelson top of pretrained representations more appealing could reduce the carbon footprint of machinelearning. However, the training of such large-scale unsupervised models is itself costly.Work on evaluating representations can be applied not only in the service of building better models,but also inspecting the contents of their learned representations. The field of machine learningfairness focuses on questions of bias in learned models. Recent work (Kurita et al., 2019; Bordia& Bowman, 2019) aims to examine whether the representations in deep learning models containprotected information which might be used in downstream decision-making. Whether or not it wouldbe the correct choice, measures such as ours might be used to compare how readily such protectedinformation can be accessed from a given representation. A major downside is that just like anyother metric that relies on the test accuracy/loss, our measures do not reveal any subtle, undesirableaspects of representation, such as gender bias. We caution users of the proposed approach that theproposed measures need to be complemented with in-depth qualitative analysis in order to avoid anyundesirable outcome of representation choice.As always, we note that machine learning improvements come in the form of “building machinesto do X better”. For a sufficiently malicious or ill-informed choice of X, such as surveillance orrecidivism prediction, almost any progress in machine learning might indirectly lead to a negativeoutcome, and our work is not excluded from that.
References
Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes.
Interna-tional Conference on Learning Representations , 2016.Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning representations by maximizing mutualinformation across views. In
Advances in Neural Information Processing Systems , 2019.Blier, L. and Ollivier, Y. The description length of deep learning models. In
Advances in NeuralInformation Processing Systems , 2018.Bordia, S. and Bowman, S. R. Identifying and reducing gender bias in word-level language models.In
North American Chapter of the Association for Computational Linguistics: Student ResearchWorkshop , pp. 7–15, 2019.Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and Wanderman-Milne, S. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax .Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam,P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R.,Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray,S., Chess, B., Clark, J., Berner, C., McCand lish, S., Radford, A., Sutskever, I., and Amodei, D.Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165 , 2020.Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. E. A simple framework for contrastive learningof visual representations. arXiv preprint arXiv:2002.05709 , 2020.Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. What you can cram into asingle $&!
Annual Meeting ofthe Association for Computational Linguistics , pp. 2126–2136, 2018.Cover, T. M. and Thomas, J. A.
Elements of information theory . Wiley, 2006.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectionaltransformers for language understanding. In
North American Chapter of the Association forComputational Linguistics , pp. 4171–4186, 2019.10rven, T., Grünwald, P., and Rooij, S. Catching up faster by switching sooner: A predictive approachto adaptive stimation with an application to the aic-bic dilemma.
Journal of the Royal StatisticalSociety. Series B (Statistical Methodology) , 74, 06 2012.Ettinger, A., Elgohary, A., and Resnik, P. Probing for semantic evidence of composition by means ofsimple classification tasks. In
Workshop on Evaluating Vector-Space Representations for NLP , pp.134–139, 2016.Grünwald, P. A tutorial introduction to the minimum description length principle. arXiv preprintmath:0406077 , 2004.He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visualrepresentation learning. arXiv preprint arXiv:1911.05722 , 2019.Hénaff, O. J., Razavi, A., Doersch, C., Eslami, S. M. A., and van den Oord, A. Data-efficient imagerecognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 , 2019.Hewitt, J. and Liang, P. Designing and interpreting probes with control tasks. In
Empirical Methods inNatural Language Processing and International Joint Conference on Natural Language Processing ,pp. 2733–2743, 2019.Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.
CoRR , abs/1412.6980, 2015.Kingma, D. P. and Welling, M. Auto-encoding variational Bayes.
International Conference onLearning Representations , 2014.Kurita, K., Vyas, N., Pareek, A., Black, A. W., and Tsvetkov, Y. Measuring bias in contextualizedword representations. In
Workshop on Gender Bias in Natural Language Processing , pp. 166–172,2019.LeCun, Y., Bengio, Y., and Hinton, G. Deep learning.
Nature , 521(7553):436–444, 2015.Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L.,and Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprintarXiv:1907.11692 , 2019.McAllester, D. and Stratos, K. Formal limitations on the measurement of mutual information.
International Conference on Artificial Intelligence and Statistics , 2020.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein,N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy,S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performancedeep learning library. In
NeurIPS , 2019.Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep con-textualized word representations. In
North American Chapter of the Association for ComputationalLinguistics , pp. 2227–2237, 2018.Pimentel, T., Valvoda, J., Maudslay, R. H., Zmigrod, R., Williams, A., and Cotterell, R. Information-theoretic probing for linguistic structure. arXiv preprint arXiv:2004.03061 , 2020.Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J.Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprintarXiv:1910.10683 , 2019.Resnick, C., Zhan, Z., and Bruna, J. Probing the state of the art: A critical look at visual representationevaluation. arXiv preprint arXiv:1912.00215 , 2019.Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inferencein deep generative models. In
International Conference on Machine Learning , 2014.Rissanen, J. Modeling by shortest data description.
Automatica , 14:465–471, 1978.Shannon, C. A mathematical theory of communication.
Bell Syst. Tech. J. , 27:379–423, 1948.11hi, X., Padhi, I., and Knight, K. Does string-based neural MT learn source syntax? In
EmpiricalMethods in Natural Language Processing , pp. 1526–1534, 2016.Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning innlp. In
ACL , 2019.Talmor, A., Elazar, Y., Goldberg, Y., and Berant, J. oLMpics–on what language model pre-trainingcaptures. arXiv preprint arXiv:1912.13283 , 2019.van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 , 2018.Voita, E. and Titov, I. Information-theoretic probing with minimum description length. arXiv preprintarXiv:2003.12298 , 2020.Xu, Y., Zhao, S., Song, J., Stewart, R., and Ermon, S. A theory of usable information undercomputational constraints. In
International Conference on Learning Representations , 2020.Yogatama, D., d’Autume, C. d. M., Connor, J., Kocisky, T., Chrzanowski, M., Kong, L., Lazaridou,A., Ling, W., Yu, L., Dyer, C., et al. Learning and evaluating general linguistic intelligence. arXivpreprint arXiv:1901.11373 , 2019.Zhang, K. and Bowman, S. Language modeling teaches you more than translation does: Lessonslearned through auxiliary syntactic task analysis. In
EMNLP Workshop BlackboxNLP: Analyzingand Interpreting Neural Networks for NLP , pp. 359–361, 2018.12 ppendix A Algorithmic details for estimating surplus description length
Recall that the SDL is defined as m SDL ( φ, D , A , ε ) = ∞ (cid:88) n =1 (cid:104) L ( A φ , n ) − ε (cid:105) + (14)For simplicity, we assume that L is bounded in [0 , . Note that this can be achieved by truncating thecross-entropy loss. Algorithm 1:
Estimate surplus error
Input: tolerance ε , max iterations M , number of datasets K , representation φ , data distribution D ,algorithm A Output:
Estimate ˆ m of m ( φ, D , ε, A ) and indicator I of whether this estimate is tight or lowerboundSample K datasets D kM ∼ D of size M + 1 for n = 1 to M do For each k ∈ [ K ] , run A on D kM [1 : n ] to produce a predictor ˆ p kn Take K test samples ( x k , y k ) = D kM [ M + 1] Evaluate ˆ L n = K (cid:80) Kk =1 (cid:96) (ˆ p kn , x k , y k ) Set ˆ m = (cid:80) Mn =1 [ ˆ L n − ε ] + if ˆ L M ≤ ε/ then Set I = tight else Set I = lower bound ; return ˆ m, I In our experiments we replace D kM [1 : n ] of Algorithm 1 with sampled subsets of size n from a singletraining set. Additionally, we use between 10 and 20 values of n instead of evaluating L ( A φ , n ) atevery integer between and M . This strategy, also used by Blier & Ollivier (2018) and Voita & Titov(2020), corresponds to the description length under a code which updates only periodically duringtransmission of the data instead of after every single point. Theorem 4.
Let the loss function L be bounded in [0 , and assume that it is decreasing in n . With ( M + 1) K datapoints, if the sample complexity is less than M , the above algorithm returns anestimate ˆ m such that with probability at least − δ | ˆ m − m ( φ, D , ε, A ) | ≤ M (cid:114) log(2 M/δ )2 K . (15) If K ≥ log(1 /δ )2 ε and the algorithm returns tight then with probability at least − δ the samplecomplexity is less than M and the above bound holds.Proof. First we apply a Hoeffding bound to show that each ˆ L n is estimated well. For any n , we have P (cid:18)(cid:12)(cid:12) ˆ L n − L ( A φ , n ) (cid:12)(cid:12) > (cid:114) log(2 M/δ )2 K (cid:19) ≤ (cid:18) − K log(2 M/δ )2 K (cid:19) = 2 δ M = δM (16)since each (cid:96) (ˆ p kn , x k , y k ) is an independent variable, bounded in [0,1] with expectation L ( A φ , n ) .13ow when sample complexity is less than M , we use a union bound to translate this to a highprobability bound on error of ˆ m , so that with probability at least − δ : | ˆ m − m ( φ, D , ε, A ) | = (cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) n =1 [ ˆ L n − ε ] + − [ L ( A φ , n ) − ε ] + (cid:12)(cid:12)(cid:12)(cid:12) (17) ≤ M (cid:88) n =1 (cid:12)(cid:12)(cid:12)(cid:12) [ ˆ L n − ε ] + − [ L ( A φ , n ) − ε ] + (cid:12)(cid:12)(cid:12)(cid:12) (18) ≤ M (cid:88) n =1 (cid:12)(cid:12)(cid:12)(cid:12) ˆ L n − L ( A φ , n ) (cid:12)(cid:12)(cid:12)(cid:12) (19) ≤ M (cid:114) log(2 M/δ )2 K (20)This gives us the first part of the claim.We want to know that when the algorithm returns tight , the estimate can be trusted (i.e. that we set M large enough). Under the assumption of large enough K , and by an application of Hoeffding, wehave that P (cid:18) L ( A φ , M ) − ˆ L M > ε/ (cid:19) ≤ exp (cid:18) − Kε (cid:19) ≤ exp (cid:18) − /δ )2 ε ε (cid:19) = δ (21)If ˆ L M ≤ ε/ , this means that L ( A φ , M ) ≤ ε with probability at least − δ . By the assumption ofdecreasing loss, this means the sample complexity is less than M , so the bound on the error of ˆ m holds. Appendix B Algorithmic details for estimating sample complexity
Recall that ε sample complexity ( ε SC) is defined as m ε SC ( φ, D , A , ε ) = min (cid:110) n ∈ N : L ( A φ , n ) ≤ ε (cid:111) . (22)We estimate m ε SC via recursive grid search. To be more precise, we first define a search interval [1 , N ] , where N is a large enough number such that L ( A φ , N ) (cid:28) ε . Then, we partition the searchinterval in to 10 sub-intervals and estimate risk of hypothesis learned from D n ∼ D n with highconfidence for each sub-interval. We then find the leftmost sub-interval that potentially contains m ε SC and proceed recursively. This procedure is formalized in Algorithm 2 and its guarantee is givenby Theorem 5. Theorem 5.
Let the loss function L be bounded in [0 , and assume that it is decreasing in n . Then,Algorithm 2 returns an estimate ˆ m that satisfies m ε SC ( φ, D , A , ε ) ≤ ˆ m with probability at least − δ .Proof. By Hoeffding, the probability that | ˆ L n − L ( A φ , n ) | ≥ ε/ , where ˆ L is computed with S = 2 log(20 k/δ ) /ε independent draws of D n ∼ D n and ( x, y ) ∼ D , is less than δ/ (10 k ) . Thealgorithm terminates after evaluating ˆ L on at most k different n ’s. By a union bound, the probabilitythat | ˆ L n − L ( A φ , n ) | ≤ ε/ for all n used by the algorithm is at least − δ . Hence, ˆ L n ≤ ε/ implies L ( A φ , n ) ≤ ε with probability at least − δ . Appendix C Experimental details
In each experiment we first estimate the loss-data curve using a fixed number of dataset sizes n and multiple random seeds, then compute each measure from that curve. Reported values of SDLcorrespond to the estimated area between the loss-data curve and the line y = ε using Riemann sumswith the values taken from the left edge of the interval. This is the same as the chunking procedure ofVoita & Titov (2020) and is equivalent to the code length of transmitting each chunk of data using a14 lgorithm 2: Estimate sample complexity via recursive grid search
Input:
Search upper limit N , parameters ε , confidence parameter δ , data distribution D , andlearning algorithm A . Output:
Estimate ˆ m such that m ε SC ( φ, D , A , ε ) ≤ ˆ m with probability − δ .let S = 2 log(20 k/δ ) /ε , and let [ (cid:96), u ] be the search interval initialized at (cid:96) = 1 , u = N . for r = 1 to k do Partition [ (cid:96), u ] into 10 equispaced bins and let ∆ be the length of each bin. for j = 1 to do Set n = (cid:96) + j ∆ .Compute ˆ L n = S (cid:80) Si =1 (cid:96) ( A ( D ni ) , x i , y i ) for S independent draws of D n and test sample ( x, y ) . if ˆ L n ≤ ε/ then Set u = n and (cid:96) = n − ∆ . breakreturn ˆ m = u , which satisfies m ε SC ( φ, D , A , ε ) ≤ ˆ m with probability − δ , where the randomnessis over independent draws of D n and test samples ( x, y ) .fixed model and switching models between intervals. Reported values of ε SC correspond to the firstmeasured n at which the loss is less than ε .All of the experiments were performed on a single server with 4 NVidia Titan X GPUs, and on thishardware no experiment took longer than an hour. All of the code for our experiments, as well as thatused to generate our plots and tables, is included in the supplement. C.1 MNIST experiments
For our experiments on MNIST, we implement a highly-performant vectorized library in JAX toconstruct loss-data curves. With this implementation it takes about one minute to estimate theloss-data curve with one sample at each of 20 settings of n . We approximate the loss-data curves at20 settings of n log-uniformly spaced on the interval [10 , and evaluate loss on the test set toapproximate the population loss. At each dataset size n we perform the same number of updates tothe model; we experimented with early stopping for smaller n but found that it made no differenceon this dataset. In order to obtain lower-variance estimates of the expected risk at each n , we run 8random seeds for each representation at each dataset size, where each random seed corresponds to arandom initialization of the probe network and a random subsample of the training set.Probes consist of two-hidden-layer MLPs with hidden dimension 512 and ReLU activations. Allprobes and representations are trained with the Adam optimizer (Kingma & Ba, 2015) with learningrate − .Each representation is normalized to have zero mean and unit variance before probing to ensurethat differences in scaling and centering do not disrupt learning. The representations of the data weevaluate are implemented as follows. Raw pixels.
The raw MNIST pixels are provided by the Pytorch datasets library (Paszke et al.,2019). It has dimension ×
28 = 784 . CIFAR.
The CIFAR representation is given by the last hidden layer of a convolutional neuralnetwork trained on the CIFAR-10 dataset. This representation has dimension 784 to match the size ofthe raw pixels. The network architecture is as follows: nn.Conv2d(1, 32, 3, 1),nn.ReLU(),nn.MaxPool2d(2),nn.Conv2d(32, 64, 3, 1),nn.ReLU(),nn.MaxPool2d(2), n.Flatten(),nn.Linear(1600, 784)nn.ReLU()nn.Linear(784, 10)nn.LogSoftmax() VAE.
The VAE (variational autoencoder; Kingma & Welling (2014); Rezende et al. (2014)) repre-sentation is given by a variational autoencoder trained to generate the MNIST digits. This VAE’slatent variable has dimension 8. We use the mean output of the encoder as the representation of thedata. The network architecture is as follows: self.encoder_layers = nn.Sequential(nn.Linear(784, 400),nn.ReLU(),nn.Linear(400, 400),nn.ReLU(),nn.Linear(400, 400),nn.ReLU(),)self.mean = nn.Linear(400, 8)self.variance = nn.Linear(400, 8)self.decoder_layers = nn.Sequential(nn.Linear(8, 400),nn.ReLU(),nn.Linear(400, 400),nn.ReLU(),nn.Linear(400, 784),)
C.2 Part of speech experiments
We follow the methodology and use the official code of Voita & Titov (2020) for our part of speechexperiments using ELMo (Peters et al., 2018) pretrained representations. In order to obtain lower-variance estimates of the expected risk at each n , we run 4 random seeds for each representationat each dataset size, where each random seed corresponds to a random initialization of the probenetwork and a random subsample of the training set. We approximate the loss-data curves at 10settings of n log-uniformly spaced on the range of the available data n ∈ [10 , ] . To more preciselyestimate ε SC, we perform one recursive grid search step: we space 10 settings over the range whichin the first round saw L ( A φ , n ) transition from above to below ε .Probes consist of the MLP-2 model of Hewitt & Liang (2019); Voita & Titov (2020) and all trainingparameters are the same as in those works. https://github.com/lena-voita/description-length-probinghttps://github.com/lena-voita/description-length-probing