Nonlinear Information Bottleneck
NNonlinear Information Bottleneck
Artemy Kolchinsky
Santa Fe Institute [email protected]
Brendan D. Tracey
Santa Fe InstituteMassachusetts Institute of Technology [email protected]
David H. Wolpert
Santa Fe InstituteMassachusetts Institute of TechnologyArizona State University [email protected]
Abstract
Information bottleneck [IB] is a technique for extracting information in some ‘in-put’ random variable that is relevant for predicting some different ‘output’ randomvariable. IB works by encoding the input in a compressed ‘bottleneck variable’from which the output can then be accurately decoded. IB can be difficult to com-pute in practice, and has been mainly developed for two limited cases: (1) discreterandom variables with small state spaces, and (2) continuous random variablesthat are jointly Gaussian distributed (in which case the encoding and decodingmaps are linear). We propose a method to perform IB in more general domains.Our approach can be applied to discrete or continuous inputs and outputs, andallows for nonlinear encoding and decoding maps. The method uses a novel up-per bound on the IB objective, derived using a non-parametric estimator of mu-tual information and a variational approximation. We show how to implement themethod using neural networks and gradient-based optimization, and demonstrateits performance on the MNIST dataset.
Imagine that we are provided with two random variables, an ‘input’ random variable X and an‘output’ random variable Y , and that we wish to use X to predict Y . As an example, considera meteorological scenario, in which X represents recorded data (wind-speed, precipitation, etc.)and Y represents the weather forecast for following days. In many cases, it is useful to extractthe information in X that is relevant for predicting Y , e.g., to find the specific (combination of)meteorological features that predict the weather.This problem is formally considered by the ‘information bottleneck’ [IB] method [1–3]. Assumethat X and Y are jointly distributed according to some P ( x, y ) . IB posits a “bottleneck” variable M which is related to X by the stochastic function P ( m | x ) called the encoding map . Given M ,predictions of Y can be made using the decoding map , P ( y | m ) := P ( m, y ) P ( m ) = (cid:82) P ( m | x ) P ( x, y ) dx (cid:82) P ( m | x ) P ( x, y (cid:48) ) dx dy (cid:48) . (1)Note that decoding is done with the assumption that M is conditionally independent of Y given X .This guarantees that any information present in M about Y is extracted from X . a r X i v : . [ c s . I T ] S e p he optimal encoding map is selected by minimizing the IB objective, which balances compressionof X and accurate prediction of Y , L IB := βI ( X ; M ) − I ( Y ; M ) . Here I ( · ; · ) is mutual information [6] and β ∈ [0 , is a parameter that controls the trade-off betweencompression and prediction . Thus, IB finds the encoding map which minimizes mutual informationbetween X and M (i.e., M maximally compresses X ), while maximizing mutual information be-tween M and Y (i.e., M optimally predicts Y ). When β is large, IB will favor maximal compressionof X ; this can be achieved by making M completely independent of X (thus also from Y ). When β is small, IB will favor solutions in which M captures maximal information about Y . In the limit β → , M will recover the minimal sufficient statistics in X for Y [7].The following example illustrates a possible use-case of IB. Consider a situation where observationsof X are made at one physical location while the prediction of Y is made at a different physicallocation. If these two locations are connected by a low-capacity channel, then it is desirable totransmit as little data as possible between them. For instance, suppose that a remote weather stationis making detailed recordings of meteorological data, which are then sent to a central server and usedto make probabilistic predictions about weather conditions for the next day. If the channel betweenthe weather station and server has low capacity, then it is important that the information transmittedfrom the weather station to the server is highly compressed. Minimizing the IB objective amounts tofinding a compressed representation ( M ) of meteorological data ( X ), which can then be transmittedacross a low capacity channel and used to optimally predict future weather ( Y ). Note that it is alsopossible to use other measures of compression, such as the Shannon entropy H ( M ) rather that I ( X ; M ) [8], though such considerations are outside the scope of the present work.Unfortunately, it is generally intractable to find the encoding maps that optimize the IB objective.This is because it can be very difficult to evaluate the integrals in Eq. (1), as well as the mutualinformation terms in the IB objective function. For this reason, until now IB has been mainly de-veloped for two limited cases. The first case is where the random variables have a small numberof discrete outcomes [1]. There, computation of Eq. (1) and mutual information terms, as well asoptimization of L IB , can be done by explicitly representing all entries of the conditional probabilitydistribution P ( m | x ) . The second case is when X and Y are continuous-valued and jointly Gaus-sian distributed [4]. Here, the IB optimization problem can be solved analytically, and the resultingoptimal encoding and decoding maps are linear.In this work, we propose a method for performing IB in much more general settings, which wecall nonlinear information bottleneck , or nonlinear IB for short. Our method assumes that M is acontinuous-valued random variable, but X and Y can be either discrete (possibly with many states)or continuous, and with any joint distribution. Furthermore, as indicated by the term nonlinear IB,the encoding and decoding maps can be nonlinear.To implement nonlinear IB, we represent the encoding and decoding maps parametrically, and thenminimize an upper bound on the IB objective using gradient-based optimization. Our approachmakes use of the following techniques: • We represent the distributions over X and Y using a finite number of data samples • We optimize the encoding and decoding maps within some parametric family of densities.The decoding map will not generally equal the integral in Section 3.2, but it can be used toderive a upper bound on the term − I ( Y ; M ) in L IB • We use a non-parametric estimator of mutual information to get an upper bound on the term I ( X ; M ) term in L IB In the next section, we describe nonlinear IB in detail, and explain its implementation using off-the-shelf neural network software. In Section 4, we demonstrate it on the MNIST dataset of hand-drawndigits. The IB objective is sometimes stated [1, 4, 5] in a different (but equivalent) form, L IB := I ( X ; M ) − βI ( Y ; M ) . Note that β > can be ignored, since these values lead only to ‘trivial’ solutions. This is because M − X − Y form a Markov chain, and the data processing inequality [6] states that I ( X ; M ) ≥ I ( Y ; M ) . Thus,for β > , the optimal possible value for L IB is 0, which can be achieved by making M independent of X .
2t should be noted that an analogy between IB and neural networks has been previously explored [9].However, the approach proposed here is not an analogy between IB and neural nets, but rather a di-rect connection demonstrating that the latter can be used to perform the former. In addition, afteran earlier formulation of our approach appeared at the RNN Symposium at the NIPS’16 confer-ence [10], we became aware of three recent papers that similarly propose novel ways of performingIB using upper bounds and gradient-based optimization [11–13]. The major difference between ourwork and these approaches is in the approximation of the compression term, I ( X ; Z ) . This distinc-tion is discussed in more detail in Section 3. In that section, we also relate our approach to otherprevious work in machine learning. In the following, we use H ( · ) for Shannon entropy, I ( · ; · ) for mutual information [MI], D ( ·(cid:107)· ) forKullback-Leibler [KL] divergence, and C ( ·(cid:107)· ) for cross-entropy. All information-theoretic quantitiesare in units of bits, and all log s are base-2.Let the input X and the output Y be distributed according to some joint distribution Q ( x, y ) , withmarginals indicated by Q ( y ) and Q ( x ) . We assume that we are provided with a ‘training dataset’ D = { ( x , y ) , . . . , ( x N , y N ) } , which contains N input-output pairs sampled IID from Q ( x, y ) .Finally, let M indicate the bottleneck variable, with states in R d . For simplicity, we assume that X and Y are continuous-valued, though our approach extends immediately to the discrete case (withsome integrals replaced by sums).Let the conditional probability P θ ( m | x ) , where θ is a vector of parameters, indicate the encodingmap from input X to the bottleneck variable M . The IB objective, as a function of the encoding mapparameters, is written as: L IB ( θ ) := βI θ ( X ; M ) − I θ ( Y ; M ) . (2)In this equation, the first MI term is computed using the joint distribution Q θ ( x, m ) := P θ ( m | x ) Q ( x ) , while the second MI term is computed using the joint distribution Q θ ( y, m ) := (cid:90) P θ ( m | x ) Q ( x, y ) dx . (3)We would like to find the encoding map which optimizes the objective, θ (cid:63) = arg min θ L IB ( θ ) . Unfortunately, in many cases this optimization problem is intractable. This is due to the difficulty ofcomputing the integral in Eq. (3) and the MI terms of Eq. (2). However, one can perform approximateIB by minimizing an upper bound on L IB . Here we derive such an upper bound, and show how itcan be optimized.First, consider some parameterized conditional probability P φ ( y | m ) of output given bottleneck,where φ is a vector of parameters. For any such P φ ( y | m ) , the non-negativity of KL divergencedefines a ‘variational’ lower-bound on the second MI term in Eq. (2), I θ ( Y ; M ) = H ( Q ( Y )) − H ( Q θ ( Y | M )) ≥ H ( Q ( Y )) − H ( Q θ ( Y | M )) − D ( Q θ ( Y | M ) (cid:107) P φ ( Y | M )) (4) = H ( Q ( Y )) − C ( Q θ ( Y | M ) (cid:107) P φ ( Y | M )) . (5)For this reason, we call P φ ( y | m ) the variational decoding map . This variational decoding mapserves as a tractable approximation to the difficult-to-compute ‘optimal’ decoding map Q θ ( y | m ) .This provides us with the following upper bound on the IB objective, L IB ( θ ) ≤ βI θ ( X ; M ) − H ( Q ( Y )) + C ( Q θ ( Y | M ) (cid:107) P φ ( Y | M )) . (6)3he entropy term H ( Q ( Y )) does not depend on the parameter values, and is thus irrelevant foroptimization. The cross-entropy term can be estimated easily from data, as we show below. Finally,consider minimizing the RHS above as a function of φ . Note that this is equivalent to minimizingthe KL divergence between P φ ( y | m ) and Q θ ( y | m ) (Eq. (4)), and will therefore select the P φ ( y | m ) which is ‘closest’ to the optimal decoder map Q θ ( y | m ) within the parametric family.The main remaining challenge is to estimate the first MI term in L IB , I θ ( X ; M ) . We now provide atractable upper bound on this MI term that is based on two assumptions.First, we assume that the encoding map P θ ( m | x ) is the sum of a deterministic differentiable function f θ ( x ) and Gaussian noise with covariance matrix σ I , P θ ( m | x ) := p normal ( m ; f θ ( x ) , σ I ) , (7)where p normal ( · ; µ, Σ ) is the p.d.f. of a multivariate Gaussian with mean µ and covariance matrix Σ .Note that σ is considered as one of the parameters in θ , and hence is optimized.The second assumption we make is that the distribution of f θ ( X ) can be approximated as a finitemixture of Gaussians [MoG]. Specifically, this MoG contains one Gaussian component for eachtraining data point i = 1 ..N in D with mean f θ ( x i ) and covariance matrix η I . The parameter η ischosen using cross-validation to maximize the leave-one-out log likelihood of the training data [14], η ( θ ) = arg max s (cid:88) i log (cid:32) N − (cid:88) j (cid:54) = i (cid:32) πs ) d/ exp (cid:18) − (cid:107) f θ ( x i ) − f θ ( x j ) (cid:107) s (cid:19) (cid:33)(cid:33) , (8)where we’ve made the dependence of η on θ explicit. Under mild assumptions, MoG models con-verge to the true distribution as the size of the training dataset grows [15].To summarize, we’ve assumed that the distribution of f θ ( X ) is a MoG, with each component havingcovariance η ( θ ) I . M is then computed by adding Gaussian noise with covariance σ I to f θ ( X ) (Eq. (7)). This means that M is distributed as a MoG, with each component having covariance ( η ( θ ) + σ ) I , p ( M = m ) := 1 N N (cid:88) i =1 p normal ( m ; f θ ( x i ) , ( η ( θ ) + σ ) I ) . We now propose the following upper bound on the mutual information between X and M : I θ ( X ; M ) = H ( M ) − H ( P θ ( M | X )) ≤ ˆ I D θ , (9)where ˆ I D θ is defined as ˆ I D θ := − N (cid:88) i log 1 N (cid:88) j exp (cid:18) − (cid:107) f θ ( x i ) − f θ ( x j ) (cid:107) η ( θ ) + σ (cid:19) − d σ η ( θ ) + σ , (10)and (as before) d is the dimensionality of M . We use two techniques to derive this upper bound.First, we note that H ( P θ ( M | X )) is the entropy of a multivariate Gaussian with covariance σ I ,which has a simple closed-form expression [6]. Second, we bound H ( M ) using a non-parametricupper bound on the entropy of a mixture, described in detail in [16]. Combining these two leads to ˆ I D θ [16]. This bound can be understood as a ‘corrected’ kernel-based MI-estimator.Note that ˆ I D θ is a differentiable function of θ , and thus can be optimized using gradient-based meth-ods. Furthermore, as shown in [16], when f θ maps the dataset into several well-separated clustersthis bound becomes an exact estimate of the empirical MI (a commonly-encountered solution to theoptimization problem posed here).Combining Eq. (6) and Eq. (9) provides a tractable upper bound, L IB ( θ ) ≤ ˆ L IB ( θ, φ ) := β ˆ I D θ + C ( Q θ ( Y | M ) (cid:107) P φ ( Y | M )) + constwhere H ( Q ( Y )) has been absorbed into ‘const’. Nonlinear IB seeks parameter values that minimizethis upper bound, θ (cid:63) , φ (cid:63) = arg min θ,φ ˆ L IB ( θ, φ ) . (11)4 .2 Implementation We show that the nonlinear IB optimization problem (Eq. (11)) can be carried out using a finite-sizedtraining set and neural network (NN) optimization techniques.The encoding map P θ ( m | x ) , as specified by Eq. (7), is computed in the following way: first, severalNN layers implement the (possibly nonlinear) deterministic function f θ ( x ) . The output of theselayers is then added to zero-centered Gaussian noise with covariance σ I , which becomes the stateof the bottleneck layer . The parameter vector θ specifies all relevant connection weights and biasesof these layers, as well as the noise variance σ . Note that due to the presence of noise, the neuralnetwork is stochastic: even with parameters held constant, different states of the bottleneck layerwould be sampled during different NN evaluations.The decoding map P φ ( y | m ) is also computed by the neural network. First, the bottleneck layer statesare passed through several deterministic layers of the network, whose parameters (connections andbiases) are specified by the vector φ . The log decoding probability log P φ ( y | m ) is then computedusing an appropriately-chosen neural network cost function (e.g., squared-error for continuous Y ,cross-entropy for discrete Y [17]).Given a finite training dataset D = { ( x , y ) , . . . , ( x N , y N ) } , the optimal encoding and decodingmaps are determined by minimizing ˆ L IB ( θ, φ ) ≈ β ˆ I D θ − N N (cid:88) i =1 log P φ ( y i | m i ) , where m i is sampled from P θ ( m | x i ) , and ˆ I D θ is specified in Eq. (10). Note that this finite sampleapproximation converges to ˆ L IB as the training dataset size N → ∞ .All terms in the above approximation are differentiable. Thus, a local optimum of ˆ L IB can be foundby gradient-descent. However, there are several important caveats. In practice we compute the gra-dient of (cid:80) Ni =1 log P φ ( y i | m i ) using stochastic gradient descent (SGD) using mini-batches of size n SGD . The gradient of the term ˆ I D θ is also computed using SGD, but using different mini-batches ofsize n MI . Generally, we choose n MI > n SGD , because having a larger n MI significantly improves theestimate of mutual information in high-dimensional spaces (i.e., large d ), while having smaller n SGD improves generalization performance [18]. In addition, note that the ‘width’ of the MI estimator, η ,is continually updated as the optimization proceeds.In practice, we carry out the following procedure:1. A mini-batch D SGD of size n SGD and a mini-batch D MI of size n MI are randomly sampledfrom the training dataset D .2. Holding θ and φ fixed, an optimizer selects the best value of η , according to Eq. (8) com-puted over D MI .3. For each input x i in D SGD , a value of m i is sampled from P θ ( m | x i ) . These samples areused to compute the stochastic gradient of − (cid:80) i log P φ ( y i | m i ) w.r.t. to θ and φ . To this isadded β times the stochastic gradient of ˆ I D θ w.r.t. to θ , computed over D MI . A step is takenin the direction of this combined gradient.4. The process repeats.Note that training is more effective when σ is initially small, so that information about the gradientof − (cid:80) i log P φ ( y i | m i ) is not completely destroyed by noise during early training. One important idea in our approach is using a differentiable, kernel-based estimator of mutual infor-mation, ˆ I D θ . See [19–23] for related ideas about using neural networks to optimize non-parametricestimates of information-theoretic functions. This technique can also be related the estimation of5eld-out data likelihood in deep learning models using kernel-based estimators (e.g., [24]). How-ever, in these approaches, held-out data likelihood is estimated only once, as a diagnostic measureonce learning is complete. We propose to instead directly optimize these non-parametric estimators. After the appearance of an earlier formulation of this approach [10], we became aware of threerecent papers that also propose methods for performing IB for continuous, possibly non-Gaussianrandom variables [11–13]. In this section, we compare these papers with the approach presentedhere.As in our work, these papers propose tractable upper bounds on the L IB objective function whichcan be optimized using neural-network-based methods. They employ the same variational bound forthe MI term I θ ( Y ; M ) (Eq. (5)) as we do, I θ ( Y ; M ) ≥ H ( Q ( Y )) − C ( Q θ ( Y | M ) (cid:107) P φ ( Y | M )) , where P φ ( y | m ) is the variational decoding map. These methods (and ours) differ, however, in theirtreatment of I θ ( X ; M ) . These methods bound I θ ( X ; M ) using a parametric marginal distributionover the bottleneck variable R α ( m ) , where α is some vector of parameters. This gives a variationalbound for I θ ( X ; M ) , I θ ( X ; M ) = H ( Q θ ( M )) − H ( P θ ( M | X )) ≤ C ( Q θ ( M ) (cid:107) R α ( M )) − H ( P θ ( M | X ))= D ( P θ ( M | X ) (cid:107) R α ( M )) . (12)Combining leads to the following variational bound for L IB , L IB ≤ L VIB := βD ( P θ ( M | X ) (cid:107) R α ( M )) + C ( Q θ ( Y | M ) (cid:107) P φ ( Y | M )) + const . (13)The three aforementioned papers differ in how they define the approximate marginal distribution R α ( m ) . In [12], R α ( m ) is a standard multivariate normal distribution, R ( m ) := p normal ( m | , I ) .In [11], R α ( m ) is a product of Student-t distributions, R α ( m ) := (cid:81) di =1 p Student ( m | , w i , ν i ) , where w i and ν i specify the scale and shape parameters of the i -th dimension. The parameters w i and ν i are encoded in α and optimized during learning, in this way tightening the bound in Eq. (12). In [13],two approximating distributions are considered: the improper log-uniform, R (log m ) := c , and thelog-normal, R (log m ) := p normal ( µ, σ ) . Additionally, the encoding map consists of a deterministicfunction with multiplicative noise. Finally, in [11, 12], the encoding map P θ ( m | x ) is treated asdeterministic function plus Gaussian noise, similarly to how it is treated here.In a sense, our proposed methodology also approximates the distribution over the bottleneck vari-able. However, rather than use a parametric variational distribution, our approximation is based on a non-parametric kernel-based estimator, and as mentioned, the estimate converges to the true Q θ ( M ) in the large sample limit.These alternative methods have potential advantages and disadvantages compared to our approach.On one hand, they are more computationally efficient: our non-parametric estimator of I θ ( X ; M ) requires O ( n ) operations per SGD batch (where n is the size of the mini-batch), while the varia-tional bound of Eq. (12) requires O ( n ) operations. On the other hand, our non-parametric estimatoris expected to give a better estimate of the MI I θ ( X ; M ) . In fact, in [16], we show that our boundis tight, or near tight, in many cases of interest. This improved estimate of the mutual informationwill improve the optimization of the compression term (i.e., I θ ( X ; M ) ), and may lead to significantperformance improvement in some cases.A thorough comparison of these different approach remains for future work (see also the Resultssection). The central idea of IB is to find a mapping from inputs to outputs while using a “compressed”intermediate representation. This idea has several precedents in the machine learning literature.One important example is work on auto-encoders . Auto-encoders are machine learning architecturesthat attempt to reconstruct a copy of the input X , while using some restricted ‘intermediate repre-sentations’ (typically encoded in the activity of a hidden layer in a neural network). Auto-encoders6re conceptually similar to IB, and can be understood as optimizing the encoding map P θ ( m | x ) anddecoding map P φ ( y | m ) where X = Y (i.e., input data is compressed into an intermediate represen-tation, from which the same input data can then be uncompressed). Auto-encoders may also regular-ize the encoding map, e.g., by placing information-theoretic penalties on the coding length of hiddenlayer activity [25, 26]. This idea has also been explored in a supervised learning scenario in [27]. Inthat work, however, hidden layer states were treated as discrete-valued, limiting the flexibility andinformation capacity of hidden representations. More recently, denoising auto-encoders [28] haveattracted attention. Denoising auto-encoders constrain the amount of information between input andhidden layers by injecting noise into the hidden layer activity, somewhat similar to our noisy map-ping between input and bottleneck layers. Existing work on auto-encoders has considered eitherpenalizing hidden layer coding length or injecting noise into the map, rather than combing the twoas we do here. Because of this, denoising auto-encoders do not have a notion of “optimal” noiselevel on the training data (since less noise will always improve prediction error on the training data),and thus cannot directly adapt the noise level. Variational auto-encoders [29] [VAE] are another recent machine-learning architecture. VAEs useneural network techniques to perform unsupervised learning, i.e., to learn generative models fromdata. VAE postulate a ‘latent variable’ M (in our language, a bottleneck variable) distributed accord-ing to some R α ( m ) , where α indicates parameters. Samples of the latent variable are then mappedto observed data space X according to P α,φ ( x ) = (cid:90) P φ ( x | m ) R α ( m ) dm , where P φ is a conditional probability density parameterized by φ . Using our terminology, P φ can beunderstood as a decoding map given the assumption that X = Y .Given a dataset D = { x , . . . , x N } , VAE attempts to choose parameters φ and α that maximize themarginal likelihood P α,φ of the samples in D . However, computing the marginal likelihood of datais generally intractable. For this reason, VAE minimizes the following upper bound on the negativelog likelihood of the data, L VAE ( θ, φ, α ) = D ( P θ ( M | X ) (cid:107) R α ( M )) + C ( Q θ ( X | M ) (cid:107) P φ ( X | M )) , where Q θ ( x | m ) is the Bayesian inverse of P θ ( m | x ) , Q θ ( x | m ) := P θ ( m | x ) Q ( x ) (cid:82) P θ ( m | x (cid:48) ) Q ( x (cid:48) ) dx (cid:48) . As was observed [12, 13], this objective is identical to L VIB (Eq. (13)), assuming β = 1 and X = Y .Interestingly, by working back through Eq. (12), one can show that the optimal R α ( m ) for thisobjective is R (cid:63) ( m ) = (cid:90) P θ ( m | x ) Q ( x ) dx . Fixing this optimal R (cid:63) ( m ) , the VAE upper bound becomes L VAE ( θ, φ ) = I ( X ; M ) + C ( Q θ ( X | M ) (cid:107) P φ ( X | M )) ≤ ˆ I D θ + C ( Q θ ( X | M ) (cid:107) P φ ( X | M )) , where in the second line we’ve used the estimator used in nonlinear IB (Eq. (10)).Thus, the method proposed in this work can also be used to provide a tractable upper bound on thenegative log likelihood in VAE. This suggests that our approach may offer a novel way of learningVAE-based generative models from data. Exploring its potential in this domain remains for futurework. We demonstrate our approach on the MNIST dataset of images of hand-drawn digits. This datasetcontains a set of 60,000 training images and 10,000 testing images. Each image is 28-by-28 pixels,and is classified into one of 10 classes corresponding to the digit identity. X ∈ R is defined to bethe vector of the 28-by-28 pixel values, and Y ∈ { .. } is defined to be the class label.7 I(X;M) I ( Y ; M ) I ( Y ; M ) TrainTestH(Y) 0.2 0.4 0.6 0.8 1.0012345 I ( X ; M ) Figure 1: Values of β versus estimates of I ( X ; M ) and I ( Y ; M ) on the MNIST dataset. Dashed lineindicates H ( Y ) = log 10 .Figure 2: Bottleneck layer activity (without noise). Left: nonlinear IB ( β = 0 . ); Right: regularsupervised learning ( β = 0 ).The encoding function f θ was implemented using three NN layers: the first layer consisted of 800Rectified-Linear [relu] units, the second layer consisted of 800 relu units, and the third layer con-sisted of 20 linear units. The state of the bottleneck variable M ∈ R corresponded to the activityof this third layer plus noise. The decoding map p φ was implemented using two neural-networklayers: a layer of 800 relu units, followed by a layer of 10 softmax units, and then the cross-entropyerror function.The optimization was performed using an off-the-shelf deep learning framework [30–32]. TheAdam [33] optimizer was used, and training was run for 200 epochs. The initial learning rate was0.001 and dropped 60% every 10 epochs. Mini-batch sizes of n SGD = 128 and n MI = 1000 wereused. All relevant code is available at https://github.com/artemyk/nonlinearIB .Fig. 1 show values of β versus estimates of I ( X ; M ) and I ( Y ; M ) on the MNIST dataset, forboth training and testing datasets. We also plot H ( Y ) = log 10 as a dotted line ( H ( Y ) is an upperbound on the possible MI term I ( Y ; M ) , and a lower bound on the minimal I ( X ; M ) necessaryfor completely-accurate prediction of Y from M ). For these plots, I ( X ; M ) was estimated usingEq. (10), while I ( Y ; M ) was estimated using the lower bound of Eq. (5), I ( Y ; M ) ≥ H ( Y ) − C ( Q θ ( Y | M ) (cid:107) P φ ( Y | M )) . As can be seen from the figure, our methods permits us to trade off between maximal informationabout Y versus maximal compression of X by varying β .Additional insight is provided by considering the intermediate representations uncovered by nonlin-ear IB (with β = 0 . ) versus regular supervised learning (with β = 0 ). To visualize these intermedi-ate representations, we optimized a slightly different network: here, f θ was implemented by a stackof three NN layers: the first layer and second layers again consisted of 800 relu units, while the third8ayer consisted of two linear units. The state of the bottleneck variable M ∈ R corresponded tothe activity of this third layer plus noise. This configuration makes the neural network map the high-dimensional input space of images into a two-dimensional, easily-visualizable representation. Thedecoding map consisted of an 800-node layer of relu units and a 10-node softmax layer, as before.Fig. 2 visualizes states of the 2-dimensional bottleneck layer (before adding noise) for a subsampleof the training dataset, with colors indicating digit identity. This is shown both for nonlinear IB ( β =0 . ) and regular supervised learning ( β = 0 ) training. Both methods successfully separate differentdigits into different regions of the bottleneck layer state space. However, with regular supervisedlearning (right), images of each digit are mapped onto a large ‘swath’ of activity, and bottleneckstates carry information about input vectors beyond digit identity. On the other hand, nonlinear IB(left) tends to map images of each digit into a separate and very tight cluster. These bottleneck statescarry almost no information about input vectors beyond digit identity.Finally, it is of interest to compare our approach with variational methods described in Section 3.2.However, the method described in [12] had no publically-available code provided. We devel-oped our own implementation of this method (available at https://github.com/artemyk/nonlinearIB ), but could not replicate replicate the publisehd results [12], even after extensiveprivate communication with the authors. Theoretical and numerical comparisons with such varia-tional methods thus remain an important area of investigation for the future. We propose ‘nonlinear IB’, a method for performing information bottleneck [IB] in novel domains.We assume that the bottleneck variable is continuous. However, unlike previous approaches, theinput and output variables can be either discrete or continuous, can be distributed in arbitrary (e.g.,non-Gaussian) ways, and the encoding and decoding maps can be nonlinear. Our method is basedon a new tractable upper bound on the IB objective. This upper bound can be optimized usinggradient-based techniques applied to training data. We show how to implement our method using anoff-the-shelf neural network package. We then demonstrate it on the MNIST dataset of handwrittendigits, showing that it uncovers very different intermediate representations than those uncovered bytraditional supervised learning.Note that we have discussed nonlinear IB as a technique for finding compressed representationsof the input. The method may also be used an information-theoretic regularizer, i.e., as a way topenalize over-fitting and improve generalization performance. Exploring its efficacy in this domainremains for future work.
Acknowledgments
We thank Steven Van Kuyk for helpful comments. We would also like to thank the Santa Fe Institutefor helping to support this research. Artemy Kolchinsky and David H. Wolpert was supported byGrant No. FQXi-RFP-1622 from the FQXi foundation and Grant No. CHE-1648973 from the USNational Science Foundation. Brendan D. Tracey was supported by the AFOSR MURI on multi-information sources of multi-physics systems under Award Number FA9550-15-1-0038.
References [1] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In , 1999.[2] Alexander G. Dimitrov and John P. Miller. Neural coding and decoding: communication channels andquantization.
Network: Computation in Neural Systems , 12(4):441–472, 2001.[3] Ins Samengo. Information loss in an optimal maximum likelihood decoding.
Neural computation , 14(4):771–779, 2002.[4] Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for Gaussianvariables.
Journal of Machine Learning Research , 6(Jan):165–188, 2005.[5] Felix Creutzig, Amir Globerson, and Naftali Tishby. Past-future information bottleneck in dynamicalsystems.
Physical Review E , 79(4):041925, 2009.
6] Thomas M. Cover and Joy A. Thomas.
Elements of information theory . John Wiley & Sons, 2012.[7] Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottle-neck.
Theoretical Computer Science , 411(29-30):2696–2711, 2010.[8] Dj Strouse and David J. Schwab. The Deterministic Information Bottleneck.
Neural Computation , pages1–20, April 2017. ISSN 0899-7667, 1530-888X. doi: 10.1162/NECO a 00961.[9] Naftali Tishby and Noga Zaslavsky. Deep Learning and the Information Bottleneck Principle. arXivpreprint arXiv:1503.02406 , 2015.[10] Artemy Kolchinsky and David H. Wolpert. Supervised learning with information penalties. In
RecurrentNeural Networks Symposium at NIPS’16 , Barcelona, Spain, 2016. URL http://people.idsia.ch/˜rupesh/rnnsymposium2016/files/kolchinsky.pdf .[11] Matthew Chalk, Olivier Marre, and Gasper Tkacik. Relevant sparse codes with variational informationbottleneck. In
Advances in Neural Information Processing Systems , pages 1957–1965, 2016.[12] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep Variational InformationBottleneck.
International Conference on Learning Representations , 2017.[13] Alessandro Achille and Stefano Soatto. Information dropout: learning optimal representations throughnoise. arXiv preprint arXiv:1611.01353 , 2016.[14] Peter Hall and Sally C Morton. On the estimation of entropy.
Annals of the Institute of StatisticalMathematics , 45(1):69–88, 1993.[15] Grace Wahba. Optimal convergence properties of variable knot, kernel, and orthogonal series methodsfor density estimation.
The Annals of Statistics , pages 15–29, 1975.[16] Artemy Kolchinsky and Brendan D. Tracey. Estimating mixture entropy with pairwise distances.
En-tropy , 19(7), 2017. ISSN 1099-4300. doi: 10.3390/e19070361. URL .[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep learning . MIT Press, 2016.[18] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak PeterTang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprintarXiv:1609.04836 , 2016.[19] Nicol Norbert Schraudolph.
Optimization of entropy with neural networks . PhD thesis, Citeseer, 1995.[20] Nicol N. Schraudolph. Gradient-based manipulation of nonparametric entropy estimates.
Neural Net-works, IEEE Transactions on , 15(4):828–837, 2004.[21] Sarit Shwartz, Michael Zibulevsky, and Yoav Y. Schechner. Fast kernel entropy estimation and optimiza-tion.
Signal Processing , 85(5):1045–1058, 2005.[22] Kari Torkkola. Feature extraction by non-parametric mutual information maximization.
Journal of ma-chine learning research , 3(Mar):1415–1438, 2003.[23] Katerina Hlavvckov-Schindler, Milan Palus, Martin Vejmelka, and Joydeep Bhattacharya. Causality de-tection based on information-theoretic approaches in time series analysis.
Physics Reports , 441(1):1–46,2007.[24] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural information processingsystems , pages 2672–2680, 2014.[25] Geoffrey E. Hinton and Richard S. Zemel. Autoencoders, minimum description length, and Helmholtzfree energy.
Advances in neural information processing systems , pages 3–3, 1994.[26] Geoffrey E. Hinton and Richard S. Zemel. Minimizing description length in an unsupervised neuralnetwork.
Preprint , 1997.[27] G. Deco, W. Finnoff, and H. G. Zimmermann. Elimination of Overtraining by a Mutual InformationNetwork. In Stan Gielen and Bert Kappen, editors,
ICANN ’93 , pages 744–749. Springer London, 1993.ISBN 978-3-540-19839-0 978-1-4471-2063-6. DOI: 10.1007/978-1-4471-2063-6 208.
28] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and compos-ing robust features with denoising autoencoders. In
Proceedings of the 25th international conference onMachine learning , pages 1096–1103. ACM, 2008.[29] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In
The International Conferenceon Learning Representations (ICLR) , 2014.[30] Franc¸ois Chollet. Keras. https://github.com/fchollet/keras , 2015.[31] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Cor-rado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning onheterogeneous systems, 2015.
Software available from tensorflow. org .[32] Theano Development Team. Theano: A Python framework for fast computation of mathematical expres-sions. arXiv e-prints , abs/1605.02688, May 2016.[33] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In , 2015., 2015.