[PDF] Understanding Convolutional Neural Networks with Information Theory: An Initial Exploration

Abstract

The matrix-based Renyi's \alpha-entropy functional and its multivariate extension were recently developed in terms of the normalized eigenspectrum of a Hermitian matrix of the projected data in a reproducing kernel Hilbert space (RKHS). However, the utility and possible applications of these new estimators are rather new and mostly unknown to practitioners. In this paper, we first show that our estimators enable straightforward measurement of information flow in realistic convolutional neural networks (CNN) without any approximation. Then, we introduce the partial information decomposition (PID) framework and develop three quantities to analyze the synergy and redundancy in convolutional layer representations. Our results validate two fundamental data processing inequalities and reveal some fundamental properties concerning the training of CNN.

Full PDF

11 Understanding Convolutional Neural Networks withInformation Theory: An Initial Exploration

Shujian Yu,

Student Member, IEEE,

Kristoffer Wickstrøm,Robert Jenssen,

Member, IEEE, and Jos´e C. Pr´ıncipe,

Life Fellow, IEEE.

Abstract —A novel functional estimator for R´enyi’s α -entropyand its multivariate extension was recently proposed in terms ofof the normalized eigenspectrum of a Hermitian matrix of theprojected data in a reproducing kernel Hilbert space (RKHS).However, the utility and possible applications of these new estima-tors are rather new and mostly unknown to practitioners. In thisbrief, we ﬁrst show that our estimators enable straightforwardmeasurement of information ﬂow in realistic convolutional neuralnetworks (CNNs) without any approximation. Then, we introducethe partial information decomposition (PID) framework anddevelop three quantities to analyze the synergy and redundancyin convolutional layer representations. Our results validate twofundamental data processing inequalities and reveal some fun-damental properties concerning CNN training. Index Terms —Convolutional Neural Networks, Data Process-ing Inequality, Multivariate Matrix-based R´enyi’s α -entropy,Partial Information Decomposition. I. I

NTRODUCTION

There has been a growing interest in understanding deepneural networks (DNNs) mapping and training using infor-mation theory [1], [2], [3]. According to Schwartz-Ziv andTishby [4], a DNN should be analyzed by measuring the infor-mation quantities that each layer’s representation T preservesabout the input signal X with respect to the desired signal Y (i.e., I ( X ; T ) with respect to I ( T ; Y ) , where I denotes mutualinformation), which has been called the Information Plane(IP). Moreover, they also empirically show that the commonstochastic gradient descent (SGD) optimization undergoes twoseparate phases in the IP: an early “ﬁtting” phase, in whichboth I ( X ; T ) and I ( T ; Y ) increase rapidly along with theiterations, and a later “compression” phase, in which there is areversal such that I ( X ; T ) and I ( T ; Y ) continually decrease.However, the observations so far have been constrained to asimple multilayer perceptron (MLP) on toy data, which werelater questioned by some counter-examples in [5].In our most recent work [6], we use a novel matrix-basedR´enyi’s α -entropy functional estimator [7] to analyze the in-formation ﬂow in stacked autoencoders (SAEs). We observedthat the existence of a “compression” phase associated with I ( X ; T ) and I ( T ; Y ) in the IP is predicated to the properdimension of the bottleneck layer size S of SAEs: if S is largerthan the intrinsic dimensionality d [8] of training data, the Shujian Yu and Jos´e C. Pr´ıncipe are with the Department of Electrical andComputer Engineering, University of Florida, Gainesville, FL 32611, USA.(email: yusjlcy9011@uﬂ.edu; [email protected]ﬂ.edu)Kristoffer Wickstrøm and Robert Jenssen are with the Department ofPhysics and Technology at UiT - The Arctic University of Norway, Tromsø9037, Norway. (email: { kwi030, robert.jenssen } @uit.no) mutual information values start to increase up to a point andthen go back approaching the bisector of the IP; if S is smallerthan d , the mutual information values increase consistently upto a point, and never go back.Despite the great potential of earlier works [4], [5], [6],there are several open questions when it comes to the applica-tions of information theoretic concepts to convolutional neuralnetworks (CNNs). These include but are not limited to:1) The accurate and tractable estimation of informationquantities in CNNs. Speciﬁcally, in the convolutional layer,the input signal X is represented by multiple feature maps,as opposed to a single vector in the fully connected layers.Therefore, the quantity we really need to measure is the mul-tivariate mutual information (MMI) between a single variable(e.g., X ) and a group of variables (e.g., different featuremaps). Unfortunately, the reliable estimation of MMI is widelyacknowledged as an intractable or infeasible task in machinelearning and information theory communities [9], especiallywhen each variable is in a high-dimensional space.2) A systematic framework to analyze CNN layer repre-sentations. By interpreting a feedforward DNN as a Markovchain, the existence of data processing inequality (DPI) is ageneral consensus [4], [6]. However, it is necessary to identifymore inner properties on CNN layer representations using aprincipled framework, beyond DPI.In this brief, we answer these two questions and make thefollowing contributions:1) By deﬁning a multivariate extension of the matrix-based R´enyi’s α -entropy functional [10], we show that theinformation ﬂow, especially the MMI, in CNNs can be mea-sured without knowledge of the probability density function(PDF).2) By capitalizing on the partial information decomposi-tion (PID) framework [11] and on our sample based estimatorfor MMI, we develop three quantities that bypass the need toestimate the synergy and redundancy amongst different featuremaps in convolutional layers. Our result sheds light on thedetermination of network depth (number of layers) and width(size of each layer). It also gives insights on network pruning.II. I NFORMATION Q UANTITY E STIMATION IN

CNN S In this section we give a brief introduction to the recentlyproposed matrix-based R´enyi’s α -entropy functional estima-tor [7] and its multivariate extension [10]. Beneﬁting from thenovel deﬁnition, we present a simple method to measure MMIbetween any pairwise layer representations in CNNs. a r X i v : . [ c s . L G ] J a n A. Matrix-based R´enyi’s α -entropy functional and its multi-variate extension In information theory, a natural extension of the well-knownShannon’s entropy is R´enyi’s α -order entropy [12]. For arandom variable X with probability density function (PDF) f ( x ) in a ﬁnite set X , the α -entropy H α ( X ) is deﬁned as: H α ( f ) = 11 − α log (cid:90) X f α ( x ) dx. (1)R´enyi’s entropy functional evidences a long track recordof usefulness in machine learning and its applications [13].Unfortunately, the accurate PDF estimation impedes its morewidespread adoption in data driven science. To solve thisproblem, [7], [10] suggest similar quantities that resemblesquantum R´enyi’s entropy [14] in terms of the normalizedeigenspectrum of the Hermitian matrix of the projected datain RKHS. The new estimators avoid evaluating the underlyingprobability distributions, and estimate information quantitiesdirectly from data. For brevity, we directly give the followingdeﬁnitions. The theoretical foundations for Deﬁnition 1 and

Deﬁnition 2 are proved respectively in [7] and [10].

Deﬁnition 1:

Let κ : X × X (cid:55)→ R be a real valuedpositive deﬁnite kernel that is also inﬁnitely divisible [15].Given X = { x , x , ..., x n } and the Gram matrix K obtainedfrom evaluating a positive deﬁnite kernel κ on all pairs ofexemplars, that is ( K ) ij = κ ( x i , x j ) , a matrix-based analogueto R´enyi’s α -entropy for a normalized positive deﬁnite (NPD)matrix A of size n × n , such that tr( A ) = 1 , can be given bythe following functional: S α ( A ) = 11 − α log (tr( A α )) = 11 − α log (cid:2) n (cid:88) i =1 λ i ( A ) α (cid:3) , (2)where A ij = n K ij √ K ii K jj and λ i ( A ) denotes the i -th eigenvalueof A . Deﬁnition 2:

Given a collection of n samples { s i =( x i , x i , · · · , x iC ) } ni =1 , where the superscript i denotes thesample index, each sample contains C ( C ≥ ) measurements x ∈ X , x ∈ X , · · · , x C ∈ X C obtained from the samerealization, and the positive deﬁnite kernels κ : X ×X (cid:55)→ R , κ : X × X (cid:55)→ R , · · · , κ C : X C × X C (cid:55)→ R , a matrix-basedanalogue to R´enyi’s α -order joint-entropy among C variablescan be deﬁned as: S α ( A , A , · · · , A C ) = S α (cid:18) A ◦ A ◦ · · · ◦ A C tr( A ◦ A ◦ · · · ◦ A C ) (cid:19) , (3)where ( A ) ij = κ ( x i , x j ) , ( A ) ij = κ ( x i , x j ) , · · · , ( A C ) ij = κ k ( x iC , x jC ) , and ◦ denotes the Hadamard product.The following corollary (see proof in [10]) serve as afoundation for our Deﬁnition 2 . Speciﬁcally, the ﬁrst inequalityindicates that the joint entropy of a set of variables is greaterthan or equal to the maximum of all of the individual entropiesof the variables in the set, whereas the second inequalitysuggests that the joint entropy of a set of variable is less thanor equal to the sum of the individual entropies of the variablesin the set.

Corollary 1:

Let A , A , · · · , A C be C n × n positivedeﬁnite matrices with trace and nonnegative entries, and ( A ) ii = ( A ) ii = · · · = ( A C ) ii = n , for i = 1 , , · · · , n .Then the following two inequalities hold: S α (cid:18) A ◦ A ◦ · · · ◦ A C tr( A ◦ A ◦ · · · ◦ A C ) (cid:19) ≥ max[ S α ( A ) , S α ( A ) , · · · , S α ( A C )] . (4) S α (cid:18) A ◦ A ◦ · · · ◦ A C tr( A ◦ A ◦ · · · ◦ A C ) (cid:19) ≤ S α ( A )+ S α ( A )+ · · · + S α ( A C ) , (5) B. Multivariate mutual information estimation in CNNs 𝑇 𝑇 𝐶−1 𝑋 𝑇 𝑇 𝑇 𝐶 Fig. 1. Venn diagram of I ( X ; { T , T , · · · , T C } ) . Suppose there are C ﬁl-ters in the convolutionallayer, then an input imageis therefore represented by C different feature maps.Each feature map charac-terizes a speciﬁc propertyof the input. This suggeststhat the amount of informa-tion that the convolutionallayer gained from input X is preserved in C informa-tion sources T , T , · · · , T C . The Venn Diagram for X and T , · · · , T C is demon-strated in Fig. 1. Speciﬁcally, the red circle represents theinformation contained in X , each blue circle represents theinformation contained in each feature map. The amountof information in X that is gained from C feature maps(i.e., I ( X ; { T , T , · · · , T C } ) ) is exactly the shaded area. Byapplying the inclusion-exclusion principle [16], this shadedarea can be computed by summing up the area of the redcircle (i.e., H ( X ) ) with the area occupied by all blue circles(i.e., H ( T , T , · · · , T C ) ), and then subtracting the total jointarea occupied by the red circle and all blue circles (i.e., H ( X, T , T , · · · , T C ) ). Formally speaking, this indicatesthat: I ( X ; { T , T , · · · , T C } ) = H ( X ) + H ( T , T , · · · , T C ) − H ( X, T , T , · · · , T C ) , (6)where H denotes entropy for a single variable or joint entropyfor a group of variables.Given Eqs. (2), (3) and (6), I ( X ; { T , T , · · · , T C } ) in amini-batch of size n can be estimated with: I α ( B ; { A , A , · · · , A C } ) = S α ( B )+ S α (cid:18) A ◦ A ◦ · · · ◦ A C tr( A ◦ A ◦ · · · ◦ A C ) (cid:19) − S α (cid:18) A ◦ A ◦ · · · ◦ A C ◦ B tr( A ◦ A ◦ · · · ◦ A C ◦ B ) (cid:19) . (7)Here, B , A , · · · , A C denote Gram matrices evaluated onthe input tensor and C feature maps tensors, respectively. Forexample, A p ( ≤ p ≤ C ) is evaluated on { x ip } ni =1 , in which x ip refers to the feature map generated from the i -th inputsample using the p -th ﬁlter. Obviously, instead of estimatingthe joint PDF on { X, T , T , · · · , T C } which is typicallyunattainable, one just needs to compute ( C +1) Gram matricesusing a real valued positive deﬁnite kernel that is also inﬁnitelydivisible [15].

III. M

AIN R ESULTS

This section presents three sets of experiments to empiri-cally validate the existence of two DPIs in CNNs, using thenovel nonparametric information theoretic (IT) estimators putforth in this work. Speciﬁcally, Section III-A validates theexistence of two DPIs in CNNs. In Section III-B, we illustrate,via the application of the PID framework in the trainingphase, some interesting observations associated with differentCNN topologies. Following this, we present implications tothe determination of network depth and width motivated bythese results. We ﬁnally point out, in Section III-C, an ad-vanced interpretation to the information plane (IP) that deservemore (theoretical) investigations. Four real-world data sets,namely MNIST [17], Fashion-MNIST [18], HASYv2 [19], andFruits 360 [20], are selected for evaluation. The characteristicsof each data set are summarized in Table I. Note that,compared with the benchmark MNIST and Fashion-MNIST,HASYv2 and Fruits 360 have signiﬁcantly larger number ofclasses as well as higher intraclass variance. For example, inFruits 360, the apple class contains different varieties (e.g.,Crimson Snow, Golden, Granny Smith), and the images arecaptured with different viewpoints and varying illuminationconditions. Due to page limitations, we only demonstrate themost representative results in the rest of this paper. Additionalexperimental results are available in Appendix C.

TABLE IT

HE NUMBER OF CLASSES ( C LASS ), THE NUMBER OF TRAININGSAMPLES ( T RAIN ), THE NUMBER OF TESTING SAMPLES ( T EST ), ANDTHE SAMPLE SIZE OF SELECTED DATA SETS . Class

Train

Test Sample SizeMNIST [17]

10 60 ,

000 10 ,

000 28 × Fashion-MNIST [18]

10 60 ,

000 10 ,

000 28 × HASYv2 [19]

369 151 ,

406 16 ,

827 32 × Fruits 360 [20]

111 56 ,

781 19 ,

053 100 × For MNIST and Fashion-MNIST, we consider a LeNet- [17] type network which consists of convolutional layers, pooling layers, and fully connected layers. For HASYv2 andFruits 360, we use a smaller AlexNet [21] type network with convolutional layers (but fewer ﬁlters in each layer) and fully connected layers. We train the CNN with SGD with mo-mentum . and mini-batch size . In MNIST and Fashion-MNIST, we select learning rate . and training epochs. InHASYv2 and Fruits , we select learning rate . and training epochs. Both “sigmoid” and “ReLU” activation func-tions are tested. For the estimation of MMI, we ﬁx α = 1 . to approximate Shannon’s deﬁnition, and use the radial basisfunction (RBF) kernel κ ( x i , x j ) = exp( − (cid:107) x i − x j (cid:107) σ ) to obtainthe Gram matrices. The kernel size σ is determined based onthe Silverman’s rule of thumb [22] σ = h × n − / (4+ d ) , where n is the number of samples in the mini-batch ( in thiswork), d is the sample dimensionality and h is an empiricalvalue selected experimentally by taking into consideration thedata’s average marginal variance. In this paper, we select h = 5 for the input signal forward propagation chain and h = 0 . for the error backpropagation chain. A. Experimental Validation of Two DPIs

We expect the existence of two DPIs in any feedforwardCNNs with K hidden layers, i.e., I ( X, T ) ≥ I ( X, T ) ≥ · · · ≥ I ( X, T K ) and I ( δ K , δ K − ) ≥ I ( δ K , δ K − ) ≥ · · · ≥ I ( δ K , δ ) , where T , T , · · · , T K are successive hidden layerrepresentations from the ﬁrst hidden layer to the output layerand δ K , δ K − , · · · , δ are errors from the output layer to theﬁrst hidden layer. This is because both X → T → · · · → T K and δ K → δ K − → · · · → δ form a Markov chain [4], [6].Fig. 2 shows the DPIs at the initial training stage, after epoch’s training and at the ﬁnal training stage, respectively.As can be seen, DPIs hold in most of the cases. Note that,there are a few disruptions in the error backpropagation chain,because the curves should be monotonic according to thetheory. One possible reason is that when training converges,the error becomes tiny such that Sliverman’s rule of thumb isno longer a reliable choice to select scale parameter σ in ourestimator. M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (a) initial iteration M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (b) 1 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (c) 10 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (d) initial iteration M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (e) 1 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (f) 10 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (g) initial iteration M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (h) 1 epoch later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (i) 15 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (j) initial iteration M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (k) 1 epoch later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (l) 15 epochs laterFig. 2. Two DPIs in CNNs. The ﬁrst row shows the validation results, onMNIST data set, obtained by a CNN with ﬁlters in the st convolutionallayer (denote Conv. ) and ﬁlters in the nd convolutional layer (denoteConv. ); the second row shows the validation results, on Fashion-MNISTdata set, obtained by the same CNN architecture as in the MNIST; the thirdrow shows the validation results, on HASYv2 data set, obtained by a CNNwith ﬁlters in Conv. , ﬁlters in Conv. and ﬁlters in Conv. ;whereas the fourth row shows the validation results, on Fruits 360 data set,obtained by a CNN with ﬁlters in Conv. , ﬁlters in Conv. , ﬁltersin Conv. and ﬁlters in Conv. . In each subﬁgure, the blue curves showthe MMI values between input and different layer representations, whereasthe green curves show the MMI values between errors in the output layer anddifferent hidden layers. B. Redundancy and Synergy in Layer Representations

In this section, we explore properties of different IT quan-tities during the training of CNNs, with the help of the PIDframework. Particularly, we are interested in determining theredundancy and synergy amongst different feature maps andhow they evolve with training in different CNN topologies.Moreover, we are also interested in identifying some upper andlower limits (if they exist) for these quantities. However, theanalysis is not easy because the set of information equationsis underdetermined as we will show next.

Given input signal X and two feature maps T and T ,the PID framework indicates that the MMI I ( X ; { T , T } ) can be decomposed into four non-negative IT components:the synergy Syn ( X ; { T , T } ) that measures the informationabout X provided by the combination of T and T (i.e., theinformation that cannot be captured by either T or T alone);the redundancy Rdn ( X ; { T , T } ) that measures the sharedinformation about X that can be provided by either T or T ;the unique information Unq ( X ; T ) (or Unq ( X ; T ) ) thatmeasures the information about X that can only be providedby T (or T ). Moreover, the unique information, the synergyand the redundancy satisfy (see Fig. 3): I ( X ; { T , T } ) = Syn ( X ; { T , T } )+ Rdn ( X ; { T , T } )+ Unq ( X ; T ) + Unq ( X ; T ); (8) I ( X ; T ) = Rdn ( X ; { T , T } ) + Unq ( X ; T ); (9) I ( X ; T ) = Rdn ( X ; { T , T } ) + Unq ( X ; T ) . (10) 𝑇 𝑋 𝑇 (a) I ( X ; { T , T } ) 𝐈(𝑋; {𝑇 , 𝑇 }) 𝐈(𝑋; 𝑇 ) 𝐈(𝑋; 𝑇 ) Syn(𝑋;{𝑇 ,𝑇 })Rdn(𝑋;{𝑇 ,𝑇 }) Unq(𝑋;𝑇 )Unq(𝑋;𝑇 ) (b) PID to I ( X ; { T , T } ) Fig. 3. Synergy and redundancy amongst different feature maps. (a) showsthe interactions between input signal and two feature maps. The shadow areaindicates the MMI I ( X ; { T , T } ) . (b) shows the PID to I ( X ; { T , T } ) . Notice that we have IT components (i.e., synergy,redundancy, and two unique information terms), but only measurements: I ( X ; { T , T } ) , I ( X ; T ) , and I ( X ; T ) .Therefore, we can never determine uniquely the ITquantities. This decomposition for I ( X ; { T , T } ) can bestraightforwardly extended for more than three variables,thus decomposing I ( X ; { T , T , · · · , T C } ) into much morecomponents. For example, if C = 4 , there will be individual non-negative items. Admittedly, the PID diagram(see Appendix A for more details) offers an intuitiveunderstanding of the interactions between input and differentfeature maps, and our estimators have been shown appropriatefor high dimensional data. However, the reliable estimation ofeach IT component still remains a bigger challenge, becauseof the undetermined nature of the problem. In fact, thereis no universal agreement on the deﬁnition of synergy andredundancy among one-dimensional -way interactions, letalone the estimation of each synergy or redundancy amongnumerous variables in high-dimensional spaces [23], [24].To this end, we develop three quantities based on the measurements by manipulating Eqs. (8)-(10) to characterizeintrinsic properties of CNN layer representations. The newdeveloped quantities avoid the direct estimation of synergyand redundancy. They are:1) I ( X ; { T , T , · · · , T C } ) , which is exactly the MMI. This quantity measures the amount of information about X that iscaptured by all feature maps (in one convolutional layer).2) C ( C − (cid:80) Ci =1 (cid:80) Cj = i +1 I ( X ; T i ) + I ( X ; T j ) − I ( X ; { T i , T j } ) ,which is referred to redundancy-synergy trade-off. Thisquantity measures the (average) redundancy-synergy trade-offin different feature maps. By rewriting Eqs. (8)-(10), wehave: I ( X ; T i ) + I ( X ; T j ) − I ( X ; { T i , T j } )= Rdn ( X ; { T i , T j } ) − Syn ( X ; { T i , T j } ) . (11)Obviously, a positive value of this trade-off impliesredundancy, whereas a negative value signiﬁes synergy [25].Here, instead of measuring all PID terms that increasepolynomially with C , we sample pairs of feature maps,calculate the information quantities for each pair, and ﬁnallycompute averages over all pairs to determine if synergydominates in the training phase. Note that, the pairwisesampling procedure has been used in neuroscience [26]and a recent paper on information theoretic investigation ofRestricted Boltzmann Machine (RBM) [3].3) C ( C − (cid:80) Ci =1 (cid:80) Cj = i +1 × I ( X ; { T i , T j } ) − I ( X ; T i ) − I ( X ; T j ) ,which is referred to weighted non-redundant information. Thisquantity measures the (average) amount of non-redundantinformation about X that is captured by pairs of featuremaps. As can be seen, from Eqs. (8)-(10), × I ( X ; { T i , T j } ) − I ( X ; T i ) − I ( X ; T j )= Unq ( X ; T i ) + Unq ( X ; T j ) + 2 × Syn ( X ; { T i , T j } ) . (12)We call this quantity “weighted” because we overempha-sized the role of synergy, but notice that redundancy doesnot explicitly appear, while the two unique information termsreappear.One should note that, Eqs. (11) and (12) are just two ofmany equations that can be written, but all are going to be alinear combination of more than one IT component. Therefore,we do not introduce any errors in computing Eqs. (11) and(12), we simply work on the linear projected space of synergyand redundancy. We will now experimentally show how thesetwo pairs of IT components (synergy and redundancy fromEq. (11), and synergy with the two unique information termsfrom Eq. (12)) evolve across different CNN layer changes.We ﬁrst evaluate MMI with respect to different CNN topolo-gies in Fig. 4. For MNIST, we demonstrate MMI values in theﬁrst two convolutional layers (denote Conv. and Conv. ).Similarly, for Fruits 360, we demonstrate MMI values inConv. , Conv. and Conv. . By DPI, the maximum amountof information that each convolutional layer representation cancapture is exactly the entropy of input. As can be seen, withthe increase of the number of ﬁlters, the total amount of in-formation that each layer captured also increases accordingly.However, it is interesting to see that MMI values are likely tosaturate with only a few ﬁlters. For example, in Fruits ,with only ﬁlters in Conv. , ﬁlters in Conv. and ﬁlters in Conv. , make MMI values to reach their maximum Conv 1 C on v (a) MMI in MNIST Conv 14 − 4.7281Conv 24 − 4.7281Conv 34 − 4.7281 (4,4,4) (b) MMI in Fruits 360Fig. 4. The MMI values in (a) Conv. and Conv. in MNIST data set;and (b) Conv. , Conv. and Conv. in Fruits data set. The black lineindicates the upper bound of MMI, i.e., the average mini-batch input entropy.The topologies of all competing networks are speciﬁed in the legend, in whichthe successive numbers indicate the number of ﬁlters in each convolutionallayer. We also report their classiﬁcation accuracies ( % ) on testing set averagedover Monte-Carlo simulations in the parentheses. value . (i.e., the ensemble average entropy across mini-batches) in each layer. More ﬁlters will increase classiﬁcationaccuracy at ﬁrst. However, increasing the number of ﬁltersdoes not guarantee that classiﬁcation accuracy increases, andmight even degrade performance.We argue that this phenomenon can be explainedby the percentage that the redundancy-synergy trade-off or the weighted non-redundant information accountsfor the MMI in each pair of feature maps, i.e., C ( C − (cid:80) Ci =1 (cid:80) Cj = i +1 I ( X ; T i )+ I ( X ; T j ) − I ( X ; { T i ,T j } ) I ( X ; { T i ,T j } ) or C ( C − (cid:80) Ci =1 (cid:80) Cj = i +1 2 × I ( X ; { T i ,T j } ) − I ( X ; T i ) − I ( X ; T j ) I ( X ; { T i ,T j } ) . Infact, by referring to Fig. 5, it is obvious that more ﬁlters canpush the network towards an improved redundancy-synergytrade-off, i.e., the synergy gradually dominates in each pair offeature maps with the increase of ﬁlters. That is perhaps oneof the main reasons why the increased number of ﬁlters canlead to better classiﬁcation performance, even though the totalmultivariate mutual information stays the same. However, ifwe look deeper, it seems that the redundancy-synergy trade-off is always positive, which may suggest that redundancy isalways larger than synergy. On the other hand, one shouldnote that the amount of non-redundant information is alwaysless than the MMI (redundancy is non-negative) no matterthe number of ﬁlters. Therefore, it is impossible to improvethe classiﬁcation performance by blindly increasing thenumber of ﬁlters. This is because the minimum probability ofclassiﬁcation error is upper bounded by the MMI expressedin different forms (e.g., [27], [28]).Having illustrated the DPIs and the redundancy-synergytrade-offs, it is easy to summarize some implications con-cerning the design and training of CNNs. First, as a possibleapplication of DPI in the error backpropagation chain, wesuggest to use the DPI as an indicator on where to perform the“bypass” in the recently proposed Relay backpropagation [29].Second, the DPIs and the redundancy-synergy trade-off maygive some guidelines on the depth and width of CNNs.Intuitively, we need multiple layers to quantify the multi-scaleinformation contained in natural images. However, more layerswill lead to severe information loss. The same dilemma appliesto the number of ﬁlters in convolutional layers: a sufﬁcientnumber of ﬁlters guarantees preservation of input informationand the ability to learn a good redundancy-synergy trade-off. However, increasing the number of ﬁlters does not always leadto performance gain.Admittedly, it is hard to give a concrete rule to determinethe exact number of ﬁlters in one convolutional layer fromthe current results. We still present a possible solution to shedlight on this problem. In fact, if we view each ﬁlter as anindividual feature extractor, the problem of determining theoptimal number of ﬁlters turns out to be seeking a stoppingcriterion for feature selection. Therefore, the number of ﬁlterscan be determined by monitoring the value of the conditionalmutual information (CMI), i.e., I ( T r ; Y | T s ) , where T s and T r denote respectively the selected and the remaining ﬁlters,and Y denotes desired response. Theoretically, I ( T r ; Y | T s ) ismonotonically decreasing if a new ﬁlter t is added into T s [30],but will never reach zero in practice [31]. Therefore, in orderto evaluate the impact of t on I ( T r ; Y | T s ) , we can create arandom permutation of t (without permuting the corresponding Y ), denoted ˜ t . If I ( { T r − t } ; Y |{ T s , t } ) is not signiﬁcantlysmaller than I ( { T r − ˜ t } ; Y |{ T s , ˜ t } ) , t can be discarded andthe ﬁlter selection is stopped. We term this method CMI-permutation [32]. We refer interested readers to Appendix Bfor its detailed implementation.Our preliminary results shown in Fig. 6 suggest that CMI-permutation is likely to underestimate the number of ﬁlters.Therefore, additional design efforts are required as futurework. C. Revisiting the Information Plane (IP)

The behaviors of curves in the IP is currently a controversialissue. Recall the discrepancy reported by Saxe et al . [5],the existence of compression phase observed by Shwartz-Ziv and Tishby [4] depends on the adopted nonlinearityfunctions: double-sided saturating nonlinearities like “tanh”or “sigmoid” yield a compression phase, but linear activationfunctions and single-sided saturating nonlinearities like the“ReLU” do not. Interestingly, Noshad et al . [33] employeddependence graphs to estimate mutual information values andobserved the compression phase even using “ReLU” activationfunctions. Similar phenomenon is also observed by Chelom-biev et al . [34], in which an entropy-based adaptive binning(EBAB) estimator is developed to enable more robust mutualinformation estimation that adapts to hidden activity of neuralnetworks. On the other hand, Goldfeld et al . [35] arguedthat compression is due to layer representations clustering,but it is hard to observe the compression in large network.We disagree with this attribution of different behavior to thenonlinear activation functions. Instead, we often forget that,rarely, estimators share all the properties of the statisticallydeﬁned quantities [36]. Hence, variability in the displayedbehavior is mostly likely attributed to different estimators ,although this argument is rarely invoked in the literature. Thisis the reason we suggest that a ﬁrst step before analyzingthe information plane curves, is to show that the employed Shwartz-Ziv and Tishby [4] use the basic Shannon’s deﬁnition and estimatemutual information by dividing neuron activation values into equal-intervalbins, whereas the base estimator used by Saxe et al . [5] provides KernelDensity Estimator (KDE) based lower and upper bounds on the true mutualinformation [37], [33]. % o f R edundan cy − S y ne r g y t r adeo ff (a) Redundancy-Synergy trade-off.The networks differ in the numberof ﬁlters in Conv. . % o f w e i gh t ed non − r edundan t i n f o r m a t i on (b) Weighted non-redundant infor-mation. The networks differ in thenumber of ﬁlters in Conv. . % o f R edundan cy − S y ne r g y t r adeo ff (c) Redundancy-Synergy trade-off.The networks differ in the numberof ﬁlters in Conv. . % o f w e i gh t ed non − r edundan t i n f o r m a t i on (d) Weighted non-redundant infor-mation. The networks differ in thenumber of ﬁlters in Conv. .

25% 30% 35% 40% 45%40%45%50%55%60% % of Redundancy−Synergy tradeoff (Conv 1) % o f R edundan cy − S y ne r g y t r adeo ff ( C on v ) (e) Redundancy-Synergy trade-off.The networks differ in the numberof ﬁlters in Conv. and Conv. .

55% 60% 65% 70% 75%40%42%44%46%48%50%52%54% % of weighted non−redundant information (Conv 1) % o f w e i gh t ed non − r edundan t i n f o r m a t i on ( C on v ) (f) Weighted non-redundant infor-mation. The networks differ in thenumber of ﬁlters in Conv. andConv. .

25% 30% 35% 40% 45%50%55%60%65%70% % of Redundancy−Synergy tradeoff (Conv 1) % o f R edundan cy − S y ne r g y t r adeo ff ( C on v ) (g) Redundancy-Synergy trade-off.The networks differ in the numberof ﬁlters in Conv. and Conv. .

55% 60% 65% 70% 75%35%40%45%50%35%40%45%50%35%40% % of weighted non−redundant information (Conv 1) % o f w e i gh t ed non − r edundan t i n f o r m a t i on ( C on v ) (h) Weighted non-redundant infor-mation. The networks differ in thenumber of ﬁlters in Conv. andConv. .Fig. 5. The redundancy-synergy trade-off and the weighted non-redundant information in MNIST (the ﬁrst row) and Fruits (the second row) data sets.(a) and (b) demonstrate the percentages of these two quantities with respect to different number of ﬁlters in Conv. , but ﬁlters in Conv. . (c) and (d)demonstrate the percentages of these two quantities with respect to ﬁlters in Conv. , but different number of ﬁlters in Conv. . Similarly, (e) and (f)compare these two quantities with respect to different number of ﬁlters in both Conv. and Conv. , with ﬁlters in Conv. ; whereas (e) and (f) comparethese two quantities with respect to different number of ﬁlters in both Conv. and Conv. , with ﬁlters in Conv. . In each subﬁgure, the topologies ofall competing networks are speciﬁed in the legend. C ond i t i ona l m u t ua l i n f o r m a t i on v a l ue s I(S−S’;Y|S’)Selected x=5 (a) ﬁlters in Conv. 2 of LeNet-5 C ond i t i ona l m u t ua l i n f o r m a t i on v a l ue s I(S−S’;Y|S’)Selected x=21 (b) ﬁlters in Conv5-3 of VGG-16Fig. 6. Determination of the number of ﬁlters in (a) Conv. of LeNet-5trained on MNIST data set; and (b) Conv5-3 of VGG-16 trained on CIFAR-10 data set. CMI-permutation [32] suggests ﬁlters among a total of ﬁlters in case (a), and ﬁlters among a total of ﬁlters in case (b). estimators meet the expectation of the DPI (or similar knownproperties of the statistical quantities). We show above thatour R´enyi’s entropy estimator passes this test.The IPs for a LeNet-5 type CNN trained on MNIST andFashion-MNIST data sets are shown in Fig. 7. From the ﬁrstcolumn, both I ( X ; T ) and I ( T ; Y ) increase rapidly up to a cer-tain point with the SGD iterations. This result conforms to thedescription in [35], suggesting that the behaviour of CNNs inthe IP not being the same as that of the MLPs in [4], [5], [33]and our intrinsic dimensionality hypothesis in [6] is speciﬁcto SAEs. However, if we remove the redundancy in I ( X ; T ) and I ( T ; Y ) , and only preserve the unique information andthe synergy (i.e., substituting I ( X ; T ) and I ( T ; Y ) with theircorresponding (average) weighted non-redundant informationdeﬁned in Section III-B), it is easy to observe the compressionphase in the modiﬁed IP. Moreover, it seems that “sigmoid” ismore likely to incur the compression, compared with “ReLU”,where this intensity can be attributed to the nonlinearity. Our result shed light on the discrepancy in [4] and [5], and reﬁnedthe argument in [33], [34].IV. C ONCLUSIONS AND F UTURE W ORK

This brief presents a systematic method to analyze CNNsmapping and training from an information theoretic per-spective. Using the multivariate extension of the matrix-based R´enyi’s α -entropy functional, we validated two dataprocessing inequalities (DPIs) in CNNs. The introduction ofpartial information decomposition (PID) framework enablesus to pin down the redundancy-synergy trade-off in layerrepresentations. We also analyzed the behaviors of curves inthe information plane, aiming at clarifying the debate on theexistence of compression in DNNs. We close by highlightingsome potential extensions of our methodology and directionof future research:1) All the information quantities mentioned in this paperare estimated based on a vector rastering of samples, i.e., eachlayer input (e.g., an input image, a feature map) is ﬁrst con-verted to a single vector before entropy or mutual informationestimation. Albeit its simplicity, we distort spatial relationshipsamongst neighboring pixels. Therefore, a question remains onthe reliable information theoretic estimation that is feasiblewithin a tensor structure.2) We look forward to evaluate our estimators on morecomplex CNN architectures, such as VGGNet [38] andResNet [39]. According to our observation, it is easy tovalidate the DPI and the rapid increase of mutual information(in top layers) in VGG- on CIFAR- dataset [40] (seeFig. 8). However, it seems that the MMI values in bottom M u t ua l I n f o r m a t i on I ( T ; Y ) N u m be r o f It e r a t i on s (a) IP, sigmoid ( ﬁlters in Conv. 1, ﬁlters in Conv. 2) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (b) M-IP, sigmoid ( ﬁlters inConv. 1, ﬁlters in Conv. 2) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (c) M-IP, ReLU ( ﬁlters in Conv. 1, ﬁlters in Conv. 2) M u t ua l I n f o r m a t i on I ( T ; Y ) N u m be r o f It e r a t i on s (d) IP, sigmoid ( ﬁlters in Conv. 1, ﬁlters in Conv. 2) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (e) M-IP, sigmoid ( ﬁlters inConv. 1, ﬁlters in Conv. 2) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (f) M-IP, ReLU ( ﬁlters in Conv. 1, ﬁlters in Conv. 2)Fig. 7. The Information Plane (IP) and the modiﬁed Information Plane (M-IP) of a LeNet- type CNN trained on MNIST (the ﬁrst row) and Fashion-MNIST(the second row) data sets. The of ﬁlters in Conv. , the of ﬁlters in Conv. , and the adopted activation function are indicated in the subtitle of eachplot. The curves in IP increase rapidly up to a point without compression (see (a) and (d)). By contrast, it is very easy to observe the compression in M-IP(see (b), (c) and (e)). Moreover, compared with ReLU, sigmoid is more likely to incur the compression (e.g., comparing (b) with (c), or (e) with (f)). Hidden Layer index MM I I ( X , T ) Fig. 8. DPI in VGG-16 on CIFAR-10: I ( X, T ) ≥ I ( X, T ) ≥ · · · ≥ I ( X, T K ) . Layer 1 to Layer 13 are convolutional layers, whereas Layer 14to Layer 16 are fully-connected layers. layers are likely to saturate. The problem arises when we tryto take the Hadamard product of the kernel matrices of eachfeature map in Eq. (7). The elements in these (normalized)kernel matrices have values between and , and takingthe entrywise product of, e.g., such matrices like in theconvolutional layer of VGG- , will tend towards a matrixwith diagonal entries /n and nearly zero everywhere else.The eigenvalues of the resulting matrix will quickly havealmost the same value across training epochs.3) Our estimator described in Eqs. (3) and (6) has twolimitations one needs to be aware of. First, our estimatorsare highly efﬁcient in the scenario of high-dimensional data,regardless of the data properties (e.g., continuous, discrete,or mixed). However, the computational complexity increasesalmost cubically with the number of samples n , because ofthe eigenvalue decomposition. Fortunately, it is possible toapply methods such as kernel randomization [41] to reducethe burden to O ( n log( n )) . By contrast, the well-known kerneldensity estimator [42] suffers from the curse of dimensional-ity [43], whereas the k -nearest neighbor estimator [44] requiresexponentially many samples for accurate estimation [45]. Sec- ond, as we have emphasized in previous work (e.g., [6], [10]),it is important to select an appropriate value for the kernel size σ and the order α of R´enyi’s entropy functional. Otherwise,spurious conclusions may happen. Reliable manners to select σ include the Silverman’s rule of thumb and to percent ofthe total (median) range of the Euclidean distances between allpairwise data points [46], [47]. On the other hand, the choiceof α is associated with the task goal. If the application requiresemphasis on tails of the distribution (rare events) or multiplemodalities, α should be less than and possibly approach to from above. α = 2 provides neutral weighting [13]. Finally, ifthe goal is to characterize modal behavior, α should be greaterthan . This work ﬁx α = 1 . to approximate Shannon’sdeﬁnition. −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.100.010.020.030.040.050.060.070.080.09 Information Bottleneck (IB) objective values P e r c en t age Fig. 9. Information Bottleneck (IB) ob-jective values distribution in CONV - of VGG- trained on CIFAR- .Nearly ﬁlters have IB values abovethe red dashed line, which imply lessimportance for classiﬁcation.

4) Perhaps one of themost promising applica-tions concerning our es-timators is the ﬁlter-levelpruning for CNNs com-pression, i.e., discardingwhole ﬁlters that are lessimportant [48]. Differentstatistics (e.g., the absoluteweight sum [49] or theAverage Percentage of Ze-ros (APoZ) [50]) have beenpropose to measure ﬁl-ter importance. Moreover, arecent study [51] suggeststhat mutual information isa reliable indicator to measure neuron (or ﬁlter) importance.Therefore, given an individual ﬁlter T i , we suggest evaluatingits importance by the Information Bottleneck (IB) [52] objec-tive I ( X ; T i ) − β I ( T i ; Y ) , where X and Y denote respectively the input batch and its corresponding desired output, and β is aLagrange multiplier. Intuitively, a small value of this objectiveindicates that T i obtains a compressed representation of X thatis relevant to Y . Therefore, the smaller the objective value, thehigher importance of T i . Fig. 9 demonstrates the IB objective( β = 2 ) values distribution for ﬁlters in CONV - layer ofVGG- trained on CIFAR- . This distribution looks similarto the one obtained by APoZ in [50], both of them indicatethat more than of ﬁlters are less important and can bediscarded with negligible loss in accuracy [49].R EFERENCES[1] N. Tishby and N. Zaslavsky, “Deep learning and the informationbottleneck principle,” in

IEEE ITW , 2015, pp. 1–5.[2] A. Achille and S. Soatto, “Emergence of invariance and disentanglementin deep representations,”

JMLR , vol. 19, no. 1, pp. 1947–1980, 2018.[3] T. Tax, P. A. Mediano, and M. Shanahan, “The partial informationdecomposition of generative neural network models,”

Entropy , vol. 19,no. 9, p. 474, 2017.[4] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neuralnetworks via information,” arXiv preprint arXiv:1703.00810 , 2017.[5] A. M. Saxe et al. , “On the information bottleneck theory of deeplearning,” in

ICLR , 2018.[6] S. Yu and J. C. Principe, “Understanding autoencoders with informationtheoretic concepts,”

Neural Networks , vol. 117, pp. 104–123, 2019.[7] L. G. Sanchez Giraldo, M. Rao, and J. C. Principe, “Measures ofentropy from data using inﬁnitely divisible kernels,”

IEEE Transactionson Information Theory , vol. 61, no. 1, pp. 535–548, 2015.[8] F. Camastra and A. Staiano, “Intrinsic dimension estimation: Advancesand open problems,”

Information Sciences , vol. 328, pp. 26–41, 2016.[9] G. Brown, A. Pocock, M.-J. Zhao, and M. Luj´an, “Conditional likelihoodmaximisation: a unifying framework for information theoretic featureselection,”

JMLR , vol. 13, no. Jan, pp. 27–66, 2012.[10] S. Yu, L. G. Sanchez Giraldo, R. Jenssen, and J. C. Principe, “Multi-variate extension of matrix-based renyi’s α -order entropy functional,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2019.[11] P. L. Williams and R. D. Beer, “Nonnegative decomposition of multi-variate information,” arXiv preprint arXiv:1004.2515 , 2010.[12] A. R´enyi, “On measures of entropy and information,” in

Proc. of the 4thBerkeley Sympos. on Math. Statist. and Prob. , vol. 1, 1961, pp. 547–561.[13] J. C. Principe,

Information theoretic learning: Renyi’s entropy and kernelperspectives . Springer Science & Business Media, 2010.[14] M. M¨uller-Lennert, F. Dupuis, O. Szehr, S. Fehr, and M. Tomamichel,“On quantum r´enyi entropies: A new generalization and some proper-ties,”

J. Math. Phys. , vol. 54, no. 12, p. 122203, 2013.[15] R. Bhatia, “Inﬁnitely divisible matrices,”

The American MathematicalMonthly , vol. 113, no. 3, pp. 221–235, 2006.[16] R. W. Yeung, “A new outlook on shannon’s information measures,”

IEEEtransactions on information theory , vol. 37, no. 3, pp. 466–474, 1991.[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[18] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel imagedataset for benchmarking machine learning algorithms,” arXiv preprintarXiv:1708.07747 , 2017.[19] M. Thoma, “The hasyv2 dataset,” arXiv preprint arXiv:1701.08380 ,2017.[20] H. Mures¸an and M. Oltean, “Fruit recognition from images using deeplearning,”

Acta Universitatis Sapientiae, Informatica , vol. 10, no. 1, pp.26–42, 2018.[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

NeurIPS , 2012, pp. 1097–1105.[22] B. W. Silverman,

Density estimation for statistics and data analysis .CRC press, 1986, vol. 26.[23] N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay, “Quantifyingunique information,”

Entropy , vol. 16, no. 4, pp. 2161–2183, 2014.[24] V. Grifﬁth and C. Koch, “Quantifying synergistic mutual information,”in

Guided Self-Organization: Inception . Springer, 2014, pp. 159–190.[25] A. J. Bell, “The co-information lattice,” in

Proceedings of the FifthInternational Workshop on Independent Component Analysis and BlindSignal Separation: ICA , vol. 2003, 2003. [26] N. Timme, W. Alford, B. Flecker, and J. M. Beggs, “Synergy, re-dundancy, and multivariate information measures: an experimentalistsperspective,”

J. Comput. Neurosci. , vol. 36, no. 2, pp. 119–140, 2014.[27] M. Hellman and J. Raviv, “Probability of error, equivocation, and thechernoff bound,”

IEEE Transactions on Information Theory , vol. 16,no. 4, pp. 368–372, 1970.[28] I. Sason and S. Verd´u, “Arimoto–r´enyi conditional entropy and bayesian m -ary hypothesis testing,” IEEE Transactions on Information Theory ,vol. 64, no. 1, pp. 4–25, 2018.[29] L. Shen and Q. Huang, “Relay backpropagation for effective learningof deep convolutional neural networks,” in

ECCV , 2016, pp. 467–482.[30] T. M. Cover and J. A. Thomas,

Elements of information theory . JohnWiley & Sons, 2012.[31] N. X. Vinh, J. Chan, and J. Bailey, “Reconsidering mutual informationbased feature selection: A statistical signiﬁcance view,” in

AAAI , 2014.[32] S. Yu and J. C. Pr´ıncipe, “Simple stopping criteria for informationtheoretic feature selection,”

Entropy , vol. 21, no. 1, p. 99, 2019.[33] M. Noshad, Y. Zeng, and A. O. Hero, “Scalable mutual informationestimation using dependence graphs,” in

ICASSP , 2019, pp. 2962–2966.[34] I. Chelombiev, C. Houghton, and C. O’Donnell, “Adaptive estimatorsshow information compression in deep neural networks,” arXiv preprintarXiv:1902.09037 , 2019.[35] Z. Goldfeld et al. , “Estimating information ﬂow in neural networks,” arXiv preprint arXiv:1810.05728 , 2018.[36] L. Paninski, “Estimation of entropy and mutual information,”

Neuralcomputation , vol. 15, no. 6, pp. 1191–1253, 2003.[37] A. Kolchinsky and B. Tracey, “Estimating mixture entropy with pairwisedistances,”

Entropy , vol. 19, no. 7, p. 361, 2017.[38] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in

ICLR , 2015.[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

CVPR , 2016, pp. 770–778.[40] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” Citeseer, Tech. Rep., 2009.[41] D. Lopez-Paz, P. Hennig, and B. Sch¨olkopf, “The randomized depen-dence coefﬁcient,” in

NeurIPS , 2013, pp. 1–9.[42] Y.-I. Moon, B. Rajagopalan, and U. Lall, “Estimation of mutual infor-mation using kernel density estimators,”

Physical Review E , vol. 52,no. 3, p. 2318, 1995.[43] T. Nagler and C. Czado, “Evading the curse of dimensionality innonparametric density estimation with simpliﬁed vine copulas,”

Journalof Multivariate Analysis , vol. 151, pp. 69–89, 2016.[44] A. Kraskov, H. St¨ogbauer, and P. Grassberger, “Estimating mutualinformation,”

Physical review E , vol. 69, no. 6, p. 066138, 2004.[45] S. Gao, G. Ver Steeg, and A. Galstyan, “Efﬁcient estimation of mutualinformation for strongly dependent variables,” in

AISTATS , 2015, pp.277–286.[46] J. Shi and J. Malik, “Normalized cuts and image segmentation,”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 22,no. 8, pp. 888–905, 2000.[47] R. Jenssen, “Kernel entropy component analysis,”

IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 32, no. 5, pp. 847–860,2009.[48] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A ﬁlter level pruning methodfor deep neural network compression,” in

ICCV , 2017, pp. 5058–5066.[49] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruningﬁlters for efﬁcient convnets,” in

ICLR , 2017.[50] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data-driven neuron pruning approach towards efﬁcient deep architectures,” arXiv preprint arXiv:1607.03250 , 2016.[51] R. A. Amjad, K. Liu, and B. C. Geiger, “Understanding indi-vidual neuron importance using information theory,” arXiv preprintarXiv:1804.06679 , 2018.[52] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneckmethod,” arXiv preprint physics/0004057 , 2000.[53] D. Koller and M. Sahami, “Toward optimal feature selection,” StanfordInfoLab, Tech. Rep., 1996.[54] S. Yaramakala and D. Margaritis, “Speculative markov blanket discoveryfor optimal feature selection,” in

ICDM , 2005, pp. 809–812.[55] J. R. Vergara and P. A. Est´evez, “A review of feature selection methodsbased on mutual information,”

Neural computing and applications ,vol. 24, no. 1, pp. 175–186, 2014. A PPENDIX

A. Partial Information Diagram

In this section, we brieﬂy recall the ideas behind thepartial information decomposition (PID) by Williams andBeer [11]. The PID is a framework to deﬁne informationdecompositions of arbitrarily many random variables, whichcan be characterized by a partial information (PI) diagram.The general structure of PI diagram becomes clear when weconsider the decomposition for four variables (see Fig. 10),which illustrates the way in which the total information that R = { R , R , R } provides about S is distributed acrossvarious combinations of sources.Speciﬁcally, each element in R can provide unique in-formation (regions labeled { } , { } , and { } ), informationredundantly with one other variable ( { }{ } , { }{ } , and { }{ } ), or information synergistically with one other variable( { } , { } , and { } ). Additionally, information can beprovided redundantly by all three variables ( { }{ }{ } ) orprovided by their three-way synergy ( { } ). More interestingare the new kinds of terms representing combinations ofredundancy and synergy. For instance, the regions marked { }{ } , { }{ } , and { }{ } represent information thatis available redundantly from either one variable consideredindividually or the other two considered together. In general,each PI item represents the redundancy of synergies for aparticular collection of sources, corresponding to one distinctway for the components of R = { R , R , R } to contributeinformation about S . Unfortunately, the number of PI itemsgrows exponentially with the increase of the number of ele-ments in R . For example, if R = { R , R , R , R } , there willbe individual non-negative items. Moreover, the reliableestimation of each PID term still remains a big challenge [23],[24]. B. Determining Network Width with CMI-Permutation [32]

Our method to determine the network width (i.e., thenumber of ﬁlters N ) is motivated by the concept of the Markovblanket (MB) [53], [54]. Remember that the MB M of atarget variable Y is the smallest subset of S such that Y isconditional independent of the rest of the variables S − M , i.e., Y ⊥ ( S − M ) | M [55]. From the perspective of informationtheory, this indicates that the conditional mutual information(CMI) I ( { S − M } ; Y | M ) is zero. Therefore, given the selectedﬁlter subset T s , the remaining ﬁlter subset T r , and the classlabels Y , we can obtain a reliable estimate to N by evaluatingif I ( T r ; Y | T s ) . Theoretically, I ( T r ; Y | T s ) is non-negative andmonotonically decreasing with the increase of the size of T s (i.e., | T s | ) [30]. But it will never approach zero in practicedue to statistical variation and chance agreement betweenvariables [31].To measure how close is I ( T r ; Y | T s ) to zero, let usselect a new candidate ﬁlter t in T r , we quantify how t affects the MB condition by creating a random permutationof t (without permuting the corresponding Y ), denoted ˜ t .If I ( { T r − t } ; Y |{ T s , t } ) is not signiﬁcantly smaller than I ( { T r − ˜ t } ; Y |{ T s , ˜ t } ) , t can be discarded from T s and theﬁlter selection is stopped (i.e., N = | T s | ). We term this Fig. 10. Partial information diagrams for (a) and (b) variables. For brevity,the sets are abbreviated by the indices of their elements; that is, { R , R } isabbreviated by { } , and so on. Figure by Williams and Beer [11]. method CMI-permutation [32]. Algorithm 1 gives its detailedimplementation, in which denotes the indicator function. Algorithm 1

CMI-permutation

Require:

Selected ﬁlter subset T s ; Remaining ﬁlter subset T r ; Classlabels Y ; Selected ﬁlter t (in T r ); Permutation number P ;Signiﬁcance level α . Ensure: decision (stop ﬁlter selection or continue ﬁlter selection);Network width N . Estimate I ( { T r − t } ; Y |{ T s , t } ) with matrix-based R´enyi’s α -entropy functional estimator [10]. for i = 1 to P do Randomly permute t to obtain ˜ t i . Estimate I ( { T r − ˜ t i } ; Y |{ T s , ˜ t i } ) with matrix-based R´enyi’s α -entropy functional estimator [10]. end for if (cid:80) Pi =1 [ I ( { T r − t } ; Y |{ T s ,t } ) ≥ I ( { T r − ˜ t i } ; Y |{ T s , ˜ t i } )] P ≤ α then decision ← Continue ﬁlter selection. else decision ← Stop ﬁlter selection. N ← | T s | . end if return decision ; N (if stop ﬁlter selection) C. Additional Experimental Results

We ﬁnally represent additional results to further support ourarguments in the main text. Speciﬁcally, Fig. 11 evaluatesmultivariate mutual information (MMI) values with respect to different CNN topologies trained on Fashion-MNIST andHASYv2 data sets. Obviously, MMI values are likely tosaturate with only a few ﬁlters. Moreover, increasing the num-ber of ﬁlters does not guarantee that classiﬁcation accuracyincreases, and might even degrade performance. The tendencyof redundancy-synergy tradeoff and weighted non-redundantinformation are shown in Fig. 12. Again, more ﬁlters canpush the network towards an improved redundancy-synergytrade-off, i.e., the synergy gradually dominates in each pair offeature maps with the increase of number of ﬁlters. Fig. 13demonstrates the information plane (IP) and the modiﬁedinformation plane (MIP) for a smaller AlexNet [21] type CNNtrained on HASYv2 and Fruits 360 data sets. Although we didnot observe any compression in IP, it is very easy to observethe compression phase in the MIP. Conv 1 C on v (a) MMI in Fashion-MNIST Conv 14 − 4.1588Conv 24 − 4.1588Conv 34 − 4.1588 (4,4,4) (b) MMI in HASYv2Fig. 11. The MMI values in (a) Conv. and Conv. in Fashion-MNISTdata set; and (b) Conv. , Conv. and Conv. in Fruits 360 data set. Theblack line indicates the upper bound of MMI, i.e., the average mini-batchinput entropy. The topologies of all competing networks are speciﬁed in thelegend, in which the successive numbers indicate the number of ﬁlters ineach convolutional layer. We also report their classiﬁcation accuracies ( % ) ontesting set averaged over Monte-Carlo simulations in the parentheses. % o f R edundan cy − S y ne r g y t r adeo ff (a) Redundancy-Synergy trade-off.The networks differ in the numberof ﬁlters in Conv. . % o f w e i gh t ed non − r edundan t i n f o r m a t i on (b) Weighted non-redundant infor-mation. The networks differ in thenumber of ﬁlters in Conv. . % o f R edundan cy − S y ne r g y t r adeo ff (c) Redundancy-Synergy trade-off.The networks differ in the numberof ﬁlters in Conv. . % o f w e i gh t ed non − r edundan t i n f o r m a t i on (d) Weighted non-redundant infor-mation. The networks differ in thenumber of ﬁlters in Conv. .Fig. 12. The redundancy-synergy trade-off and the weighted non-redundant information in Fashion-MNIST data set. (a) and (b) demonstrate the percentagesof these two quantities with respect to different number of ﬁlters in Conv. , but ﬁlters in Conv. . (c) and (d) demonstrate the percentages of these twoquantities with respect to ﬁlters in Conv. , but different number of ﬁlters in Conv. . In each subﬁgure, the topologies of all competing networks arespeciﬁed in the legend.(a) IP ( ﬁlters in Conv. 1, ﬁl-ters in Conv. 2, ﬁlters in Conv. 3) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (b) Conv. 2 in M-IP ( ﬁlters inConv. 1, ﬁlters in Conv. 2, ﬁlters in Conv. 3) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (c) Conv. 3 in M-IP ( ﬁlters inConv. 1, ﬁlters in Conv. 2, ﬁlters in Conv. 3)(d) IP ( ﬁlters in Conv. 1, ﬁl-ters in Conv. 2, ﬁlters in Conv. 3) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (e) Conv. 2 in M-IP ( ﬁlters inConv. 1, ﬁlters in Conv. 2, ﬁlters in Conv. 3) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (f) Conv. 3 in M-IP ( ﬁlters inConv. 1, ﬁlters in Conv. 2, ﬁlters in Conv. 3)Fig. 13. The Information Plane (IP) and the modiﬁed Information Plane (M-IP) of a smaller AlexNet type CNN trained on HASYv2 (the ﬁrst row) andFruits (the second row) data sets. The of ﬁlters in Conv. , the of ﬁlters in Conv. , and the of ﬁlters in Conv.3