Understanding Convolutional Neural Networks with Information Theory: An Initial Exploration
Shujian Yu, Kristoffer Wickstrøm, Robert Jenssen, Jose C. Principe
11 Understanding Convolutional Neural Networks withInformation Theory: An Initial Exploration
Shujian Yu,
Student Member, IEEE,
Kristoffer Wickstrøm,Robert Jenssen,
Member, IEEE, and Jos´e C. Pr´ıncipe,
Life Fellow, IEEE.
Abstract —A novel functional estimator for R´enyi’s α -entropyand its multivariate extension was recently proposed in terms ofof the normalized eigenspectrum of a Hermitian matrix of theprojected data in a reproducing kernel Hilbert space (RKHS).However, the utility and possible applications of these new estima-tors are rather new and mostly unknown to practitioners. In thisbrief, we first show that our estimators enable straightforwardmeasurement of information flow in realistic convolutional neuralnetworks (CNNs) without any approximation. Then, we introducethe partial information decomposition (PID) framework anddevelop three quantities to analyze the synergy and redundancyin convolutional layer representations. Our results validate twofundamental data processing inequalities and reveal some fun-damental properties concerning CNN training. Index Terms —Convolutional Neural Networks, Data Process-ing Inequality, Multivariate Matrix-based R´enyi’s α -entropy,Partial Information Decomposition. I. I
NTRODUCTION
There has been a growing interest in understanding deepneural networks (DNNs) mapping and training using infor-mation theory [1], [2], [3]. According to Schwartz-Ziv andTishby [4], a DNN should be analyzed by measuring the infor-mation quantities that each layer’s representation T preservesabout the input signal X with respect to the desired signal Y (i.e., I ( X ; T ) with respect to I ( T ; Y ) , where I denotes mutualinformation), which has been called the Information Plane(IP). Moreover, they also empirically show that the commonstochastic gradient descent (SGD) optimization undergoes twoseparate phases in the IP: an early “fitting” phase, in whichboth I ( X ; T ) and I ( T ; Y ) increase rapidly along with theiterations, and a later “compression” phase, in which there is areversal such that I ( X ; T ) and I ( T ; Y ) continually decrease.However, the observations so far have been constrained to asimple multilayer perceptron (MLP) on toy data, which werelater questioned by some counter-examples in [5].In our most recent work [6], we use a novel matrix-basedR´enyi’s α -entropy functional estimator [7] to analyze the in-formation flow in stacked autoencoders (SAEs). We observedthat the existence of a “compression” phase associated with I ( X ; T ) and I ( T ; Y ) in the IP is predicated to the properdimension of the bottleneck layer size S of SAEs: if S is largerthan the intrinsic dimensionality d [8] of training data, the Shujian Yu and Jos´e C. Pr´ıncipe are with the Department of Electrical andComputer Engineering, University of Florida, Gainesville, FL 32611, USA.(email: yusjlcy9011@ufl.edu; [email protected]fl.edu)Kristoffer Wickstrøm and Robert Jenssen are with the Department ofPhysics and Technology at UiT - The Arctic University of Norway, Tromsø9037, Norway. (email: { kwi030, robert.jenssen } @uit.no) mutual information values start to increase up to a point andthen go back approaching the bisector of the IP; if S is smallerthan d , the mutual information values increase consistently upto a point, and never go back.Despite the great potential of earlier works [4], [5], [6],there are several open questions when it comes to the applica-tions of information theoretic concepts to convolutional neuralnetworks (CNNs). These include but are not limited to:1) The accurate and tractable estimation of informationquantities in CNNs. Specifically, in the convolutional layer,the input signal X is represented by multiple feature maps,as opposed to a single vector in the fully connected layers.Therefore, the quantity we really need to measure is the mul-tivariate mutual information (MMI) between a single variable(e.g., X ) and a group of variables (e.g., different featuremaps). Unfortunately, the reliable estimation of MMI is widelyacknowledged as an intractable or infeasible task in machinelearning and information theory communities [9], especiallywhen each variable is in a high-dimensional space.2) A systematic framework to analyze CNN layer repre-sentations. By interpreting a feedforward DNN as a Markovchain, the existence of data processing inequality (DPI) is ageneral consensus [4], [6]. However, it is necessary to identifymore inner properties on CNN layer representations using aprincipled framework, beyond DPI.In this brief, we answer these two questions and make thefollowing contributions:1) By defining a multivariate extension of the matrix-based R´enyi’s α -entropy functional [10], we show that theinformation flow, especially the MMI, in CNNs can be mea-sured without knowledge of the probability density function(PDF).2) By capitalizing on the partial information decomposi-tion (PID) framework [11] and on our sample based estimatorfor MMI, we develop three quantities that bypass the need toestimate the synergy and redundancy amongst different featuremaps in convolutional layers. Our result sheds light on thedetermination of network depth (number of layers) and width(size of each layer). It also gives insights on network pruning.II. I NFORMATION Q UANTITY E STIMATION IN
CNN S In this section we give a brief introduction to the recentlyproposed matrix-based R´enyi’s α -entropy functional estima-tor [7] and its multivariate extension [10]. Benefiting from thenovel definition, we present a simple method to measure MMIbetween any pairwise layer representations in CNNs. a r X i v : . [ c s . L G ] J a n A. Matrix-based R´enyi’s α -entropy functional and its multi-variate extension In information theory, a natural extension of the well-knownShannon’s entropy is R´enyi’s α -order entropy [12]. For arandom variable X with probability density function (PDF) f ( x ) in a finite set X , the α -entropy H α ( X ) is defined as: H α ( f ) = 11 − α log (cid:90) X f α ( x ) dx. (1)R´enyi’s entropy functional evidences a long track recordof usefulness in machine learning and its applications [13].Unfortunately, the accurate PDF estimation impedes its morewidespread adoption in data driven science. To solve thisproblem, [7], [10] suggest similar quantities that resemblesquantum R´enyi’s entropy [14] in terms of the normalizedeigenspectrum of the Hermitian matrix of the projected datain RKHS. The new estimators avoid evaluating the underlyingprobability distributions, and estimate information quantitiesdirectly from data. For brevity, we directly give the followingdefinitions. The theoretical foundations for Definition 1 and
Definition 2 are proved respectively in [7] and [10].
Definition 1:
Let κ : X × X (cid:55)→ R be a real valuedpositive definite kernel that is also infinitely divisible [15].Given X = { x , x , ..., x n } and the Gram matrix K obtainedfrom evaluating a positive definite kernel κ on all pairs ofexemplars, that is ( K ) ij = κ ( x i , x j ) , a matrix-based analogueto R´enyi’s α -entropy for a normalized positive definite (NPD)matrix A of size n × n , such that tr( A ) = 1 , can be given bythe following functional: S α ( A ) = 11 − α log (tr( A α )) = 11 − α log (cid:2) n (cid:88) i =1 λ i ( A ) α (cid:3) , (2)where A ij = n K ij √ K ii K jj and λ i ( A ) denotes the i -th eigenvalueof A . Definition 2:
Given a collection of n samples { s i =( x i , x i , · · · , x iC ) } ni =1 , where the superscript i denotes thesample index, each sample contains C ( C ≥ ) measurements x ∈ X , x ∈ X , · · · , x C ∈ X C obtained from the samerealization, and the positive definite kernels κ : X ×X (cid:55)→ R , κ : X × X (cid:55)→ R , · · · , κ C : X C × X C (cid:55)→ R , a matrix-basedanalogue to R´enyi’s α -order joint-entropy among C variablescan be defined as: S α ( A , A , · · · , A C ) = S α (cid:18) A ◦ A ◦ · · · ◦ A C tr( A ◦ A ◦ · · · ◦ A C ) (cid:19) , (3)where ( A ) ij = κ ( x i , x j ) , ( A ) ij = κ ( x i , x j ) , · · · , ( A C ) ij = κ k ( x iC , x jC ) , and ◦ denotes the Hadamard product.The following corollary (see proof in [10]) serve as afoundation for our Definition 2 . Specifically, the first inequalityindicates that the joint entropy of a set of variables is greaterthan or equal to the maximum of all of the individual entropiesof the variables in the set, whereas the second inequalitysuggests that the joint entropy of a set of variable is less thanor equal to the sum of the individual entropies of the variablesin the set.
Corollary 1:
Let A , A , · · · , A C be C n × n positivedefinite matrices with trace and nonnegative entries, and ( A ) ii = ( A ) ii = · · · = ( A C ) ii = n , for i = 1 , , · · · , n .Then the following two inequalities hold: S α (cid:18) A ◦ A ◦ · · · ◦ A C tr( A ◦ A ◦ · · · ◦ A C ) (cid:19) ≥ max[ S α ( A ) , S α ( A ) , · · · , S α ( A C )] . (4) S α (cid:18) A ◦ A ◦ · · · ◦ A C tr( A ◦ A ◦ · · · ◦ A C ) (cid:19) ≤ S α ( A )+ S α ( A )+ · · · + S α ( A C ) , (5) B. Multivariate mutual information estimation in CNNs 𝑇 𝑇 𝐶−1 𝑋 𝑇 𝑇 𝑇 𝐶 Fig. 1. Venn diagram of I ( X ; { T , T , · · · , T C } ) . Suppose there are C fil-ters in the convolutionallayer, then an input imageis therefore represented by C different feature maps.Each feature map charac-terizes a specific propertyof the input. This suggeststhat the amount of informa-tion that the convolutionallayer gained from input X is preserved in C informa-tion sources T , T , · · · , T C . The Venn Diagram for X and T , · · · , T C is demon-strated in Fig. 1. Specifically, the red circle represents theinformation contained in X , each blue circle represents theinformation contained in each feature map. The amountof information in X that is gained from C feature maps(i.e., I ( X ; { T , T , · · · , T C } ) ) is exactly the shaded area. Byapplying the inclusion-exclusion principle [16], this shadedarea can be computed by summing up the area of the redcircle (i.e., H ( X ) ) with the area occupied by all blue circles(i.e., H ( T , T , · · · , T C ) ), and then subtracting the total jointarea occupied by the red circle and all blue circles (i.e., H ( X, T , T , · · · , T C ) ). Formally speaking, this indicatesthat: I ( X ; { T , T , · · · , T C } ) = H ( X ) + H ( T , T , · · · , T C ) − H ( X, T , T , · · · , T C ) , (6)where H denotes entropy for a single variable or joint entropyfor a group of variables.Given Eqs. (2), (3) and (6), I ( X ; { T , T , · · · , T C } ) in amini-batch of size n can be estimated with: I α ( B ; { A , A , · · · , A C } ) = S α ( B )+ S α (cid:18) A ◦ A ◦ · · · ◦ A C tr( A ◦ A ◦ · · · ◦ A C ) (cid:19) − S α (cid:18) A ◦ A ◦ · · · ◦ A C ◦ B tr( A ◦ A ◦ · · · ◦ A C ◦ B ) (cid:19) . (7)Here, B , A , · · · , A C denote Gram matrices evaluated onthe input tensor and C feature maps tensors, respectively. Forexample, A p ( ≤ p ≤ C ) is evaluated on { x ip } ni =1 , in which x ip refers to the feature map generated from the i -th inputsample using the p -th filter. Obviously, instead of estimatingthe joint PDF on { X, T , T , · · · , T C } which is typicallyunattainable, one just needs to compute ( C +1) Gram matricesusing a real valued positive definite kernel that is also infinitelydivisible [15].
III. M
AIN R ESULTS
This section presents three sets of experiments to empiri-cally validate the existence of two DPIs in CNNs, using thenovel nonparametric information theoretic (IT) estimators putforth in this work. Specifically, Section III-A validates theexistence of two DPIs in CNNs. In Section III-B, we illustrate,via the application of the PID framework in the trainingphase, some interesting observations associated with differentCNN topologies. Following this, we present implications tothe determination of network depth and width motivated bythese results. We finally point out, in Section III-C, an ad-vanced interpretation to the information plane (IP) that deservemore (theoretical) investigations. Four real-world data sets,namely MNIST [17], Fashion-MNIST [18], HASYv2 [19], andFruits 360 [20], are selected for evaluation. The characteristicsof each data set are summarized in Table I. Note that,compared with the benchmark MNIST and Fashion-MNIST,HASYv2 and Fruits 360 have significantly larger number ofclasses as well as higher intraclass variance. For example, inFruits 360, the apple class contains different varieties (e.g.,Crimson Snow, Golden, Granny Smith), and the images arecaptured with different viewpoints and varying illuminationconditions. Due to page limitations, we only demonstrate themost representative results in the rest of this paper. Additionalexperimental results are available in Appendix C.
TABLE IT
HE NUMBER OF CLASSES ( C LASS ), THE NUMBER OF TRAININGSAMPLES ( T RAIN ), THE NUMBER OF TESTING SAMPLES ( T EST ), ANDTHE SAMPLE SIZE OF SELECTED DATA SETS . Class
Train
Test Sample SizeMNIST [17]
10 60 ,
000 10 ,
000 28 × Fashion-MNIST [18]
10 60 ,
000 10 ,
000 28 × HASYv2 [19]
369 151 ,
406 16 ,
827 32 × Fruits 360 [20]
111 56 ,
781 19 ,
053 100 × For MNIST and Fashion-MNIST, we consider a LeNet- [17] type network which consists of convolutional layers, pooling layers, and fully connected layers. For HASYv2 andFruits 360, we use a smaller AlexNet [21] type network with convolutional layers (but fewer filters in each layer) and fully connected layers. We train the CNN with SGD with mo-mentum . and mini-batch size . In MNIST and Fashion-MNIST, we select learning rate . and training epochs. InHASYv2 and Fruits , we select learning rate . and training epochs. Both “sigmoid” and “ReLU” activation func-tions are tested. For the estimation of MMI, we fix α = 1 . to approximate Shannon’s definition, and use the radial basisfunction (RBF) kernel κ ( x i , x j ) = exp( − (cid:107) x i − x j (cid:107) σ ) to obtainthe Gram matrices. The kernel size σ is determined based onthe Silverman’s rule of thumb [22] σ = h × n − / (4+ d ) , where n is the number of samples in the mini-batch ( in thiswork), d is the sample dimensionality and h is an empiricalvalue selected experimentally by taking into consideration thedata’s average marginal variance. In this paper, we select h = 5 for the input signal forward propagation chain and h = 0 . for the error backpropagation chain. A. Experimental Validation of Two DPIs
We expect the existence of two DPIs in any feedforwardCNNs with K hidden layers, i.e., I ( X, T ) ≥ I ( X, T ) ≥ · · · ≥ I ( X, T K ) and I ( δ K , δ K − ) ≥ I ( δ K , δ K − ) ≥ · · · ≥ I ( δ K , δ ) , where T , T , · · · , T K are successive hidden layerrepresentations from the first hidden layer to the output layerand δ K , δ K − , · · · , δ are errors from the output layer to thefirst hidden layer. This is because both X → T → · · · → T K and δ K → δ K − → · · · → δ form a Markov chain [4], [6].Fig. 2 shows the DPIs at the initial training stage, after epoch’s training and at the final training stage, respectively.As can be seen, DPIs hold in most of the cases. Note that,there are a few disruptions in the error backpropagation chain,because the curves should be monotonic according to thetheory. One possible reason is that when training converges,the error becomes tiny such that Sliverman’s rule of thumb isno longer a reliable choice to select scale parameter σ in ourestimator. M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (a) initial iteration M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (b) 1 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (c) 10 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (d) initial iteration M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (e) 1 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (f) 10 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (g) initial iteration M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (h) 1 epoch later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (i) 15 epochs later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (j) initial iteration M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (k) 1 epoch later M u t ua l I n f o r m a t i on I ( X , T ) M u t ua l I n f o r m a t i on I ( δ L , δ T ) Forward chainBackward chain (l) 15 epochs laterFig. 2. Two DPIs in CNNs. The first row shows the validation results, onMNIST data set, obtained by a CNN with filters in the st convolutionallayer (denote Conv. ) and filters in the nd convolutional layer (denoteConv. ); the second row shows the validation results, on Fashion-MNISTdata set, obtained by the same CNN architecture as in the MNIST; the thirdrow shows the validation results, on HASYv2 data set, obtained by a CNNwith filters in Conv. , filters in Conv. and filters in Conv. ;whereas the fourth row shows the validation results, on Fruits 360 data set,obtained by a CNN with filters in Conv. , filters in Conv. , filtersin Conv. and filters in Conv. . In each subfigure, the blue curves showthe MMI values between input and different layer representations, whereasthe green curves show the MMI values between errors in the output layer anddifferent hidden layers. B. Redundancy and Synergy in Layer Representations
In this section, we explore properties of different IT quan-tities during the training of CNNs, with the help of the PIDframework. Particularly, we are interested in determining theredundancy and synergy amongst different feature maps andhow they evolve with training in different CNN topologies.Moreover, we are also interested in identifying some upper andlower limits (if they exist) for these quantities. However, theanalysis is not easy because the set of information equationsis underdetermined as we will show next.
Given input signal X and two feature maps T and T ,the PID framework indicates that the MMI I ( X ; { T , T } ) can be decomposed into four non-negative IT components:the synergy Syn ( X ; { T , T } ) that measures the informationabout X provided by the combination of T and T (i.e., theinformation that cannot be captured by either T or T alone);the redundancy Rdn ( X ; { T , T } ) that measures the sharedinformation about X that can be provided by either T or T ;the unique information Unq ( X ; T ) (or Unq ( X ; T ) ) thatmeasures the information about X that can only be providedby T (or T ). Moreover, the unique information, the synergyand the redundancy satisfy (see Fig. 3): I ( X ; { T , T } ) = Syn ( X ; { T , T } )+ Rdn ( X ; { T , T } )+ Unq ( X ; T ) + Unq ( X ; T ); (8) I ( X ; T ) = Rdn ( X ; { T , T } ) + Unq ( X ; T ); (9) I ( X ; T ) = Rdn ( X ; { T , T } ) + Unq ( X ; T ) . (10) 𝑇 𝑋 𝑇 (a) I ( X ; { T , T } ) 𝐈(𝑋; {𝑇 , 𝑇 }) 𝐈(𝑋; 𝑇 ) 𝐈(𝑋; 𝑇 ) Syn(𝑋;{𝑇 ,𝑇 })Rdn(𝑋;{𝑇 ,𝑇 }) Unq(𝑋;𝑇 )Unq(𝑋;𝑇 ) (b) PID to I ( X ; { T , T } ) Fig. 3. Synergy and redundancy amongst different feature maps. (a) showsthe interactions between input signal and two feature maps. The shadow areaindicates the MMI I ( X ; { T , T } ) . (b) shows the PID to I ( X ; { T , T } ) . Notice that we have IT components (i.e., synergy,redundancy, and two unique information terms), but only measurements: I ( X ; { T , T } ) , I ( X ; T ) , and I ( X ; T ) .Therefore, we can never determine uniquely the ITquantities. This decomposition for I ( X ; { T , T } ) can bestraightforwardly extended for more than three variables,thus decomposing I ( X ; { T , T , · · · , T C } ) into much morecomponents. For example, if C = 4 , there will be individual non-negative items. Admittedly, the PID diagram(see Appendix A for more details) offers an intuitiveunderstanding of the interactions between input and differentfeature maps, and our estimators have been shown appropriatefor high dimensional data. However, the reliable estimation ofeach IT component still remains a bigger challenge, becauseof the undetermined nature of the problem. In fact, thereis no universal agreement on the definition of synergy andredundancy among one-dimensional -way interactions, letalone the estimation of each synergy or redundancy amongnumerous variables in high-dimensional spaces [23], [24].To this end, we develop three quantities based on the measurements by manipulating Eqs. (8)-(10) to characterizeintrinsic properties of CNN layer representations. The newdeveloped quantities avoid the direct estimation of synergyand redundancy. They are:1) I ( X ; { T , T , · · · , T C } ) , which is exactly the MMI. This quantity measures the amount of information about X that iscaptured by all feature maps (in one convolutional layer).2) C ( C − (cid:80) Ci =1 (cid:80) Cj = i +1 I ( X ; T i ) + I ( X ; T j ) − I ( X ; { T i , T j } ) ,which is referred to redundancy-synergy trade-off. Thisquantity measures the (average) redundancy-synergy trade-offin different feature maps. By rewriting Eqs. (8)-(10), wehave: I ( X ; T i ) + I ( X ; T j ) − I ( X ; { T i , T j } )= Rdn ( X ; { T i , T j } ) − Syn ( X ; { T i , T j } ) . (11)Obviously, a positive value of this trade-off impliesredundancy, whereas a negative value signifies synergy [25].Here, instead of measuring all PID terms that increasepolynomially with C , we sample pairs of feature maps,calculate the information quantities for each pair, and finallycompute averages over all pairs to determine if synergydominates in the training phase. Note that, the pairwisesampling procedure has been used in neuroscience [26]and a recent paper on information theoretic investigation ofRestricted Boltzmann Machine (RBM) [3].3) C ( C − (cid:80) Ci =1 (cid:80) Cj = i +1 × I ( X ; { T i , T j } ) − I ( X ; T i ) − I ( X ; T j ) ,which is referred to weighted non-redundant information. Thisquantity measures the (average) amount of non-redundantinformation about X that is captured by pairs of featuremaps. As can be seen, from Eqs. (8)-(10), × I ( X ; { T i , T j } ) − I ( X ; T i ) − I ( X ; T j )= Unq ( X ; T i ) + Unq ( X ; T j ) + 2 × Syn ( X ; { T i , T j } ) . (12)We call this quantity “weighted” because we overempha-sized the role of synergy, but notice that redundancy doesnot explicitly appear, while the two unique information termsreappear.One should note that, Eqs. (11) and (12) are just two ofmany equations that can be written, but all are going to be alinear combination of more than one IT component. Therefore,we do not introduce any errors in computing Eqs. (11) and(12), we simply work on the linear projected space of synergyand redundancy. We will now experimentally show how thesetwo pairs of IT components (synergy and redundancy fromEq. (11), and synergy with the two unique information termsfrom Eq. (12)) evolve across different CNN layer changes.We first evaluate MMI with respect to different CNN topolo-gies in Fig. 4. For MNIST, we demonstrate MMI values in thefirst two convolutional layers (denote Conv. and Conv. ).Similarly, for Fruits 360, we demonstrate MMI values inConv. , Conv. and Conv. . By DPI, the maximum amountof information that each convolutional layer representation cancapture is exactly the entropy of input. As can be seen, withthe increase of the number of filters, the total amount of in-formation that each layer captured also increases accordingly.However, it is interesting to see that MMI values are likely tosaturate with only a few filters. For example, in Fruits ,with only filters in Conv. , filters in Conv. and filters in Conv. , make MMI values to reach their maximum Conv 1 C on v (a) MMI in MNIST Conv 14 − 4.7281Conv 24 − 4.7281Conv 34 − 4.7281 (4,4,4) (b) MMI in Fruits 360Fig. 4. The MMI values in (a) Conv. and Conv. in MNIST data set;and (b) Conv. , Conv. and Conv. in Fruits data set. The black lineindicates the upper bound of MMI, i.e., the average mini-batch input entropy.The topologies of all competing networks are specified in the legend, in whichthe successive numbers indicate the number of filters in each convolutionallayer. We also report their classification accuracies ( % ) on testing set averagedover Monte-Carlo simulations in the parentheses. value . (i.e., the ensemble average entropy across mini-batches) in each layer. More filters will increase classificationaccuracy at first. However, increasing the number of filtersdoes not guarantee that classification accuracy increases, andmight even degrade performance.We argue that this phenomenon can be explainedby the percentage that the redundancy-synergy trade-off or the weighted non-redundant information accountsfor the MMI in each pair of feature maps, i.e., C ( C − (cid:80) Ci =1 (cid:80) Cj = i +1 I ( X ; T i )+ I ( X ; T j ) − I ( X ; { T i ,T j } ) I ( X ; { T i ,T j } ) or C ( C − (cid:80) Ci =1 (cid:80) Cj = i +1 2 × I ( X ; { T i ,T j } ) − I ( X ; T i ) − I ( X ; T j ) I ( X ; { T i ,T j } ) . Infact, by referring to Fig. 5, it is obvious that more filters canpush the network towards an improved redundancy-synergytrade-off, i.e., the synergy gradually dominates in each pair offeature maps with the increase of filters. That is perhaps oneof the main reasons why the increased number of filters canlead to better classification performance, even though the totalmultivariate mutual information stays the same. However, ifwe look deeper, it seems that the redundancy-synergy trade-off is always positive, which may suggest that redundancy isalways larger than synergy. On the other hand, one shouldnote that the amount of non-redundant information is alwaysless than the MMI (redundancy is non-negative) no matterthe number of filters. Therefore, it is impossible to improvethe classification performance by blindly increasing thenumber of filters. This is because the minimum probability ofclassification error is upper bounded by the MMI expressedin different forms (e.g., [27], [28]).Having illustrated the DPIs and the redundancy-synergytrade-offs, it is easy to summarize some implications con-cerning the design and training of CNNs. First, as a possibleapplication of DPI in the error backpropagation chain, wesuggest to use the DPI as an indicator on where to perform the“bypass” in the recently proposed Relay backpropagation [29].Second, the DPIs and the redundancy-synergy trade-off maygive some guidelines on the depth and width of CNNs.Intuitively, we need multiple layers to quantify the multi-scaleinformation contained in natural images. However, more layerswill lead to severe information loss. The same dilemma appliesto the number of filters in convolutional layers: a sufficientnumber of filters guarantees preservation of input informationand the ability to learn a good redundancy-synergy trade-off. However, increasing the number of filters does not always leadto performance gain.Admittedly, it is hard to give a concrete rule to determinethe exact number of filters in one convolutional layer fromthe current results. We still present a possible solution to shedlight on this problem. In fact, if we view each filter as anindividual feature extractor, the problem of determining theoptimal number of filters turns out to be seeking a stoppingcriterion for feature selection. Therefore, the number of filterscan be determined by monitoring the value of the conditionalmutual information (CMI), i.e., I ( T r ; Y | T s ) , where T s and T r denote respectively the selected and the remaining filters,and Y denotes desired response. Theoretically, I ( T r ; Y | T s ) ismonotonically decreasing if a new filter t is added into T s [30],but will never reach zero in practice [31]. Therefore, in orderto evaluate the impact of t on I ( T r ; Y | T s ) , we can create arandom permutation of t (without permuting the corresponding Y ), denoted ˜ t . If I ( { T r − t } ; Y |{ T s , t } ) is not significantlysmaller than I ( { T r − ˜ t } ; Y |{ T s , ˜ t } ) , t can be discarded andthe filter selection is stopped. We term this method CMI-permutation [32]. We refer interested readers to Appendix Bfor its detailed implementation.Our preliminary results shown in Fig. 6 suggest that CMI-permutation is likely to underestimate the number of filters.Therefore, additional design efforts are required as futurework. C. Revisiting the Information Plane (IP)
The behaviors of curves in the IP is currently a controversialissue. Recall the discrepancy reported by Saxe et al . [5],the existence of compression phase observed by Shwartz-Ziv and Tishby [4] depends on the adopted nonlinearityfunctions: double-sided saturating nonlinearities like “tanh”or “sigmoid” yield a compression phase, but linear activationfunctions and single-sided saturating nonlinearities like the“ReLU” do not. Interestingly, Noshad et al . [33] employeddependence graphs to estimate mutual information values andobserved the compression phase even using “ReLU” activationfunctions. Similar phenomenon is also observed by Chelom-biev et al . [34], in which an entropy-based adaptive binning(EBAB) estimator is developed to enable more robust mutualinformation estimation that adapts to hidden activity of neuralnetworks. On the other hand, Goldfeld et al . [35] arguedthat compression is due to layer representations clustering,but it is hard to observe the compression in large network.We disagree with this attribution of different behavior to thenonlinear activation functions. Instead, we often forget that,rarely, estimators share all the properties of the statisticallydefined quantities [36]. Hence, variability in the displayedbehavior is mostly likely attributed to different estimators ,although this argument is rarely invoked in the literature. Thisis the reason we suggest that a first step before analyzingthe information plane curves, is to show that the employed Shwartz-Ziv and Tishby [4] use the basic Shannon’s definition and estimatemutual information by dividing neuron activation values into equal-intervalbins, whereas the base estimator used by Saxe et al . [5] provides KernelDensity Estimator (KDE) based lower and upper bounds on the true mutualinformation [37], [33]. % o f R edundan cy − S y ne r g y t r adeo ff (a) Redundancy-Synergy trade-off.The networks differ in the numberof filters in Conv. . % o f w e i gh t ed non − r edundan t i n f o r m a t i on (b) Weighted non-redundant infor-mation. The networks differ in thenumber of filters in Conv. . % o f R edundan cy − S y ne r g y t r adeo ff (c) Redundancy-Synergy trade-off.The networks differ in the numberof filters in Conv. . % o f w e i gh t ed non − r edundan t i n f o r m a t i on (d) Weighted non-redundant infor-mation. The networks differ in thenumber of filters in Conv. .
25% 30% 35% 40% 45%40%45%50%55%60% % of Redundancy−Synergy tradeoff (Conv 1) % o f R edundan cy − S y ne r g y t r adeo ff ( C on v ) (e) Redundancy-Synergy trade-off.The networks differ in the numberof filters in Conv. and Conv. .
55% 60% 65% 70% 75%40%42%44%46%48%50%52%54% % of weighted non−redundant information (Conv 1) % o f w e i gh t ed non − r edundan t i n f o r m a t i on ( C on v ) (f) Weighted non-redundant infor-mation. The networks differ in thenumber of filters in Conv. andConv. .
25% 30% 35% 40% 45%50%55%60%65%70% % of Redundancy−Synergy tradeoff (Conv 1) % o f R edundan cy − S y ne r g y t r adeo ff ( C on v ) (g) Redundancy-Synergy trade-off.The networks differ in the numberof filters in Conv. and Conv. .
55% 60% 65% 70% 75%35%40%45%50%35%40%45%50%35%40% % of weighted non−redundant information (Conv 1) % o f w e i gh t ed non − r edundan t i n f o r m a t i on ( C on v ) (h) Weighted non-redundant infor-mation. The networks differ in thenumber of filters in Conv. andConv. .Fig. 5. The redundancy-synergy trade-off and the weighted non-redundant information in MNIST (the first row) and Fruits (the second row) data sets.(a) and (b) demonstrate the percentages of these two quantities with respect to different number of filters in Conv. , but filters in Conv. . (c) and (d)demonstrate the percentages of these two quantities with respect to filters in Conv. , but different number of filters in Conv. . Similarly, (e) and (f)compare these two quantities with respect to different number of filters in both Conv. and Conv. , with filters in Conv. ; whereas (e) and (f) comparethese two quantities with respect to different number of filters in both Conv. and Conv. , with filters in Conv. . In each subfigure, the topologies ofall competing networks are specified in the legend. C ond i t i ona l m u t ua l i n f o r m a t i on v a l ue s I(S−S’;Y|S’)Selected x=5 (a) filters in Conv. 2 of LeNet-5 C ond i t i ona l m u t ua l i n f o r m a t i on v a l ue s I(S−S’;Y|S’)Selected x=21 (b) filters in Conv5-3 of VGG-16Fig. 6. Determination of the number of filters in (a) Conv. of LeNet-5trained on MNIST data set; and (b) Conv5-3 of VGG-16 trained on CIFAR-10 data set. CMI-permutation [32] suggests filters among a total of filters in case (a), and filters among a total of filters in case (b). estimators meet the expectation of the DPI (or similar knownproperties of the statistical quantities). We show above thatour R´enyi’s entropy estimator passes this test.The IPs for a LeNet-5 type CNN trained on MNIST andFashion-MNIST data sets are shown in Fig. 7. From the firstcolumn, both I ( X ; T ) and I ( T ; Y ) increase rapidly up to a cer-tain point with the SGD iterations. This result conforms to thedescription in [35], suggesting that the behaviour of CNNs inthe IP not being the same as that of the MLPs in [4], [5], [33]and our intrinsic dimensionality hypothesis in [6] is specificto SAEs. However, if we remove the redundancy in I ( X ; T ) and I ( T ; Y ) , and only preserve the unique information andthe synergy (i.e., substituting I ( X ; T ) and I ( T ; Y ) with theircorresponding (average) weighted non-redundant informationdefined in Section III-B), it is easy to observe the compressionphase in the modified IP. Moreover, it seems that “sigmoid” ismore likely to incur the compression, compared with “ReLU”,where this intensity can be attributed to the nonlinearity. Our result shed light on the discrepancy in [4] and [5], and refinedthe argument in [33], [34].IV. C ONCLUSIONS AND F UTURE W ORK
This brief presents a systematic method to analyze CNNsmapping and training from an information theoretic per-spective. Using the multivariate extension of the matrix-based R´enyi’s α -entropy functional, we validated two dataprocessing inequalities (DPIs) in CNNs. The introduction ofpartial information decomposition (PID) framework enablesus to pin down the redundancy-synergy trade-off in layerrepresentations. We also analyzed the behaviors of curves inthe information plane, aiming at clarifying the debate on theexistence of compression in DNNs. We close by highlightingsome potential extensions of our methodology and directionof future research:1) All the information quantities mentioned in this paperare estimated based on a vector rastering of samples, i.e., eachlayer input (e.g., an input image, a feature map) is first con-verted to a single vector before entropy or mutual informationestimation. Albeit its simplicity, we distort spatial relationshipsamongst neighboring pixels. Therefore, a question remains onthe reliable information theoretic estimation that is feasiblewithin a tensor structure.2) We look forward to evaluate our estimators on morecomplex CNN architectures, such as VGGNet [38] andResNet [39]. According to our observation, it is easy tovalidate the DPI and the rapid increase of mutual information(in top layers) in VGG- on CIFAR- dataset [40] (seeFig. 8). However, it seems that the MMI values in bottom M u t ua l I n f o r m a t i on I ( T ; Y ) N u m be r o f It e r a t i on s (a) IP, sigmoid ( filters in Conv. 1, filters in Conv. 2) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (b) M-IP, sigmoid ( filters inConv. 1, filters in Conv. 2) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (c) M-IP, ReLU ( filters in Conv. 1, filters in Conv. 2) M u t ua l I n f o r m a t i on I ( T ; Y ) N u m be r o f It e r a t i on s (d) IP, sigmoid ( filters in Conv. 1, filters in Conv. 2) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (e) M-IP, sigmoid ( filters inConv. 1, filters in Conv. 2) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (f) M-IP, ReLU ( filters in Conv. 1, filters in Conv. 2)Fig. 7. The Information Plane (IP) and the modified Information Plane (M-IP) of a LeNet- type CNN trained on MNIST (the first row) and Fashion-MNIST(the second row) data sets. The of filters in Conv. , the of filters in Conv. , and the adopted activation function are indicated in the subtitle of eachplot. The curves in IP increase rapidly up to a point without compression (see (a) and (d)). By contrast, it is very easy to observe the compression in M-IP(see (b), (c) and (e)). Moreover, compared with ReLU, sigmoid is more likely to incur the compression (e.g., comparing (b) with (c), or (e) with (f)). Hidden Layer index MM I I ( X , T ) Fig. 8. DPI in VGG-16 on CIFAR-10: I ( X, T ) ≥ I ( X, T ) ≥ · · · ≥ I ( X, T K ) . Layer 1 to Layer 13 are convolutional layers, whereas Layer 14to Layer 16 are fully-connected layers. layers are likely to saturate. The problem arises when we tryto take the Hadamard product of the kernel matrices of eachfeature map in Eq. (7). The elements in these (normalized)kernel matrices have values between and , and takingthe entrywise product of, e.g., such matrices like in theconvolutional layer of VGG- , will tend towards a matrixwith diagonal entries /n and nearly zero everywhere else.The eigenvalues of the resulting matrix will quickly havealmost the same value across training epochs.3) Our estimator described in Eqs. (3) and (6) has twolimitations one needs to be aware of. First, our estimatorsare highly efficient in the scenario of high-dimensional data,regardless of the data properties (e.g., continuous, discrete,or mixed). However, the computational complexity increasesalmost cubically with the number of samples n , because ofthe eigenvalue decomposition. Fortunately, it is possible toapply methods such as kernel randomization [41] to reducethe burden to O ( n log( n )) . By contrast, the well-known kerneldensity estimator [42] suffers from the curse of dimensional-ity [43], whereas the k -nearest neighbor estimator [44] requiresexponentially many samples for accurate estimation [45]. Sec- ond, as we have emphasized in previous work (e.g., [6], [10]),it is important to select an appropriate value for the kernel size σ and the order α of R´enyi’s entropy functional. Otherwise,spurious conclusions may happen. Reliable manners to select σ include the Silverman’s rule of thumb and to percent ofthe total (median) range of the Euclidean distances between allpairwise data points [46], [47]. On the other hand, the choiceof α is associated with the task goal. If the application requiresemphasis on tails of the distribution (rare events) or multiplemodalities, α should be less than and possibly approach to from above. α = 2 provides neutral weighting [13]. Finally, ifthe goal is to characterize modal behavior, α should be greaterthan . This work fix α = 1 . to approximate Shannon’sdefinition. −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.100.010.020.030.040.050.060.070.080.09 Information Bottleneck (IB) objective values P e r c en t age Fig. 9. Information Bottleneck (IB) ob-jective values distribution in CONV - of VGG- trained on CIFAR- .Nearly filters have IB values abovethe red dashed line, which imply lessimportance for classification.
4) Perhaps one of themost promising applica-tions concerning our es-timators is the filter-levelpruning for CNNs com-pression, i.e., discardingwhole filters that are lessimportant [48]. Differentstatistics (e.g., the absoluteweight sum [49] or theAverage Percentage of Ze-ros (APoZ) [50]) have beenpropose to measure fil-ter importance. Moreover, arecent study [51] suggeststhat mutual information isa reliable indicator to measure neuron (or filter) importance.Therefore, given an individual filter T i , we suggest evaluatingits importance by the Information Bottleneck (IB) [52] objec-tive I ( X ; T i ) − β I ( T i ; Y ) , where X and Y denote respectively the input batch and its corresponding desired output, and β is aLagrange multiplier. Intuitively, a small value of this objectiveindicates that T i obtains a compressed representation of X thatis relevant to Y . Therefore, the smaller the objective value, thehigher importance of T i . Fig. 9 demonstrates the IB objective( β = 2 ) values distribution for filters in CONV - layer ofVGG- trained on CIFAR- . This distribution looks similarto the one obtained by APoZ in [50], both of them indicatethat more than of filters are less important and can bediscarded with negligible loss in accuracy [49].R EFERENCES[1] N. Tishby and N. Zaslavsky, “Deep learning and the informationbottleneck principle,” in
IEEE ITW , 2015, pp. 1–5.[2] A. Achille and S. Soatto, “Emergence of invariance and disentanglementin deep representations,”
JMLR , vol. 19, no. 1, pp. 1947–1980, 2018.[3] T. Tax, P. A. Mediano, and M. Shanahan, “The partial informationdecomposition of generative neural network models,”
Entropy , vol. 19,no. 9, p. 474, 2017.[4] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neuralnetworks via information,” arXiv preprint arXiv:1703.00810 , 2017.[5] A. M. Saxe et al. , “On the information bottleneck theory of deeplearning,” in
ICLR , 2018.[6] S. Yu and J. C. Principe, “Understanding autoencoders with informationtheoretic concepts,”
Neural Networks , vol. 117, pp. 104–123, 2019.[7] L. G. Sanchez Giraldo, M. Rao, and J. C. Principe, “Measures ofentropy from data using infinitely divisible kernels,”
IEEE Transactionson Information Theory , vol. 61, no. 1, pp. 535–548, 2015.[8] F. Camastra and A. Staiano, “Intrinsic dimension estimation: Advancesand open problems,”
Information Sciences , vol. 328, pp. 26–41, 2016.[9] G. Brown, A. Pocock, M.-J. Zhao, and M. Luj´an, “Conditional likelihoodmaximisation: a unifying framework for information theoretic featureselection,”
JMLR , vol. 13, no. Jan, pp. 27–66, 2012.[10] S. Yu, L. G. Sanchez Giraldo, R. Jenssen, and J. C. Principe, “Multi-variate extension of matrix-based renyi’s α -order entropy functional,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2019.[11] P. L. Williams and R. D. Beer, “Nonnegative decomposition of multi-variate information,” arXiv preprint arXiv:1004.2515 , 2010.[12] A. R´enyi, “On measures of entropy and information,” in
Proc. of the 4thBerkeley Sympos. on Math. Statist. and Prob. , vol. 1, 1961, pp. 547–561.[13] J. C. Principe,
Information theoretic learning: Renyi’s entropy and kernelperspectives . Springer Science & Business Media, 2010.[14] M. M¨uller-Lennert, F. Dupuis, O. Szehr, S. Fehr, and M. Tomamichel,“On quantum r´enyi entropies: A new generalization and some proper-ties,”
J. Math. Phys. , vol. 54, no. 12, p. 122203, 2013.[15] R. Bhatia, “Infinitely divisible matrices,”
The American MathematicalMonthly , vol. 113, no. 3, pp. 221–235, 2006.[16] R. W. Yeung, “A new outlook on shannon’s information measures,”
IEEEtransactions on information theory , vol. 37, no. 3, pp. 466–474, 1991.[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”
Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[18] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel imagedataset for benchmarking machine learning algorithms,” arXiv preprintarXiv:1708.07747 , 2017.[19] M. Thoma, “The hasyv2 dataset,” arXiv preprint arXiv:1701.08380 ,2017.[20] H. Mures¸an and M. Oltean, “Fruit recognition from images using deeplearning,”
Acta Universitatis Sapientiae, Informatica , vol. 10, no. 1, pp.26–42, 2018.[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
NeurIPS , 2012, pp. 1097–1105.[22] B. W. Silverman,
Density estimation for statistics and data analysis .CRC press, 1986, vol. 26.[23] N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay, “Quantifyingunique information,”
Entropy , vol. 16, no. 4, pp. 2161–2183, 2014.[24] V. Griffith and C. Koch, “Quantifying synergistic mutual information,”in
Guided Self-Organization: Inception . Springer, 2014, pp. 159–190.[25] A. J. Bell, “The co-information lattice,” in
Proceedings of the FifthInternational Workshop on Independent Component Analysis and BlindSignal Separation: ICA , vol. 2003, 2003. [26] N. Timme, W. Alford, B. Flecker, and J. M. Beggs, “Synergy, re-dundancy, and multivariate information measures: an experimentalistsperspective,”
J. Comput. Neurosci. , vol. 36, no. 2, pp. 119–140, 2014.[27] M. Hellman and J. Raviv, “Probability of error, equivocation, and thechernoff bound,”
IEEE Transactions on Information Theory , vol. 16,no. 4, pp. 368–372, 1970.[28] I. Sason and S. Verd´u, “Arimoto–r´enyi conditional entropy and bayesian m -ary hypothesis testing,” IEEE Transactions on Information Theory ,vol. 64, no. 1, pp. 4–25, 2018.[29] L. Shen and Q. Huang, “Relay backpropagation for effective learningof deep convolutional neural networks,” in
ECCV , 2016, pp. 467–482.[30] T. M. Cover and J. A. Thomas,
Elements of information theory . JohnWiley & Sons, 2012.[31] N. X. Vinh, J. Chan, and J. Bailey, “Reconsidering mutual informationbased feature selection: A statistical significance view,” in
AAAI , 2014.[32] S. Yu and J. C. Pr´ıncipe, “Simple stopping criteria for informationtheoretic feature selection,”
Entropy , vol. 21, no. 1, p. 99, 2019.[33] M. Noshad, Y. Zeng, and A. O. Hero, “Scalable mutual informationestimation using dependence graphs,” in
ICASSP , 2019, pp. 2962–2966.[34] I. Chelombiev, C. Houghton, and C. O’Donnell, “Adaptive estimatorsshow information compression in deep neural networks,” arXiv preprintarXiv:1902.09037 , 2019.[35] Z. Goldfeld et al. , “Estimating information flow in neural networks,” arXiv preprint arXiv:1810.05728 , 2018.[36] L. Paninski, “Estimation of entropy and mutual information,”
Neuralcomputation , vol. 15, no. 6, pp. 1191–1253, 2003.[37] A. Kolchinsky and B. Tracey, “Estimating mixture entropy with pairwisedistances,”
Entropy , vol. 19, no. 7, p. 361, 2017.[38] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in
ICLR , 2015.[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
CVPR , 2016, pp. 770–778.[40] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” Citeseer, Tech. Rep., 2009.[41] D. Lopez-Paz, P. Hennig, and B. Sch¨olkopf, “The randomized depen-dence coefficient,” in
NeurIPS , 2013, pp. 1–9.[42] Y.-I. Moon, B. Rajagopalan, and U. Lall, “Estimation of mutual infor-mation using kernel density estimators,”
Physical Review E , vol. 52,no. 3, p. 2318, 1995.[43] T. Nagler and C. Czado, “Evading the curse of dimensionality innonparametric density estimation with simplified vine copulas,”
Journalof Multivariate Analysis , vol. 151, pp. 69–89, 2016.[44] A. Kraskov, H. St¨ogbauer, and P. Grassberger, “Estimating mutualinformation,”
Physical review E , vol. 69, no. 6, p. 066138, 2004.[45] S. Gao, G. Ver Steeg, and A. Galstyan, “Efficient estimation of mutualinformation for strongly dependent variables,” in
AISTATS , 2015, pp.277–286.[46] J. Shi and J. Malik, “Normalized cuts and image segmentation,”
IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 22,no. 8, pp. 888–905, 2000.[47] R. Jenssen, “Kernel entropy component analysis,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 32, no. 5, pp. 847–860,2009.[48] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning methodfor deep neural network compression,” in
ICCV , 2017, pp. 5058–5066.[49] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruningfilters for efficient convnets,” in
ICLR , 2017.[50] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data-driven neuron pruning approach towards efficient deep architectures,” arXiv preprint arXiv:1607.03250 , 2016.[51] R. A. Amjad, K. Liu, and B. C. Geiger, “Understanding indi-vidual neuron importance using information theory,” arXiv preprintarXiv:1804.06679 , 2018.[52] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneckmethod,” arXiv preprint physics/0004057 , 2000.[53] D. Koller and M. Sahami, “Toward optimal feature selection,” StanfordInfoLab, Tech. Rep., 1996.[54] S. Yaramakala and D. Margaritis, “Speculative markov blanket discoveryfor optimal feature selection,” in
ICDM , 2005, pp. 809–812.[55] J. R. Vergara and P. A. Est´evez, “A review of feature selection methodsbased on mutual information,”
Neural computing and applications ,vol. 24, no. 1, pp. 175–186, 2014. A PPENDIX
A. Partial Information Diagram
In this section, we briefly recall the ideas behind thepartial information decomposition (PID) by Williams andBeer [11]. The PID is a framework to define informationdecompositions of arbitrarily many random variables, whichcan be characterized by a partial information (PI) diagram.The general structure of PI diagram becomes clear when weconsider the decomposition for four variables (see Fig. 10),which illustrates the way in which the total information that R = { R , R , R } provides about S is distributed acrossvarious combinations of sources.Specifically, each element in R can provide unique in-formation (regions labeled { } , { } , and { } ), informationredundantly with one other variable ( { }{ } , { }{ } , and { }{ } ), or information synergistically with one other variable( { } , { } , and { } ). Additionally, information can beprovided redundantly by all three variables ( { }{ }{ } ) orprovided by their three-way synergy ( { } ). More interestingare the new kinds of terms representing combinations ofredundancy and synergy. For instance, the regions marked { }{ } , { }{ } , and { }{ } represent information thatis available redundantly from either one variable consideredindividually or the other two considered together. In general,each PI item represents the redundancy of synergies for aparticular collection of sources, corresponding to one distinctway for the components of R = { R , R , R } to contributeinformation about S . Unfortunately, the number of PI itemsgrows exponentially with the increase of the number of ele-ments in R . For example, if R = { R , R , R , R } , there willbe individual non-negative items. Moreover, the reliableestimation of each PID term still remains a big challenge [23],[24]. B. Determining Network Width with CMI-Permutation [32]
Our method to determine the network width (i.e., thenumber of filters N ) is motivated by the concept of the Markovblanket (MB) [53], [54]. Remember that the MB M of atarget variable Y is the smallest subset of S such that Y isconditional independent of the rest of the variables S − M , i.e., Y ⊥ ( S − M ) | M [55]. From the perspective of informationtheory, this indicates that the conditional mutual information(CMI) I ( { S − M } ; Y | M ) is zero. Therefore, given the selectedfilter subset T s , the remaining filter subset T r , and the classlabels Y , we can obtain a reliable estimate to N by evaluatingif I ( T r ; Y | T s ) . Theoretically, I ( T r ; Y | T s ) is non-negative andmonotonically decreasing with the increase of the size of T s (i.e., | T s | ) [30]. But it will never approach zero in practicedue to statistical variation and chance agreement betweenvariables [31].To measure how close is I ( T r ; Y | T s ) to zero, let usselect a new candidate filter t in T r , we quantify how t affects the MB condition by creating a random permutationof t (without permuting the corresponding Y ), denoted ˜ t .If I ( { T r − t } ; Y |{ T s , t } ) is not significantly smaller than I ( { T r − ˜ t } ; Y |{ T s , ˜ t } ) , t can be discarded from T s and thefilter selection is stopped (i.e., N = | T s | ). We term this Fig. 10. Partial information diagrams for (a) and (b) variables. For brevity,the sets are abbreviated by the indices of their elements; that is, { R , R } isabbreviated by { } , and so on. Figure by Williams and Beer [11]. method CMI-permutation [32]. Algorithm 1 gives its detailedimplementation, in which denotes the indicator function. Algorithm 1
CMI-permutation
Require:
Selected filter subset T s ; Remaining filter subset T r ; Classlabels Y ; Selected filter t (in T r ); Permutation number P ;Significance level α . Ensure: decision (stop filter selection or continue filter selection);Network width N . Estimate I ( { T r − t } ; Y |{ T s , t } ) with matrix-based R´enyi’s α -entropy functional estimator [10]. for i = 1 to P do Randomly permute t to obtain ˜ t i . Estimate I ( { T r − ˜ t i } ; Y |{ T s , ˜ t i } ) with matrix-based R´enyi’s α -entropy functional estimator [10]. end for if (cid:80) Pi =1 [ I ( { T r − t } ; Y |{ T s ,t } ) ≥ I ( { T r − ˜ t i } ; Y |{ T s , ˜ t i } )] P ≤ α then decision ← Continue filter selection. else decision ← Stop filter selection. N ← | T s | . end if return decision ; N (if stop filter selection) C. Additional Experimental Results
We finally represent additional results to further support ourarguments in the main text. Specifically, Fig. 11 evaluatesmultivariate mutual information (MMI) values with respect to different CNN topologies trained on Fashion-MNIST andHASYv2 data sets. Obviously, MMI values are likely tosaturate with only a few filters. Moreover, increasing the num-ber of filters does not guarantee that classification accuracyincreases, and might even degrade performance. The tendencyof redundancy-synergy tradeoff and weighted non-redundantinformation are shown in Fig. 12. Again, more filters canpush the network towards an improved redundancy-synergytrade-off, i.e., the synergy gradually dominates in each pair offeature maps with the increase of number of filters. Fig. 13demonstrates the information plane (IP) and the modifiedinformation plane (MIP) for a smaller AlexNet [21] type CNNtrained on HASYv2 and Fruits 360 data sets. Although we didnot observe any compression in IP, it is very easy to observethe compression phase in the MIP. Conv 1 C on v (a) MMI in Fashion-MNIST Conv 14 − 4.1588Conv 24 − 4.1588Conv 34 − 4.1588 (4,4,4) (b) MMI in HASYv2Fig. 11. The MMI values in (a) Conv. and Conv. in Fashion-MNISTdata set; and (b) Conv. , Conv. and Conv. in Fruits 360 data set. Theblack line indicates the upper bound of MMI, i.e., the average mini-batchinput entropy. The topologies of all competing networks are specified in thelegend, in which the successive numbers indicate the number of filters ineach convolutional layer. We also report their classification accuracies ( % ) ontesting set averaged over Monte-Carlo simulations in the parentheses. % o f R edundan cy − S y ne r g y t r adeo ff (a) Redundancy-Synergy trade-off.The networks differ in the numberof filters in Conv. . % o f w e i gh t ed non − r edundan t i n f o r m a t i on (b) Weighted non-redundant infor-mation. The networks differ in thenumber of filters in Conv. . % o f R edundan cy − S y ne r g y t r adeo ff (c) Redundancy-Synergy trade-off.The networks differ in the numberof filters in Conv. . % o f w e i gh t ed non − r edundan t i n f o r m a t i on (d) Weighted non-redundant infor-mation. The networks differ in thenumber of filters in Conv. .Fig. 12. The redundancy-synergy trade-off and the weighted non-redundant information in Fashion-MNIST data set. (a) and (b) demonstrate the percentagesof these two quantities with respect to different number of filters in Conv. , but filters in Conv. . (c) and (d) demonstrate the percentages of these twoquantities with respect to filters in Conv. , but different number of filters in Conv. . In each subfigure, the topologies of all competing networks arespecified in the legend.(a) IP ( filters in Conv. 1, fil-ters in Conv. 2, filters in Conv. 3) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (b) Conv. 2 in M-IP ( filters inConv. 1, filters in Conv. 2, filters in Conv. 3) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (c) Conv. 3 in M-IP ( filters inConv. 1, filters in Conv. 2, filters in Conv. 3)(d) IP ( filters in Conv. 1, fil-ters in Conv. 2, filters in Conv. 3) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (e) Conv. 2 in M-IP ( filters inConv. 1, filters in Conv. 2, filters in Conv. 3) Non−Redundant Information Unq(X;T)+Syn(X;T) N on − R edundan t I n f o r m a t i on U nq ( T ; Y ) + S y n ( T ; Y ) N u m be r o f It e r a t i on s (f) Conv. 3 in M-IP ( filters inConv. 1, filters in Conv. 2, filters in Conv. 3)Fig. 13. The Information Plane (IP) and the modified Information Plane (M-IP) of a smaller AlexNet type CNN trained on HASYv2 (the first row) andFruits (the second row) data sets. The of filters in Conv. , the of filters in Conv. , and the of filters in Conv.3