How does topology influence gradient propagation and model performance of deep networks with DenseNet-type skip connections?
HHow Does Topology of Neural Architectures ImpactGradient Propagation and Model Performance?
Kartikeya Bhardwaj , Guihong Li , and Radu Marculescu Arm Inc., San Jose, CA 95134 The University of Texas at Austin, Austin, TX 78712 [email protected], [email protected], [email protected]
Abstract
In this paper, we address two fundamental questions in neural architecture designresearch: ( i ) How does an architecture topology impact the gradient flow duringtraining? ( ii ) Can certain topological characteristics of deep networks indicate apriori ( i.e. , without training) which models, with a different number of param-eters/FLOPS/layers, achieve a similar accuracy? To this end, we formulate theproblem of deep learning architecture design from a network science perspectiveand introduce a new metric called NN-Mass to quantify how effectively informationflows through a given architecture. We demonstrate that our proposed NN-Massis more effective than the number of parameters to characterize the gradient flowproperties, and to identify models with similar accuracy, despite having signifi-cantly different size/compute requirements. Detailed experiments on both syntheticand real datasets ( e.g. , MNIST and CIFAR-10/100) provide extensive empiricalevidence for our insights. Finally, we exploit our new metric to design efficientarchitectures directly, and achieve up to × fewer parameters and FLOPS, whilelosing minimal accuracy ( . vs. ) over large CNNs on CIFAR-10. Recent research in neural architecture design has driven several breakthroughs in deep learning.Specifically, major contributions have been made in the following two directions: ( i ) Initialization ofmodel weights [1, 2, 3, 4], and ( ii ) Topology of the network that shows how different compute units( e.g. , neurons, channels, layers) should be connected to each other [5, 6, 7, 8, 9, 10]. While manyattempts have been made to study the impact of initialization on model accuracy [4, 11, 12, 13, 14],good Deep Neural Network (DNN) topologies have been mainly developed either manually ( e.g. ,Resnets, Densenets, etc. [5, 6, 7, 8]) or automatically using Neural Architecture Search (NAS)techniques [9, 10, 15, 16]. However, the impact of topological properties on model performancehas not been explored systematically. Hence, there is a significant gap in our understanding on howvarious topological properties impact the gradient flow and accuracy of DNNs.In general, the topology (or structure) of networks strongly influences the phenomena taking placeover them [17]. For instance, how closely the users of a social network are connected to eachother directly affects how fast the information propagates through the network [18]. Similarly, aDNN architecture can be seen as a network of different neurons connected together. Therefore, thetopology of deep networks can influence how effectively the gradients can flow and, hence, howmuch information can be learned. Indeed, this can also mean that models with similar topologicalproperties, but significantly different size/compute requirements can achieve similar accuracy.Models with highly different compute but similar accuracy have been studied in the field of modelcompression [19, 20, 21, 22, 23]. Moreover, recent NAS has also focused on deploying efficienthardware-aware models [15, 16]. Motivated by the need for ( i ) understanding the relationship betweengradient flow and topology, and ( ii ) efficient models, we address the following fundamental questions:1. How does the DNN topology influence the gradient flow through the network?2. Can topological properties of DNNs indicate a priori ( i.e. , without training) which modelsachieve a similar accuracy, despite having vastly different parameters/FLOPS/layers? Preprint. Under review. a r X i v : . [ s t a t . M L ] J u l o answer the above questions, we first model DNNs as complex networks in order to exploit thenetwork science [17] – the study of networks – and quantify their topological properties. To thisend, we propose a new metric called NN-Mass that explains the relationship between the topologicalstructure of DNNs and Layerwise Dynamical Isometry (LDI), a property that indicates the faithfulgradient propagation through the network [11, 4]. Specifically, models with similar NN-Mass shouldhave similar LDI, and thus a similar gradient flow that results in comparable accuracy. With thesetheoretical insights, we conduct a thorough Neural Architecture Space Exploration (NASE) and showthat models with the same width and NN-Mass indeed achieve similar accuracy irrespective of theirdepth, number of parameters, and FLOPS. Finally, after extensive experiments linking topology andgradient flow, we show how the closed-form expression for NN-Mass can be used to directly designefficient deep networks without searching or training individual models during the search. Overall,we propose a new theoretically-grounded perspective for designing efficient neural architectures thatreveals how topology influences the gradient propagation in deep networks.The rest of the paper is organized as follows: Section 2 discusses the related work and somepreliminaries. Then, Section 3 describes our proposed metrics and their theoretical analysis. Section 4presents detailed experimental results. Section 5 summarizes our work and contributions.
NAS techniques [9, 10, 15, 16] have indeed resulted in state-of-the-art neural architectures. Morerecently, [24, 25] utilized standard network science ideas such as Barabasi-Albert (BA) [26] orWatts-Strogatz (WS) [27] models for NAS. However, like the rest of the NAS research, [24, 25]did not address what characteristics of the topology make various models (with different parameters/FLOPS/layers) achieve similar accuracy. Unlike our work, NAS methods [9, 15, 16, 10,24, 25] do not connect the topology with the gradient flow.On the other hand, the impact of initialization on model convergence and gradients has also beenstudied [1, 2, 4, 11, 12, 13, 14]. Moreover, recent model compression literature attempts to connectpruning at initialization to gradient properties [11]. Again, none of these studies address the impactof the architecture topology on gradient propagation. Hence, our work is orthogonal to prior art thatexplores the impact of initialization on gradients [1, 2, 4, 11, 12, 13, 14] or pruning [11, 28]. Relatedwork on important network science and gradient propagation concepts is discussed below.
Preliminaries.
In our work, we use the following two well-established concepts:
Definition 1 (Average Degree [17]) . Average degree ( ˆ k ) of a network determines the average numberof connections a node has, i.e., ˆ k is given by number of edges divided by total number of nodes. Average degree and degree distribution ( i.e. , distribution of nodes’ degrees) are important topologicalcharacteristics which directly affect how information flows through a network. The dynamics of howfast a signal can propagate through a network heavily depends on the network topology.
Definition 2 (Layerwise Dynamical Isometry (LDI) [11]) . A deep network satisfies LDI if thesingular values of Jacobians at initialization are close to 1 for all layers. Specifically, for a multilayerfeed-forward network, let s i ( W i ) be the output (weights) of layer i such that s i = φ ( h i ) , h i = W i s i − + b i ; then, the Jacobian matrix at layer i is defined as: J i,i − = ∂ s i ∂ s i − = D i W i . Here, J i,i − ∈ R w i ,w i − , w i is the number of neurons in layer i . D jki = φ (cid:48) ( h i ) δ jk . φ (cid:48) denotes thederivative of non-linearity φ and δ jk is Kronecker delta [11]. Then, if the singular values σ j for all J i,i − are close to 1, then the network satisfies the LDI. LDI indicates that the signal propagating through the deep network will neither get attenuated, noramplified too much; hence, this ensures faithful propagation of gradients during training [4, 11].
We first model DNNs via network science to derive our proposed topological metrics. We thendemonstrate the theoretical relationship between NN-Mass and gradient propagation.
We start with a generic multilayer perceptron (MLP) setup with d c layers containing w c neuronseach. Since our objective is to study the topological properties of neural architectures, we assumeshortcut connections (or long-range links ) superimposed on top of a typical MLP setup (see Fig. 1(a)).2 GB … α R α G α B Convolution layer i m
Output Channels n Input
Channels m [ k×k×n ] Filters α ’s are contributions of input channels to output channels c. Single Convolutional Layer b. Mean singular value increases with matrix size layer 𝑖 output = 𝒔 𝒊 … ……… … ……… layer 𝑖 -2 layer … layer 𝑖 Weights(with shortcuts) = 𝑾 𝒊 a. Our setup: DNN as network of neurons
Depth 𝑑 𝑐 W i d t h 𝑤 𝑐 Short-range links / Long-range linksRandomly selected neurons
Figure 1: (a) DNN setup: The DNN (depth d c , width w c ) has layer-by-layer short-range connections(gray) with additional long-range links (purple/red). (b) Simulation of Gaussian matrices: Meansingular values vs. size of a matrix ( w c + m/ , w c ) . Mean singular values increase as m increases(more simulations are given in Appendix D). (c) Convolutional layers form a similar topologicalstructure as MLP layers: All input channels contribute to all output channels.Specifically, all neurons at layer i receive long-range links from a maximum of t c neurons fromprevious layers. That is, we randomly select min { w c ( i − , t c } neurons from layers , , . . . , ( i − ,and concatenate them at layer i − (see Fig. 1(a)) ; the concatenated neurons then pass through afully-connected layer to generate the output of layer i ( s i ). As a result, the weight matrix W i (whichis used to generate s i ) gets additional weights to account for the incoming long-range links. Similarto recent NAS research [29], our rationale behind selecting random links is that random architecturesare often as competitive as the carefully designed models. Moreover, the random long-range links ontop of fixed short-range links make our architectures a small-world network (Fig. 6, Appendix A) [27],and allows us to use network science to study their topological properties [17, 30, 31, 32].Like standard CNNs [5, 6], we can generalize this setup to contain multiple ( N c ) cells of width w c and depth d c . All long-range links are present only within a cell and do not extend between cells. Our key objectives are twofold: ( i ) Quantify what topological characteristics of DNN architecturesaffect their accuracy and gradient flow, and ( ii ) Exploit such properties to directly design efficientCNNs. To this end, we propose new metrics called NN-Density and
NN-Mass , as defined below.
Definition 3 (Cell-Density) . Density of a cell quantifies how densely its neurons are connected vialong-range links. Formally, for a cell c , cell-density ρ c is given by: ρ c = Actual = 2 (cid:80) d c − i =2 min { w c ( i − , t c } w c ( d c − d c − (1)For complete derivation, please refer to Appendix B. With the above definition for cell-density,NN-Density ( ρ avg ) is simply defined as the average density across all cells in a DNN. Definition 4 (Mass of DNNs) . NN-Mass quantifies how effectively information can flow through agiven DNN topology. For a given width ( w c ), models with similar NN-Mass, but different depths ( d c )and Note that, density is basically mass/volume . Let volume be the total number of neurons in a cell.Then, we can derive the NN-Mass ( m ) by multiplying the cell-density with total neurons in each cell: m = N c (cid:88) c =1 w c d c ρ c = N c (cid:88) c =1 d c (cid:80) d c − i =2 min { w c ( i − , t c } ( d c − d c − (2)Now we explain the use of the above metrics for neural architecture space exploration. Neural Architecture Space Exploration (NASE).
In NASE, we systematically study the de-sign space of DNNs using NN-Mass. Note that, NN-Mass is a function of network width, depth, Here, w c ( i − is the total number of candidate neurons from layers , , . . . , ( i − that can supplylong-range links; if the maximum number of neurons t c that can supply long-range links to the current layerexceeds total number of possible candidates, then all neurons from layers , , . . . , ( i − are selected. Neuronsare concatenated similar to how channels are concatenated in Densenets [6]. i.e. , the topology of a model). For a fixed number of cells, an architecturecan be completely specified by { depth, width, maximum long-range link candidates } per cell = { d c , w c , t c } . Hence, to perform NASE, we vary { d c , w c , t c } to create random architectures with dif-ferent parameters/FLOPS/layers, and NN-Mass. We then train these architectures and characterizetheir accuracy, topology, and gradient propagation in order to understand theoretical relationshipsamong them. Without loss of generality, we assume the DNN has only one cell of width w c and depth d c . Proposition 1 ( NN-Mass and average degree of the network (a topological property) ) . Theaverage degree of a deep network with NN-Mass m is given by ˆ k = w c + m/ . The proof of the above result is given in Appendix C.
Intuition.
Proposition 1 states that the average degree of a deep network is w c + m/ , which, giventhe NN-Mass m , is independent of the depth d c . The average degree indicates how well-connectedthe network is. Hence, it controls how effectively the information can flow through a given topology.Therefore, for a given width and NN-Mass, the average amount of information that can flow throughvarious architectures (with different parameters/layers) should be similar (due to the same averagedegree). Thus, we hypothesize that these topological characteristics might constrain the amount ofinformation being learned by different models. Next, we show the impact of topology on gradientpropagation. Proposition 2 ( NN-Mass and LDI ) . Given a small deep network f S (depth d S ) and a large deepnetwork f L (depth d L , d L >> d S ), both with same NN-Mass m and width w c , the LDI for bothmodels is equivalent. Specifically, if Σ iS ( Σ iL ) denotes the singular values of the initial layerwiseJacobians J i,i − for the small (large) model, then, the mean singular values in both models aresimilar; that is, E [ Σ iS ] ≈ E [ Σ iL ] .Proof. To prove the above result, it suffices to show that the initial Jacobians J i,i − have similarproperties for both models (and thus their singular value distributions will be similar). For our setup,the output of layer i , s i = φ ( W i x i − + b i ) , where x i − = s i − ∪ y i − concatenates output oflayer i − ( s i − ) with the neurons y i − supplying the long-range links (random min { w c ( i − , t c } neurons selected uniformly from layers to i − ). Hence, J i,i − = ∂ s i /∂ x i − = D i W i .Compared to a typical MLP scenario (see Definition 2), the sizes of matrices D i and W i increase toaccount for incoming long-range links.For two models f S and f L , the layerwise Jacobian ( J i,i − ) can have two kinds of properties: ( i ) Thevalues inside Jacobian matrix for f S and f L can be different, and/or ( ii ) The sizes of layerwiseJacobian matrices for f S and f L can be different. Hence, our objective is to show that when the widthand NN-Mass are similar, irrespective of the depth of the model (and thus irrespective of number ofparameters/FLOPS), both the values and the size of initial layerwise Jacobians will be similar.Let us start by considering a linear network: in this case, J i,i − = W i . Since the LDI looks at theproperties of layerwise Jacobians at initialization , and because all models are initialized the sameway ( e.g. , Gaussians with variance scaling ), the values inside J i,i − for both f S and f L have samedistribution (point ( i ) above is satisfied). We next show that even the sizes of layerwise Jacobians forboth models are similar if the width and NN-Mass are similar.How is topology related to the layerwise Jacobians? Since the average degree is same for both models(see Proposition 1), on average, the number of incoming shortcuts at a typical layer is w c × m/ . Inother words, since the degree distribution for the random long-range links is Poisson [30] with averagedegree ¯ k R|G ≈ m/ (see Eq. (7), Appendix C), an average m/ neurons supply long-range links toeach layer . Therefore, the Jacobians will theoretically have the same dimensions ( w c + m/ , w c ) irrespective of the depth of the neural network ( i.e. , point ( ii ) is also satisfied).So far, the discussion has considered only a linear network. For a non-linear network, the Jacobian isgiven as J i,i − = D i W i . As explained in [11], D i depends on pre-activations h i = W i x i − + b i .As established in several deep network mean field theory studies [14, 12, 11, 13], the distribution Variance scaling methods also take into account the number of input/output units. Hence, if the width is thesame between models of different depths, the distribution at initialization is still similar. Theoretically, a Poisson process assumes a constant rate of arrival of links.
4f pre-activations at layer i ( h i ) is a Gaussian N (0 , q i ) due to the central limit theorem. Similarto [11, 14], if the input h is chosen to satisfy a fixed point q i = q ∗ , the distribution of D i becomesindependent of the depth ( N (0 , q ∗ ) ). Therefore, the distribution of both D i and W i is similar fordifferent models irrespective of the depth, even for non-linear networks. Moreover, the sizes of thematrices will again be similar due to similar average degree in both f S and f L .Hence, the size and distribution of values in the Jacobian matrix is similar for both the large andthe small model (provided the width and NN-Mass are similar). That is, the distribution and meansingular values will also be similar: E [ Σ iS ] ≈ E [ Σ iL ] . In other words, LDI is equivalent betweenmodels of different depths if their width and NN-Mass are similar.We note that the mean singular values increase with NN-Mass. To illustrate this effect, we numericallysimulate several Gaussian-distributed matrices of sizes ( w c + m/ , w c ) and compute their meansingular values. Specifically, we vary m for widths w c and see the impact of this size variation onmean singular values. Fig. 1(b) shows that as NN-Mass varies, the mean singular values linearlyincrease with NN-Mass. In our experiments, we show that this linear trend between mean singularvalues and NN-Mass holds true for actual non-linear deep networks. A formal proof of this observationand more simulations are given in Appendix D. Note that, our results should not be interpreted asbigger models yield larger mean singular values. We explicitly show in the next section that therelationship between the total number of parameters and mean singular values is significantly worsethan that for NN-Mass. Hence, it is the topological properties that enable LDI in different deepnetworks and not the number of parameters. Remark 1 (NN-Mass formulation is same for CNNs).
Fig. 1(c) shows a typical convolutionallayer. Since all channel-wise convolutions are added together, each output channel is some functionof all input channels. This makes the topology of CNNs similar to that of our MLP setup. The keydifference is that the nodes in the network (see Fig. 1(a)) now represent channels and not individualneurons. Of note, for our CNN setup, we use three cells (similar to [5, 6]). More details on CNNsetup (including a concrete example for NN-Mass calculations) are given in Appendices E and F.Next, we present detailed experimental evidence to validate our theoretical findings.
To perform NASE for MLPs and CNNs, we generate random architectures with different NN-Massand number of parameters (
Params) by varying { d c , w c , t c } . For random MLPs with different { d c , t c } and w c = 8 ( cells = 1), we conduct the following experiments on the MNIST dataset:( i ) We explore the impact of varying Params and NN-Mass on the test accuracy; ( ii ) We demonstratehow LDI depends on NN-Mass and Params; ( iii ) We further show that models with similar NN-Mass(and width) result in similar training convergence, despite having different depths and
Params.After the extensive empirical evidence for our theoretical insights ( i.e. , the connection betweengradient propagation and topology), we next move on to random CNN architectures with three cells.We conduct the following experiments on the CIFAR-10 and CIFAR-100 datasets: ( i ) We show thatNN-Mass can further identify CNNs that achieve similar test accuracy, despite having highly different Params/FLOPS/layers; ( ii ) We show that NN-Mass is a significantly more effective indicator ofmodel performance than parameter counts; ( iii ) We also show that our findings hold for CIFAR-100,a much more complex dataset than CIFAR-10. These models are trained for 200 epochs.Finally, we exploit NN-Mass to directly design efficient CNNs (for CIFAR-10) which achieveaccuracy comparable to significantly larger models. For these experiments, the models are trainedfor 600 epochs. Overall, we train hundreds of different MLP and CNN architectures with each MLP(CNN) repeated five (three) times with different random seeds, to obtain our results. More setupdetails ( e.g. , architecture details, learning rates, etc. ) are given in Appendix G (see Tables 2, 3, and 4). Fig. 2(a) shows test accuracy vs.
Params of DNNs with different depths on theMNIST dataset. As evident, even though many models have different
Params, they achieve a similartest accuracy. On the other hand, when the same set of models are plotted against NN-Mass, theirtest accuracy curves cluster together tightly, as shown in Fig. 2(b). To further quantify the aboveobservation, we generate a linear fit between test accuracy vs. log(
Params) and log(NN-Mass) (seebrown markers on Fig. 2(a,b)). For NN-Mass, we achieve a significantly higher goodness-of-fit R = a) (b) (c) (d) Figure 2: MNIST results: (a) Models with different number of parameters (
Params) achieve similartest accuracy. (b) Test accuracy curves of models with different depths/
Params concentrate whenplotted against NN-Mass (test accuracy std. dev. ∼ . − . ). (c,d) Mean singular values of J i,i − are much better correlated with NN-Mass ( R = 0 . ) than with Params ( R = 0 . ). . than that for Params ( R = 0 . ). This demonstrates that NN-Mass can identify DNNs thatachieve similar accuracy, even if they have a highly different number of parameters/FLOPS /layers.We next investigate the gradient propagation properties to explain the test accuracy results. Layerwise Dynamical Isometry (LDI).
We calculate the mean singular values of initial layerwiseJacobians, and plot them against
Params (see Fig. 2(c)) and NN-Mass (see Fig. 2(d)). Clearly, NN-Mass ( R = 0 . ) is far better correlated with the mean singular values than Params ( R = 0 . ).Figure 3: Models A and C have thesame NN-Mass and achieve verysimilar training convergence, eventhough they have highly different Params and depth. Model B hassignificantly fewer layers than Cbut the same
Params, yet achievesa faster training convergence thanC (B has higher NN-Mass than C).More importantly, just as Proposition 2 predicts, these resultsshow that models with similar NN-Mass and width have equiv-alent LDI properties, irrespective of the total depth (and, thus
Params) of the network. For example, even though the 32-layer models have more parameters, they have similar meansingular values as the 16-layer DNNs. This clearly suggeststhat the gradient propagation properties are heavily influencedby the topological characteristics like NN-Mass, and not just byDNN depth and
Params. Of note, the linear trend in Fig. 2(d)is similar to that seen in Fig. 1(b) simulation.
Training Convergence.
The above results pose the followinghypotheses: ( i ) If the gradient flow between DNNs (with sim-ilar NN-Mass and width) is similar, their training convergenceshould be similar, even if they have highly different Paramsand depths; ( ii ) If two models have same Params (and width),but different depths and NN-Mass, then the DNN with higherNN-Mass should have faster training convergence (since itsmean singular value will be higher – see the trend in Fig. 2(d)).To demonstrate that both hypotheses above hold true, we pickthree models – A, B, and C – from Fig. 2(a,b) and plot their training loss vs. epochs. Models Aand C have similar NN-Mass, but C has more
Params and depth than A. Model B has far fewerlayers and nearly the same
Params as C, but has a higher NN-Mass. Fig. 3 shows the trainingconvergence results for all three models. As evident, the training convergence of model A (7.8KParams, 20-layers) nearly coincides with that of model C (8.8K Params, 32-layers). Moreover,even though model B (8.7K Params, 20-layers) is shallower than the 32-layer model C, the trainingconvergence of B is significantly faster than that of C (due to higher NN-Mass and, therefore, betterLDI). Training convergence results for several other models in Fig. 2(a,b) show similar observations(see Fig. 10 in Appendix H.1). These results clearly validate the theoretical insights in Proposition 2,and emphasize the importance of topological properties of neural architectures in characterizing thegradient propagation and model performance. Other similar experiments for synthetic datasets aregiven in Appendix H.2.
Since we have now established a concrete relationship between gradient propagation and topologicalproperties, in the rest of the paper, we will show that NN-Mass can be used to identify and design effi-cient CNNs that achieve similar accuracy as models with significantly higher
Params/FLOPS/layers. For our setup, more parameters lead to more FLOPS. FLOPS results are given for CNNs in Appendix H.8. odel Performance. Fig. 4(a) shows the test accuracy of various CNNs vs. total
Params. Asevident, models with highly different number of parameters ( e.g. , see models A-E in box W),achieve a similar test accuracy. Note that, there is a large gap in the model size: CNNs in box W T e s t A cc u r a c y Test Accuracy vs. Number of Parameters
W A B C D E
200 400 600 800 1000 1200NN-Mass95.996.096.196.296.396.496.596.6 T e s t A cc u r a c y Test Accuracy vs. NN-Mass
Y ZX A'B' C'D' E' a. b.
Figure 4: CIFAR-10 Width Multiplier wm = 2 :(a) Models with very different Params (box W)achieve similar test accuracies. (b) Models withsimilar accuracy often have similar NN-Mass:Models in W cluster into Z. Results are reportedas the mean of three runs (std. dev. ∼ . ).range from 5M parameters (model A) to 9Mparameters (models D,E). Again, as shown inFig. 4(b), when plotted against NN-Mass, thetest accuracy curves of CNNs with differentdepths cluster together ( e.g. , models A-E in boxW cluster into A’-E’ within bucket Z). Hence,NN-Mass identifies CNNs with similar accuracy,despite having highly different Params/layers.The same holds true for models within X and Y.We now explore the impact of varying modelwidth. In our CNN setup, we control the widthof the models using width multipliers ( wm ) [33,7]. The above results are for wm = 2 . For lowerwidth CNNs ( wm = 1 ), Fig. 5(a) shows thatmodels in boxes U and V concentrate into thebuckets W and Z, respectively (see also otherbuckets). Note that, the 31-layer models do not fall within the buckets (see blue line in Fig. 5(b)).We hypothesize that this could be because the capacity of these models is too small to reach highaccuracy. This does not happen for CNNs with higher width. Specifically, Fig. 5(c) shows the resultsfor wm = 3 . As evident, models with 6M-7M parameters achieve comparable test accuracy asmodels with up to 16M parameters ( e.g. , bucket Y in Fig. 5(d) contains models ranging from { } , all the way to {
64 layers, 16.7M parameters } ). Again, for all widths, thegoodness-of-fit ( R ) for linear fit between test accuracy and log(NN-Mass) achieves high values(0.74-0.90 as shown in Fig. 15 in Appendix H.4). T e s t A cc u r a c y Test Accuracy vs. Number of Parameters -- wm = 1 a. U V
100 200 300 400 500NN-Mass94.5094.7595.0095.2595.5095.7596.00 T e s t A cc u r a c y Test Accuracy vs. NN-Mass -- wm = 1 b. W Y ZX Higher Width (wm=3)Lower Width (wm=1)
200 400 600 800 1000NN-Mass96.196.296.396.496.596.696.796.8 T e s t A cc u r a c y Test Accuracy vs. NN-Mass -- wm = 3 d. W Y ZX T e s t A cc u r a c y Test Accuracy vs. Number of Parameters -- wm = 3 c. Figure 5: Similar observations hold for low- ( wm = 1 ) and high-width ( wm = 3 ) models: (a,b) Many models with very different Params (boxes U and V) cluster into buckets W and Z (see alsoother buckets). (c, d) For high-width, we observe a significantly tighter clustering compared to thelow-width case. Results are reported as the mean of three runs (std. dev. ∼ . ). Comparison between NN-Mass and Parameter Counting.
Next, we quantitatively compare NN-Mass to parameter counts. As shown in Fig. 16 in Appendix H.5, for wm = 2 , Params yield an R = 0 . which is lower than that for NN-Mass ( R = 0 . , see Fig. 16(a, b)). However, for higherwidths ( wm = 3 ), the parameter count completely fails to predict model performance ( R = 0 . inFig. 16(c)). On the other hand, NN-Mass achieves a significantly higher R = 0 . (see Fig. 16(d)).Since NN-Mass is a good indicator of model performance, we can in fact use it to predict a priori the test accuracy of completely unknown architectures . The complete details of this experiment andthe results are presented in Appendix H.6. We show that a linear model trained on CNNs of depth { , , , } ( R = 0 . ; see Fig. 15(b)) can successfully predict the test accuracy of unknownCNNs of depth { , , , } with a high R = 0 . (see Fig. 17 in Appendix H.6). Results for CIFAR-100 Dataset.
We now corroborate our main findings on CIFAR-100 datasetwhich is significantly more complex than CIFAR-10. To this end, we train the models in Fig. 4 onCIFAR-100. Fig. 18 (see Appendix H.7) once again shows that several models with highly differentnumber of parameters achieve similar accuracy. Moreover, Fig. 18(b) demonstrates that these modelsget clustered when plotted against NN-Mass. Further, a high R = 0 . is achieved for a linear fit onthe accuracy vs. log(NN-Mass) plot (see Appendix H.7 and Fig. 18). Base channels in each cell is [16,32,64]. For wm = 2 , cells will have [32,64,128] channels per layer. ± standard deviation of three runs. DARTS results are reported from [9]. Model Architecture designmethod
Parameters/
FLOPS layers Specializedsearch space? NN-Mass Test AccuracyDARTS (first order) NAS [9] 3.3M/– – Yes – . ± . DARTS (second order) NAS [9] 3.3M/– – Yes – . ± . % Train large modelsto be compressed Manual 11.89M/3.63G 64 No 1126 . ± . Manual 8.15M/2.54G 64 No 622 . ± . Proposed
Directly via NN-Mass
40 No 755 . ± . % Proposed
Directly via NN-Mass
37 No 813 . ± . % Proposed
Directly via NN-Mass
31 No 856 . ± . % Results for
FLOPS.
So far, we have shown results for the number of parameters. However, theresults for
FLOPS follow a very similar pattern (see Fig. 19 in Appendix H.8). In summary, weshow that NN-Mass can identify models that yield similar test accuracy, despite having very different parameters/FLOPS/layers. We next use this observation to directly design efficient architectures.
We train our models for 600 epochs on the CIFAR-10 dataset (similar to the setup in DARTS [9]).Table 1 summarizes the number of parameters, FLOPS, and test accuracy of various CNNs. Wefirst train two large CNN models of about 8M and 12M parameters with NN-Mass of 622 and 1126,respectively; both of these models achieve around accuracy. Next, we train three significantlysmaller models: ( i ) A 5M parameter model with 40 layers and a NN-Mass of 755, ( ii ) A 4.6Mparameter model with 37 layers and a NN-Mass of 813, and ( iii ) A 31-layer, 3.82M parameter modelwith a NN-Mass of 856. We set the NN-Mass of our smaller models between 750-850 ( i.e. , withinthe 600-1100 range of the manually-designed CNNs). Interestingly, we do not need to train anyintermediate architectures to arrive at the above efficient CNNs. Indeed, classical NAS involves aninitial “search-phase” over a space of operations to find the architectures [10]. In contrast, our efficientmodels can be directly designed using the closed form Eq. (2) of NN-Mass (see Appendix H.9 formore details), which does not involve any intermediate training or even an initial search-phase likeprior NAS methods. As explained earlier, this is possible because NN-Mass can identify models withsimilar performance a priori ( i.e. , without any training)!As evident from Table 1, our 5M parameter model reaches a test accuracy of . , while the 4.6M(3.82M) parameter model obtains . ( . ) accuracy on the CIFAR-10 test set. Clearly, allthese accuracies are either comparable to, or slightly lower ( ∼ . ) than the large CNNs, whilereducing Params/FLOPS by up to × compared to the 11.89M-parameter/3.63G-FLOPS model.Moreover, DARTS [9], a competitive NAS baseline, achieves a comparable ( ) accuracy withslightly lower 3.3M parameters. However, the search space of DARTS (like all other NAS techniques)is very specialized and utilizes many state-of-the-art innovations such as depth-wise separableconvolutions [7], dilated convolutions [34], etc . On the contrary, we use regular convolutionswith only concatenation-type long-range links in our work and present a theoretically-groundedapproach. Indeed, our current objective is not to beat DARTS (or any other NAS technique), butrather underscore the topological properties that should guide the efficient architecture design process. To answer “
How does the topology of neural architectures impact gradient propagation and modelperformance? ”, we have proposed a new, network science-based metric called
NN-Mass whichquantifies how effectively information flows through a given architecture. We have also establishedconcrete theoretical relationships among NN-Mass, topological structure of networks, and layerwisedynamical isometry that ensures faithful propagation of gradients through DNNs.Our experiments have demonstrated that NN-Mass is significantly more effective than the number ofparameters to characterize the gradient flow properties, and to identify models with similar accuracy,despite having a highly different number of parameters/FLOPS/layers. Finally, we have exploitedour new metric to design efficient architectures directly, and achieve up to × fewer parameters andFLOPS, while sacrificing minimal accuracy over large CNNs.By quantifying the topological properties of deep networks, our work serves an important step tounderstand and to design new neural architectures. Since topology is deeply intertwined with thegradient propagation, such topological metrics deserve major attention in future research.8 roader Impact Neural architecture is a fundamental part of the model design process. For all applications, the veryfirst task is to decide how wide or deep should the network be. With the rising ubiquity of complextopologies ( e.g. , Resnets, Densenets, NASNet [5, 6, 10]), the architecture design decisions todaynot only encompass simple depth and width, but also necessitate an understanding of how variousneurons/channels/layers should be connected. Indeed, this understanding has been missing from priorart. Then, it naturally raises the following question: how can we design efficient architectures if wedo not even understand how their topology impacts gradient flow?
With our work, we bridge this gap by bringing a new theoretically-grounded perspective for designingneural architecture topologies. Having demonstrated that topology is a big part of the gradientpropagation mechanism in deep networks, it is essential to include these metrics in the modeldesign process. From a broader perspective, we believe characterizing such properties will enablemore efficient and highly accurate neural architectures for all applications (computer vision, naturallanguage, speech recognition, learning over biological data, etc. ). Moreover, our work brings togethertwo important fields – deep learning and network science. Specifically, small-world networks haveallowed us to study the relationship between topology and gradient propagation. This encouragesnew research at the intersection of deep learning and network science, which can ultimately helpadvance our theoretical understanding of deep networks, while building significantly more efficientmodels in practice.
References [1] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In
Neuralnetworks: Tricks of the trade , pages 9–48. Springer, 2012.[2] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neuralnetworks. In
Proceedings of the thirteenth international conference on artificial intelligence and statistics ,pages 249–256, 2010.[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification. In
Proceedings of the IEEE international conferenceon computer vision , pages 1026–1034, 2015.[4] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics oflearning in deep linear neural networks. arXiv preprint arXiv:1312.6120 , 2013.[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[6] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 4700–4708, 2017.[7] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, MarcoAndreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile visionapplications. arXiv:1704.04861 , 2017.[8] Mark Sandler and et al. Inverted residuals and linear bottlenecks: Mobile networks for classification,detection and segmentation. arXiv:1801.04381 , 2018.[9] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprintarXiv:1806.09055 , 2018.[10] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architecturesfor scalable image recognition. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 8697–8710, 2018.[11] Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. A signal propaga-tion perspective for pruning neural networks at initialization. In
International Conference on LearningRepresentations , 2020.[12] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponentialexpressivity in deep neural networks through transient chaos. In
Advances in neural information processingsystems , pages 3360–3368, 2016.[13] Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrz˛ebski, Jacek Tabor, and Maciej A Nowak. Dynamicalisometry is achieved in residual networks in a universal way for any activation function. arXiv preprintarXiv:1809.08848 , 2018.
14] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learningthrough dynamical isometry: theory and practice. In
Advances in neural information processing systems ,pages 4785–4795, 2017.[15] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le.Mnasnet: Platform-aware neural architecture search for mobile. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 2820–2828, 2019.[16] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task andhardware. arXiv preprint arXiv:1812.00332 , 2018.[17] Mark Newman, Albert-Laszlo Barabasi, and Duncan J Watts.
The structure and dynamics of networks ,volume 19. Princeton University Press, 2011.[18] Albert-László Barabási and Eric Bonabeau. Scale-free networks.
Scientific american , 288(5):60–69, 2003.[19] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficientconvnets. arXiv:1608.08710 , 2016.[20] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networksusing energy-aware pruning. arXiv:1611.05128 , 2016.[21] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neuralnetworks: Training neural networks with low precision weights and activations.
JMLR , 18(1):6869–6898,2017.[22] Liangzhen Lai, Naveen Suda, and Vikas Chandra. Deep convolutional neural network inference withfloating-point weights and fixed-point activations. arXiv preprint arXiv:1703.03073 , 2017.[23] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 , 2015.[24] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neuralnetworks for image recognition. arXiv preprint arXiv:1904.01569 , 2019.[25] Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. Discovering neural wirings. arXiv preprintarXiv:1906.00586 , 2019.[26] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. science ,286(5439):509–512, 1999.[27] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature ,393(6684):440, 1998.[28] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neuralnetworks. arXiv preprint arXiv:1803.03635 , 2018.[29] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. arXivpreprint arXiv:1902.07638 , 2019.[30] Albert-Laszlo Barabasi.
Network Science (Chapter 3: Random Networks) . Cambridge University Press,2016.[31] Mark EJ Newman and Duncan J Watts. Renormalization group analysis of the small-world network model.
Physics Letters A , 263(4-6):341–346, 1999.[32] Remi Monasson. Diffusion, localization and dispersion relations on “small-world” lattices.
The EuropeanPhysical Journal B-Condensed Matter and Complex Systems , 12(4):555–567, 1999.[33] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146 ,2016.[34] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprintarXiv:1511.07122 , 2015.[35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-tion. arXiv preprint arXiv:1409.1556 , 2014. upplementary Information:How Does Topology of Neural Architectures Impact Gradient Propagationand Model Performance?A DNNs/CNNs with long-range links are Small-World Networks Note that, the DNNs/CNNs considered in our work have both short-range and long-range links (seeFig. 1(a)). This kind of topology typically falls into the category of small-world networks which canbe represented as a lattice network G (containing short-range links) superimposed with a randomnetwork R (to account for long-range links) [32, 31]. This is illustrated in Fig. 6. Short-range links Long-range links = +
Small-World Network Lattice Network (G) Random Network (R)Each node has k short-range neighbors … … … … … … ……… … … …… … … ……… … … …… … … …… … = + CNN architecture with long-range links Lattice Network (G) containing layer-by-layer connections Random Network (R) consisting of long-range links a. Traditional Network Science: b. A Convolutional Neural Network: w c incoming links at each node (channel) Figure 6: (a) Small-World Networks in traditional network science are modeled as a superposition of alattice network ( G ) and a random network R [27, 31, 32]. (b) A DNN/CNN with both short-range andlong-range links can be similarly modeled as a random network superimposed on a lattice network.Not all links are shown for simplicity. B Derivation of Density of a Cell
Note that, the maximum number of neurons contributing long-range links at each layer in cell c isgiven by t c . Also, for a layer i , possible candidates for long-range links = all neurons up to layer( i − ) are w c ( i − (see Fig. 1(a)). Indeed, if t c is sufficiently large, initial few layers may nothave t c neurons that can supply long-range links. For these layers, we use all available neurons forlong-range links. Therefore, for a given layer i , number of long-range links ( l i ) is given by: l i = (cid:26) w c ( i − × w c if t c > w c ( i − t c × w c otherwise (3)where, both cases have been multiplied by w c because once the neurons are randomly selected, theysupply long-range links to all w c neurons at the current layer i (see Fig. 1(a)). Hence, for an entirecell, total number of neurons contributing long-range links ( l c ) is as follows: l c = w c d c − (cid:88) i =2 min { w c ( i − , t c } (4)On the other hand, the total number of possible long-range links within a cell ( L ) is simply the sumof possible candidates at each layer: L = d c − (cid:88) i =2 w c ( i − × w c = w c d c − (cid:88) i =2 ( i − w c [1 + 2 + . . . + ( d c − w c ( d c − d c − (5)11sing Eq. (4) and Eq. (5), we can rewrite Eq. (1) as: ρ c = 2 (cid:80) d c − i =2 min { w c ( i − , t c } w c ( d c − d c − (6) C Proof of Proposition 1
Proposition 1 ( NN-Mass and average degree of the network (a topological property) ) . Theaverage degree of a deep network with NN-Mass m is given by ˆ k = w c + m/ .Proof. As shown in Fig. 6, deep networks with shortcut connections can be represented as small-worldnetworks consisting of two parts: ( i ) lattice network containing only the short-range links, and ( ii )random network superimposed on top of the lattice network to account for long-range links. Forsufficiently deep networks, the average degree for the lattice network will be just the width w c of thenetwork. The average degree of the randomly added long-range links ¯ k R|G is given by: ¯ k R|G = Number of long-range links added by R Number of nodes = w c (cid:80) d c − i =2 min { w c ( i − , t c } w c d c = m ( d c − d c − d c (using (2) for one cell) ≈ m (when d c >> , e.g. , for deep networks) (7)Therefore, average degree of the complete model is given by w c + m/ . D Proof of Proposition 2
Proposition 2 ( NN-Mass and LDI ) . Given a small deep network f S (depth d S ) and a large deepnetwork f L (depth d L , d L >> d S ), both with same NN-Mass m and width w c , the LDI for bothmodels is equivalent. Specifically, if Σ iS ( Σ iL ) denotes the singular values of the initial layerwiseJacobians J i,i − for the small (large) model, then, the mean singular values in both models aresimilar; that is, E [ Σ iS ] ≈ E [ Σ iL ] .Proof. Consider a matrix M ∈ R H × W with H rows and W columns, and all entries independentlyinitialized with a Gaussian Distribution N (0 , q ) , we calculate its mean singular value. We firstperform Singular Value Decomposition ( SVD ) on the given matrix M : U ∈ R H × H , Σ ∈ R H × W , V ∈ R W × W = SV D ( M )Σ ∈ R H × W = Diag ( σ , σ , ..., σ K ) Given a row vector (cid:126)u i ∈ R H in U , and a row vector (cid:126)v i ∈ R W in V , we use the following relations ofSVD in our proof: σ i = (cid:126)u iT M (cid:126)v i (cid:126)u iT (cid:126)u i = 1 (cid:126)v iT (cid:126)v i = 1 It is hard to directly compute the mean singular value E [ σ i ] . To simplify the problem, consider σ i : σ i = σ i × σ Ti = ( (cid:126)u iT M (cid:126)v i )( (cid:126)u iT M (cid:126)v i ) T = (cid:126)u iT M (cid:126)v i (cid:126)v iT M T (cid:126)u i = (cid:126)u iT M M T (cid:126)u i (8)12ubstituting B = M M T (where, B ∈ R H × H ), and using m ij to represent the ij th entry of the givenmatrix M , the entry b ij in B is given by: b ij = H (cid:88) k =1 m ik , when i = j H (cid:88) k =1 m ik m kj , when i (cid:54) = j Since m ij follows an independent and identical Gaussian Distribution N (0 , q ) , the diagonal entriesof B ( b ii ) follow a chi-square distribution with H degrees of freedom: b ii ∼ χ ( H ) For the non-diagonal entries of B, i.e. i (cid:54) = j , suppose z k = xy, x = m ik , and y = m kj ; then theprobability density function ( PDF ) of z is as follows: P DF Z ( z k ) = (cid:90) ∞−∞ P DF X ( t ) | t | P DF Y ( z k t ) dt = (cid:90) ∞−∞ π | t | e − t z k t dt (9)Based on probability density function of z k , the expectation of z k is given by: E [ z k ] = (cid:90) ∞−∞ P DF Z ( z k ) z k dz k As shown in Eq. (9),
P DF Z ( z k ) is an even function, then P DF Z ( z k ) z k is an odd function; therefore, E [ z k ] = 0 and, thus, E [ b ij ] = (cid:80) Hk =1 E [ z k ] = 0 , when i (cid:54) = j .Hence, we can now get the expectation for each entry in the Matrix B : E [ b ij ] = (cid:26) H, i = j , i (cid:54) = j ; that is: E [ B ] = Diag ( b ii ) = H I (10)where, I ∈ R H × H is an identity matrix. Combining Eq. (8) and Eq. (10), we get the following results: E [ σ i ] = E [ (cid:126)u iT M M T (cid:126)u i ]= E [ (cid:126)u iT ] E [ M M T ] E [ (cid:126)u i ]= E [ (cid:126)u iT ] E [ B ] E [ (cid:126)u i ]= E [ (cid:126)u iT ] H IE [ (cid:126)u i ]= H E [ (cid:126)u iT (cid:126)u i ]= H (11)Therefore, we have: E [ σ i ] = H (12)Eq. 12 states that, for a Gaussian M ∈ R H × W , E [ σ i ] is dependent on number of rows H , anddoes not depend on W. To empirically verify this, we simulate several Gaussian matrices of widths W ∈ { , , ..., } and H ∈ (0 , . We plot E [ σ i ] vs. H in Fig. 7. As evident, for different W , the mean singular values are nearly coinciding, thereby showing that mean singular value indeeddepends on H . Also, for small-enough ranges of H , the relationship between E [ σ i ] and H can beapproximated with a linear trend.To see the above linear trend between the mean singular values ( E [ σ i ] ) and H , we now simulate amore realistic scenario that will happen in the case of initial layerwise Jacobian matrices ( J i,i − ).As explained in the main paper, the layerwise Jacobians will theoretically have ( w c + m/ , w c ) dimensions, where w c is the width of DNN and m is the NN-Mass. That is, now M = J i,i − , W = w c , and H = w c + m/ . Hence, in Fig. 8, we plot mean singular values for Gaussian distributedmatrices of size ( w c + m/ , w c ) vs. NN-Mass ( m ). As evident, for w c ranging from 8 to 256, meansingular values increase linearly with NN-Mass. We will explicitly demonstrate in our experimentsthat this linear trend holds true for actual non-linear deep networks.Finally, since the Jacobians have a size of ( w c + m/ , w c ), Eq. 12 suggests that its mean singularvalues should depend on H = w c + m/ . Hence, when two DNNs have same NN-Mass and width,their mean singular values should be similar, i.e. , E [ Σ iS ] ≈ E [ Σ iL ] (irrespective of their depths).13igure 7: Mean Singular Value E [ σ i ] only increases with H while varying W . For small-enoughranges, the E [ σ i ] vs. H relationship can be approximated by a linear trend.Figure 8: To simulate more realistic Jacobian matrices, we calculate the mean singular value of matrix M with size [ w c + m/ , w c ] ( w c is given by the Width in the title of each sub-figure). Clearly, E [ σ i ] varies linearly with corresponding NN-Mass for all w c values. Moreover, as w c increases, the meansingular values ( E [ σ i ] ) increase. Both observations show that E [ σ i ] increases with ˆ k = w c + m/ (since the height of the Jacobian matrix H = ˆ k depends on both w c and m ). E CNN Details
In contrast to our MLP setup which contains only a single cell of width w c and depth d c , our CNNsetup contains three cells, each containing a fixed number of layers, similar to prior works such asDensenets [6], Resnets [5], etc . However, topologically, a CNN is very similar to MLP. Since in aregular convolutional layer, channel-wise convolutions are added to get the final output channel (seeFig. 1(c)), each input channel contributes to each output channel at all layers. This is true for bothlong-range and short-range links; this makes the topological structure of CNNs similar to our MLPsetup shown in Fig. 1(a) in the main paper (the only difference is that now each channel is a node inthe network and not each neuron).In the case of CNNs, following the standard practice [35], the width ( i.e. , the number of channels perlayer) is increased by a factor of two at each cell as the feature map height and width are reducedby half. After the convolutions, the final feature map is average-pooled and passed through a fully-14 t c = 3 t c = 4 t c = 5 Not all links are shown above. If a channel is selected, it contributes long-range links to all output channels of the current layer
12 34 56 123 Concatenate feature maps like Densenets
Average Pool Logits Outputs after softmax … … … Fully-connected
Cell 1 Cell 2 Cell 3
Layer i=2 : Long-range links (violet) from 4 previous channels because min{ w c (i-1), t c } = 4 No long-range links between cells
Layer i : 0 1 2 3 Layer i=3 : Long-range links (green) from 5 previous channels because min{ w c (i-1), t c } = 5 Initial conv
Max previous channels for long-range links All links d c =4 layers w c = Figure 9: An example of CNN to calculate NN-Density and NN-Mass. Not all links are shown in themain figure for simplicity. The inset shows the contribution from all long-range and short-range links:The feature maps for randomly selected channels are concatenated at the current layer (similar toDensenets [6]). At each layer in a given cell, the maximum number of channels that can contributelong-range links is given by t c .connected layer to generate logits. The width ( i.e. , the number of channels at each layer) of CNNsis controlled using a width multiplier, wm (like in Wide Resnets [33] and Mobilenets [7]). Base channels in each cell is [16,32,64]. For wm = 2 , cells will have [32,64,128] channels per layer. F Example: Computing NN-Mass for a CNN
Given a CNN architecture shown in Fig. 9, we now calculate its NN-Mass. This CNN consistsof three cells, each containing d c = 4 convolutional layers. The three cells have a width, ( i.e. ,the number of channels per layer) of 2, 3, and 4, respectively. We denote the network width as w c = [2 , , . Finally, the maximum number of channels that can supply long-range links is givenby t c = [3 , , . That is, the first cell can have a maximum of three long-range link candidates perlayer ( i.e. , previous channels that can supply long-range links), the second cell can have a maximumof four long-range link candidates per layer, and so on. Moreover, as mentioned before, we randomlychoose min { w c ( i − , t c } channels for long-range links at each layer. The inset of Fig. 9 shows howlong-range links are created by concatenating the feature maps from previous layers.Hence, using d c = 4 , w c = [2 , , , and t c = [3 , , for each cell c , we can directly use Eq. (2) tocompute the NN-Mass value. Putting the values in the equations, we obtain m = 28 . Consequently,the set { d c , w c , t c } can be used to specify the architecture of any CNN with concatenation-typelong-range links. Therefore, to perform NASE, we vary { d c , w c , t c } to obtain architectures withdifferent NN-Mass and NN-Density values. G Complete Details of the Experimental Setup
G.1 MLP Setup
We now explain more details on our MLP setup for the MNIST dataset. We create random archi-tectures with different NN-Mass and
Params by varying t c and d c . Moreover, we just use a singlecell for all MLP experiments. We fix w c = 8 and vary d c ∈ { , , , , } . For each depth d c , we vary t c ∈ { , , , . . . , } . Specifically, for a given { d c , w c , t c } configuration, we createrandom long-range links at layer i by uniformly sampling min { w c ( i − , t c } neurons out of w c ( i − activation outputs from previous { , , . . . , i − } layers.15able 2: CNN architecture details (width multiplier = 2) Numberof Cells Max. Long-RangeLink Candidates ( t c ) Depth Width Multiplier3 [10,35,50][20,45,75][30,50,100][40,60,120][50,70,145] 31 23 [20,40,70][30,50,100][40,80,125][50,105,150][60,130,170] 40 23 [25,50,90][35,80,125][50,105,150][70,130,170][90,150,210] 49 23 [30,80,117][50,110,150][70,140,200][90,175,250][110,215,300] 64 2 We train these random architectures on the MNIST dataset for 60 epochs with Exponential LinearUnit (ELU) as the activation function. Further, each { d c , w c , t c } configuration is trained five timeswith different random seeds. In other words, during each of the five runs of a specific { d c , w c , t c } configuration, the shortcuts are initialized randomly so these five models are not the same. Thiskind of setup is used to validate that NN-Mass is indeed a topological property of deep networks,and that the specific connections inside the random architectures do not affect our conclusions. Theresults are then averaged over all runs: Mean is plotted in Fig. 2 and standard deviation, which istypically low, is also given in Fig. 2 caption. Overall, this setup results in many MLPs with different Params/FLOPS/layers.
G.2 CNN Setup
Much of the setup for creating long-range links in CNNs is the same as that for MLPs, except we havethree cells instead of just one. As explained in Appendix E, the width of the three cells is given as wm × [16 , , , where wm is the width multiplier. Note that, since we have three cells of differentwidths ( w c ), t c also has a different value for each cell. The depth per cell d c is the same for all cells;hence, the total depth is given by d c + 4 . For instance, for 31-layer model, our d c = 9 . For most ofour experiments, we set the total depth of the CNN as { , , , } . Some of the experiments alsouse a total depth of { , , , } .Again, we conduct several experiments for different { d c , w c , t c } values which yield many randomCNN architectures. The random long-range link creation process is the same as that in MLPs and, forCNN experiments, we have repeated all experiments three times with different random seeds. Specificnumbers used for { d c , w c , t c } are given in Tables 2, 3, and 4. Each row in all tables represents adifferent { d c , w c , t c } configuration. Of note, all CNNs use ReLU activation function and Batch Normlayers.For CNNs, we verify our findings on CIFAR-10 and CIFAR-100 image classification datasets. Thelearning rate for all models is initialized to 0.05 and follows a cosine-annealing schedule at eachepoch. The minimum learning rate is 0.0 (see the end of Section H.9 for details on how we fixedthese hyper-parameter values). Similar to the setup in NAS prior works, the cutout is used for dataaugmentation. All models are trained in Pytorch on NVIDIA 1080-Ti, Titan Xp, and 2080-Ti GPUs.This completes the experimental setup. 16able 3: CNN architecture details (width multiplier = 1) Numberof Cells Max. Long-RangeLink Candidates ( t c ) Depth Width Multiplier3 [5,8,12][10,30,50][30,40,70][41,61,91][50,90,110] 31 13 [5,9,12][11,31,51][31,41,71][41,62,92][50,90,109] 40 13 [5,10,11][11,31,52][31,41,73][42,62,93][50,90,109] 49 13 [5,10,12][11,32,53][31,42,74][42,62,94][49,90,110] 64 1 Table 4: CNN architecture details (width multiplier = 3)
Numberof Cells Max. Long-RangeLink Candidates ( t c ) Depth Width Multiplier3 [10,30,50][40,60,90][70,90,130][100,120,170][130,150,210] 31 33 [11,31,51][42,62,92][72,93,133][103,123,173][133,153,212] 40 33 [11,31,52][43,63,93][73,95,135][104,124,176][134,154,214] 49 33 [12,32,52][44,64,95][76,96,136][106,126,178][135,156,216] 64 3 a) (b)(c) (d) Figure 10: More MNIST training convergence results (a, b are repeated from Fig. 2 but have beenannotated with different models (X, Y, Z, D, E, F)): (a) Models with different
Params achieve similartest accuracy. (b) Test accuracy curves of models with different depths/
Params concentrate whenplotted against NN-Mass (test accuracy std. dev. ∼ . - . ). (c,d) Models X and Y have thesame NN-Mass and achieve very similar training convergence, even though they have highly different Params and depth. Model Z has significantly fewer layers than Y but the same
Params and yetachieves faster training convergence than Y (Z has higher NN-Mass than Y). The above conclusionshold true for models D, E, and F. Note that, the training convergence curves for similar NN-Massmodels are coinciding.
H Additional Results
H.1 More MNIST training convergence results
We pick two groups of three models each – (X, Y, and Z) – and – (D, E, and F) and plot their trainingaccuracy vs. epochs. Models X and Y have similar NN-Mass but Y has more
Params and depththan X. Model Z has far fewer layers and nearly the same
Params as X, but has higher NN-Mass.Fig. 10(c) shows the training convergence results for all three models. As is evident, the trainingconvergence of model X (8.3K Params, 24-layers) nearly coincides with that of model Y (9.0KParams, 32-layers). Moreover, even though model Z (8.3K Params, 16-layers) is shallower thanthe 32-layer model Y (and has far fewer
Params), training convergence of Z is significantly fasterthan that of Y (due to higher NN-Mass and, therefore, better LDI). These results clearly show theevidence towards theoretical insights in Proposition 2, and emphasize the importance of topologicalproperties of neural architectures in characterizing gradient propagation and model performance.Similar observations are found among models D, E, and F.18 X i X
14 24 34 i Y = i Y =
14 24 34 i Y = i Y = i X i X (a) Seg4 i X i X
14 24 34 i Y = i Y =
14 24 34 i Y = i Y = i X i X (b) Circle4 Figure 11: Illustration of synthetic datasets Seg4 and Circle4: (a). Seg20 (Seg30) dataset is similar toSeg4, but divides the [0 , range into 20 (30) segments. (b). Circle (or Circle20) dataset is similar toCircle4, but divides a unit circle into 20 concentric circles. H.2 Results on synthetic data
In this section, we design a few synthetic experiments for MLP experiments to verify that ourobservations in Section 4.2 hold for diverse datasets. Specifically, we design three datasets – Seg20,Seg30, and Circle20 (or just Circle). Fig. 11(a) illustrates the Seg4 dataset where the range [0 . isbroken into 4 segments. Similarly, Seg20 (Seg30) breaks down the linear line into 20 (30) segments.The classification problem has two classes (each alternate segment is a single class).Fig. 11(b) shows the circle dataset where a unit circle is broken down into concentric circles (regionsbetween circles make a class and we have two total classes). The details of these datasets are given inTable 5. Of note, we have used the ReLU activation function for these experiments (unlike ELU usedfor MNIST). Table 5: Description of our generated Synthetic DatasetsDataset name Description: Training Set, i ∈ [1 , ; Test Set, i ∈ [1 , Seg20 Feature: [ X i , X i ] , Label: Y i , X i = sample ( [ (cid:98) i (cid:99) , (cid:98) i (cid:99) + 1]) , Y i = (cid:98) i (cid:99) mod Seg30 Feature: [ X i , X i ] , Label: Y i , X i = sample ( [ (cid:98) i (cid:99) , (cid:98) i (cid:99) + 1]) , Y i = (cid:98) i (cid:99) mod Circle (Cir-cle20) Feature: [ X i , X i ] , Label: Y i , X i = L i ∗ cos ( rand _ num ) , X i = L i ∗ sin ( rand _ num ) , L i = sample ( [ (cid:98) i (cid:99) , Y i = (cid:98) i (cid:99) mod For the above synthetic experiments, we once again conduct the following experiments: ( i ) Weexplore the impact of varying Params and NN-Mass on the test accuracy. ( ii ) We demonstrate howLDI depends on NN-Mass and Params.
Test Accuracy
As shown in Fig. 12(a, b, c) and Fig. 12(d, e, f), NN-Mass is a much bettermetric to characterize the model performance of DNNs than the number of parameters. Again,we quantitatively analyze the above results by generating a linear fit between test accuracy vs.log(
Params) and log(NN-Mass). Similar to the MNIST case, our results show that R of testaccuracy vs. NN-Mass is much higher than that for Params.19 a) Linear: Seg=20 (b) Linear: Seg=30 (c) Circular: Circle20(d) Linear: Seg=20 (e) Linear: Seg=30 (f) Circular: Circle20
Figure 12: Synthetic results: (a, b, c) Models with different
Params achieve similar test accuracyacross all synthetic datasets. (d, e, f) Test accuracy curves for the same set of models come closertogether when plotted against NN-Mass.Figure 13: Synthetic results (Circle20 datasets): Mean singular value of J i,i − is much bettercorrelated with NN-Mass than with Params.
Layerwise Dynamical Isometry
Fig. 13 shows the LDI results for the Circle20 dataset. Again,higher NN-Mass leads to higher initial singular value. Moreover, NN-Mass is better correlated withLDI than
Params. Hence, this further emphasizes why networks with similar NN-Mass (instead of
Params) result in a more similar model performance.
H.3 Impact of Varying NN-Density
As a baseline, we show that NN-Density cannot predict the accuracy of models with different depths.We train different deep networks with varying NN-Density (see Table 2 models in Appendix G).Fig. 14 shows that shallower models with higher density can reach accuracy comparable to deepermodels with lower density (which is quite reasonable since the shallower models are more denselyconnected compared to deeper networks, thereby promoting more effective information flow inshallower CNNs despite having significantly fewer parameters). However, NN-Density alone doesnot identify models (with different sizes/compute) that achieve similar accuracy: CNNs with differentdepths achieve comparable test accuracies at different NN-Density values ( e.g. , although a 31-layermodel with ρ avg = 0 . performs close to 64-layer model with ρ avg = 0 . , a 49-layer model with20 .10 0.15 0.20 0.25 0.30NN-Density95.996.096.196.296.396.496.596.6 T e s t A cc u r a c y Test Accuracy vs. NN-Density
PRQ
Figure 14: CIFAR-10 Width Multiplier wm = 2 : Shallower models with higher density can reachcomparable accuracy to deeper models with lower density. This does not help since models withdifferent depths achieve comparable accuracies at different densities. a. b. T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass) -- wm = 1. R-squared = 0.74 T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass). R-squared = 0.84 T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass) -- wm = 3. R-squared = 0.90 c. Figure 15: Impact of varying width: (a) Width multiplier, wm = 1 , (b) wm = 2 , and (c) wm = 3 . Aswidth increases, the capacity of small (shallower) models increases and, therefore, the accuracy-gapbetween models of different depths reduces. Hence, the R for linear fit increases as width increases. ρ avg = 0 . already outperforms the test accuracy of the above 64-layer model; see models P, Q, R inFig. 14). Therefore, NN-Density alone is not sufficient. H.4 R-Squared of CIFAR-10 Accuracy vs. NN-Mass
Fig. 15 shows the impact of increasing model widths on R of linear fit between test accuracy andlog(NN-Mass). H.5 Comparison between NN-Mass and Parameter Counting for CNNs
For MLPs, we have shown that NN-Mass significantly outperforms
Params for predicting modelperformance. For CNNs, we quantitatively demonstrate that while parameter counting can be a usefulindicator of test accuracy for models with low width (but still not as good as NN-Mass), as the widthincreases, parameter counting completely fails to predict test accuracy. Specifically, in Fig. 16(a), wefit a linear model between test accuracy and log( parameters) and found that the R for this modelis 0.76 which is slightly lower than that obtained for NN-Mass ( R = 0 . , see Fig. 16(b)). Whenthe width multiplier of CNNs increases to three, parameter counting completely fails to fit the testaccuracies of the models ( R = 0 . ). In contrast, NN-Mass significantly outperforms parametercounting for wm = 3 as it achieves an R = 0 . . This demonstrates that NN-Mass is indeed asignificantly stronger indicator of model performance than parameter counting. H.6 NN-Mass to Predict Test Accuracy of Unknown Architectures
We now demonstrate that NN-Mass can be used to predict the test accuracy of unknown architecturesthat have not been trained before. Towards this end, we create a testing set of new architectures bytraining 20 previously unknown architectures with wm = 2 , and { , , , } layers. For thesemodels, we vary the NN-Density between { . , . , . , . , . } which is differentfrom the initial architecture space exploration setting in Fig 15(b) or Table 2 (in the initial setting, { , , , } -layer models were trained for NN-Densities: { . , . , . , . , . } ). Wenext use the linear model trained on the { , , , } -layer models (see Fig. 15(b)) to predictthe test accuracy of the unknown { , , , } -layer CNNs. Note that, our testing set consists of21 .0 1.5 2.0 2.5log( T e s t A cc u r a c y Test Accuracy vs. log( a. b. T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass). R-squared = 0.84 T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass) -- wm = 3. R-squared = 0.90 d.c. T e s t A cc u r a c y Test Accuracy vs. log(
Figure 16: NN-Mass as an indicator of model performance compared to parameter counting. (a) For wm = 2 , log( parameters) fits the test accuracy with an R = 0 . . (b) For the same wm = 2 case, log(NN-Mass) fits the test accuracy with a higher R = 0 . . (c) For higher width ( wm = 3 ),parameter counting completely fails to fit the test accuracy of various models ( R = 0 . ). (d) Incontrast, NN-Mass still fits the accuracies with a high R = 0 . . T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass). Testing R-squared = 0.79
Figure 17: Linear modeled trained in Fig. 15(b) is used to predict the test accuracy of com-pletely new architectures. The resulting R = 0 . is still high and is comparable to the training R = 0 . . The linear model was trained on the test accuracies and NN-Mass of models with { , , , } layers, and densities varying as { . , . , . , . , . } . To create the test-ing set, we trained completely new models with { , , , } layers, and densities varying as { . , . , . , . , . } .models with both different number of layers and different NN-Densities (and, implicitly, differentNN-Mass values) compared to the training set.Fig. 17 shows that the testing R = 0 . ( i.e. , the R obtained by predicting the accuracy of modelsin the testing set) which is close to the training R = 0 . (see Fig. 15(b)). Hence, NN-Mass can beused to predict test accuracy of models which were never trained before. H.7 Results for CIFAR-100
Results for CIFAR-100 dataset are shown in Fig. 18. As evident, several models achieve similaraccuracy despite having highly different number of parameters ( e.g. , see models within box W inFig. 18(a)). Again, these models get clustered together when plotted against NN-Mass. Specifically,models within box W in Fig. 18(a) fall into buckets Y and Z in Fig. 18(b). Hence, models that gotclustered together for CIFAR-10 dataset, also get clustered for CIFAR-100. To quantify the aboveresults, we fit a linear model between test accuracy and log(NN-Mass) and, again, obtain a high22 T e s t A cc u r a c y Test Accuracy vs. Number of Parameters
200 400 600 800 1000 1200NN-Mass77.077.578.078.579.079.580.080.5 T e s t A cc u r a c y Test Accuracy vs. NN-Mass a. b.
W Y ZX T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass). R-squared = 0.84 c. Figure 18: Similar results are obtained for CIFAR-100 ( wm = 2 ). (a) Models in box W have highlydifferent parameters but achieve similar accuracy. (b) These models get clustered into buckets Yand Z. (c) The R value for fitting a linear regression model is . which shows that NN-Mass is agood predictor of test accuracy. Results are reported as the mean of three runs (std. dev. ∼ . ). R = 0 . (see Fig. 18(c)). Therefore, our observations hold true across multiple image classificationdatasets. H.8 Results for Floating Point Operations (FLOPS)
All results for FLOPS (of CNN architectures in Tables 2, 3, and 4) are shown in Fig. 19. As evident,models with highly different number of FLOPS often achieve similar test accuracy. As shown earlier,many of these CNN architectures cluster together when plotted against NN-Mass.
H.9 NN-Mass for directly designing compressed architectures
Our theoretical and empirical evidence shows that NN-Mass is a reliable indicator for models whichachieve a similar accuracy despite having different number of layers and parameters. Therefore, thisobservation can be used for directly designing efficient CNNs as follows:• First, train a reference big CNN (with a large number of parameters and layers) whichachieves very high accuracy on the target dataset. Calculate its NN-Mass (denoted m L ).• Next, create a completely new and significantly smaller model using far fewer parametersand layers, but with a NN-Mass ( m S ) comparable to or higher than the large CNN. Thisprocess is very fast as the new model is created without any a priori training. For instance,to design an efficient CNN of width w c and depth per cell d c and NN-Mass m S ≈ m L , weonly need to find how many long-range links to add in each cell. Since, NN-Mass has aclosed form equation ( i.e. , Eq. (2)), a simple search over the number of long-range linkscan directly determine NN-Mass of various architectures. Then, we select the architecturewith the NN-Mass close to that of the reference CNN. Unlike current manual or NAS-basedmethods, our approach does not require training of individual architectures during the search.• Since NN-Mass of the smaller model is similar to that of the reference CNN, our theoreticalas well as empirical results suggest that the newly generated model will lose only a smallamount of accuracy, while significantly reducing the model size. To validate this, we trainthe new, significantly smaller model and compare its test accuracy against that of the originallarge CNN. A note on hyper-parameter ( e.g. , initial learning rate) optimization.
Note that, throughout thiswork, we optimized the hyper-parameters such as initial learning rate for the largest models and thenused the same initial learning rate for the smaller models. Hence, if these hyper-parameters werefurther optimized for the smaller models, the gap between the accuracy curves in Figures 5, 18, 19, etc. , would reduce further ( i.e. , the clustering on NN-Mass plots would further improve). Similarly,the accuracy gap between compressed models and the large CNNs would reduce even more in Table 1if the hyper-parameters were optimized for the smaller models as well. We did not optimize theinitial learning rates, etc. , for the smaller models as it would have resulted in an explosion in terms ofnumber of experiments. Hence, since our focus is on topological properties of CNN architectures, wefixed the other hyper-parameters as described above.23 .1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Floating Point Operations (GFLOPS)94.094.595.095.596.096.5 T e s t A cc u r a c y Test Accuracy vs. FLOPS31-layers40-layers49-layers64-layers T e s t A cc u r a c y Test Accuracy vs. NN-Mass31-layers40-layers49-layers64-layers T e s t A cc u r a c y Test Accuracy vs. FLOPS31-layers40-layers49-layers64-layers
200 400 600 800 1000 1200 1400NN-Mass95.996.096.196.296.396.496.596.696.7 T e s t A cc u r a c y Test Accuracy vs. NN-Mass31-layers40-layers49-layers64-layers T e s t A cc u r a c y Test Accuracy vs. FLOPS31-layers40-layers49-layers64-layers
200 400 600 800 1000NN-Mass96.096.196.296.396.496.596.696.796.8 T e s t A cc u r a c y Test Accuracy vs. NN-Mass31-layers40-layers49-layers64-layers T e s t A cc u r a c y Test Accuracy vs. FLOPS31-layers40-layers49-layers64-layers
200 400 600 800 1000 1200 1400NN-Mass77.077.578.078.579.079.580.080.5 T e s t A cc u r a c y Test Accuracy vs. NN-Mass31-layers40-layers49-layers64-layers a. CIFAR-10 width multiplier = 1b. CIFAR-10 width multiplier = 2c. CIFAR-10 width multiplier = 3d. CIFAR-100 width multiplier = 2a. CIFAR-10 width multiplier = 1b. CIFAR-10 width multiplier = 2c. CIFAR-10 width multiplier = 3d. CIFAR-100 width multiplier = 2