[PDF] How does topology influence gradient propagation and model performance of deep networks with DenseNet-type skip connections?

Abstract

DenseNets introduce concatenation-type skip connections that achieve state-of-the-art accuracy in several computer vision tasks. In this paper, we reveal that the topology of the concatenation-type skip connections is closely related to the gradient propagation which, in turn, enables a predictable behavior of DNNs' test performance. To this end, we introduce a new metric called NN-Mass to quantify how effectively information flows through DNNs. Moreover, we empirically show that NN-Mass also works for other types of skip connections, e.g., for ResNets, Wide-ResNets (WRNs), and MobileNets, which contain addition-type skip connections (i.e., residuals or inverted residuals). As such, for both DenseNet-like CNNs and ResNets/WRNs/MobileNets, our theoretically grounded NN-Mass can identify models with similar accuracy, despite having significantly different size/compute requirements. Detailed experiments on both synthetic and real datasets (e.g., MNIST, CIFAR-10, CIFAR-100, ImageNet) provide extensive evidence for our insights. Finally, the closed-form equation of our NN-Mass enables us to design significantly compressed DenseNets (for CIFAR-10) and MobileNets (for ImageNet) directly at initialization without time-consuming training and/or searching.

Full PDF

HHow Does Topology of Neural Architectures ImpactGradient Propagation and Model Performance?

Kartikeya Bhardwaj , Guihong Li , and Radu Marculescu Arm Inc., San Jose, CA 95134 The University of Texas at Austin, Austin, TX 78712 [email protected], [email protected], [email protected]

Abstract

In this paper, we address two fundamental questions in neural architecture designresearch: ( i ) How does an architecture topology impact the gradient ﬂow duringtraining? ( ii ) Can certain topological characteristics of deep networks indicate apriori ( i.e. , without training) which models, with a different number of param-eters/FLOPS/layers, achieve a similar accuracy? To this end, we formulate theproblem of deep learning architecture design from a network science perspectiveand introduce a new metric called NN-Mass to quantify how effectively informationﬂows through a given architecture. We demonstrate that our proposed NN-Massis more effective than the number of parameters to characterize the gradient ﬂowproperties, and to identify models with similar accuracy, despite having signiﬁ-cantly different size/compute requirements. Detailed experiments on both syntheticand real datasets ( e.g. , MNIST and CIFAR-10/100) provide extensive empiricalevidence for our insights. Finally, we exploit our new metric to design efﬁcientarchitectures directly, and achieve up to × fewer parameters and FLOPS, whilelosing minimal accuracy ( . vs. ) over large CNNs on CIFAR-10. Recent research in neural architecture design has driven several breakthroughs in deep learning.Speciﬁcally, major contributions have been made in the following two directions: ( i ) Initialization ofmodel weights [1, 2, 3, 4], and ( ii ) Topology of the network that shows how different compute units( e.g. , neurons, channels, layers) should be connected to each other [5, 6, 7, 8, 9, 10]. While manyattempts have been made to study the impact of initialization on model accuracy [4, 11, 12, 13, 14],good Deep Neural Network (DNN) topologies have been mainly developed either manually ( e.g. ,Resnets, Densenets, etc. [5, 6, 7, 8]) or automatically using Neural Architecture Search (NAS)techniques [9, 10, 15, 16]. However, the impact of topological properties on model performancehas not been explored systematically. Hence, there is a signiﬁcant gap in our understanding on howvarious topological properties impact the gradient ﬂow and accuracy of DNNs.In general, the topology (or structure) of networks strongly inﬂuences the phenomena taking placeover them [17]. For instance, how closely the users of a social network are connected to eachother directly affects how fast the information propagates through the network [18]. Similarly, aDNN architecture can be seen as a network of different neurons connected together. Therefore, thetopology of deep networks can inﬂuence how effectively the gradients can ﬂow and, hence, howmuch information can be learned. Indeed, this can also mean that models with similar topologicalproperties, but signiﬁcantly different size/compute requirements can achieve similar accuracy.Models with highly different compute but similar accuracy have been studied in the ﬁeld of modelcompression [19, 20, 21, 22, 23]. Moreover, recent NAS has also focused on deploying efﬁcienthardware-aware models [15, 16]. Motivated by the need for ( i ) understanding the relationship betweengradient ﬂow and topology, and ( ii ) efﬁcient models, we address the following fundamental questions:1. How does the DNN topology inﬂuence the gradient ﬂow through the network?2. Can topological properties of DNNs indicate a priori ( i.e. , without training) which modelsachieve a similar accuracy, despite having vastly different parameters/FLOPS/layers? Preprint. Under review. a r X i v : . [ s t a t . M L ] J u l o answer the above questions, we ﬁrst model DNNs as complex networks in order to exploit thenetwork science [17] – the study of networks – and quantify their topological properties. To thisend, we propose a new metric called NN-Mass that explains the relationship between the topologicalstructure of DNNs and Layerwise Dynamical Isometry (LDI), a property that indicates the faithfulgradient propagation through the network [11, 4]. Speciﬁcally, models with similar NN-Mass shouldhave similar LDI, and thus a similar gradient ﬂow that results in comparable accuracy. With thesetheoretical insights, we conduct a thorough Neural Architecture Space Exploration (NASE) and showthat models with the same width and NN-Mass indeed achieve similar accuracy irrespective of theirdepth, number of parameters, and FLOPS. Finally, after extensive experiments linking topology andgradient ﬂow, we show how the closed-form expression for NN-Mass can be used to directly designefﬁcient deep networks without searching or training individual models during the search. Overall,we propose a new theoretically-grounded perspective for designing efﬁcient neural architectures thatreveals how topology inﬂuences the gradient propagation in deep networks.The rest of the paper is organized as follows: Section 2 discusses the related work and somepreliminaries. Then, Section 3 describes our proposed metrics and their theoretical analysis. Section 4presents detailed experimental results. Section 5 summarizes our work and contributions.

NAS techniques [9, 10, 15, 16] have indeed resulted in state-of-the-art neural architectures. Morerecently, [24, 25] utilized standard network science ideas such as Barabasi-Albert (BA) [26] orWatts-Strogatz (WS) [27] models for NAS. However, like the rest of the NAS research, [24, 25]did not address what characteristics of the topology make various models (with different parameters/FLOPS/layers) achieve similar accuracy. Unlike our work, NAS methods [9, 15, 16, 10,24, 25] do not connect the topology with the gradient ﬂow.On the other hand, the impact of initialization on model convergence and gradients has also beenstudied [1, 2, 4, 11, 12, 13, 14]. Moreover, recent model compression literature attempts to connectpruning at initialization to gradient properties [11]. Again, none of these studies address the impactof the architecture topology on gradient propagation. Hence, our work is orthogonal to prior art thatexplores the impact of initialization on gradients [1, 2, 4, 11, 12, 13, 14] or pruning [11, 28]. Relatedwork on important network science and gradient propagation concepts is discussed below.

Preliminaries.

In our work, we use the following two well-established concepts:

Deﬁnition 1 (Average Degree [17]) . Average degree ( ˆ k ) of a network determines the average numberof connections a node has, i.e., ˆ k is given by number of edges divided by total number of nodes. Average degree and degree distribution ( i.e. , distribution of nodes’ degrees) are important topologicalcharacteristics which directly affect how information ﬂows through a network. The dynamics of howfast a signal can propagate through a network heavily depends on the network topology.

Deﬁnition 2 (Layerwise Dynamical Isometry (LDI) [11]) . A deep network satisﬁes LDI if thesingular values of Jacobians at initialization are close to 1 for all layers. Speciﬁcally, for a multilayerfeed-forward network, let s i ( W i ) be the output (weights) of layer i such that s i = φ ( h i ) , h i = W i s i − + b i ; then, the Jacobian matrix at layer i is deﬁned as: J i,i − = ∂ s i ∂ s i − = D i W i . Here, J i,i − ∈ R w i ,w i − , w i is the number of neurons in layer i . D jki = φ (cid:48) ( h i ) δ jk . φ (cid:48) denotes thederivative of non-linearity φ and δ jk is Kronecker delta [11]. Then, if the singular values σ j for all J i,i − are close to 1, then the network satisﬁes the LDI. LDI indicates that the signal propagating through the deep network will neither get attenuated, norampliﬁed too much; hence, this ensures faithful propagation of gradients during training [4, 11].

We ﬁrst model DNNs via network science to derive our proposed topological metrics. We thendemonstrate the theoretical relationship between NN-Mass and gradient propagation.

We start with a generic multilayer perceptron (MLP) setup with d c layers containing w c neuronseach. Since our objective is to study the topological properties of neural architectures, we assumeshortcut connections (or long-range links ) superimposed on top of a typical MLP setup (see Fig. 1(a)).2 GB … α R α G α B Convolution layer i m

Output Channels n Input

Channels m [ k×k×n ] Filters α ’s are contributions of input channels to output channels c. Single Convolutional Layer b. Mean singular value increases with matrix size layer 𝑖 output = 𝒔 𝒊 … ……… … ……… layer 𝑖 -2 layer … layer 𝑖 Weights(with shortcuts) = 𝑾 𝒊 a. Our setup: DNN as network of neurons

Depth 𝑑 𝑐 W i d t h 𝑤 𝑐 Short-range links / Long-range linksRandomly selected neurons

Figure 1: (a) DNN setup: The DNN (depth d c , width w c ) has layer-by-layer short-range connections(gray) with additional long-range links (purple/red). (b) Simulation of Gaussian matrices: Meansingular values vs. size of a matrix ( w c + m/ , w c ) . Mean singular values increase as m increases(more simulations are given in Appendix D). (c) Convolutional layers form a similar topologicalstructure as MLP layers: All input channels contribute to all output channels.Speciﬁcally, all neurons at layer i receive long-range links from a maximum of t c neurons fromprevious layers. That is, we randomly select min { w c ( i − , t c } neurons from layers , , . . . , ( i − ,and concatenate them at layer i − (see Fig. 1(a)) ; the concatenated neurons then pass through afully-connected layer to generate the output of layer i ( s i ). As a result, the weight matrix W i (whichis used to generate s i ) gets additional weights to account for the incoming long-range links. Similarto recent NAS research [29], our rationale behind selecting random links is that random architecturesare often as competitive as the carefully designed models. Moreover, the random long-range links ontop of ﬁxed short-range links make our architectures a small-world network (Fig. 6, Appendix A) [27],and allows us to use network science to study their topological properties [17, 30, 31, 32].Like standard CNNs [5, 6], we can generalize this setup to contain multiple ( N c ) cells of width w c and depth d c . All long-range links are present only within a cell and do not extend between cells. Our key objectives are twofold: ( i ) Quantify what topological characteristics of DNN architecturesaffect their accuracy and gradient ﬂow, and ( ii ) Exploit such properties to directly design efﬁcientCNNs. To this end, we propose new metrics called NN-Density and

NN-Mass , as deﬁned below.

Deﬁnition 3 (Cell-Density) . Density of a cell quantiﬁes how densely its neurons are connected vialong-range links. Formally, for a cell c , cell-density ρ c is given by: ρ c = Actual = 2 (cid:80) d c − i =2 min { w c ( i − , t c } w c ( d c − d c − (1)For complete derivation, please refer to Appendix B. With the above deﬁnition for cell-density,NN-Density ( ρ avg ) is simply deﬁned as the average density across all cells in a DNN. Deﬁnition 4 (Mass of DNNs) . NN-Mass quantiﬁes how effectively information can ﬂow through agiven DNN topology. For a given width ( w c ), models with similar NN-Mass, but different depths ( d c )and Note that, density is basically mass/volume . Let volume be the total number of neurons in a cell.Then, we can derive the NN-Mass ( m ) by multiplying the cell-density with total neurons in each cell: m = N c (cid:88) c =1 w c d c ρ c = N c (cid:88) c =1 d c (cid:80) d c − i =2 min { w c ( i − , t c } ( d c − d c − (2)Now we explain the use of the above metrics for neural architecture space exploration. Neural Architecture Space Exploration (NASE).

In NASE, we systematically study the de-sign space of DNNs using NN-Mass. Note that, NN-Mass is a function of network width, depth, Here, w c ( i − is the total number of candidate neurons from layers , , . . . , ( i − that can supplylong-range links; if the maximum number of neurons t c that can supply long-range links to the current layerexceeds total number of possible candidates, then all neurons from layers , , . . . , ( i − are selected. Neuronsare concatenated similar to how channels are concatenated in Densenets [6]. i.e. , the topology of a model). For a ﬁxed number of cells, an architecturecan be completely speciﬁed by { depth, width, maximum long-range link candidates } per cell = { d c , w c , t c } . Hence, to perform NASE, we vary { d c , w c , t c } to create random architectures with dif-ferent parameters/FLOPS/layers, and NN-Mass. We then train these architectures and characterizetheir accuracy, topology, and gradient propagation in order to understand theoretical relationshipsamong them. Without loss of generality, we assume the DNN has only one cell of width w c and depth d c . Proposition 1 ( NN-Mass and average degree of the network (a topological property) ) . Theaverage degree of a deep network with NN-Mass m is given by ˆ k = w c + m/ . The proof of the above result is given in Appendix C.

Intuition.

Proposition 1 states that the average degree of a deep network is w c + m/ , which, giventhe NN-Mass m , is independent of the depth d c . The average degree indicates how well-connectedthe network is. Hence, it controls how effectively the information can ﬂow through a given topology.Therefore, for a given width and NN-Mass, the average amount of information that can ﬂow throughvarious architectures (with different parameters/layers) should be similar (due to the same averagedegree). Thus, we hypothesize that these topological characteristics might constrain the amount ofinformation being learned by different models. Next, we show the impact of topology on gradientpropagation. Proposition 2 ( NN-Mass and LDI ) . Given a small deep network f S (depth d S ) and a large deepnetwork f L (depth d L , d L >> d S ), both with same NN-Mass m and width w c , the LDI for bothmodels is equivalent. Speciﬁcally, if Σ iS ( Σ iL ) denotes the singular values of the initial layerwiseJacobians J i,i − for the small (large) model, then, the mean singular values in both models aresimilar; that is, E [ Σ iS ] ≈ E [ Σ iL ] .Proof. To prove the above result, it sufﬁces to show that the initial Jacobians J i,i − have similarproperties for both models (and thus their singular value distributions will be similar). For our setup,the output of layer i , s i = φ ( W i x i − + b i ) , where x i − = s i − ∪ y i − concatenates output oflayer i − ( s i − ) with the neurons y i − supplying the long-range links (random min { w c ( i − , t c } neurons selected uniformly from layers to i − ). Hence, J i,i − = ∂ s i /∂ x i − = D i W i .Compared to a typical MLP scenario (see Deﬁnition 2), the sizes of matrices D i and W i increase toaccount for incoming long-range links.For two models f S and f L , the layerwise Jacobian ( J i,i − ) can have two kinds of properties: ( i ) Thevalues inside Jacobian matrix for f S and f L can be different, and/or ( ii ) The sizes of layerwiseJacobian matrices for f S and f L can be different. Hence, our objective is to show that when the widthand NN-Mass are similar, irrespective of the depth of the model (and thus irrespective of number ofparameters/FLOPS), both the values and the size of initial layerwise Jacobians will be similar.Let us start by considering a linear network: in this case, J i,i − = W i . Since the LDI looks at theproperties of layerwise Jacobians at initialization , and because all models are initialized the sameway ( e.g. , Gaussians with variance scaling ), the values inside J i,i − for both f S and f L have samedistribution (point ( i ) above is satisﬁed). We next show that even the sizes of layerwise Jacobians forboth models are similar if the width and NN-Mass are similar.How is topology related to the layerwise Jacobians? Since the average degree is same for both models(see Proposition 1), on average, the number of incoming shortcuts at a typical layer is w c × m/ . Inother words, since the degree distribution for the random long-range links is Poisson [30] with averagedegree ¯ k R|G ≈ m/ (see Eq. (7), Appendix C), an average m/ neurons supply long-range links toeach layer . Therefore, the Jacobians will theoretically have the same dimensions ( w c + m/ , w c ) irrespective of the depth of the neural network ( i.e. , point ( ii ) is also satisﬁed).So far, the discussion has considered only a linear network. For a non-linear network, the Jacobian isgiven as J i,i − = D i W i . As explained in [11], D i depends on pre-activations h i = W i x i − + b i .As established in several deep network mean ﬁeld theory studies [14, 12, 11, 13], the distribution Variance scaling methods also take into account the number of input/output units. Hence, if the width is thesame between models of different depths, the distribution at initialization is still similar. Theoretically, a Poisson process assumes a constant rate of arrival of links.

4f pre-activations at layer i ( h i ) is a Gaussian N (0 , q i ) due to the central limit theorem. Similarto [11, 14], if the input h is chosen to satisfy a ﬁxed point q i = q ∗ , the distribution of D i becomesindependent of the depth ( N (0 , q ∗ ) ). Therefore, the distribution of both D i and W i is similar fordifferent models irrespective of the depth, even for non-linear networks. Moreover, the sizes of thematrices will again be similar due to similar average degree in both f S and f L .Hence, the size and distribution of values in the Jacobian matrix is similar for both the large andthe small model (provided the width and NN-Mass are similar). That is, the distribution and meansingular values will also be similar: E [ Σ iS ] ≈ E [ Σ iL ] . In other words, LDI is equivalent betweenmodels of different depths if their width and NN-Mass are similar.We note that the mean singular values increase with NN-Mass. To illustrate this effect, we numericallysimulate several Gaussian-distributed matrices of sizes ( w c + m/ , w c ) and compute their meansingular values. Speciﬁcally, we vary m for widths w c and see the impact of this size variation onmean singular values. Fig. 1(b) shows that as NN-Mass varies, the mean singular values linearlyincrease with NN-Mass. In our experiments, we show that this linear trend between mean singularvalues and NN-Mass holds true for actual non-linear deep networks. A formal proof of this observationand more simulations are given in Appendix D. Note that, our results should not be interpreted asbigger models yield larger mean singular values. We explicitly show in the next section that therelationship between the total number of parameters and mean singular values is signiﬁcantly worsethan that for NN-Mass. Hence, it is the topological properties that enable LDI in different deepnetworks and not the number of parameters. Remark 1 (NN-Mass formulation is same for CNNs).

Fig. 1(c) shows a typical convolutionallayer. Since all channel-wise convolutions are added together, each output channel is some functionof all input channels. This makes the topology of CNNs similar to that of our MLP setup. The keydifference is that the nodes in the network (see Fig. 1(a)) now represent channels and not individualneurons. Of note, for our CNN setup, we use three cells (similar to [5, 6]). More details on CNNsetup (including a concrete example for NN-Mass calculations) are given in Appendices E and F.Next, we present detailed experimental evidence to validate our theoretical ﬁndings.

To perform NASE for MLPs and CNNs, we generate random architectures with different NN-Massand number of parameters (

Params) by varying { d c , w c , t c } . For random MLPs with different { d c , t c } and w c = 8 ( cells = 1), we conduct the following experiments on the MNIST dataset:( i ) We explore the impact of varying Params and NN-Mass on the test accuracy; ( ii ) We demonstratehow LDI depends on NN-Mass and Params; ( iii ) We further show that models with similar NN-Mass(and width) result in similar training convergence, despite having different depths and

Params.After the extensive empirical evidence for our theoretical insights ( i.e. , the connection betweengradient propagation and topology), we next move on to random CNN architectures with three cells.We conduct the following experiments on the CIFAR-10 and CIFAR-100 datasets: ( i ) We show thatNN-Mass can further identify CNNs that achieve similar test accuracy, despite having highly different Params/FLOPS/layers; ( ii ) We show that NN-Mass is a signiﬁcantly more effective indicator ofmodel performance than parameter counts; ( iii ) We also show that our ﬁndings hold for CIFAR-100,a much more complex dataset than CIFAR-10. These models are trained for 200 epochs.Finally, we exploit NN-Mass to directly design efﬁcient CNNs (for CIFAR-10) which achieveaccuracy comparable to signiﬁcantly larger models. For these experiments, the models are trainedfor 600 epochs. Overall, we train hundreds of different MLP and CNN architectures with each MLP(CNN) repeated ﬁve (three) times with different random seeds, to obtain our results. More setupdetails ( e.g. , architecture details, learning rates, etc. ) are given in Appendix G (see Tables 2, 3, and 4). Fig. 2(a) shows test accuracy vs.

Params of DNNs with different depths on theMNIST dataset. As evident, even though many models have different

Params, they achieve a similartest accuracy. On the other hand, when the same set of models are plotted against NN-Mass, theirtest accuracy curves cluster together tightly, as shown in Fig. 2(b). To further quantify the aboveobservation, we generate a linear ﬁt between test accuracy vs. log(

Params) and log(NN-Mass) (seebrown markers on Fig. 2(a,b)). For NN-Mass, we achieve a signiﬁcantly higher goodness-of-ﬁt R = a) (b) (c) (d) Figure 2: MNIST results: (a) Models with different number of parameters (

Params) achieve similartest accuracy. (b) Test accuracy curves of models with different depths/

Params concentrate whenplotted against NN-Mass (test accuracy std. dev. ∼ . − . ). (c,d) Mean singular values of J i,i − are much better correlated with NN-Mass ( R = 0 . ) than with Params ( R = 0 . ). . than that for Params ( R = 0 . ). This demonstrates that NN-Mass can identify DNNs thatachieve similar accuracy, even if they have a highly different number of parameters/FLOPS /layers.We next investigate the gradient propagation properties to explain the test accuracy results. Layerwise Dynamical Isometry (LDI).

We calculate the mean singular values of initial layerwiseJacobians, and plot them against

Params (see Fig. 2(c)) and NN-Mass (see Fig. 2(d)). Clearly, NN-Mass ( R = 0 . ) is far better correlated with the mean singular values than Params ( R = 0 . ).Figure 3: Models A and C have thesame NN-Mass and achieve verysimilar training convergence, eventhough they have highly different Params and depth. Model B hassigniﬁcantly fewer layers than Cbut the same

Params, yet achievesa faster training convergence thanC (B has higher NN-Mass than C).More importantly, just as Proposition 2 predicts, these resultsshow that models with similar NN-Mass and width have equiv-alent LDI properties, irrespective of the total depth (and, thus

Params) of the network. For example, even though the 32-layer models have more parameters, they have similar meansingular values as the 16-layer DNNs. This clearly suggeststhat the gradient propagation properties are heavily inﬂuencedby the topological characteristics like NN-Mass, and not just byDNN depth and

Params. Of note, the linear trend in Fig. 2(d)is similar to that seen in Fig. 1(b) simulation.

Training Convergence.

The above results pose the followinghypotheses: ( i ) If the gradient ﬂow between DNNs (with sim-ilar NN-Mass and width) is similar, their training convergenceshould be similar, even if they have highly different Paramsand depths; ( ii ) If two models have same Params (and width),but different depths and NN-Mass, then the DNN with higherNN-Mass should have faster training convergence (since itsmean singular value will be higher – see the trend in Fig. 2(d)).To demonstrate that both hypotheses above hold true, we pickthree models – A, B, and C – from Fig. 2(a,b) and plot their training loss vs. epochs. Models Aand C have similar NN-Mass, but C has more

Params and depth than A. Model B has far fewerlayers and nearly the same

Params as C, but has a higher NN-Mass. Fig. 3 shows the trainingconvergence results for all three models. As evident, the training convergence of model A (7.8KParams, 20-layers) nearly coincides with that of model C (8.8K Params, 32-layers). Moreover,even though model B (8.7K Params, 20-layers) is shallower than the 32-layer model C, the trainingconvergence of B is signiﬁcantly faster than that of C (due to higher NN-Mass and, therefore, betterLDI). Training convergence results for several other models in Fig. 2(a,b) show similar observations(see Fig. 10 in Appendix H.1). These results clearly validate the theoretical insights in Proposition 2,and emphasize the importance of topological properties of neural architectures in characterizing thegradient propagation and model performance. Other similar experiments for synthetic datasets aregiven in Appendix H.2.

Since we have now established a concrete relationship between gradient propagation and topologicalproperties, in the rest of the paper, we will show that NN-Mass can be used to identify and design efﬁ-cient CNNs that achieve similar accuracy as models with signiﬁcantly higher

Params/FLOPS/layers. For our setup, more parameters lead to more FLOPS. FLOPS results are given for CNNs in Appendix H.8. odel Performance. Fig. 4(a) shows the test accuracy of various CNNs vs. total

Params. Asevident, models with highly different number of parameters ( e.g. , see models A-E in box W),achieve a similar test accuracy. Note that, there is a large gap in the model size: CNNs in box W T e s t A cc u r a c y Test Accuracy vs. Number of Parameters

W A B C D E

200 400 600 800 1000 1200NN-Mass95.996.096.196.296.396.496.596.6 T e s t A cc u r a c y Test Accuracy vs. NN-Mass

Y ZX A'B' C'D' E' a. b.

Figure 4: CIFAR-10 Width Multiplier wm = 2 :(a) Models with very different Params (box W)achieve similar test accuracies. (b) Models withsimilar accuracy often have similar NN-Mass:Models in W cluster into Z. Results are reportedas the mean of three runs (std. dev. ∼ . ).range from 5M parameters (model A) to 9Mparameters (models D,E). Again, as shown inFig. 4(b), when plotted against NN-Mass, thetest accuracy curves of CNNs with differentdepths cluster together ( e.g. , models A-E in boxW cluster into A’-E’ within bucket Z). Hence,NN-Mass identiﬁes CNNs with similar accuracy,despite having highly different Params/layers.The same holds true for models within X and Y.We now explore the impact of varying modelwidth. In our CNN setup, we control the widthof the models using width multipliers ( wm ) [33,7]. The above results are for wm = 2 . For lowerwidth CNNs ( wm = 1 ), Fig. 5(a) shows thatmodels in boxes U and V concentrate into thebuckets W and Z, respectively (see also otherbuckets). Note that, the 31-layer models do not fall within the buckets (see blue line in Fig. 5(b)).We hypothesize that this could be because the capacity of these models is too small to reach highaccuracy. This does not happen for CNNs with higher width. Speciﬁcally, Fig. 5(c) shows the resultsfor wm = 3 . As evident, models with 6M-7M parameters achieve comparable test accuracy asmodels with up to 16M parameters ( e.g. , bucket Y in Fig. 5(d) contains models ranging from { } , all the way to {

64 layers, 16.7M parameters } ). Again, for all widths, thegoodness-of-ﬁt ( R ) for linear ﬁt between test accuracy and log(NN-Mass) achieves high values(0.74-0.90 as shown in Fig. 15 in Appendix H.4). T e s t A cc u r a c y Test Accuracy vs. Number of Parameters -- wm = 1 a. U V

100 200 300 400 500NN-Mass94.5094.7595.0095.2595.5095.7596.00 T e s t A cc u r a c y Test Accuracy vs. NN-Mass -- wm = 1 b. W Y ZX Higher Width (wm=3)Lower Width (wm=1)

200 400 600 800 1000NN-Mass96.196.296.396.496.596.696.796.8 T e s t A cc u r a c y Test Accuracy vs. NN-Mass -- wm = 3 d. W Y ZX T e s t A cc u r a c y Test Accuracy vs. Number of Parameters -- wm = 3 c. Figure 5: Similar observations hold for low- ( wm = 1 ) and high-width ( wm = 3 ) models: (a,b) Many models with very different Params (boxes U and V) cluster into buckets W and Z (see alsoother buckets). (c, d) For high-width, we observe a signiﬁcantly tighter clustering compared to thelow-width case. Results are reported as the mean of three runs (std. dev. ∼ . ). Comparison between NN-Mass and Parameter Counting.

Next, we quantitatively compare NN-Mass to parameter counts. As shown in Fig. 16 in Appendix H.5, for wm = 2 , Params yield an R = 0 . which is lower than that for NN-Mass ( R = 0 . , see Fig. 16(a, b)). However, for higherwidths ( wm = 3 ), the parameter count completely fails to predict model performance ( R = 0 . inFig. 16(c)). On the other hand, NN-Mass achieves a signiﬁcantly higher R = 0 . (see Fig. 16(d)).Since NN-Mass is a good indicator of model performance, we can in fact use it to predict a priori the test accuracy of completely unknown architectures . The complete details of this experiment andthe results are presented in Appendix H.6. We show that a linear model trained on CNNs of depth { , , , } ( R = 0 . ; see Fig. 15(b)) can successfully predict the test accuracy of unknownCNNs of depth { , , , } with a high R = 0 . (see Fig. 17 in Appendix H.6). Results for CIFAR-100 Dataset.

We now corroborate our main ﬁndings on CIFAR-100 datasetwhich is signiﬁcantly more complex than CIFAR-10. To this end, we train the models in Fig. 4 onCIFAR-100. Fig. 18 (see Appendix H.7) once again shows that several models with highly differentnumber of parameters achieve similar accuracy. Moreover, Fig. 18(b) demonstrates that these modelsget clustered when plotted against NN-Mass. Further, a high R = 0 . is achieved for a linear ﬁt onthe accuracy vs. log(NN-Mass) plot (see Appendix H.7 and Fig. 18). Base channels in each cell is [16,32,64]. For wm = 2 , cells will have [32,64,128] channels per layer. ± standard deviation of three runs. DARTS results are reported from [9]. Model Architecture designmethod

Parameters/

FLOPS layers Specializedsearch space? NN-Mass Test AccuracyDARTS (ﬁrst order) NAS [9] 3.3M/– – Yes – . ± . DARTS (second order) NAS [9] 3.3M/– – Yes – . ± . % Train large modelsto be compressed Manual 11.89M/3.63G 64 No 1126 . ± . Manual 8.15M/2.54G 64 No 622 . ± . Proposed

Directly via NN-Mass

40 No 755 . ± . % Proposed

Directly via NN-Mass

37 No 813 . ± . % Proposed

Directly via NN-Mass

31 No 856 . ± . % Results for

FLOPS.

So far, we have shown results for the number of parameters. However, theresults for

FLOPS follow a very similar pattern (see Fig. 19 in Appendix H.8). In summary, weshow that NN-Mass can identify models that yield similar test accuracy, despite having very different parameters/FLOPS/layers. We next use this observation to directly design efﬁcient architectures.

We train our models for 600 epochs on the CIFAR-10 dataset (similar to the setup in DARTS [9]).Table 1 summarizes the number of parameters, FLOPS, and test accuracy of various CNNs. Weﬁrst train two large CNN models of about 8M and 12M parameters with NN-Mass of 622 and 1126,respectively; both of these models achieve around accuracy. Next, we train three signiﬁcantlysmaller models: ( i ) A 5M parameter model with 40 layers and a NN-Mass of 755, ( ii ) A 4.6Mparameter model with 37 layers and a NN-Mass of 813, and ( iii ) A 31-layer, 3.82M parameter modelwith a NN-Mass of 856. We set the NN-Mass of our smaller models between 750-850 ( i.e. , withinthe 600-1100 range of the manually-designed CNNs). Interestingly, we do not need to train anyintermediate architectures to arrive at the above efﬁcient CNNs. Indeed, classical NAS involves aninitial “search-phase” over a space of operations to ﬁnd the architectures [10]. In contrast, our efﬁcientmodels can be directly designed using the closed form Eq. (2) of NN-Mass (see Appendix H.9 formore details), which does not involve any intermediate training or even an initial search-phase likeprior NAS methods. As explained earlier, this is possible because NN-Mass can identify models withsimilar performance a priori ( i.e. , without any training)!As evident from Table 1, our 5M parameter model reaches a test accuracy of . , while the 4.6M(3.82M) parameter model obtains . ( . ) accuracy on the CIFAR-10 test set. Clearly, allthese accuracies are either comparable to, or slightly lower ( ∼ . ) than the large CNNs, whilereducing Params/FLOPS by up to × compared to the 11.89M-parameter/3.63G-FLOPS model.Moreover, DARTS [9], a competitive NAS baseline, achieves a comparable ( ) accuracy withslightly lower 3.3M parameters. However, the search space of DARTS (like all other NAS techniques)is very specialized and utilizes many state-of-the-art innovations such as depth-wise separableconvolutions [7], dilated convolutions [34], etc . On the contrary, we use regular convolutionswith only concatenation-type long-range links in our work and present a theoretically-groundedapproach. Indeed, our current objective is not to beat DARTS (or any other NAS technique), butrather underscore the topological properties that should guide the efﬁcient architecture design process. To answer “

How does the topology of neural architectures impact gradient propagation and modelperformance? ”, we have proposed a new, network science-based metric called

NN-Mass whichquantiﬁes how effectively information ﬂows through a given architecture. We have also establishedconcrete theoretical relationships among NN-Mass, topological structure of networks, and layerwisedynamical isometry that ensures faithful propagation of gradients through DNNs.Our experiments have demonstrated that NN-Mass is signiﬁcantly more effective than the number ofparameters to characterize the gradient ﬂow properties, and to identify models with similar accuracy,despite having a highly different number of parameters/FLOPS/layers. Finally, we have exploitedour new metric to design efﬁcient architectures directly, and achieve up to × fewer parameters andFLOPS, while sacriﬁcing minimal accuracy over large CNNs.By quantifying the topological properties of deep networks, our work serves an important step tounderstand and to design new neural architectures. Since topology is deeply intertwined with thegradient propagation, such topological metrics deserve major attention in future research.8 roader Impact Neural architecture is a fundamental part of the model design process. For all applications, the veryﬁrst task is to decide how wide or deep should the network be. With the rising ubiquity of complextopologies ( e.g. , Resnets, Densenets, NASNet [5, 6, 10]), the architecture design decisions todaynot only encompass simple depth and width, but also necessitate an understanding of how variousneurons/channels/layers should be connected. Indeed, this understanding has been missing from priorart. Then, it naturally raises the following question: how can we design efﬁcient architectures if wedo not even understand how their topology impacts gradient ﬂow?

With our work, we bridge this gap by bringing a new theoretically-grounded perspective for designingneural architecture topologies. Having demonstrated that topology is a big part of the gradientpropagation mechanism in deep networks, it is essential to include these metrics in the modeldesign process. From a broader perspective, we believe characterizing such properties will enablemore efﬁcient and highly accurate neural architectures for all applications (computer vision, naturallanguage, speech recognition, learning over biological data, etc. ). Moreover, our work brings togethertwo important ﬁelds – deep learning and network science. Speciﬁcally, small-world networks haveallowed us to study the relationship between topology and gradient propagation. This encouragesnew research at the intersection of deep learning and network science, which can ultimately helpadvance our theoretical understanding of deep networks, while building signiﬁcantly more efﬁcientmodels in practice.

References [1] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efﬁcient backprop. In

Neuralnetworks: Tricks of the trade , pages 9–48. Springer, 2012.[2] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neuralnetworks. In

Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics ,pages 249–256, 2010.[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassinghuman-level performance on imagenet classiﬁcation. In

Proceedings of the IEEE international conferenceon computer vision , pages 1026–1034, 2015.[4] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics oflearning in deep linear neural networks. arXiv preprint arXiv:1312.6120 , 2013.[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[6] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 4700–4708, 2017.[7] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, MarcoAndreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile visionapplications. arXiv:1704.04861 , 2017.[8] Mark Sandler and et al. Inverted residuals and linear bottlenecks: Mobile networks for classiﬁcation,detection and segmentation. arXiv:1801.04381 , 2018.[9] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprintarXiv:1806.09055 , 2018.[10] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architecturesfor scalable image recognition. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 8697–8710, 2018.[11] Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. A signal propaga-tion perspective for pruning neural networks at initialization. In

International Conference on LearningRepresentations , 2020.[12] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponentialexpressivity in deep neural networks through transient chaos. In

Advances in neural information processingsystems , pages 3360–3368, 2016.[13] Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrz˛ebski, Jacek Tabor, and Maciej A Nowak. Dynamicalisometry is achieved in residual networks in a universal way for any activation function. arXiv preprintarXiv:1809.08848 , 2018.

14] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learningthrough dynamical isometry: theory and practice. In

Advances in neural information processing systems ,pages 4785–4795, 2017.[15] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le.Mnasnet: Platform-aware neural architecture search for mobile. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 2820–2828, 2019.[16] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task andhardware. arXiv preprint arXiv:1812.00332 , 2018.[17] Mark Newman, Albert-Laszlo Barabasi, and Duncan J Watts.

The structure and dynamics of networks ,volume 19. Princeton University Press, 2011.[18] Albert-László Barabási and Eric Bonabeau. Scale-free networks.

Scientiﬁc american , 288(5):60–69, 2003.[19] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters for efﬁcientconvnets. arXiv:1608.08710 , 2016.[20] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efﬁcient convolutional neural networksusing energy-aware pruning. arXiv:1611.05128 , 2016.[21] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neuralnetworks: Training neural networks with low precision weights and activations.

JMLR , 18(1):6869–6898,2017.[22] Liangzhen Lai, Naveen Suda, and Vikas Chandra. Deep convolutional neural network inference withﬂoating-point weights and ﬁxed-point activations. arXiv preprint arXiv:1703.03073 , 2017.[23] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 , 2015.[24] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neuralnetworks for image recognition. arXiv preprint arXiv:1904.01569 , 2019.[25] Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. Discovering neural wirings. arXiv preprintarXiv:1906.00586 , 2019.[26] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. science ,286(5439):509–512, 1999.[27] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature ,393(6684):440, 1998.[28] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neuralnetworks. arXiv preprint arXiv:1803.03635 , 2018.[29] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. arXivpreprint arXiv:1902.07638 , 2019.[30] Albert-Laszlo Barabasi.

Network Science (Chapter 3: Random Networks) . Cambridge University Press,2016.[31] Mark EJ Newman and Duncan J Watts. Renormalization group analysis of the small-world network model.

Physics Letters A , 263(4-6):341–346, 1999.[32] Remi Monasson. Diffusion, localization and dispersion relations on “small-world” lattices.

The EuropeanPhysical Journal B-Condensed Matter and Complex Systems , 12(4):555–567, 1999.[33] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146 ,2016.[34] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprintarXiv:1511.07122 , 2015.[35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-tion. arXiv preprint arXiv:1409.1556 , 2014. upplementary Information:How Does Topology of Neural Architectures Impact Gradient Propagationand Model Performance?A DNNs/CNNs with long-range links are Small-World Networks Note that, the DNNs/CNNs considered in our work have both short-range and long-range links (seeFig. 1(a)). This kind of topology typically falls into the category of small-world networks which canbe represented as a lattice network G (containing short-range links) superimposed with a randomnetwork R (to account for long-range links) [32, 31]. This is illustrated in Fig. 6. Short-range links Long-range links = +

Small-World Network Lattice Network (G) Random Network (R)Each node has k short-range neighbors … … … … … … ……… … … …… … … ……… … … …… … … …… … = + CNN architecture with long-range links Lattice Network (G) containing layer-by-layer connections Random Network (R) consisting of long-range links a. Traditional Network Science: b. A Convolutional Neural Network: w c incoming links at each node (channel) Figure 6: (a) Small-World Networks in traditional network science are modeled as a superposition of alattice network ( G ) and a random network R [27, 31, 32]. (b) A DNN/CNN with both short-range andlong-range links can be similarly modeled as a random network superimposed on a lattice network.Not all links are shown for simplicity. B Derivation of Density of a Cell

Note that, the maximum number of neurons contributing long-range links at each layer in cell c isgiven by t c . Also, for a layer i , possible candidates for long-range links = all neurons up to layer( i − ) are w c ( i − (see Fig. 1(a)). Indeed, if t c is sufﬁciently large, initial few layers may nothave t c neurons that can supply long-range links. For these layers, we use all available neurons forlong-range links. Therefore, for a given layer i , number of long-range links ( l i ) is given by: l i = (cid:26) w c ( i − × w c if t c > w c ( i − t c × w c otherwise (3)where, both cases have been multiplied by w c because once the neurons are randomly selected, theysupply long-range links to all w c neurons at the current layer i (see Fig. 1(a)). Hence, for an entirecell, total number of neurons contributing long-range links ( l c ) is as follows: l c = w c d c − (cid:88) i =2 min { w c ( i − , t c } (4)On the other hand, the total number of possible long-range links within a cell ( L ) is simply the sumof possible candidates at each layer: L = d c − (cid:88) i =2 w c ( i − × w c = w c d c − (cid:88) i =2 ( i − w c [1 + 2 + . . . + ( d c − w c ( d c − d c − (5)11sing Eq. (4) and Eq. (5), we can rewrite Eq. (1) as: ρ c = 2 (cid:80) d c − i =2 min { w c ( i − , t c } w c ( d c − d c − (6) C Proof of Proposition 1

Proposition 1 ( NN-Mass and average degree of the network (a topological property) ) . Theaverage degree of a deep network with NN-Mass m is given by ˆ k = w c + m/ .Proof. As shown in Fig. 6, deep networks with shortcut connections can be represented as small-worldnetworks consisting of two parts: ( i ) lattice network containing only the short-range links, and ( ii )random network superimposed on top of the lattice network to account for long-range links. Forsufﬁciently deep networks, the average degree for the lattice network will be just the width w c of thenetwork. The average degree of the randomly added long-range links ¯ k R|G is given by: ¯ k R|G = Number of long-range links added by R Number of nodes = w c (cid:80) d c − i =2 min { w c ( i − , t c } w c d c = m ( d c − d c − d c (using (2) for one cell) ≈ m (when d c >> , e.g. , for deep networks) (7)Therefore, average degree of the complete model is given by w c + m/ . D Proof of Proposition 2

Proposition 2 ( NN-Mass and LDI ) . Given a small deep network f S (depth d S ) and a large deepnetwork f L (depth d L , d L >> d S ), both with same NN-Mass m and width w c , the LDI for bothmodels is equivalent. Speciﬁcally, if Σ iS ( Σ iL ) denotes the singular values of the initial layerwiseJacobians J i,i − for the small (large) model, then, the mean singular values in both models aresimilar; that is, E [ Σ iS ] ≈ E [ Σ iL ] .Proof. Consider a matrix M ∈ R H × W with H rows and W columns, and all entries independentlyinitialized with a Gaussian Distribution N (0 , q ) , we calculate its mean singular value. We ﬁrstperform Singular Value Decomposition ( SVD ) on the given matrix M : U ∈ R H × H , Σ ∈ R H × W , V ∈ R W × W = SV D ( M )Σ ∈ R H × W = Diag ( σ , σ , ..., σ K ) Given a row vector (cid:126)u i ∈ R H in U , and a row vector (cid:126)v i ∈ R W in V , we use the following relations ofSVD in our proof: σ i = (cid:126)u iT M (cid:126)v i (cid:126)u iT (cid:126)u i = 1 (cid:126)v iT (cid:126)v i = 1 It is hard to directly compute the mean singular value E [ σ i ] . To simplify the problem, consider σ i : σ i = σ i × σ Ti = ( (cid:126)u iT M (cid:126)v i )( (cid:126)u iT M (cid:126)v i ) T = (cid:126)u iT M (cid:126)v i (cid:126)v iT M T (cid:126)u i = (cid:126)u iT M M T (cid:126)u i (8)12ubstituting B = M M T (where, B ∈ R H × H ), and using m ij to represent the ij th entry of the givenmatrix M , the entry b ij in B is given by: b ij =  H (cid:88) k =1 m ik , when i = j H (cid:88) k =1 m ik m kj , when i (cid:54) = j Since m ij follows an independent and identical Gaussian Distribution N (0 , q ) , the diagonal entriesof B ( b ii ) follow a chi-square distribution with H degrees of freedom: b ii ∼ χ ( H ) For the non-diagonal entries of B, i.e. i (cid:54) = j , suppose z k = xy, x = m ik , and y = m kj ; then theprobability density function ( PDF ) of z is as follows: P DF Z ( z k ) = (cid:90) ∞−∞ P DF X ( t ) | t | P DF Y ( z k t ) dt = (cid:90) ∞−∞ π | t | e − t z k t dt (9)Based on probability density function of z k , the expectation of z k is given by: E [ z k ] = (cid:90) ∞−∞ P DF Z ( z k ) z k dz k As shown in Eq. (9),

P DF Z ( z k ) is an even function, then P DF Z ( z k ) z k is an odd function; therefore, E [ z k ] = 0 and, thus, E [ b ij ] = (cid:80) Hk =1 E [ z k ] = 0 , when i (cid:54) = j .Hence, we can now get the expectation for each entry in the Matrix B : E [ b ij ] = (cid:26) H, i = j , i (cid:54) = j ; that is: E [ B ] = Diag ( b ii ) = H I (10)where, I ∈ R H × H is an identity matrix. Combining Eq. (8) and Eq. (10), we get the following results: E [ σ i ] = E [ (cid:126)u iT M M T (cid:126)u i ]= E [ (cid:126)u iT ] E [ M M T ] E [ (cid:126)u i ]= E [ (cid:126)u iT ] E [ B ] E [ (cid:126)u i ]= E [ (cid:126)u iT ] H IE [ (cid:126)u i ]= H E [ (cid:126)u iT (cid:126)u i ]= H (11)Therefore, we have: E [ σ i ] = H (12)Eq. 12 states that, for a Gaussian M ∈ R H × W , E [ σ i ] is dependent on number of rows H , anddoes not depend on W. To empirically verify this, we simulate several Gaussian matrices of widths W ∈ { , , ..., } and H ∈ (0 , . We plot E [ σ i ] vs. H in Fig. 7. As evident, for different W , the mean singular values are nearly coinciding, thereby showing that mean singular value indeeddepends on H . Also, for small-enough ranges of H , the relationship between E [ σ i ] and H can beapproximated with a linear trend.To see the above linear trend between the mean singular values ( E [ σ i ] ) and H , we now simulate amore realistic scenario that will happen in the case of initial layerwise Jacobian matrices ( J i,i − ).As explained in the main paper, the layerwise Jacobians will theoretically have ( w c + m/ , w c ) dimensions, where w c is the width of DNN and m is the NN-Mass. That is, now M = J i,i − , W = w c , and H = w c + m/ . Hence, in Fig. 8, we plot mean singular values for Gaussian distributedmatrices of size ( w c + m/ , w c ) vs. NN-Mass ( m ). As evident, for w c ranging from 8 to 256, meansingular values increase linearly with NN-Mass. We will explicitly demonstrate in our experimentsthat this linear trend holds true for actual non-linear deep networks.Finally, since the Jacobians have a size of ( w c + m/ , w c ), Eq. 12 suggests that its mean singularvalues should depend on H = w c + m/ . Hence, when two DNNs have same NN-Mass and width,their mean singular values should be similar, i.e. , E [ Σ iS ] ≈ E [ Σ iL ] (irrespective of their depths).13igure 7: Mean Singular Value E [ σ i ] only increases with H while varying W . For small-enoughranges, the E [ σ i ] vs. H relationship can be approximated by a linear trend.Figure 8: To simulate more realistic Jacobian matrices, we calculate the mean singular value of matrix M with size [ w c + m/ , w c ] ( w c is given by the Width in the title of each sub-ﬁgure). Clearly, E [ σ i ] varies linearly with corresponding NN-Mass for all w c values. Moreover, as w c increases, the meansingular values ( E [ σ i ] ) increase. Both observations show that E [ σ i ] increases with ˆ k = w c + m/ (since the height of the Jacobian matrix H = ˆ k depends on both w c and m ). E CNN Details

In contrast to our MLP setup which contains only a single cell of width w c and depth d c , our CNNsetup contains three cells, each containing a ﬁxed number of layers, similar to prior works such asDensenets [6], Resnets [5], etc . However, topologically, a CNN is very similar to MLP. Since in aregular convolutional layer, channel-wise convolutions are added to get the ﬁnal output channel (seeFig. 1(c)), each input channel contributes to each output channel at all layers. This is true for bothlong-range and short-range links; this makes the topological structure of CNNs similar to our MLPsetup shown in Fig. 1(a) in the main paper (the only difference is that now each channel is a node inthe network and not each neuron).In the case of CNNs, following the standard practice [35], the width ( i.e. , the number of channels perlayer) is increased by a factor of two at each cell as the feature map height and width are reducedby half. After the convolutions, the ﬁnal feature map is average-pooled and passed through a fully-14 t c = 3 t c = 4 t c = 5 Not all links are shown above. If a channel is selected, it contributes long-range links to all output channels of the current layer

12 34 56 123 Concatenate feature maps like Densenets

Average Pool Logits Outputs after softmax … … … Fully-connected

Cell 1 Cell 2 Cell 3

Layer i=2 : Long-range links (violet) from 4 previous channels because min{ w c (i-1), t c } = 4 No long-range links between cells

Layer i : 0 1 2 3 Layer i=3 : Long-range links (green) from 5 previous channels because min{ w c (i-1), t c } = 5 Initial conv

Max previous channels for long-range links All links d c =4 layers w c = Figure 9: An example of CNN to calculate NN-Density and NN-Mass. Not all links are shown in themain ﬁgure for simplicity. The inset shows the contribution from all long-range and short-range links:The feature maps for randomly selected channels are concatenated at the current layer (similar toDensenets [6]). At each layer in a given cell, the maximum number of channels that can contributelong-range links is given by t c .connected layer to generate logits. The width ( i.e. , the number of channels at each layer) of CNNsis controlled using a width multiplier, wm (like in Wide Resnets [33] and Mobilenets [7]). Base channels in each cell is [16,32,64]. For wm = 2 , cells will have [32,64,128] channels per layer. F Example: Computing NN-Mass for a CNN

Given a CNN architecture shown in Fig. 9, we now calculate its NN-Mass. This CNN consistsof three cells, each containing d c = 4 convolutional layers. The three cells have a width, ( i.e. ,the number of channels per layer) of 2, 3, and 4, respectively. We denote the network width as w c = [2 , , . Finally, the maximum number of channels that can supply long-range links is givenby t c = [3 , , . That is, the ﬁrst cell can have a maximum of three long-range link candidates perlayer ( i.e. , previous channels that can supply long-range links), the second cell can have a maximumof four long-range link candidates per layer, and so on. Moreover, as mentioned before, we randomlychoose min { w c ( i − , t c } channels for long-range links at each layer. The inset of Fig. 9 shows howlong-range links are created by concatenating the feature maps from previous layers.Hence, using d c = 4 , w c = [2 , , , and t c = [3 , , for each cell c , we can directly use Eq. (2) tocompute the NN-Mass value. Putting the values in the equations, we obtain m = 28 . Consequently,the set { d c , w c , t c } can be used to specify the architecture of any CNN with concatenation-typelong-range links. Therefore, to perform NASE, we vary { d c , w c , t c } to obtain architectures withdifferent NN-Mass and NN-Density values. G Complete Details of the Experimental Setup

G.1 MLP Setup

We now explain more details on our MLP setup for the MNIST dataset. We create random archi-tectures with different NN-Mass and

Params by varying t c and d c . Moreover, we just use a singlecell for all MLP experiments. We ﬁx w c = 8 and vary d c ∈ { , , , , } . For each depth d c , we vary t c ∈ { , , , . . . , } . Speciﬁcally, for a given { d c , w c , t c } conﬁguration, we createrandom long-range links at layer i by uniformly sampling min { w c ( i − , t c } neurons out of w c ( i − activation outputs from previous { , , . . . , i − } layers.15able 2: CNN architecture details (width multiplier = 2) Numberof Cells Max. Long-RangeLink Candidates ( t c ) Depth Width Multiplier3 [10,35,50][20,45,75][30,50,100][40,60,120][50,70,145] 31 23 [20,40,70][30,50,100][40,80,125][50,105,150][60,130,170] 40 23 [25,50,90][35,80,125][50,105,150][70,130,170][90,150,210] 49 23 [30,80,117][50,110,150][70,140,200][90,175,250][110,215,300] 64 2 We train these random architectures on the MNIST dataset for 60 epochs with Exponential LinearUnit (ELU) as the activation function. Further, each { d c , w c , t c } conﬁguration is trained ﬁve timeswith different random seeds. In other words, during each of the ﬁve runs of a speciﬁc { d c , w c , t c } conﬁguration, the shortcuts are initialized randomly so these ﬁve models are not the same. Thiskind of setup is used to validate that NN-Mass is indeed a topological property of deep networks,and that the speciﬁc connections inside the random architectures do not affect our conclusions. Theresults are then averaged over all runs: Mean is plotted in Fig. 2 and standard deviation, which istypically low, is also given in Fig. 2 caption. Overall, this setup results in many MLPs with different Params/FLOPS/layers.

G.2 CNN Setup

Much of the setup for creating long-range links in CNNs is the same as that for MLPs, except we havethree cells instead of just one. As explained in Appendix E, the width of the three cells is given as wm × [16 , , , where wm is the width multiplier. Note that, since we have three cells of differentwidths ( w c ), t c also has a different value for each cell. The depth per cell d c is the same for all cells;hence, the total depth is given by d c + 4 . For instance, for 31-layer model, our d c = 9 . For most ofour experiments, we set the total depth of the CNN as { , , , } . Some of the experiments alsouse a total depth of { , , , } .Again, we conduct several experiments for different { d c , w c , t c } values which yield many randomCNN architectures. The random long-range link creation process is the same as that in MLPs and, forCNN experiments, we have repeated all experiments three times with different random seeds. Speciﬁcnumbers used for { d c , w c , t c } are given in Tables 2, 3, and 4. Each row in all tables represents adifferent { d c , w c , t c } conﬁguration. Of note, all CNNs use ReLU activation function and Batch Normlayers.For CNNs, we verify our ﬁndings on CIFAR-10 and CIFAR-100 image classiﬁcation datasets. Thelearning rate for all models is initialized to 0.05 and follows a cosine-annealing schedule at eachepoch. The minimum learning rate is 0.0 (see the end of Section H.9 for details on how we ﬁxedthese hyper-parameter values). Similar to the setup in NAS prior works, the cutout is used for dataaugmentation. All models are trained in Pytorch on NVIDIA 1080-Ti, Titan Xp, and 2080-Ti GPUs.This completes the experimental setup. 16able 3: CNN architecture details (width multiplier = 1) Numberof Cells Max. Long-RangeLink Candidates ( t c ) Depth Width Multiplier3 [5,8,12][10,30,50][30,40,70][41,61,91][50,90,110] 31 13 [5,9,12][11,31,51][31,41,71][41,62,92][50,90,109] 40 13 [5,10,11][11,31,52][31,41,73][42,62,93][50,90,109] 49 13 [5,10,12][11,32,53][31,42,74][42,62,94][49,90,110] 64 1 Table 4: CNN architecture details (width multiplier = 3)

Numberof Cells Max. Long-RangeLink Candidates ( t c ) Depth Width Multiplier3 [10,30,50][40,60,90][70,90,130][100,120,170][130,150,210] 31 33 [11,31,51][42,62,92][72,93,133][103,123,173][133,153,212] 40 33 [11,31,52][43,63,93][73,95,135][104,124,176][134,154,214] 49 33 [12,32,52][44,64,95][76,96,136][106,126,178][135,156,216] 64 3 a) (b)(c) (d) Figure 10: More MNIST training convergence results (a, b are repeated from Fig. 2 but have beenannotated with different models (X, Y, Z, D, E, F)): (a) Models with different

Params achieve similartest accuracy. (b) Test accuracy curves of models with different depths/

Params concentrate whenplotted against NN-Mass (test accuracy std. dev. ∼ . - . ). (c,d) Models X and Y have thesame NN-Mass and achieve very similar training convergence, even though they have highly different Params and depth. Model Z has signiﬁcantly fewer layers than Y but the same

Params and yetachieves faster training convergence than Y (Z has higher NN-Mass than Y). The above conclusionshold true for models D, E, and F. Note that, the training convergence curves for similar NN-Massmodels are coinciding.

H Additional Results

H.1 More MNIST training convergence results

We pick two groups of three models each – (X, Y, and Z) – and – (D, E, and F) and plot their trainingaccuracy vs. epochs. Models X and Y have similar NN-Mass but Y has more

Params and depththan X. Model Z has far fewer layers and nearly the same

Params as X, but has higher NN-Mass.Fig. 10(c) shows the training convergence results for all three models. As is evident, the trainingconvergence of model X (8.3K Params, 24-layers) nearly coincides with that of model Y (9.0KParams, 32-layers). Moreover, even though model Z (8.3K Params, 16-layers) is shallower thanthe 32-layer model Y (and has far fewer

Params), training convergence of Z is signiﬁcantly fasterthan that of Y (due to higher NN-Mass and, therefore, better LDI). These results clearly show theevidence towards theoretical insights in Proposition 2, and emphasize the importance of topologicalproperties of neural architectures in characterizing gradient propagation and model performance.Similar observations are found among models D, E, and F.18 X i X

14 24 34 i Y = i Y =

14 24 34 i Y = i Y = i X i X (a) Seg4 i X i X

14 24 34 i Y = i Y =

14 24 34 i Y = i Y = i X i X (b) Circle4 Figure 11: Illustration of synthetic datasets Seg4 and Circle4: (a). Seg20 (Seg30) dataset is similar toSeg4, but divides the [0 , range into 20 (30) segments. (b). Circle (or Circle20) dataset is similar toCircle4, but divides a unit circle into 20 concentric circles. H.2 Results on synthetic data

In this section, we design a few synthetic experiments for MLP experiments to verify that ourobservations in Section 4.2 hold for diverse datasets. Speciﬁcally, we design three datasets – Seg20,Seg30, and Circle20 (or just Circle). Fig. 11(a) illustrates the Seg4 dataset where the range [0 . isbroken into 4 segments. Similarly, Seg20 (Seg30) breaks down the linear line into 20 (30) segments.The classiﬁcation problem has two classes (each alternate segment is a single class).Fig. 11(b) shows the circle dataset where a unit circle is broken down into concentric circles (regionsbetween circles make a class and we have two total classes). The details of these datasets are given inTable 5. Of note, we have used the ReLU activation function for these experiments (unlike ELU usedfor MNIST). Table 5: Description of our generated Synthetic DatasetsDataset name Description: Training Set, i ∈ [1 , ; Test Set, i ∈ [1 , Seg20 Feature: [ X i , X i ] , Label: Y i , X i = sample ( [ (cid:98) i (cid:99) , (cid:98) i (cid:99) + 1]) , Y i = (cid:98) i (cid:99) mod Seg30 Feature: [ X i , X i ] , Label: Y i , X i = sample ( [ (cid:98) i (cid:99) , (cid:98) i (cid:99) + 1]) , Y i = (cid:98) i (cid:99) mod Circle (Cir-cle20) Feature: [ X i , X i ] , Label: Y i , X i = L i ∗ cos ( rand _ num ) , X i = L i ∗ sin ( rand _ num ) , L i = sample ( [ (cid:98) i (cid:99) , Y i = (cid:98) i (cid:99) mod For the above synthetic experiments, we once again conduct the following experiments: ( i ) Weexplore the impact of varying Params and NN-Mass on the test accuracy. ( ii ) We demonstrate howLDI depends on NN-Mass and Params.

Test Accuracy

As shown in Fig. 12(a, b, c) and Fig. 12(d, e, f), NN-Mass is a much bettermetric to characterize the model performance of DNNs than the number of parameters. Again,we quantitatively analyze the above results by generating a linear ﬁt between test accuracy vs.log(

Params) and log(NN-Mass). Similar to the MNIST case, our results show that R of testaccuracy vs. NN-Mass is much higher than that for Params.19 a) Linear: Seg=20 (b) Linear: Seg=30 (c) Circular: Circle20(d) Linear: Seg=20 (e) Linear: Seg=30 (f) Circular: Circle20

Figure 12: Synthetic results: (a, b, c) Models with different

Params achieve similar test accuracyacross all synthetic datasets. (d, e, f) Test accuracy curves for the same set of models come closertogether when plotted against NN-Mass.Figure 13: Synthetic results (Circle20 datasets): Mean singular value of J i,i − is much bettercorrelated with NN-Mass than with Params.

Layerwise Dynamical Isometry

Fig. 13 shows the LDI results for the Circle20 dataset. Again,higher NN-Mass leads to higher initial singular value. Moreover, NN-Mass is better correlated withLDI than

Params. Hence, this further emphasizes why networks with similar NN-Mass (instead of

Params) result in a more similar model performance.

H.3 Impact of Varying NN-Density

As a baseline, we show that NN-Density cannot predict the accuracy of models with different depths.We train different deep networks with varying NN-Density (see Table 2 models in Appendix G).Fig. 14 shows that shallower models with higher density can reach accuracy comparable to deepermodels with lower density (which is quite reasonable since the shallower models are more denselyconnected compared to deeper networks, thereby promoting more effective information ﬂow inshallower CNNs despite having signiﬁcantly fewer parameters). However, NN-Density alone doesnot identify models (with different sizes/compute) that achieve similar accuracy: CNNs with differentdepths achieve comparable test accuracies at different NN-Density values ( e.g. , although a 31-layermodel with ρ avg = 0 . performs close to 64-layer model with ρ avg = 0 . , a 49-layer model with20 .10 0.15 0.20 0.25 0.30NN-Density95.996.096.196.296.396.496.596.6 T e s t A cc u r a c y Test Accuracy vs. NN-Density

PRQ

Figure 14: CIFAR-10 Width Multiplier wm = 2 : Shallower models with higher density can reachcomparable accuracy to deeper models with lower density. This does not help since models withdifferent depths achieve comparable accuracies at different densities. a. b. T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass) -- wm = 1. R-squared = 0.74 T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass). R-squared = 0.84 T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass) -- wm = 3. R-squared = 0.90 c. Figure 15: Impact of varying width: (a) Width multiplier, wm = 1 , (b) wm = 2 , and (c) wm = 3 . Aswidth increases, the capacity of small (shallower) models increases and, therefore, the accuracy-gapbetween models of different depths reduces. Hence, the R for linear ﬁt increases as width increases. ρ avg = 0 . already outperforms the test accuracy of the above 64-layer model; see models P, Q, R inFig. 14). Therefore, NN-Density alone is not sufﬁcient. H.4 R-Squared of CIFAR-10 Accuracy vs. NN-Mass

Fig. 15 shows the impact of increasing model widths on R of linear ﬁt between test accuracy andlog(NN-Mass). H.5 Comparison between NN-Mass and Parameter Counting for CNNs

For MLPs, we have shown that NN-Mass signiﬁcantly outperforms

Params for predicting modelperformance. For CNNs, we quantitatively demonstrate that while parameter counting can be a usefulindicator of test accuracy for models with low width (but still not as good as NN-Mass), as the widthincreases, parameter counting completely fails to predict test accuracy. Speciﬁcally, in Fig. 16(a), weﬁt a linear model between test accuracy and log( parameters) and found that the R for this modelis 0.76 which is slightly lower than that obtained for NN-Mass ( R = 0 . , see Fig. 16(b)). Whenthe width multiplier of CNNs increases to three, parameter counting completely fails to ﬁt the testaccuracies of the models ( R = 0 . ). In contrast, NN-Mass signiﬁcantly outperforms parametercounting for wm = 3 as it achieves an R = 0 . . This demonstrates that NN-Mass is indeed asigniﬁcantly stronger indicator of model performance than parameter counting. H.6 NN-Mass to Predict Test Accuracy of Unknown Architectures

We now demonstrate that NN-Mass can be used to predict the test accuracy of unknown architecturesthat have not been trained before. Towards this end, we create a testing set of new architectures bytraining 20 previously unknown architectures with wm = 2 , and { , , , } layers. For thesemodels, we vary the NN-Density between { . , . , . , . , . } which is differentfrom the initial architecture space exploration setting in Fig 15(b) or Table 2 (in the initial setting, { , , , } -layer models were trained for NN-Densities: { . , . , . , . , . } ). Wenext use the linear model trained on the { , , , } -layer models (see Fig. 15(b)) to predictthe test accuracy of the unknown { , , , } -layer CNNs. Note that, our testing set consists of21 .0 1.5 2.0 2.5log( T e s t A cc u r a c y Test Accuracy vs. log( a. b. T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass). R-squared = 0.84 T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass) -- wm = 3. R-squared = 0.90 d.c. T e s t A cc u r a c y Test Accuracy vs. log(

Figure 16: NN-Mass as an indicator of model performance compared to parameter counting. (a) For wm = 2 , log( parameters) ﬁts the test accuracy with an R = 0 . . (b) For the same wm = 2 case, log(NN-Mass) ﬁts the test accuracy with a higher R = 0 . . (c) For higher width ( wm = 3 ),parameter counting completely fails to ﬁt the test accuracy of various models ( R = 0 . ). (d) Incontrast, NN-Mass still ﬁts the accuracies with a high R = 0 . . T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass). Testing R-squared = 0.79

Figure 17: Linear modeled trained in Fig. 15(b) is used to predict the test accuracy of com-pletely new architectures. The resulting R = 0 . is still high and is comparable to the training R = 0 . . The linear model was trained on the test accuracies and NN-Mass of models with { , , , } layers, and densities varying as { . , . , . , . , . } . To create the test-ing set, we trained completely new models with { , , , } layers, and densities varying as { . , . , . , . , . } .models with both different number of layers and different NN-Densities (and, implicitly, differentNN-Mass values) compared to the training set.Fig. 17 shows that the testing R = 0 . ( i.e. , the R obtained by predicting the accuracy of modelsin the testing set) which is close to the training R = 0 . (see Fig. 15(b)). Hence, NN-Mass can beused to predict test accuracy of models which were never trained before. H.7 Results for CIFAR-100

Results for CIFAR-100 dataset are shown in Fig. 18. As evident, several models achieve similaraccuracy despite having highly different number of parameters ( e.g. , see models within box W inFig. 18(a)). Again, these models get clustered together when plotted against NN-Mass. Speciﬁcally,models within box W in Fig. 18(a) fall into buckets Y and Z in Fig. 18(b). Hence, models that gotclustered together for CIFAR-10 dataset, also get clustered for CIFAR-100. To quantify the aboveresults, we ﬁt a linear model between test accuracy and log(NN-Mass) and, again, obtain a high22 T e s t A cc u r a c y Test Accuracy vs. Number of Parameters

200 400 600 800 1000 1200NN-Mass77.077.578.078.579.079.580.080.5 T e s t A cc u r a c y Test Accuracy vs. NN-Mass a. b.

W Y ZX T e s t A cc u r a c y Test Accuracy vs. log(NN-Mass). R-squared = 0.84 c. Figure 18: Similar results are obtained for CIFAR-100 ( wm = 2 ). (a) Models in box W have highlydifferent parameters but achieve similar accuracy. (b) These models get clustered into buckets Yand Z. (c) The R value for ﬁtting a linear regression model is . which shows that NN-Mass is agood predictor of test accuracy. Results are reported as the mean of three runs (std. dev. ∼ . ). R = 0 . (see Fig. 18(c)). Therefore, our observations hold true across multiple image classiﬁcationdatasets. H.8 Results for Floating Point Operations (FLOPS)

All results for FLOPS (of CNN architectures in Tables 2, 3, and 4) are shown in Fig. 19. As evident,models with highly different number of FLOPS often achieve similar test accuracy. As shown earlier,many of these CNN architectures cluster together when plotted against NN-Mass.

H.9 NN-Mass for directly designing compressed architectures

Our theoretical and empirical evidence shows that NN-Mass is a reliable indicator for models whichachieve a similar accuracy despite having different number of layers and parameters. Therefore, thisobservation can be used for directly designing efﬁcient CNNs as follows:• First, train a reference big CNN (with a large number of parameters and layers) whichachieves very high accuracy on the target dataset. Calculate its NN-Mass (denoted m L ).• Next, create a completely new and signiﬁcantly smaller model using far fewer parametersand layers, but with a NN-Mass ( m S ) comparable to or higher than the large CNN. Thisprocess is very fast as the new model is created without any a priori training. For instance,to design an efﬁcient CNN of width w c and depth per cell d c and NN-Mass m S ≈ m L , weonly need to ﬁnd how many long-range links to add in each cell. Since, NN-Mass has aclosed form equation ( i.e. , Eq. (2)), a simple search over the number of long-range linkscan directly determine NN-Mass of various architectures. Then, we select the architecturewith the NN-Mass close to that of the reference CNN. Unlike current manual or NAS-basedmethods, our approach does not require training of individual architectures during the search.• Since NN-Mass of the smaller model is similar to that of the reference CNN, our theoreticalas well as empirical results suggest that the newly generated model will lose only a smallamount of accuracy, while signiﬁcantly reducing the model size. To validate this, we trainthe new, signiﬁcantly smaller model and compare its test accuracy against that of the originallarge CNN. A note on hyper-parameter ( e.g. , initial learning rate) optimization.

Note that, throughout thiswork, we optimized the hyper-parameters such as initial learning rate for the largest models and thenused the same initial learning rate for the smaller models. Hence, if these hyper-parameters werefurther optimized for the smaller models, the gap between the accuracy curves in Figures 5, 18, 19, etc. , would reduce further ( i.e. , the clustering on NN-Mass plots would further improve). Similarly,the accuracy gap between compressed models and the large CNNs would reduce even more in Table 1if the hyper-parameters were optimized for the smaller models as well. We did not optimize theinitial learning rates, etc. , for the smaller models as it would have resulted in an explosion in terms ofnumber of experiments. Hence, since our focus is on topological properties of CNN architectures, weﬁxed the other hyper-parameters as described above.23 .1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Floating Point Operations (GFLOPS)94.094.595.095.596.096.5 T e s t A cc u r a c y Test Accuracy vs. FLOPS31-layers40-layers49-layers64-layers T e s t A cc u r a c y Test Accuracy vs. NN-Mass31-layers40-layers49-layers64-layers T e s t A cc u r a c y Test Accuracy vs. FLOPS31-layers40-layers49-layers64-layers

200 400 600 800 1000 1200 1400NN-Mass95.996.096.196.296.396.496.596.696.7 T e s t A cc u r a c y Test Accuracy vs. NN-Mass31-layers40-layers49-layers64-layers T e s t A cc u r a c y Test Accuracy vs. FLOPS31-layers40-layers49-layers64-layers

200 400 600 800 1000NN-Mass96.096.196.296.396.496.596.696.796.8 T e s t A cc u r a c y Test Accuracy vs. NN-Mass31-layers40-layers49-layers64-layers T e s t A cc u r a c y Test Accuracy vs. FLOPS31-layers40-layers49-layers64-layers

200 400 600 800 1000 1200 1400NN-Mass77.077.578.078.579.079.580.080.5 T e s t A cc u r a c y Test Accuracy vs. NN-Mass31-layers40-layers49-layers64-layers a. CIFAR-10 width multiplier = 1b. CIFAR-10 width multiplier = 2c. CIFAR-10 width multiplier = 3d. CIFAR-100 width multiplier = 2a. CIFAR-10 width multiplier = 1b. CIFAR-10 width multiplier = 2c. CIFAR-10 width multiplier = 3d. CIFAR-100 width multiplier = 2