An Approximation Algorithm for Optimal Subarchitecture Extraction
aa r X i v : . [ c s . L G ] O c t An Approximation Algorithm for Optimal SubarchitectureExtraction
Adrian de Wynter Amazon Alexa, 300 Pine St., Seattle, Washington, USA 98101 [email protected]
Abstract. We consider the problem of finding the set of architectural parameters for a chosen deep neural networkwhich is optimal under three metrics: parameter size, inference speed, and error rate. In this paper westate the problem formally, and present an approximation algorithm that, for a large subset of instancesbehaves like an FPTAS with an approximation error of ρ ≤ | − ǫ | , and that runs in O ( | Ξ | + | W ∗ T | (1 + | Θ || B || Ξ | / ( ǫ s / ))) steps, where ǫ and s are input parameters; | B | is the batch size; | W ∗ T | denotes thecardinality of the largest weight set assignment; and | Ξ | and | Θ | are the cardinalities of the candidatearchitecture and hyperparameter spaces, respectively. Keywords: neural architecture search · deep neural network compression
Due to their success and versatility, as well as more available computing power, deep neural networkshave become ubiquitous on multiple fields and applications. In an apparent correlation betweenperformance versus size,
Jurassic neural networks–that is, deep neural networks with parametersizes dwarfing their contemporaries–such as the the 17-billion parameter Turing-NLG (Rosset, 2020)have also achieved remarkable, ground-breaking successes on a variety of tasks.However, the relatively short history of this field has shown that Jurassic networks tend to bereplaced by smaller, higher-performing models. For instance, the record-shattering, then-massive,138-million parameter VGG-16 by Simonyan and Zisserman (2014), topped the leaderboards on theImageNet challenge. Yet, the following year it was displaced by the 25-million parameter ResNet byHe et al. (2015). . Another example is the 173-billion parameter GPT-3 from Brown et al. (2020),which was outperformed by the 360-million parameter RoBERTa-large from Liu et al. (2019b) inthe SuperGLUE benchmark leaderboard. Indeed, many strong developments have come from networks that have less parameters, butare based off these larger models. While it is known that overparametrized models train easily(Livni et al., 2014), a recent result by Frankle and Carbin (2019) states that it is possible to removeseveral neurons for a trained deep neural network and still retain good, or even better, generalizabil-ity when compared to the original model. From a computational point of view, this suggests that,for a trained architecture, there exists a smaller version with a weight configuration that works justas well as its larger counterpart. Simple yet powerful algorithms, such as the one by Renda et al.(2020) have already supplied an empirical confirmation of this hypothesis, by providing a procedurethrough which to obtain high-performing pruned versions of a variety of models. This paper has not been fully peer-reviewed. http://image-net.org/challenges/LSVRC/2014/index ; accessed June th , 2020. Both models were eventuallydisplaced by DPN (Chen et al., 2017), with about half the number of parameters of VGG-16. https://super.gluebenchmark.com/leaderboard , accessed June th , 2020. ur problem, which we dub the optimal subarchitecture extraction (OSE) problem, is motivatedfrom these observations. Loosely speaking, the OSE problem is the task of selecting the best non-trainable parameters for a neural network, such that it is optimal in the sense that it minimizesparameter size, inference speed, and error rate.Formally, the OSE problem is interesting on its own right–knowing whether such optimal sub-architecture can (or cannot) be found efficiently has implications on the design and analysis ofbetter algorithms for automated neural network design, which perhaps has an even higher impacton the environment than the training of Jurassic networks. More importantly, hardness results ona problem are not mere statements about the potential running time of solutions for it; they arestatements on the intrinsic difficulty of the problem itself. In this paper, we present a formal characterization of the OSE problem, which we refer to as ose ,and prove that it is weakly NP -hard. We then introduce an approximation algorithm to solveit, and prove its time and approximation bounds. In particular, we show that the algorithm inquestion under a large set of inputs attains an absolute error bound of c ≤ ǫ − , and runs on O ( | Ξ | + | W ∗ T | (1 + | Θ || B || Ξ | / ( ǫ s / ))) steps, where ǫ and s are input parameters; | B | is the batchsize; | W ∗ T | is the cardinality of the largest weight set assignment; and | Ξ | , | Θ | are the cardinalities ofthe candidate architecture and hyperparameter spaces, respectively. These results apply for a classof networks that fullfills three conditions: the intermediate layers are considerably more expensive,both in terms of parameter size and operations required, than the input and output functions of thenetwork; the corresponding optimization problem is L -Lipschitz smooth with bounded stochasticgradients; and the training procedure uses stochastic gradient descent as the optimizer. We refer tothese assumptions collectively as the strong AB n C property , and note that it fits the contemporaryparadigm of deep neural network training. We also show that if we assume the optimization to be µ -strongly convex, this algorithm is an FPTAS with an approximation ratio of ρ ≤ | − ǫ | ; andremark that the results from the FPTAS version of the algorithm hold regardless of convexity, if weassume the optimal solution set to be the set of reachable weights by the optimizer under the inputhyperparameters sets Θ . The OSE problem is tied to three subfields on machine learning: neural architecture search (NAS),weight pruning, and general neural network compression techniques. All of these areas are complexand vast; while we attempted to make an account of the most relevant works, it is likely we un-knowingly omitted plenty of contributions. We link several surveys and encourage the reader to divedeeper on the subjects of interest.Given that OSE searches for optimal architectural variations, it is a special case of NAS. In-deed,
True
NAS, where where neither the weights, architectural parameters, or even the functionscomposing the architecture are fully defined, has been explored in both an applied and a theoreticalfashion. Several well-known examples of the former can be seen in Zoph and Le (2016); Real et al.(2018); Liu et al. (2019a), and Li et al. (2020), when applied to deep learning, although early workon this area can be seen in Carpenter and Grossberg (1987) and Schaffer et al. (1990). The readeris referred to the survey by Elsken et al. (2019) for NAS, and the book by Hutter et al. (2018) foran overview of its applications in the nascent field of automated neural network design, or AutoML.From a computational standpoint our problem can be seen as a special case of the so-calledArchitecture Search Problem from de Wynter (2019), where the architecture remains fixed, but2he non-trainable parameters (e.g., the dimensionality of the hidden layers) for said architecturedo not. The hardness results around OSE are not surprising, and are based off several importantcontributions to the complexity theory of training a neural network in the computational model:the loading problem was shown to be NP -complete for a 3-node architecture with hard threshold(Blum and Rivest, 1988) and continuous activation functions (DasGupta et al., 1995); Arora et al.(2016) showed that a 2-layer neural network with a ReLU activation unit could learn a global min-ima in exponential time on the input dimensionality; and, more generally, the well-known results byKlivans and Sherstov (2006) and Daniely (2016) prove that intersections of half-spaces are not effi-ciently learnable under certain assumptions. It is argued by Shalev-Shwartz and Ben-David (2014)that this result has implications on the impossibility of efficiently training neural networks; likewise,Arora et al. (1993) showed that surrogating a loss function and minimizing the gradient–a commonpractice in deep learning to circumvent non-convexity–is intractable for a large class of functions.Other works proving stronger versions of these results can be found in Daniely and Shalev-Shwartz(2016).Our work, however, is more closely related to that by Judd (1990), who showed that approx-imately loading a network, independently of the type of node used, is NP -complete on the worstcase. It presents a very similar approach insofar as both their approach and ours emphasize thearchitecture of the network. However, our problem differs in several aspects: namely, the definitionof an architecture focuses on the combinatorial aspect of it, rather than on the functional aspect ofevery layer; we do not constrain ourselves to simply finding the highest-performing architecture withrespect to accuracy, but also the smallest bitstring representation that also implies lower number ofoperations performed; and we assume that the input task is learnable by the given architecture–aproblem shown to be hard by DasGupta et al. (1995), but also proven to be an insufficient charac-terization of the generalizability of deep neural networks (Livni et al., 2014).Indeed, when framed as a deep neural network multiobjective optimization problem, the OSEproblem finds similes on the considerable work on one-size-fits-all algorithms designed to improvethe performance of a model on these metrics. See, for example, Han et al. (2015). Ohter com-mon techniques for neural network compression involve knowledge distillation (Bucilu et al., 2006;Hinton et al., 2015), although, by design, they do not offer the same guarantees as the OSE prob-lem. On the other hand, a very clever method was recently presented by Renda et al. (2020), withmuch success on a variety of neural networks. We consider OSE to be a problem very similar totheirs, although the method presented there is a form of weight pruning , where the parameters ofa trained network are removed with the aim of improving either inference speed or parameter size.On the other hand, OSE is a “bottom up” problem where we begin with an untrained network, andwe directly optimize over the search space.Finally, we would like to emphasize that the idea of a volumetric measure of quality for amultiobjective setting is well-known, and was inspired in our work by Cross et al. (2019) in thecontext of quantum chip optimization. Moreover, a thorough treatment of the conditions wheremore data and more parameters in the context of abyssal models is done in a recent paper byNakkiran et al. (2020), and a good analysis of the convergence rate of such optimizers in non-convex settings and under similar assumptions to those of the strong AB n C property can be foundin Zaheer et al. (2018a). The remaining of this paper is structured as follows: in Section 2 we introduce notation, a formaldefinition of the OSE problem, and conclude by proving its hardness. We then begin in Section 3 bystating and proving some assumptions and properties needed for our algorithm, and subsequently3ormulate such procedure. Time and approximation bounds for a variety of input situations areprovided in Section 4. Finally, in Section 5, we discuss our work, as well as its potential impact anddirections for further research.
We wish to find the set of non-trainable parameters for a deep neural network f : R p → R q suchthat its parameter size, inference speed, and error rate on a set D are optimal amongst all suchnon-trainable parameters; in addition, the functions that compose f must remain unchanged.Formally, we refer to a layer as a nonlinear function l i ( x ; W i ) that takes in an input x ∈ R p i ,and a finite set of trainable weights, or parameters, W i = { w i, , . . . , w i,k } , such that every w i,j ∈ W i is an r -dimensional array w i,j ∈ R d (1) i,j × d (2) i,j ×···× d ( r ) i,j for some d (1) i,j , d (2) i,j , . . . , d ( r ) i,j , and returns an output y ∈ R q i .This definition is intentionally abstracted out: in practice, such a component is implementedas a computer program not solely dependent on the dimensionality of the input, but also other,non-trainable parameters that effect a change on its output value. We can then parametrize saidlayer with an extra ordered tuple of variables ξ i = h d (1) i,j , d (2) i,j , . . . , d ( r ) i,j i , and explicitly rewrite theequation for a layer as l i ( x ; W i ; ξ i ) . For the rest of the paper we will adopt this programmatic, ratherthan functional, view of a layer, and remark that this perspective change is needed for notationalconvenience and has little effect on the mathematics governing this problem. We will also assume,for simplicity, that the output of the last layer is binary, that is, f : R p → { , } .With the definition of a layer, we can then write a neural network architecture f as a continuous,piecewise-linear function, composed of a sequence of n layers: f ( x ; W ; ξ ) = l n ( l n − ( . . . ( l ( x ; W ; ξ ) . . . ); W n − ; ξ n − ); W n ; ξ n ) , (1)and formally define the architectural parameters and search spaces: Definition 1.
The architectural parameter set of a neural network architecture f ( x ; W ; ξ ) is afinite ordered tuple ξ = h ξ , . . . , ξ n i of variables such that ξ i ∈ ξ iff it is non-trainable. Intuitively, an architectural parameter is different from a trainable parameter because it is a vari-able which is assigned a value when the architecture is instantiated, likewise remaining unchangedfor the lifetime of this function. A change on the value assignment of said variable is capable ofeffect change on the three functions we are optimizing over, even when utilizing the same trainingprocedure. However, any assignment of ξ i can be mapped to a specific set of possible trainableparameter assignments. Definition 2.
The search space Ξ = { ξ (1) , . . . ξ ( m ) } is a finite set of valid assignments for the ar-chitectural parameters of a neural network architecture f ( x ; W ; ξ ( i ) ) , such that for every assignment ξ ( i ) , the corresponding weight assignment set W ( i ) ∈ W is non-empty. Even though the large majority of neural network components could be written as an affinetransformation, for practical purposes we leave out the implementation specifics for each layer; theproblem we will be dealing with operates on Ξ , and treats the architecture as a black box. Ouronly constraint, however, is that all trainable and architectural parameters in Equation 1 must benecessary to compute the output for the network.The following example illustrates a “small” problem, focusing on a specific architecture and itsvariations: 4 xample 1. Let our input be x ∈ R p , and H, J, A be positive integers. Consider the followingarchitecture, composed (in sequence) of a Transformer (Vaswani et al., 2017), a linear layer, and asigmoid activation unit σ ( · ) : f ( x ) = σ W L softmax ( W K x + b K )( W Q x + b Q ) ⊤ q HA ( W V x + b V ) + b L (2)Here, W K , W V , W Q ∈ R H × p , W L ∈ R J × H , and b K , b V , b Q ∈ R H , b L ∈ R J are trainable parameters.Although not explicitly stated in Equation 2, A is constrained to be divisible by H ; it specifies thestep size (i.e., number of rows) for the argument over which the softmax ( · ) function is to be executed;and acts as a regularization parameter. Thus, the architectural parameters for this architecture aregiven by ξ = h H, J, A i , and the search space would be a subset of the countably infinite number ofassignments to H, J and A that would satisfy Equation 2. Finally, note how the Transformer unitin this example is considered to be its own layer, in spite of being a block of three linear layers, adivision, and an activation function.For notational completeness, in the following we revisit or define the parameter size , inferencespeed , and error of a network. Definition 3.
The parameter size p ( f ( · ; W ; ξ )) of a network f ( · ; W ; ξ ) is defined as the numberof trainable parameters in the network: p ( f ( · ; W ; ξ )) = X { l i ( · ; W i ; ξ i ): l i ∈ f } | W i | (3)The parameter size of a network is different than the size it occupies in memory, since compressedmodels will require less bits of storage space, but will still have to encode the same number of pa-rameters. In other words, Definition 3 provides a bound on the length of the bitstring representationof f . Definition 4.
The inference speed i ( f ( · ; W ; ξ )) of a network f ( · ; W ; ξ ) is defined as the total num-ber of steps required to compute an output during the forward propagation step for a fixed-size input. Similar to the parameter size, Definition 4 provides a lower bound on the requirements to com-pute an output, barring any implementation-specific optimizations. We must also point out that,outside of the model of computation used, some compression schemes (e.g., floating point compres-sion) also have a direct impact on the actual inference speed of the network, roughly proportionalto its compression ratio. We do not consider compressed models further, but we note that suchenhancements can be trivially included into these objective functions without any impact to thecomplexity bounds presented in this paper.
Definition 5.
Let D be a set such that it is sampled i.i.d. from an unknown probability distribution;that is, D = {h x i , y i i} i =1 ,...,m for x i ∈ R p and y i ∈ { , } . The error rate of a network f ( · ; W ; ξ ) with respect to D is defined as: e ( f ( · ; W ; ξ ) , D ) = 1 | D | X h x i ,y i i∈ D [ f ( x i ; W ; ξ ) = y i ] (4) where we assume that Boolean assignments evaluate to { , } . This parameter is referred to in the literature as the number of attention heads . e ( f ( · ; W ; ξ ) , D ) is directly dependent on the trained parameters,as well as (by extension) the hyperparameters and training procedure used to obtain them. Inparticular, we note that the optimal weight set W ∗ is the one that minimizes the error across allpossible assignments in W . Albeit obvious, this insight will allow us to prove the complexity boundsfor this problem, as well as provide an approximation algorithm for it. It is then clear that weare unable to pick an optimal architecture set ξ ∗ without evaluating the error of all the candidatenetworks f ( · ; W ∗ ; ξ ( j ) ) , ∀ ξ ( j ) , ξ ∗ ∈ Ξ and W ∗ ∈ W . The main results of our work, presented inSection 3, focus on obtaining a way to circumvent such a limitation.Whenever there is no room for ambiguity regarding the arguments, we will write Definitions 3,4, and 5, as p ( f ) , i ( f ) , and e ( f ) , respectively, and obviate the input parameters to the architecture;moreover, we will refer to D as the dataset , following the definitions above. We are now ready to formally define the optimal subarchitecture extraction problem:
Definition 6 ( ose ). Given a dataset D = {h x i , y i i} i =1 ,...,m sampled i.i.d. from an unknown proba-bility distribution, search space Ξ , a set of possible weight assignments W , a set of hyperparametercombinations Θ , and an architecture f ( x ) = l n ( . . . l ( x ; W ; ξ ) . . . ); W n ; ξ n ) , find a valid assignment ξ ∗ ∈ Ξ , W ∗ ∈ W such that p ( f ( · ; W ∗ ; ξ ∗ )) , i ( f ( · ; W ∗ ; ξ ∗ )) and e ( f ( · ; W ∗ ; ξ ∗ ) , D ) are minimalacross all ξ ( i ) ∈ Ξ , W ( i ) ∈ W , and θ ( i ) ∈ Θ . We assume that Ξ is well-posed ; that is, Ξ is a finite set for which every assignment ξ ( k ) ∈ Ξ of architectural parameters to f ( · ; · ; ξ ( k ) ) is valid. In other words, ξ ( k ) ∈ Ξ is compatible with everylayer l j ( · ; · ; ξ ( k ) j ) ∈ f ( · ; · ; ξ ( k ) ) in the following manner: – The dimensionality of the first operation on l ( · ; · ; ξ ( k )1 ) is compatible with the dimensionality of x i ∈ R p , for h x i , y i i ∈ D . – The dimensionality of the last operation on l ( · ; · ; ξ ( k ) n ) is compatible with the dimensionality ofall y i ∈ R q for h x i , y i i ∈ D . – For any two subsequent operations, the dimensionality of the first operation is compatible withthe dimensionality of the ensuing operation.Note that the well-posedness of Ξ can be achieved in a programmatic manner and in lineartime so long as p and q are guaranteed to be constant. Likewise, it does not need to encode solelydimensionality parameters. Mathematically, we expect Ξ to be the domain of a bijective functionbetween the search space and all possible valid architectures. ose We prove in Theorem 1 that ose is weakly NP -hard. To begin, we formulate the decision versionof ose , ose-dec , as follows: Definition 7 ( ose-dec ). Given three numbers k p , k i , k e , and a tuple for ose h f, D, W, Ξ, Θ i , isthere an assignment ξ ∗ ∈ Ξ, W ∗ ∈ W (5)6 or f ( · ; W ; ξ ) such that p ( f ( · ; W ∗ ; ξ ∗ )) ≤ k p , (6) i ( f ( · ; W ∗ ; ξ ∗ )) ≤ k i , and (7) e ( f ( · ; W ∗ ; ξ ∗ ) , D ) ≤ k e ? (8)For the following, we will make the assumption, without loss of generality, that all inputs toDefinition 7 are integer-valued. Lemma 1. ose-dec ∈ NP -hard.Proof. In Appendix A.
Theorem 1.
Let ¯ f be the size of a network f in bits. Assume that, for any instance of ose-dec , i ( f ) ∈ Ω (cid:0) poly ( ¯ f ) (cid:1) for all assignments ξ ( i ) ∈ Ξ . Then ose-dec is weakly NP -hard.Proof. By construction. Let I = h f, D, W, Ξ, Θ, k p , k i , k e i be an instance of ose-dec .Remark that, by definition, the parameter size of a neural network increases with the cardi-nality of W ( i ) , which is itself dependent on the choice of ξ ( i ) . Note also that the inference speed i ( f ( · ; W ( i ) ; ξ ( i ) )) can be obtained for free at the same time as computing the error, as this functionruns in Θ ( i ( f ( · ; W ( i ) ; ξ ( i ) )) | D | ) steps, for any f ( · ; W ( i ) ; ξ ( i ) ) .Consider now the following exact algorithm: – Construct tables T , . . . , T k p , each of size | Θ || W ( i ) | , for every W ( i ) ∈ W , i < k p . – For every table T i compute e ( f ( · ; W ( i ) ,j ; ξ ( i ) ) , D ) for every assignment W ( i ) ,j ∈ W ( i ) and hy-perparameter set θ ( i ) ∈ Θ . If the condition i ( f ) < k i and e ( f ) < k e holds, stop and return“yes”. – Otherwise, return “no”.It is important to highlight that the algorithm does not perform any training, and so the valuesfor some entries of θ ( i ) ∈ Θ might be unused. That being said, given the ambiguity surrounding thedefinition of Θ , we can bound the runtime of this procedure to O | Θ || D | k p X i | W ( i ) | i ( f ( · ; · ; ξ ( i ) )) , (9)steps.Note that, although unnervingly slow, Equation 9 implies this algorithm is polynomial on thecardinalities of Θ , W ( i ) , and D , rather than on their magnitudes.However, this algorithm actually runs in pseudopolynomial time, since the instance’s runtimehas a dependency on i ( f ) . From the theorem statement, we can assume that i ( f ) is lower-boundedby the size in bits of f ( · ; W ( i ) ,j ; ξ ( i ) ) , and, since | W ( i ) | ∈ Θ (cid:0) | f ( · ; W ( i ) ,j ; ξ ( i ) ) | (cid:1) in both length andmagnitude, it is also lower-bounded by the values assigned to the elements of W ( i ) ,j ∈ W ( i ) .This in turn implies that, if W ( i ) ,jk ∈ W ( i ) is chosen to be arbitrarily large, the runtime of ouralgorithm is not guaranteed to be polynomial. It follows that ose-dec is weakly NP -hard. Remark 1.
The assumption from Theorem 1 regarding the boundedness of i ( f ) is not a strongassumption. If it were not to hold in practice, we would be unable to find tractable algorithms toevaluate the error e ( f ) . Computationally speaking, hypothesis spaces are defined in finite-precision machines, and hence are finite. orollary 1. Let I = h f, D, W, Ξ, Θ i be an instance of ose such that, ∀ ξ ( i ) , ξ ( j ) ∈ Ξ , and anyassignments of W ( i ) , W ( j ) ∈ W , e ( f ( · ; W ( i ) ; ξ ( i ) ) , D ) = e ( f ( · ; W ( j ) ; ξ ( j ) ) , D ) . (10) This instance admits a polynomial time algorithm.Proof.
Note that p ( f ) and i ( f ) can be computed in time linear on the size of the architecture byemploying a counting argument. Therefore this instance of ose is equivalent to the single-sourceshortest-paths problem between a fixed source and target s, t . To achieve this, construct a graphwith a vertex per every architectural parameter ξ ( i ) j ∈ ξ ( i ) , for all ξ ( i ) ∈ Ξ . Assign a zero-weightedge in between every ξ ( i ) j , ξ ( i ) j +1 ∈ ξ ( i ) , and between s and every ξ ( i )1 . Finally, add an edge of weight p ( f ( · ; · ; ξ ( i ) ))+ i ( f ( · ; · ; ξ ( i ) )) between every ξ ( i ) − and t . This is well-known to be solvable in polynomialtime.It is important to highlight that the hardness results from this section are not necessarily lim-iting in practice (cf. Theorem 1), and worst-case analysis only applies to the general case. Currentalgorithms devised to train deep neural networks rely on the statistical model of learning, and rou-tinely achieve excellent results by approximating a stationary point corresponding to this problemin a tractable number of steps. Nonetheless, computationally speaking, Theorem 1 highlights a keyobservation of the OSE problem: the problem could be solved (i.e., brute-forced) for small instancesefficiently, but the compute power required for larger inputs quickly becomes intractable. ose In this section we introduce our approximation algorithm for ose . Our strategy is simple: we rely onsurrogate functions for each of our objective functions, as well as a scalarization of said surrogateswhich we refer to as the W - coefficient . We begin by describing these functions, and then we showthat a solution optimal with respect to the W -coefficient is an optimal solution for ose . We concludethis section by introducing the algorithm. We surrogate our objective functions such that they are all in terms of the variable set for Ξ , withthe ultimate goal of transforming ose into a volume maximization problem. Assuming that p ( f ) and i ( f ) are expressable in terms of Ξ is not a strong assumption: if such functions were not at leastweakly correlated with Ξ , we could assume their values to be as constants for any f ( · ; · ; ξ ) in Ξ , andsolve this as a neural network training problem. Seen in another way, as the number of trainableparameters increases, we can expect p ( f ) and i ( f ) to likewise increase, although not necessarily ina linear fashion. On the other hand, e ( f ) is a function directly affected by the values of the trained weights W , and not by the architectural parameters. This means that the expressability of e ( f ) as afunction of Ξ requires stronger assumptions than for the other two metrics, and directly influencesour ability to approximate a solution for ose .For the following, let F be the set of possible architectures, such that it is the codomain of somebijective function A assigning architectural parameters from Ξ to specific candidate architectures, A : Ξ → F. (11)8dditionally, we will use the shorthand poly ( Ξ ) to denote a polynomial on the variable set { ξ i, , . . . ξ i,m : ξ ( i ) = h ξ i, , . . . ξ i,m i ∈ Ξ } , that is,poly ( Ξ ) = poly ( ξ , , . . . , ξ i, , . . . ξ i,m , . . . , ξ n,m ) . (12) Surrogate Inference Speed
We will assign constant-time cost to add-multiply operations, andlinear on the size of the input for every other operation. We assume that the computations aretaken up to some finite precision b . ˆ i ( f ( · ; W ; ξ )) = X l i ( · ; W i ; ξ i ) ∈ f number of additions in l i ( · ; W i ; ξ i )+ X l i ( · ; W i ; ξ i ) ∈ f number of multiplications in l i ( · ; W i ; ξ i )+ X j ∈ [1 ,...,b ] X l i ( · ; W i ; ξ i ) ∈ f j · number of other operations in l i ( · ; W i ; ξ i ) of length j. (13) Lemma 2.
For all f ∈ F , ˆ i ( f ( · ; W ; ξ )) ∈ O( poly ( Ξ )) . (14) Proof.
Recall that k -ary mathematical operations performed on an set of inputs will polynomiallydepend on the size of each of its members, which is given by ˜ O ( n ω ) for ω being a constant that oftendepends on the chosen multiplication algorithm. Given that the size of every member is specifiedby the length of the bitwise representation of every element, times the number of elements onsaid member–which is by definition, encoded in ξ –we can have a loose bound on the number ofoperations specified by every positional variable ξ i ∈ ξ , for every ξ ∈ Ξ . By the compositionalclosure of polynomials, this value remains polynomial on the variable set for Ξ . Lemma 3.
For any f ∈ F , i ( f ( · ; W ; ξ )) ∈ O (cid:16) ˆ i ( f ( · ; W ; ξ ) (cid:17) . (15) Proof.
Immediate from Lemma 2.Lemma 2 implies that we did not need to specify linear cost for all other operations. In fact, ifwe were to use their actual cost with respect to their bitstring size and the chosen multiplicationalgorithm, we could still maintain polynomiality, and the bound from Lemma 3 would be tight.However, Equation 13 suffices for the purposes of our analysis, since this implies that the surrogateinference speed of a network is bounded by its size.It is important to note that Equation 13 is not an entirely accurate depiction of how fast can anarchitecture compute an output in practice. Multiple factors, ranging from hardware optimizationsand other compile-time techniques, to even which processes are running in parallel on the machine,can affect the total wall clock time between computations. Nonetheless, Equation 13 provides a quan-tification of the approximate “cost” inherent to an architecture; regardless of the extraneous factors In the deep learning literature, the surrogate inference speed is sometimes measured via the number of floatingpoint operations , or FLOPs–that is, the number of add-multiply operations. See, for example, Molchanov et al.(2017). However, this definition would imply that ˆ i ( f ( · ; W ; ξ )) ∈ O( i ( f ( · ; W ; ξ ))) , which is undesirable from acorrectness point of view. ˆ i ( f ( · ; W ( k ) ; ξ ( i ) )) and ˆ i ( f ( · ; W ( l ) ; ξ ( j ) )) for two ξ ( i ) , ξ ( j ) ∈ Ξ . Surrogate Parameter Size
We do not surrogate p ( f ) , as it is already expressible as a polynomialon Ξ : Lemma 4.
For any f ∈ F , p ( f ( · ; W ; ξ )) ∈ Θ( poly ( Ξ )) . (16) Proof.
By Definition 3, note that p : F → N ≥ is the function taking in an architecture and returningits parameter size. Let ˆ p : Ξ → N ≥ be the function taking in the architectural parameter set of thearchitecture, and returning its parameter size. ˆ p is the pullback of A by p . Polynomiality followsfrom the fact that every r -dimensional set of weights w i,j ∈ R d (1) i,j × d (2) i,j ×···× d ( r ) i,j belonging to some layer l i ( · ; W i ; ξ i ) , is parametrized in Ξ -space by r corresponding architectural parameters ξ i,k = d ( k ) i,j . Thiswill contribute a total of Q rk d ( k ) i,j trainable parameters to the evaluation of p ( f ) . On the other hand,the non-trainable architectural parameters that do not correspond to a dimensionality will not affectthe value of p ( f ) . Surrogate Error
It was mentioned in Section 2.3 that the error is hard to compute directly, andhence we must rely on the surrogating of e ( f ) to an empirical loss function. Following standardoptimization techniques we then define the surrogate error as the empirical loss function over asubset B ⊂ D : ˆ e ( f ( · ; W ; ξ ) , B ) = 1 | B | X h x i ,y i i∈ B ⊂ D ℓ ( f ( x i ; W ; ξ ) , y i ) . (17)for some smooth function ℓ : R p × R q → [0 , that upper bounds Definition 5 and is classification-calibrated. This function does not need to be symmetric; but in Section 4 we will impose furtherconstraints on it. It is common to minimize the empirical loss by the use of an optimizer such asstochastic gradient descent (SGD), although in deep learning issues arise due to its non-convexity.Our goal is then to show that such a loss function can be approximated as a function of Ξ . Lemma 5.
Let I = h f, D, W, Ξ, Θ, ℓ i be an instance of ose , such that the same training procedureand optimizer are called for every f ∈ F . Fix a number of iterations, s , for this training procedure.Then, for any f ∈ F , with fixed B ⊂ D and θ ∈ Θ , ˆ e ( f ( · ; W ; ξ ) , B ) corresponds to the image ofsome polynomial poly B,s,θ ( Ξ ) evaluated at f .Proof. Note that a training procedure has, at iteration s , exactly one weight set assigned to any f ∈ F . We can then construct a function that assigns a quality measure to every architecture on f based on ˆ e ( f ) at s , that is, f : F → [0 , . Polynomiality follows from the fact that f is the pullbackof ˆ e by p ; by Lemma 4, f is then parametrized by a polynomial on Ξ . Lemma 6.
For any f ∈ F and an appropriate choice of ℓ ( · , · ) , e ( f ( · ; W ; ξ ) , D ) ∈ O( ˆ e ( f ( · ; W ; ξ ) , B )) . (18) See, for example, Bartlett et al. (2006) and Nguyen et al. (2009). roof. By construction, ℓ ( · , · ) is designed to upper bound e ( f ) .As before, we will employ the shorthands ˆ e ( f ) and ˆ i ( f ) for Equations 17 and 13. It is clear thatany solution strategy to ose will need, from the definition of the surrogate error, to take an extrainput ℓ . We will account for that and denote such an instance as I = h f, D, W, Ξ, Θ, ℓ i . W -coefficient The final tool we require in order to describe our algorithm is what we refer to as the W -coefficient, which is the scalarization of our objective functions. We begin by definining a crucial tool for ouralgorithm, the maximum point on Ξ : Definition 8 (The maximum point on Ξ ). Let T ( · ; W ∗ T ; ξ T ) be an architecture in F such that ∀ f ∈ F , p ( f ) < p ( T ) and i ( f ) < i ( T ) . We refer to T as the maximum point on Ξ . Note that the maximum point on Ξ is not an optimal point–our goal is to minimize all threefunctions, and T ( · ; W ∗ T ; ξ T ) does not have a known error rate on D . By this reason, the parameterscorresponding to the maximum point cannot be seen as a nadir objective vector. Definition 9.
Let f ( · ; W ; ξ ) , T ( · ; W ∗ T ; ξ T ) ∈ F be two architectures, such that T is a maximumpoint on Ξ . The W -coefficient of f and T is given by: W ( f, T ) = (cid:18) p ( T ) − p ( f ) p ( T ) (cid:19) ˆ i ( T ) − ˆ i ( f )ˆ i ( T ) ! (cid:18) e ( f ) (cid:19) . (19)The need for the normalization terms in Equation 19 becomes clear as we note that the practicalrange of the functions might differ considerably. We now prove that the W -coefficient can be seenas the scalarization of our surrogate objective functions p ( f ) , ˆ e ( f ) , and ˆ i ( f ) : Lemma 7.
Let T ( · ; W ∗ T ; ξ T ) ∈ F be a maximum point on Ξ . Let OP T be the Pareto-optimal setfor the multiobjective optimization problem:Minimize p ( f ) , ˆ e ( f ) , ˆ i ( f ) Subject to
D, Θ, and f ∈ F (20) Then f ∗ ( · ; W ∗ ; ξ ∗ ) ∈ OP T if and only if W ( f ∗ , T ) > W ( f, T ) ∀ f, f ∗ ∈ F .Proof. “If” direction: by Equation 19 and Definition 8, the terms corresponding to p ( T ) − p ( f ) (resp. ˆ i ( T ) − ˆ i ( f ) ) will be greater whenever p ( f ) (resp. ˆ i ( f ) ) is minimized. Likewise, the termcorresponding to ˆ e ( f ) decreases when it is better. An architecture f ∗ ∈ OP T will thus correspondto sup { W ( f, T ) : f ∈ F } .“Only if” direction: assume ∃ f ′ ∈ F such that at least one of the following is true: p ( f ′ ) < p ( f ∗ ) , ˆ i ( f ′ ) < ˆ i ( f ∗ ) , or ˆ e ( f ′ ) > ˆ e ( f ∗ ) , and f ′ OP T . But that would mean that W ( f ′ , T ) ≥ W ( f ∗ , T ) , a contradiction. Lemma 8.
Let I = h f, D, W, Ξ, Θ, ℓ i be an instance of ose , and let T ( · ; W ∗ T ; ξ T ) ∈ F be a maxi-mum point on Ξ with respect to some dataset D . An architecture f ∗ ( · ; W ∗ ; ξ ∗ ) ∈ F is optimal over p ( f ) , e ( f ) , and i ( f ) , if and only if W ( f ∗ , T ) > W ( f, T ) ∀ f ∈ F .Proof. It follows immediately from Lemma 7, as well as Lemmas 3, 4, and 6; an architecture thatmaximizes argmax f ∈ F W ( f, T ) will belong to the Pareto optimal set for this instance of ose . Visually, the W -coefficient is proportional to the volume of the Ξ -dimensional conic section by the volume spannedby h p ( T ) − p ( f ) , i ( T ) − ˆ i ( f ) , ˆ e ( f ) i in Ξ -space, hence its name. .3 Algorithm Our proposed algorithm is displayed in Algorithm 1. We highlight two particularities of this proce-dure: it only evaluates a fraction ⌊| Ξ | /ǫ ⌋ of the possible architectures, and it does so for a limitedamount of steps, rather than until convergence. Likewise, the training algorithm remains fixed onevery iteration, and is left unspecified. In Section 4 we describe training procedures and optimizersunder which we can guarantee optimality for a given instancce. Algorithm 1
Our proposed algorithm to solve ose . Input:
Architecture f , dataset D , weights space W , search space Ξ , hyperparameter set Θ , interval size ≤ ǫ ≤| Ξ | , maximum training steps s > , selected loss ℓ .2: Find a maximum point T ( · ; W ∗ T ; ξ ∗ T ) .3: Obtain expressions for p ( T ) and ˆ i ( T ) in terms of ξ ∗ T .4: Ξ ′ ← Sort the terms in Ξ based on the leading term from p ( T ) for every hyperparameter set θ ∈ Θ do for every ǫ th set ξ ( ǫ ) ∈ Ξ ′ do
7: Train a candidate architecture f ( i ) ( · ; W ( i ) ; ξ ( ǫ ) ) for s steps, under θ .8: Keep track of the largest W -coefficient W ( f ( i ) , T ) , and its corresponding sets ξ ( i ) and W ( i ) end for end for return h ξ ∗ , W ∗ i corresponding to the first recorded argmax f ( i ) ( · ; W ∗ ; ξ ∗ ) W ( f ( i ) , T ) In this section we begin by introducing the notion of the AB n C property , which consolidates theassumptions leading to the surrogate functions from the previous section. We then provide timebounds for Algorithm 1, and then show that, if the input presents the AB n C property, its solutionruns in polynomial time. We conclude this section with an analysis of the correctness and errorbounds of our algorithm, and prove under which conditions Algorithm 1 behaves like an FPTAS. AB n C Property
It is clear from Algorithm 1 that, without any additional assumptions, the correctness proven inLemma 8 is not guaranteed outside of asymptotic conditions. Specifically, the work done in theprevious section with the aim of expressing p ( f ) , i ( f ) and e ( f ) as functions on Ξ implicitly inducesan order on the objective functions. Without it, an algorithm that relies on evaluating every ǫ th architecture can only provide worst case guarantees.The AB n C property captures sufficient and necessary conditions to guarantee such an order, aswell as providing faster solutions for Algorithm 1. Definition 10 (The weak AB n C property). Let
A, B , and C be unique layers for an architecture f ( · ; W ; ξ ) ∈ F , such that, for any input x : f ( x ; W ; ξ ) = C ( B n ( . . . B ( A ( x ; W A ; ξ A ); W ; ξ ) . . . ); W n ; ξ n ); W C ; ξ C ) , (21) where n ≥ is an architectural parameter.We say that f ( x ; W ; ξ ) has the weak AB n C property , if, for ≤ i ≤ n : p ( A ( · ; W A ; ξ A )) and p ( C ( · ; W C ; ξ C )) ∈ o( p ( B i ( · ; W B i ; ξ B i ))) , (22) i ( A ( · ; W A ; ξ A )) and i ( C ( · ; W C ; ξ C )) ∈ o( i ( B i ( · ; W B i ; ξ B i ))) . (23)12 emma 9. Let I = h f, D, W, Ξ, Θ, ℓ i be an instance of ose . If there exists at least one f ∈ F withthe weak AB n C property, then all f ∈ F have the weak AB n C property.Proof. It follows from Definition 10. Let f ( · ; W ( f ) ; ξ ( f ) ) , g ( · ; W ( g ) ; ξ ( g ) ) ∈ F be two architectures,such that f has the AB n C property, and let T be a maximal point on Ξ . By Lemma 4, it can be seenthat p ( f ) ∈ Θ ( p ( g )) . In particular, although the value of p ( g ) changes under an assignment ξ ( g ) ,the function that defines it does not. Therefore, the condition p ( A ( · ; W A ; ξ A )) , p ( C ( · ; W C ; ξ C )) ∈ o ( p ( B i ( · ; W B i ; ξ B i ))) , for some layers A, B i , C ∈ g , holds. The proof for Equation 23 for the samelayers of g is symmetric, by means of Lemma 3.In spite of the closure from Lemma 9, Algorithm 1 still requires us to find the optimal weightset W ∗ for every architecture evaluated. Given that the set W is oftentimes large, the algorithmitself does not adjust to modern deep learning practices. Therefore we strengthen Definition 10 byimposing some further conditions on the metric function ℓ . Definition 11 (The strong AB n C property). Let I = h f, D, W, Ξ, Θ, ℓ i be an instance of ose .We say I has the strong AB n C property if ∃ f ∈ F with the weak AB n C property and ∇ W ℓ is L -Lipschitz smooth with respect to W , with bounded stochastic gradients, for every θ ∈ Θ . That is: ||∇ ℓ ( f ( x i ; · ) , y i ) − ∇ ℓ ( f ( x j ; · ) , y j ) || ≤ L || ℓ ( f ( x i ; · ) , y i ) − ℓ ( f ( x j ; · ) , y j ) || , (24) and ||∇ ℓ ( f ( x i ; · ) , y i ) k || ≤ G ∀ k ∈ [ | B | ] , (25) ∀h x i , y i i , h x j , y j i ∈ B ⊂ D , and some fixed constants L and G . Remark that the constraints from Definition 11 are pretty common assumptions the analysisof first-order optimizers in machine learning (Shalev-Shwartz et al., 2009; Zaheer et al., 2018b).While there is no algorithm to find an ǫ -stationary point in the non-convex, non-smooth setting ,approximating a point near it is indeed tractable (Zhang et al., 2020).We can exploit such a result with the goal of showing that the set F has an ordering withrespect to p ( f ) , ˆ i ( f ) , and ˆ e ( f ) , through the expected convergence of SGD in the non-convex case.Bounds on this convergence are well-known in the literature (see, for example, Ghadimi and Lan(2013)). However, Definition 11 has looser assumptions, which align directly with a recent result byNguyen et al. (2020) on a variant of SGD referred to there as shuffling-type , which we reproducebelow for convenience: Lemma 10 (Nguyen et al. (2020)).
Let f ( x ; w ) be a L -Lipschitz smooth function on w , boundedfrom below and such that ||∇ f ( x i ; w ) || ≤ G for some G and i ∈ [ n ] . Given the following problem:min w F ( w ) = 1 n n X i f ( x i ; w ) , (26) the number of gradient evaluations from a shuffling-type SGD required to obtain a solution ˆ w T bounded by E [ ||∇ F ( ˆ w T ) || ] ≤ ǫ (27) This limitation, as argued by LeCun et al. (2015), in addition to the large track record of successes by deep learningsystems, has little impact on natural problems. s given by: T = (cid:22) LG [ F ( ˜ w ) − inf w F ( w )] · nǫ / (cid:23) . (28) for a fixed learning rate η = √ ǫLG . Lemma 11.
Let I = h f, D, W, Ξ, Θ, ℓ i be an instance of ose with the strong AB n C property. Fixa maximum point T and a training procedure with a shuffling-type SGD optimizer. Then, for a fixednumber of steps s and fixed θ ∈ Θ , for any f, g ∈ F , if their depths n f ∈ ξ f , n g ∈ ξ g are such that n f ≤ n g , then ˆ e ( f ) ≤ ˆ e ( g ) .Proof. Let s f , (resp. s g ) be the total number of steps required to achieve an ǫ -accurate solution for f (resp. g ). Then, by Lemma 10, and counting the actual number of operations on the network: s f ≤ Cǫ / i ( f ) (29) s g ≤ Cǫ / i ( g ) (30)where C is a constant equal in both cases due to the parameters of the training procedure. Bythe weak AB n C property, s f < s g , since n f < n g = ⇒ i ( f ) ≤ i ( g ) . Thus, for a fixed step s : Cǫ f / i ( f ) = Cǫ g / i ( g ) , (31)which immediately implies that ǫ g ≥ ǫ f , w.p.1, with equality guaranteed when i ( g ) = i ( f ) .Since ǫ g (resp. ǫ f ) is the radius of a ball centered around the optimal point, then ˆ e ( f ) ≤ ˆ e ( g ) . Remark 2.
If we were to change the conditions from Lemma 11 to any loss with convergence guar-antees (e.g., for µ -strongly convex ℓ ( f ( x i ; · ) , y i ) ), the statement would still hold.With these tools, we can now state the following lemma: Lemma 12.
Let I = h f, D, W, Ξ, Θ, ℓ i be an instance of ose , such that I presents the strong AB n C property. Let T be a maximum point on Ξ . Then, for any f, g ∈ F : p ( f ) (cid:22) p ( g ) ⇐⇒ ˆ i ( f ) (cid:22) ˆ i ( g ) (32) p ( f ) (cid:22) p ( g ) ⇐⇒ ˆ e ( f ) (cid:22) ˆ e ( g ) (33) Proof.
The proof for Equation 32 is immediate from the definition of the weak AB n C property:since the inference speed as well as the parameter size are bounded by the depth of the network,there exists a partial ordering on the architectures with respect to the original set. Equation 33follows immediately from Lemma 11 and the definition of the strong AB n C property. In this section we prove the time complexity of Algorithm 1.14 heorem 2.
Let I = h f, D, W, Ξ, Θ, ℓ i be an instance of ose with the AB n C property. Then ifAlgorithm 1 employs a shuffling-type SGD training procedure, it terminates in O (cid:18) | Ξ | + | W ∗ T | (cid:18) | Θ || B || Ξ | ǫ · s / (cid:19)(cid:19) (34) steps, where ≤ ǫ ≤ | Ξ | , and s > are input parameters; B ⊂ D for B in every θ ( i ) ∈ Θ ; and | W ∗ T | = argmax w ∈ W ( | w | ) is the cardinality of the largest weight set assignment.Proof. By Lemma 10 and Lemma 11, finding the maximum point T ( · ; W ∗ T ; ξ T ) will be boundedby | Ξ | . Likewise, obtaining the expressions for p ( T ) and ˆ i ( T ) can be done by traversing T andemploying a counting argument. By the definition of the weak AB n C property and Lemma 12, thisis bounded by the cardinality of its weight set. On the same vein, grouping the terms of p ( T ) canbe done in constant time by relating it to the depth of the network.Then the initialization time of our algorithm will be: O( | Ξ | + | W ∗ T | ) . (35)On the other hand, training each of the | Ξ | candidate architectures with an interval size ǫ takes O (cid:18)(cid:22) | Ξ | ǫ (cid:23) i ( f ) | B || Θ | s / (cid:19) (36)steps. Adding both equations together, and by Definition 8, this leads to a total time complexityof: O (cid:18) | Ξ | + | W ∗ T | (cid:18) | Θ || B || Ξ | ǫ · s / (cid:19)(cid:19) (37)steps, which concludes the proof.So far we have assumed training is normally done through the minimization of the surrogate loss ˆ e ( f ) . Other methods, such as Bayesian optimization, simulated annealing, or even random search,can also be used to assign weights to f . In the general case, when the algorithm is guaranteed to re-turn the optimal set of weights regardless of the convexity and differentiability of the input function,any reasonable procedure would take at most O ( i ( f ) | D || W | ) steps to obtain such a solution.As shown in Theorem 1, when the input does not present the AB n C property, the runtime isnot necessarily polynomial. Likewise, its runtime might be worse than a simple linear search when | W | << | Ξ | . This last situation, however, is only seen in zero-shot techniques. In this section we prove error bounds for Algorithm 1 under the strong AB n C property. The generalcase, where it is impossible to assume whether Lemma 12 holds, can only provide asymptoticallyworst-case guarantees. Theorem 3.
Let I = h f, D, W, Ξ, Θ, ℓ i be an instance of ose with the strong AB n C property. Fora chosen s > and ≤ ǫ ≤ | Ξ | , if Algorithm 1 employs a shuffling-type SGD training procedurewith a fixed learning rate, it will return a solution with a worst-case absolute error bound of c ≤ ǫ − on the space of solutions reachable by this training procedure. roof. The proof has two parts: first, proving that the worst-case absolute error bound for an s -optimal solution is at most ǫ − ; second, showing that c is independent of the choice of s , and thatthese results extend to an optimal solution found by taking every architecture to convergence withthe optimizer.To begin, assume for simplicity Θ = { θ } .By Lemmas 7 and 8 we know that, for a fixed s , W ( f s, ∗ , T ) is s -optimal in the limit where ǫ = | Ξ | . Let OP T s be the location of the solution in Ξ -space to such an instance, and let OP T
ALG be the corresponding location of the output of Algorithm 1.The input instance I , by Lemma 11, contains an ordering for the error with respect to theparameter size. Sorting the parameters and then obtaining a solution OP T
ALG means that
OP T s islocated within the radius of the ball centered at W ( f s , T ) ’s index, that is, OP T s ∈ B ( OP T
ALG , r ) .It is not difficult to see that r = ǫ − , and therefore OP T s ≥ OP T
ALG − ( ǫ − , or (38) OP T s ≤ OP T
ALG + ( ǫ − . (39)We can simplify the above expression to | OP T s − OP T
ALG | ≤ ǫ − , (40)which yields the desired absolute error bound.Now let | Θ | > . By line 8, Equation 40 still holds, since an optimal solution W i over θ ( i ) ∈ Θ will be returned iff W i ( f s , T ) ≥ W j ( f s , T ) for all θ ( i ) , θ ( j ) ∈ Θ .Consider the case s + 1 . Given that the instance has the AB n C property, Lemma 12 holds, andfollowing Equation 40 we obtain | OP T s +1 − OP T
ALG | ≤ ǫ − . (41)By an induction step on some s T > s , we have: | OP T s T − OP T
ALG | ≤ ǫ − , (42)from which immediately follows that c ≤ ǫ − for any choice of s .To finalize our proof, remark that OP T s T is only s T -optimal. Consider then the case when forthe optimal solution OP T ∗ –that is, it is the solution corresponding to the smallest number of stepsrequired to converge to an optimal solution W ∗ ∈ W for all f ∗ and θ i ∈ Θ . By the definition of thestrong AB n C property, Lemma 10, and the induction step from the last part: | OP T ∗ − OP T
ALG | ≤ ǫ − , (43)which concludes the proof.While Theorem 3 provides stronger results, we also provide the approximation ratio of Algo-rithm 1, to be used later. Corollary 2.
If the instance presents the strong AB n C property, Algorithm 1 returns a solutionwith an approximation ratio ρ ≤ | − ǫ | (44) on the space of solutions reachable by this training procedure. roof. Follows immediately from Theorem 3, and the definition of an approximation ratio: ρ ≤ (cid:12)(cid:12)(cid:12)(cid:12) OP T ∗ − OP T
ALG
OP T ∗ (cid:12)(cid:12)(cid:12)(cid:12) (45) ρ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( k − ( ǫ − j | Ξ | ǫ k − k j | Ξ | ǫ k ( k − ( ǫ − j | Ξ | ǫ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (46)for some k ≤ ǫ . Plugging it in we obtain the desired ratio. Remark 3.
Note that
OP T ∗ does not necessarily correspond to a global optimum, but it correspondsto the optimum reachable under the combination of the given Θ , as well as the chosen optimizationalgorithm.The previous results show that an input with the AB n C property is well-behaved under Al-gorithm 1. Specifically, they show dependence on their time and error bounds with respect to theinput parameters s and ǫ . Theorem 1 suggests that this problem admits an FPTAS, and Theorem 4crystallizes this observation for a subproblem of ose . Theorem 4.
Let ose-strong-abnc be a subproblem of ose that only admits instances with thestrong AB n C property. Assume that ˆ e ( f ; W ; ξ ) is µ -strongly convex for all instances. Then Algo-rithm 1 is an FPTAS for ose-strong-abnc , for any choice of optimizer.Proof. By Theorem 2, the running time of our algorithm is polynomial on both the size of theinput, and on /ǫ . From Corollary 2 we obtain the desired approximation ratio. Given the convexityassumptions, the approximation ratio applies to the global optimal solution which would be obtainedvia an exhaustive search. By definition, this algorithm is an FPTAS for ose-strong-abnc . Remark 4.
By Theorem 5, lifting the convexity condition from Theorem 4 would not guaranteethe same global optimization guarantees. However, if we are to consider solely the set of instancesreachable by the optimizer’s algorithm as the optimal solution space, the results from Theorem 4still hold without the need to assume µ -strong convexity. We presented a formal analysis of the OSE problem, as well as an FPTAS that works for a largeclass of neural network architectures. The time complexity for Algorithm 1 can be further improvedby using a strongly convex loss (Rakhlin et al., 2012) or by employing optimizers other than SGD.A popular choice of the latter is ADAM (Kingma and Ba, 2014), which is known to yield fasterconvergence guarantees for some convex optimization problems, but not necessarily all (Reddi et al.,2018). Regardless, it is also well-known that SGD finds solutions that are highly generalizable(Kleinberg et al., 2018; Dinh et al., 2017).On the same vein, tighter approximation bounds can be achieved even for cases where only theweak AB n C property is present, as one could induce an ordering on the error by making it a convexcombination of the candidate’s surrogate error and of the maximum point’s. This, however, impliesthat one must train the maximum point to convergence, impacting the runtime of the algorithm bya non-negligible factor which may or may not be recovered in an amortized context.We began this paper by indicating that obtaining a smaller network with the OSE objectivewould be more efficient in terms of speed, size, and environmental impact; yet, the reliance of our17lgorithm on a maximum point on Ξ appears to contradict that. Such a model is, by design, ananalysis tool, and does not need to be trained. The optimality of the W -coefficient, as well as theother results from this paper, still hold without it.Finally, the keen reader would have noticed that we alternated between the computationalcomplexity analysis of the hardness of training a neural network, and its statistical counterpart; wewere able to bridge such a gap by treating the training process as a black-box and imposing simpleconvergence constraints. A more in-depth analysis of the differences between both approaches, as wellas further hardness results on learning half-spaces, is done by Daniely et al. (2014). This techniqueproved itself helpful in yielding the results around the convergence and time complexity guaranteesof the work on our paper. We hope that it can be further applied to other deep learning problemsto obtain algorithms with well-understood performances and limitations. Acknowledgments
The author would like to thank D. Perry for his detailed questions about this work, which ultimatelyled to a more generalizable approach. Special thanks are in order for B. d’Iverno, and Q. Wang,whose helpful conversations about the approximation ratio were crucial on the analysis.18 ibliography
Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. 2016.Understanding deep neural networks with rectified linear units.
CoRR , abs/1611.01491.Sanjeev Arora, László Babai, Jacques Stern, and Z. Sweedyk. 1993. The hardness of approximateoptima in lattices, codes, and systems of linear equations. In
Proceedings of 1993 IEEE 34thAnnual Foundations of Computer Science , pages 724–733.Peter Bartlett and Shai Ben-David. 2002. Hardness results for neural network approximation prob-lems. In
Theoretical Computer Science , volume 284, pages 53–66.Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. 2006.Convexity, classification, and risk bounds.
Journal of the American Statistical Association ,101(473):138–156.Avrim Blum and Ronald L. Rivest. 1988. Training a 3-node neural network is NP-Complete. In
Pro-ceedings of the 1st International Conference on Neural Information Processing Systems , NIPS’88,page 494–501, Cambridge, MA, USA. MIT Press.Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, ArielHerbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, ScottGray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, IlyaSutskever, and Dario Amodei. 2020. Language models are few-shot learners.Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In
Pro-ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and datamining , pages 535–541.Gail A. Carpenter and Stephen Grossberg. 1987. A massively parallel architecture for a self-organizing neural pattern recognition machine.
Computer Vision, Graphics and Image Processing , 37:54–115.Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. 2017. Dualpath networks.
NIPS .Andrew W. Cross, Lev S. Bishop, Sarah Sheldon, Paul D. Nation, and Jay M. Gambetta. 2019.Validating quantum computers using randomized model circuits.
Phys. Rev. A , 100:032328.Amit Daniely. 2016. Complexity theoretic limitations on learning halfspaces. In
Proceedings of theForty-Eighth Annual ACM Symposium on Theory of Computing , STOC ’16, page 105–117, NewYork, NY, USA. Association for Computing Machinery.Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. 2014.From average case complexity to improper learning complexity. In
Proceedings of the Forty-SixthAnnual ACM Symposium on Theory of Computing , STOC ’14, page 441–448, New York, NY,USA. Association for Computing Machinery.Amit Daniely and Shai Shalev-Shwartz. 2016. Complexity theoretic limitations on learning dnf’s.In , volume 49 of
Proceedings of Machine LearningResearch , pages 815–830, Columbia University, New York, New York, USA. PMLR.Bhaskar DasGupta, Hava T. Siegelmann, and Eduardo Sontag. 1995. On the complexity of trainingneural networks with continuous activation functions.
IEEE Transactions on Neural Networks ,6(6):1490–1504.Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. 2017. Sharp minima can gener-alize for deep nets. In
Proceedings of the 34th International Conference on Machine Learning -Volume 70 , ICML’17, page 1019–1028. JMLR.org.homas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey.
Journal of Machine Learning Research , 20:1–21.Jonathan Frankle and Michael Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks.In
International Conference on Learning Representations .Saeed Ghadimi and Guanghui Lan. 2013. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.
SIAM J. Optim , 23:2341––2368.Gene H. Golub and Charles F. Van Loan. 2013.
Matrix Computation , page 12. The Johns HopkinsPress, Baltimore.Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural net-works with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015.Deep residual learning for image recognition.
CoRR , abs/1512.03385.Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 .Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors. 2018.
Automated Machine Learning:Methods, Systems, Challenges . Springer. In press, available at http://automl.org/book.J. Stephen Judd. 1990.
Neural Network Design and the Complexity of Learning . MIT Press, Cam-bridge, MA, USA.Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 .Bobby Kleinberg, Yuanzhi Li, and Yang Yuan. 2018. An alternative view: When does SGD escape local minima?In
Proceedings of the 35th International Conference on Machine Learning , volume 80 of
Proceed-ings of Machine Learning Research , pages 2698–2707, Stockholmsmässan, Stockholm Sweden.PMLR.Adam R. Klivans and Alexander A. Sherstov. 2006. Cryptographic hardness for learning intersections of halfspaces.In
Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science , FOCS’06, page 553–562, USA. IEEE Computer Society.Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning.
Nature , 521(7553):436–444.Liam Li, Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. 2020.Geometry-aware gradient algorithms for neural architecture search.Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019a. Darts: Differentiable architecture search.
International Conference on Learning Representations .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b.Roberta: A robustly optimized BERT pretraining approach.
CoRR , abs/1907.11692.Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. 2014. On the computational efficiency of trainingneural networks. In
Proceedings of the 27th International Conference on Neural InformationProcessing Systems - Volume 1 , NIPS’14, page 855–863, Cambridge, MA, USA. MIT Press.Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017.Pruning convolutional neural networks for resource efficient transfer learning. In
InternationalConference on Learning Representations .Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. 2020.Deep double descent: Where bigger models and more data hurt. In
International Conference onLearning Representations .Lam M. Nguyen, Quoc Tran-Dinh, Dzung T. Phan, Phuong Ha Nguyen, and Marten van Dijk. 2020.A unified convergence analysis for shuffling-type gradient methods.XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. 2009.On surrogate loss functions and f-divergences.
The Annals of Statistics , 37(2):876–904.20lexander Rakhlin, Ohad Shamir, and Karthik Sridharan. 2012. Making gradient descent optimalfor strongly convex stochastic optimization. In
Proceedings of the 29th International Coference onInternational Conference on Machine Learning , ICML’12, page 1571–1578, Madison, WI, USA.Omnipress.Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2018.Regularized evolution for image classifier architecture search.
ICML AutoML Workshop .Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the convergence of adam and beyond.In
International Conference on Learning Representations .Alex Renda, Jonathan Frankle, and Michael Carbin. 2020.Comparing rewinding and fine-tuning in neural network pruning. In
International Confer-ence on Learning Representations .Corby Rosset. 2020. Turing-NLG: A 17-billion-parameter language model by Microsoft. .Online; published February 13, 2020, accessed May 4, 2020.J. David Schaffer, Richard A. Caruana, and Larry J. Eshelman. 1990.Using genetic search to exploit the emergent behavior of neural networks.
Physics D , 42(244-248).Shai Shalev-Shwartz and Shai Ben-David. 2014.
Understanding Machine Learning: From Theory to Algorithms .Cambridge University Press.Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. 2009. Stochastic convexoptimization. In
COLT .Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances in NeuralInformation Processing Systems 30 , pages 5998–6008. Curran Associates, Inc.Adrian de Wynter. 2019. On the bounds of function approximations. In I. Tetko, V. Kůrková,P. Karpov, and F. Theis, editors,
Artificial Neural Networks and Machine Learning – ICANN2019: Theoretical Neural Computation , volume 11727 of
Lecture Notes in Computer Science , pages401–417. Springer Cham, Heidelberg.Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. 2018a.Adaptive methods for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grau-man, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Sys-tems 31 , pages 9793–9803. Curran Associates, Inc.Manzil Zaheer, Sashank J. Reddi, Devendra Singh Sachan, Satyen Kale, and Sanjiv Kumar. 2018b.Adaptive methods for nonconvex optimization. In
NeurIPS .Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Ali Jadbabaie, and Suvrit Sra. 2020.On complexity of finding stationary points of nonsmooth nonconvex functions.Barret Zoph and Quoc V. Le. 2016. Neural architecture search with reinforcement learning.
CoRR ,abs/1611.01578.
AppendicesA Proof of Lemma 1
We can prove Lemma 1 by leveraging previous results from the literature. Our–admittedly simple–argument relies on noting that training a neural network is equivalent to our problem when Ξ = { ξ } .21aid problem is known to be computationally hard in both its decision (Bartlett and Ben-David,2002) and optimization (Shalev-Shwartz and Ben-David, 2014) versions, and has been formulatedoften. We will refer to its decision version as nn-training-dec , and, for convenience, we reproduceit below with our notation:
Definition 12 ( nn-training-dec ). Given a dataset sampled i.i.d. from an unknown probabilitydistribution D = {h x i , y i i} i =1 ,...,m , a set of possible weight assignments W , an architecture f ( · ; · ; ξ ) ,and a number k , the nn-training-dec problem requires finding a valid combination of parameters W ∗ ∈ W such that e ( f ( · ; W ∗ ; ξ ) , D ) ≤ k. (47) Theorem 5.
For arbitrary
W, ξ, D , and k , nn-training-dec ∈ NP -hard.Proof. Refer to Section 1.2 for proofs of multiple instances of this problem.We can now prove Lemma 1:
Proof.