[PDF] AdaS: Adaptive Scheduling of Stochastic Gradients

Abstract

The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to tune the step-size requires extensive practical experience--offering limited insight into how the parameters update--and is not consistent across applications. This work attempts to answer a question of interest to both researchers and practitioners, namely \textit{"how much knowledge is gained in iterative training of deep neural networks?"} Answering this question introduces two useful metrics derived from the singular values of the low-rank factorization of convolution layers in deep neural networks. We introduce the notions of \textit{"knowledge gain"} and \textit{"mapping condition"} and propose a new algorithm called Adaptive Scheduling (AdaS) that utilizes these derived metrics to adapt the SGD learning rate proportionally to the rate of change in knowledge gain over successive iterations. Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training. Code is available at \url{this https URL}.

Full PDF

AAdaS: Adaptive Scheduling of Stochastic Gradients

Mahdi S. Hosseini and Konstantinos N. Plataniotis

University of Toronto, The Edward S. Rogers Sr. Department of Electrical & Computer EngineeringToronto, Ontario, M5S 3G4, Canada [email protected]

Abstract

Answering this question introduces two useful metrics derived fromthe singular values of the low-rank factorization of convolution layers in deepneural networks. We introduce the notions of “knowledge gain” and “mappingcondition” and propose a new algorithm called Adaptive Scheduling (AdaS) thatutilizes these derived metrics to adapt the SGD learning rate proportionally tothe rate of change in knowledge gain over successive iterations. Experimentationreveals that, using the derived metrics, AdaS exhibits: (a) faster convergence andsuperior generalization over existing adaptive learning methods; and (b) lack ofdependence on a validation set to determine when to stop training. Code is availableat https://github.com/mahdihosseini/AdaS . Stochastic Gradient Descent (SGD), a ﬁrst-order optimization method [26, 3, 4, 29], has becomethe mainstream method for training over-parametrized models such as deep neural networks [17, 8].Attempting to augment this method, SGD with momentum [24, 34] accumulates the historicallyaligned gradients which helps in navigating past ravines and towards a more optimal solution. Iteventually converges faster and exhibits better generalization compared to vanilla SGD. However,as the step-size (aka global learning rate) is mainly ﬁxed for momentum SGD, it blindly followsthese past gradients and can eventually overshoot an optimum and cause oscillatory behavior. From apractical standpoint (e.g. in the context of deep neural network [2, 28, 17]) it is even more concerningto deploy a ﬁxed global learning rate as it often leads to poorer convergence, requires extensive tuning,and exhibits strong performance ﬂuctuations over the selection range.A handful of methods have been introduced over the past decade to solve the latter issues based onthe adaptive gradient methods [7, 36, 40, 13, 6, 25, 18, 20, 1, 27, 37, 22]. These methods can berepresented in the general form: Φ k ←− Φ k − − η k ψ ( g , · · · , g k ) φ ( g , · · · , g k ) , (1)where for some k th iteration, g i is the stochastic gradient obtained at the i th iteration, φ ( g , · · · , g k ) is the gradient estimation, and η k /ψ ( g , · · · , g k ) is the adaptive learning rate, where ψ ( g , · · · , g k ) generally relates to the square of the gradients. Each adaptive method therefore attempts to modifythe gradient estimation (through the use of momentum) or the adaptive learning rate (through a Preprint. Under review. a r X i v : . [ c s . L G ] J un ifference choice of ψ ( g , · · · , g k ) ). Furthermore, it is also common to subject η k to a manually setschedule for more optimal performance and better theoretical convergence guarantees.Such methods were ﬁrst introduced in [7] (AdaGrad) by regulating the update size with the accumu-lated second order statistical measures of gradients which provides a robust framework for sparsegradient updates. The issue of vanishing learning rate caused by equally weighted accumulation ofgradients is the main drawback of AdaGrad that was raised in [36] (RMSProp), which utilizes theexponential decaying average of gradients instead of accumulation. A variant of ﬁrst-order gradientmeasures was also introduced in [40] (AdaDelta), which solves the decaying learning rate problemusing an accumulation window, providing a robust framework toward hyper-parameter tuning issues.The adaptive moment estimation in [13] (AdaM) was introduced later to leverage both ﬁrst andsecond moment measures of gradients for parameter updating. AdaM can be seen as the celebrationof all three adaptive optimizers: AdaGrad, RMSProp and AdaDelta–solving the vanishing learningrate problem and offering a more optimal adaptive learning rate to improve in rapid convergence andgeneralization capabilities. Further improvements were made on AdaM using Nesterov Momentum[6], long-term memory of past gradients [25], rectiﬁed estimations [18], dynamic bound of learningrate [20], hyper-gradient descent method [1], and loss-based step-size [27]. Methods based online-search techniques [37] and coin betting [22] are also introduced to avoid bottlenecks caused bythe hyper-parameter tuning issues in SGD.The AdaM optimizer, as well as its other variants, has attracted many practitioners in deep learningfor two main reasons: (1) it requires minimal hyper-parameter tuning effort; and (2) it provides anefﬁcient convergence optimization framework. Despite the ease of implementation of such optimizers,there is a growing concern about their poor “generalization” capabilities. They perform well on thegiven-samples i.e. training data (at times even better performance can be achieved compared to non-adaptive methods such as in [19, 32, 33]), but perform poorly on the out-of-samples i.e. test/evaluationdata [38]. Despite various research efforts taken for adaptive learning methods, the non-adaptiveSGD based optimizers (such as scheduled learning methods including Warmup Techniques [19],Cyclical-Learning [32, 33], and Step-Decaying [8]) are still considered to be the golden frameworksto achieve better performance at the price of either more epochs for training and/or costly tuning foroptimal hyper-parameter conﬁgurations given different datasets and models.Our goal in this paper is twofold: (1) we address the above issues by proposing a new approach foradaptive methods in SGD optimization; and (2) we introduce new probing metrics that enable themonitoring and evaluation of quality of learning within layers of a Convolutional Neural Network(CNN). Unlike the general trend in most adaptive methods where the raw measurements fromgradients are utilized to adapt the step-size and regulate the gradients (through different choices ofadaptive learning rate or gradient estimation), we take a different approach and focus our efforts onthe scheduling of the learning rate η k independently for each convolutional block. Speciﬁcally, weﬁrst ask “how much of the gradients are useful for SGD updates?” and then translate this into a newconcept we call the “knowledge gain” , which is measured from the energy of low-rank factorizationof convolution weights in deep layers. The knowledge gain deﬁnes the usefulness of gradients andadapts the next step-size η k for SGD updates. We summarize our contributions as follows:1. The new concepts of “knowledge gain” and “mapping condition” are introduced to measurethe quality of convolution weights that can be used in iterative training and to provide ananswer to these questions: how well are the layers trained given a certain epoch steps? Isthere enough information obtained via the sequence of updates?2. We proposed a new adaptive scheduling algorithm for SGD called “AdaS” which introducesminimal computational overhead over vanilla SGD and guarantees the increase of knowledgegain over consecutive epochs. AdaS adaptively schedules η k for every conv block and bothgeneralizes well and outperforms previous adaptive methods e.g. AdaM. A pitching factorcalled gain-factor is tuned in AdaS to trade off between fast convergence and greedyperformance. Code is available at https://github.com/mahdihosseini/AdaS .3. Thorough experiments are conducted for image classiﬁcation problems using various datasetand CNN models. We adopt different optimizers and compare their convergence speed andgeneralization characteristics to our AdaS optimizer.4. A new probing tool based on knowledge gain and mapping condition is introduced to measurethe quality of network training without requiring test/evaluation data. We investigate therelationship between our new quality metrics and performance results.2 Knowledge Gain in CNN Training

Central to our work is the notion of knowledge gain measured from convolutional weights ofCNNs. Consider the convolutional weights of a particular layer in a CNN deﬁned by a four-way array (aka fourth-order tensor) Φ ∈ R N × N × N × N , where N and N are the height andwidth of the convolutional kernels, and N and N correspond to the number of input and outputchannels, respectively. The feature mapping under this convolution operation follows F O (: , : , (cid:96) ) = (cid:80) N (cid:96) =1 F I (: , : , (cid:96) ) ∗ Φ (: , : , (cid:96) , (cid:96) ) , where F I and F O are the input and output feature maps stacked in3D volumes, and (cid:96) ∈ { , . . . , N } is the output index. The well-posedness of this feature mappingcan be studied by the generalized spectral decomposition (i.e. SVD) form of the tensor arrays usingthe Tucker model [14, 30] in full-core tensor mode Φ = N (cid:88) (cid:96) =1 N (cid:88) (cid:96) =1 N (cid:88) (cid:96) =1 N (cid:88) (cid:96) =1 G ( (cid:96) , (cid:96) , (cid:96) , (cid:96) ) u (cid:96) (cid:125) u (cid:96) (cid:125) u (cid:96) (cid:125) u (cid:96) , (2)where, the core G (containing singular values) is called a ( N , N , N , N ) -tensor, u (cid:96) i ∈ R N i is thefactor basis for decomposition, and (cid:125) is outer product operation. We use similar notations as in [30]for brevity. Note that Φ can be at most of rank ( N , N , N , N ) .The tensor array in (2) is (usually) initialized by random noise sampling for CNN training such that themapping under this tensor randomly spans the output dimensions (i.e. the diffusion of knowledge isfully random in the beginning with no learned structure). Throughout an iterative training framework,more knowledge is gained lying in the tensor space as a mixture of a low-rank manifold and perturbingnoise. Therefore, it makes sense to decompose (factorize) the observing tensor within each layerof the CNN as Φ = ˆΦ + E . This decomposes the observing tensor array into a low-rank tensor ˆΦ containing the small-core tensor such that the error residues E = || Φ − ˆΦ || F are minimized. Asimilar framework is also used in CNN compression [16, 35, 12, 39]. A handful of techniques (e.g.CP/PRAFAC, TT, HT, truncated-MLSVD, Compression) can be found in [14, 23, 9, 30] to estimatesuch small-core tensor. The majority of these solutions are iterative and we therefore take a morecareful consideration toward such low-rank decomposition.An equivalent representation of the tensor decomposition (2) is the vector form x (cid:44) vec ( Φ ) =( U ⊗ U ⊗ U ⊗ U ) g , where vec ( · ) is a vector obtained by stacking all tensor elements column-wise, g = vec ( G ) , ⊗ is the Kronecker product, and U i is a factor matrix containing all bases u (cid:96) i stacked in column form. Since we are interested in input and output channels of CNNs fordecomposition, we use mode-3 and mode-4 vector expressions yielding two matrices Φ = ( U ⊗ U ⊗ U ) G U T and Φ = ( U ⊗ U ⊗ U ) G U T , (3)where, Φ ∈ R N N N × N , Φ ∈ R N N N × N , and G and G are likewise reshaped forms ofthe core tensor G . The tensor decomposition (3) is the equivalent representation to (2) decomposedat mode-3 and mode-4. Recall the matrix (two-way array) decomposition e.g. SVD such that Φ = UΣV T where U ≡ U ⊗ U ⊗ U , V ≡ U , and Σ ≡ G [30]. In other words, todecompose a tensor on a given mode, we ﬁrst unfold the tensor (on the given mode) and then apply adecomposition method of interest such as SVD.The presence of noise, however, is still a barrier for better understanding of the latter reshapedforms. Similar to [16], we revise our goal into low-rank matrix factorizations of Φ = ˆΦ + E and Φ = ˆΦ + E , where a global analytical solution is given by the Variational Baysian MatrixFactorization (VBMF) technique in [21] as a re-weighted SVD of the observation matrix. Thismethod avoids unnecessary implementing an iterative algorithm.Using the above decomposition framework, we introduce the following two deﬁnitions. Deﬁnition 1. (Knowledge Gain). For convolutional weights in deep CNNs, deﬁne the knowledgegain across a particular channel (i.e. d -th dimension) G d,p ( Φ ) = 1 N d · σ p ( ˆΦ d ) N (cid:48) d (cid:88) i =1 σ pi ( ˆΦ d ) , (4) where, σ ≥ σ ≥ · · · ≥ σ N (cid:48) d are the low-rank singular values of a single-channel convolutionalweight in descending order, N (cid:48) d = rank { ˆΦ d } , d stands for dimension index, and p ∈ { , } . Φ in (4) is in fact a direct measure of the normenergy of the factorized matrix. Remark 1.

Recall for p = 2 that the summation of squared singular values from Deﬁnition 1 isequivalent to the Frobenius (norm) i.e. (cid:80) N (cid:48) d i =1 σ i ( ˆΦ d ) = Tr (cid:110) ˆΦ Td ˆΦ d (cid:111) = || ˆΦ d || F [11]. Also, for p =1 the summation of singular values is bounded between || ˆΦ d || F ≤ (cid:80) N (cid:48) d i =1 σ i ( ˆΦ d ) ≤ (cid:112) N (cid:48) d || ˆΦ d || F . The energies here indicate the distance measure from the matrix separability obtained from low-rank factorization (similar to the index of inseparability in neurophysiology [5]). In other words,it measures the space span obtained by the low-rank structure. The division factors N d in (4) alsonormalize the gain G p ∈ [0 , as a fraction of channel capacity. In this study we are mainly interestedin third and fourth dimension measures (i.e. d = { , } ). Deﬁnition 2. (Mapping Condition). For convolutional weights in deep CNNs, deﬁne the mappingcondition across a particular channel (i.e. d -th dimension) κ d ( Φ ) = σ ( ˆΦ d ) /σ N (cid:48) d ( ˆΦ d ) , (5) where, σ and σ N (cid:48) d are the maximum and minimum low-rank singular values of a single-channelconvolutional weight, respectively. Recall the matrix-vector calculation form by mapping the input vector into the output vector, whereits numerical stability is deﬁned by the matrix condition number as a relative ratio of maximumto minimum singular values [11]. The convolution operations in CNNs follow a similar conceptby mapping input feature images into output features. Accordingly, the mapping condition of theconvolutional layers in CNN is deﬁned by (5) as a direct measurement of condition number oflow-rank factorized matrices: indicating the well-posedness of convolution operation.

As an optimization method in deep learning, SGD typically attempts to minimize the loss functionsof large networks [3, 4, 17, 8]. Consider the updates on the convolutional weights Φ k using thisoptimization Φ k ←− Φ k − − η k ∇ ˜ f k ( Φ k − ) for k ∈ { ( t − K + 1 , · · · , tK } , (6)where t and K correspond to epoch index and number of mini-batches, respectively, ∇ ˜ f k ( Φ k − ) =1 / | Ω k | (cid:80) i ∈ Ω k ∇ f i ( Φ k − ) is the average stochastic gradients on k th mini-batch that are randomlyselected from a batch of n -samples Ω k ⊂ { , · · · , n } , and η k deﬁnes the step-size taken toward theopposite direction of average gradients. The selection of step-size η k can be either adaptive withrespect to the statistical measure from gradients [7, 40, 36, 13, 6, 25, 20, 37, 18] or could be subjectto change in different scheduled learning regimes [19, 32, 33, 19].In the scheduled learning rate method, the step-size is usually ﬁxed for every t th epoch (i.e. for all K mini-batch updates) and changes according to the schedule assignment for the next epoch (i.e. η k ≡ η ( t ) ). We setup our problem by accumulating all observed gradients throughout K mini-batchupdates within the t th epoch Φ k b = Φ k a − η ( t ) k b (cid:88) k = k a ∇ ˜ f k ( Φ k ) , (7)where k a = ( t − K + 1 and k b = tK . Note that the signiﬁcance of updates in (7) from k a th to k b thiteration is controlled by the step-size η ( t ) , which directly impacts the rate of the knowledge gain.Here we provide satisfying conditions on the step-size for increasing the knowledge gain in SGD. Theorem 1. (Increasing Knowledge Gain for SGD). Using the knowledge gain from Deﬁnition 4and setting the step-size of Stochastic Gradient Descent (SGD) proportionate to η = ζ (cid:2) G ( Φ k b ) − G ( Φ k a ) (cid:3) (8) will guarantee the monotonic increase of the knowledge gain i.e. G ( Φ k b ) ≥ G ( Φ k a ) for someexisting lower bound η ≥ η and ζ ≥ . We formulate the update rule for AdaS using SGD with Momentum as follows η ( t, (cid:96) ) ← β · η ( t − , (cid:96) ) + ζ · [ ¯ G ( t, (cid:96) ) − ¯ G ( t − , (cid:96) )] (9) v k(cid:96) ← α · v k − (cid:96) − η ( t, (cid:96) ) · g k(cid:96) (10) θ k(cid:96) ← θ k − (cid:96) + v k(cid:96) (11)where k is the current mini-batch, t is the current epoch iteration, (cid:96) is the conv block index, ¯ G ( · ) isthe average knowledge gain obtained from both mode-3 and mode-4 decompositions, v is the velocityterm, and θ are the learnable parameters.The pseudo-code for our proposed algorithm AdaS is presented in Algorithm 1. Each convolutionblock in the CNN is assigned an index { (cid:96) } L(cid:96) =1 where all learnable parameters (e.g. conv, biases, batch-norms, etc) are called using this index. The goal in AdaS is ﬁrstly to callback the convolutional weightswithin each block, secondly to apply low-rank matrix factorization on the unfolded tensors, and ﬁnallyapproximate the overall knowledge gain G t(cid:96) and mapping condition κ t(cid:96) . The approximation is doneonce every epoch and introduces minimal computational overhead over the rest of the optimizationframework. The learning rate is computed relative to the rate of change in knowledge gain over twoconsecutive epoch updates (from previous to current). The learning rate η ( t, (cid:96) ) is then further updatedby an exponential moving average called the gain-factor, with hyper-parameter β , to accumulate thehistory of knowledge gain over the sequence of epochs. In effect, β controls the tradeoff betweenconvergence speed and training accuracy of AdaS. An ablative study on the effect of this parameter isprovided in the Appendix-B. The computed step-sizes for all conv-blocks are then passed through theSGD optimization framework for adaptation. Note that the same step-size is used within each blockfor all learnable parameters. Code is available at https://github.com/mahdihosseini/AdaS . Algorithm 1:

Adaptive Scheduling (AdaS) for SGD with Momentum

Require : batch size n , T , L , initial step-sizes { η (0 , (cid:96) ) } L(cid:96) =1 , initialmomentum vectors (cid:8) v (cid:96) (cid:9) L(cid:96) =1 , initial parameter vectors (cid:8) θ (cid:96) (cid:9) L(cid:96) =1 , SGD momentumrate α ∈ [0 , , AdaS gain factor β ∈ [0 , , knowledge gain hyper-parameter ζ = 1 ,minimum learning rate η min > for t = : T dofor (cid:96) = : L do

1. unfold tensors using (3): Φ ← mode-3 ( Φ t(cid:96) ) and Φ ← mode-4 ( Φ t(cid:96) )

2. apply low-rank factorization [21]: ˆΦ ← EVBMF ( Φ ) and ˆΦ ← EVBMF ( Φ )

3. compute average knowledge gain using (4): G ( t, (cid:96) ) ← [ G , ( Φ ) + G , ( Φ )] /

4. compute average mapping condition using (5): κ ( t, (cid:96) ) ← [ κ ( Φ ) + κ ( Φ )] /

5. compute step momentum: η ( t, (cid:96) ) ← β · η ( t − , (cid:96) ) + ζ · [ G ( t, (cid:96) ) − G ( t − , (cid:96) )]

6. lower bound the learning rate: η ( t, (cid:96) ) ← max ( η ( t, (cid:96) ) , η min ) end randomly shufﬂe dataset, generate K mini-batches { Ω k ⊂ { , · · · , n }} Kk =1 for k = ( t − K + 1 : tK do

1. compute gradient: g k(cid:96) ← | Ω k | (cid:80) i ∈ Ω k ∇ Φ f (( x ( i ) , y ( i ) ); Φ k − (cid:96) ) , (cid:96) ∈ { , · · · , L }

2. compute the velocity term: v k(cid:96) ← α · v k − (cid:96) − η ( t, (cid:96) ) · g k(cid:96)

3. apply update: θ k(cid:96) ← θ k − (cid:96) + v k(cid:96) endend Experiments

We compare our AdaS algorithm to several adaptive and non-adaptive optimizers in the context ofimage classiﬁcation. In particular, we implement AdaS with SGD with momentum, four adaptivemethods i.e. AdaGrad [7], RMSProp [8], AdaM [13], AdaBound [20], and two non-adaptivemomentum SGDs guided by scheduled learning techniques i.e. OneCyleLR (also known as thesuper-convergence method) [33] and SGD with StepLR (step decaying) [8]. We further investigatethe dependencies of CNN training quality to knowledge gain and mapping conditions and provideuseful insights into the usefulness of different optimizers for training different models. For details onablative studies and a complete set of experiments, please refer to the Appendix-B.

We investigate the efﬁcacy of AdaS with respect to variations in the number of deep layers usingVGG16 [31] and ResNet34 [10] and the number of classes using the standard CIFAR-10 and CIFAR-100 datasets [15] for training. The details of pre-processing steps, network implementation andtraining/testing frameworks are adopted from the CIFAR GitHub repository using PyTorch. We setthe initial learning rates of AdaGrad, RMSProp and AdaBound to η = { e- , e- , e- } per theirsuggested default values. We further followed the suggested tuning in [38] for AdaM ( η = 3 e- forVGG16 and η = 1 e- for ResNet34) and SGD-StepLR ( η = 1 e- dropping half magnitude every epochs) to achieve the best performance. For SGD-1CycleLR we set epochs for the wholecycle and found the best conﬁguration ( η = 3 e- for VGG16 and η = 4 e- for ResNet34). Toconﬁgure the best initial learning rate for AdaS, we performed a dense grid search and found thevalues for VGG16 and ResNet34 to be η = { e- , e- } . Despite the differences in optimal valuesthat are independently obtained for each network, the optimizer performance is fairly robust relativeto changes in these values. Each model is trained for epochs in independent runs and averagetest accuracy and training losses are reported. The mini-batch size is also set to | Ω k | = 128 . (a) Test Acc–VGG16–CIFAR10 (b)

Test Acc–ResNet34–CIFAR10 (c)

Test Acc–VGG16–CIFAR100 (d)

Test Acc–ResNet34–CIFAR100 (e)

Train Loss–VGG16–CIFAR10 (f)

Train Loss–ResNet34–CIFAR10 (g)

Train Loss–VGG16–CIFAR100 (h)

Train Loss–ResNet34–CIFAR100

Figure 1: Training performance using different optimizers across two different datasets (i.e. CIFAR10and CIFAR100) and two different CNNs (i.e. VGG16 and ResNet34)

We ﬁrst empirically evaluate the effect of the gain-factor β on AdaS convergence by deﬁning eightdifferent grid values (i.e. β ∈ { . , . , · · · , . } ). The trade-off between the selection ofdifferent values of β is demonstrated in Figure 1 (complete ablation study is provided in Appendix-B). https://github.com/kuangliu/pytorch-cifar β translates to faster convergence, whereas setting it to higher values yields better ﬁnalperformance–at the cost of requiring more epochs for training. The performance comparison ofoptimizers is also overlaid in the same ﬁgure, where AdaS (with lower β ) surpasses all adaptive andnon-adaptive methods by a large margin in both test accuracy and training loss during the initialstages of training (i.e. epoch < ), where as SGD-StepLR and AdaS (with higher β ) eventuallyovertake the other methods with more training epochs. Furthermore, AdaGrad, RMSProp, AdaM, andAdaBound all achieve similar or sometimes even lower training losses compared to AdaS (includingthe two non-adaptive methods), but attain lower test accuracies. Such controversial results were alsoreported in [38] where adaptive optimizers generalize worse compared to non-adaptive methods.In retrospect, we claim here that AdaS solves this issue by generalizing better than other adaptiveoptimizers.We further provide quantitative results on the convergence of all optimizers trained on ResNet34 inTable 1 with a ﬁxed number of training epochs. The rank consistency of AdaS (using two differentgain factors of low β = 0 . and high β = 0 . values) over other optimizers is evident. For instance,AdaS β =0 . gains . test accuracy (with half conﬁdence interval) over the second best optimizerAdaM on CIFAR-100 trained with epochs.Table 1: Image classiﬁcation performance (test accuracy) of ResNet34 on CIFAR-10 and CIFAR-100with ﬁxed budget epochs. Four adaptive (AdaGrad, RMSProp, AdaM, AdaS) and one non-adaptive(SGD-StepLR) optimizers are deployed for comparison. Epoch AdaGrad RMSProp AdaM SGD-StepLR AdaS β =0 . AdaS β =0 . C I F A R -

25 0 . ± .

47% 0 . ± . . ± . % 0 . ± . . ± . % 0 . ± . . ± .

61% 0 . ± .

58% 0 . ± .

35% 0 . ± . . ± . % . ± . %75 0 . ± .

18% 0 . ± .

78% 0 . ± .

26% 0 . ± . . ± . % . ± . %100 0 . ± .

25% 0 . ± .

77% 0 . ± .

27% 0 . ± . . ± . % . ± . %150 0 . ± .

37% 0 . ± .

33% 0 . ± . . ± . % 0 . ± . . ± . %200 0 . ± .

15% 0 . ± .

30% 0 . ± . . ± . % 0 . ± . . ± . %250 0 . ± .

24% 0 . ± .

29% 0 . ± . . ± . % 0 . ± . . ± . % C I F A R -

25 0 . ± .

70% 0 . ± . . ± . % 0 . ± . . ± . % 0 . ± . . ± .

27% 0 . ± .

62% 0 . ± .

46% 0 . ± . . ± . % . ± . %75 0 . ± .

38% 0 . ± .

50% 0 . ± .

49% 0 . ± . . ± . % . ± . %100 0 . ± .

38% 0 . ± .

29% 0 . ± .

27% 0 . ± . . ± . % . ± . %150 0 . ± .

31% 0 . ± .

48% 0 . ± . . ± . % 0 . ± . . ± . %200 0 . ± .

25% 0 . ± .

50% 0 . ± . . ± . % 0 . ± . . ± . %250 0 . ± .

23% 0 . ± .

29% 0 . ± . . ± . % 0 . ± . . ± . % (a) AdaGrad (b)

RMSProp (c)

AdaM (d)

AdaBound (e)

SGD-StepLR (f)

SGD-1CycleLR (g)

AdaS β =0 . (h) AdaS β =0 . (i) AdaS β =0 . (j) AdaS β =0 . (k) AdaS β =0 . (l) AdaS β =0 . Figure 2: Evolution of knowledge gain versus mapping condition across different training epochsusing ResNet34 on CIFAR10. The transition of color shades correspond to different convolutionblocks. The transparency of scatter plots corresponds to the epoch convergence–the higher trans-parency is inversely related to the training epoch. For complete results of different optimizers andmodels, please refer to Appendix-B. 7 .3 Dependence of the Quality of Network Training to Knowledge Gain

Both concepts of knowledge gain G and mapping condition κ can be used to prob within anintermediate layer of a CNN and quantify the quality of training with respect to different parametersettings. Such quantization do not require test or evaluation data where one can directly measurethe “expected performance” of the network throughout the training updates. Our ﬁrst observationhere is made by linking the knowledge gain measure to the relative success of each method in thetest accuracy performance. For instance, by raising the gain factor β in AdaS, the deeper layers ofCNN eventually gain further knowledge as shown in Figure 2. This directly affects the success intest performance results. Also, deploying different optimizers yields different behavior in knowledgegain obtained in different layers of the CNN. Table 2 lists all four numerical measurements of TestAccuracy, Training Loss, Knowledge Gain, and Mapping Condition for different optimizers. Note therank order correlation of knowledge gain and test accuracy. Although for both RMSProp and AdaMthe knowledge gains are high, however, the Mapping Conditions are also high which deteriorates theoverall performance of the network.Table 2: Performance of ResNet34 on CIFAR10 dataset reported at ﬁnal training epoch for eachoptimizer. The average scores are reported for G and κ across all convolutional blocks. A d a G r a d R M S P r o p A d a M A d a B o u n d S G D - C y c l e L R S G D - S t e p L R A d a S β = . A d a S β = . A d a S β = . A d a S β = . Test Accuracy . . . . . . . . . . Training Loss . e- . e- . e- . e- . e- . e- . e- . e- . e- . e- Knowledge Gain ( G ) . . . . . . . . . . Mapping Condition ( κ ) .

482 18 .

124 18 .

484 4 .

833 7 .

478 10 .

086 5 .

524 6 .

661 7 .

839 8 . Our second observation here is made by studying the effect of mapping condition and how it relates tothe possible lack of generalizability of each optimizer. Although adaptive optimizers (e.g. RMSPropand AdaM) yield lower training loss, they over-ﬁt perturbing features (mainly caused by incompletesecond order statistic measure e.g. diagonal Hessian approximation) and accordingly hamper theirgeneralization [38]. We suspect such unwanted phenomena is related to the mapping condition withinCNN layers. In fact, a mixture of both average κ and average G can help to better realize how welleach optimizer can be generalized for training/testing evaluations.We conclude by identifying that an ideal optimizer would yield G → and κ → across alllayers within a CNN. We highlight that increases in κ correlates to greater disentanglement betweenintermediate input and output layers, hampering the ﬂow of information. Further, we identify thatincreases in knowledge gain strengthen the carriage of information through the network which enablesgreater performance. We have introduced a new adaptive method called AdaS to solve the issue of combined fast conver-gence and high precision performance of SGD in deep neural networks–all in a uniﬁed optimizationframework. The method combines the low-rank approximation framework in each convolution layerand identiﬁes how much knowledge is gained in the progression of epoch training and adapts theSGD learning rate proportionally to the rate of change in knowledge gain. AdaS adds minimalcomputational overhead on the regular SGD algorithm and accordingly provides a well generalizedframework to trade off between convergence speed and performance results. Furthermore, AdaSprovides an optimization framework which suggests the validation data is no longer required and thestopping criteria for training can be obtained directly from the training loss. Empirical evaluationsreveal the possible existence of a lower-bound on SGD step-size that can monotonically increasethe knowledge gain independently to each network convolution layer and accordingly improve theoverall performance. AdaS is capable of signiﬁcant improvements in generalization over traditionaladaptive methods (i.e. AdaM) while maintaining their rapid convergence characteristics. We highlightthat these improvements come through the application of AdaS to simple SGD with momentum.We further identify that since AdaS adaptively tunes the learning rates η ( t, (cid:96) ) independently to allconvolutional blocks, it can be deployed with adaptive methods such as AdaM, replacing the tradi-tional scheduling techniques. We postulate that such deployments of AdaS with adaptive gradient8pdates could introduce greater robustness to initial learning rate choice and leave this exploration asfuture work. Finally, we emphasize that, without loss of generality, AdaS can be deployed on fully-connected networks, where the weight matrices can be directly fed into the low-rank factorization formetric evaluations. Broader Impact

The content of this research is of broad interest to both researchers and practitioners of computerscience and engineering for training deep learning models in machine learning. The method proposedin this paper introduces a new optimization tool that can be adopted for training variety of modelssuch as Convolutional Neural Network (CNN). The proposed optimizer has strong generalizabilitythat include both fast convergence speed and also achieve superior performance compared to theexisting off-the-shelf optimizers. The method further introduces a new concept that measures howwell the CNN model is trained by probing in different layers of the network and obtain a qualitymeasure for training. This metric can be of broad interest to computer scientists and engineers todevelop efﬁcient models that can be tailored on speciﬁc applications and dataset.

References [1] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Onlinelearning rate adaptation with hypergradient descent. In

International Conference on Learning Representa-tions , 2018.[2] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In

Neuralnetworks: Tricks of the trade , pages 437–478. Springer, 2012.[3] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In

Proceedings of COMP-STAT’2010 , pages 177–186. Springer, 2010.[4] Léon Bottou. Stochastic gradient descent tricks. In

Neural networks: Tricks of the trade , pages 421–436.Springer, 2012.[5] Didier A Depireux, Jonathan Z Simon, David J Klein, and Shihab A Shamma. Spectro-temporal responseﬁeld characterization with dynamic ripples in ferret primary auditory cortex.

Journal of neurophysiology ,85(3):1220–1234, 2001.[6] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.[7] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning andstochastic optimization.

Journal of Machine Learning Research , 12(61):2121–2159, 2011.[8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep learning . MIT press, Cambridge, MA, USA,2016. .[9] Lars Grasedyck, Daniel Kressner, and Christine Tobler. A literature survey of low-rank tensor approxima-tion techniques.

GAMM-Mitteilungen , 36(1):53–78, 2013.[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[11] Roger A Horn and Charles R Johnson.

Matrix analysis . Cambridge university press, 2012.[12] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. ompression ofdeep convolutional neural networks for fast and low power mobile applications. jan 2016. 4th InternationalConference on Learning Representations, ICLR 2016.[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[14] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications.

SIAM review , 51(3):455–500,2009.[15] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.[16] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-upconvolutional neural networks using ﬁne-tuned cp-decomposition. jan 2015. 3rd International Conferenceon Learning Representations, ICLR 2015.[17] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436–444, 2015.[18] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and JiaweiHan. On the variance of the adaptive learning rate and beyond. In

International Conference on LearningRepresentations , 2020.[19] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. 2017. 5thInternational Conference on Learning Representations, ICLR 2017.[20] Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learningrate. In

International Conference on Learning Representations , 2019.[21] Shinichi Nakajima, Masashi Sugiyama, S Derin Babacan, and Ryota Tomioka. Global analytic solutionof fully-observed variational bayesian matrix factorization.

Journal of Machine Learning Research ,14(Jan):1–37, 2013.

22] Francesco Orabona and Tatiana Tommasi. Training deep networks without learning rates through coinbetting. In

Advances in Neural Information Processing Systems , pages 2160–2170, 2017.[23] Ivan V Oseledets. Tensor-train decomposition.

SIAM Journal on Scientiﬁc Computing , 33(5):2295–2317,2011.[24] Boris Polyak. Some methods of speeding up the convergence of iteration methods.

Ussr ComputationalMathematics and Mathematical Physics , 4:1–17, 12 1964.[25] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In

InternationalConference on Learning Representations , 2018.[26] Herbert Robbins and Sutton Monro. A stochastic approximation method.

The annals of mathematicalstatistics , pages 400–407, 1951.[27] Michal Rolinek and Georg Martius. L4: Practical loss-based stepsize adaptation for deep learning. In

Advances in Neural Information Processing Systems , pages 6433–6443, 2018.[28] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In

International Conference onMachine Learning , pages 343–351, 2013.[29] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnite sums with the stochastic averagegradient.

Mathematical Programming , 162(1-2):83–112, 2017.[30] Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E Papalexakis, andChristos Faloutsos. Tensor decomposition for signal processing and machine learning.

IEEE Transactionson Signal Processing , 65(13):3551–3582, 2017.[31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In

International Conference on Learning Representations , 2015.[32] Leslie N Smith. Cyclical learning rates for training neural networks. In , pages 464–472. IEEE, 2017.[33] Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using largelearning rates. In

Artiﬁcial Intelligence and Machine Learning for Multi-Domain Operations Applications ,volume 11006, page 1100612. International Society for Optics and Photonics, 2019.[34] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization andmomentum in deep learning. In

International conference on machine learning , pages 1139–1147, 2013.[35] Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, and E. Weinan. Convolutional neural networks withlow-rank regularization. jan 2016. 4th International Conference on Learning Representations, ICLR 2016.[36] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average ofits recent magnitude.

COURSERA: Neural networks for machine learning , 4(2):26–31, 2012.[37] Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien.Painless stochastic gradient: Interpolation, line-search, and convergence rates. In

Advances in NeuralInformation Processing Systems , pages 3727–3740, 2019.[38] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal valueof adaptive gradient methods in machine learning. In

Advances in Neural Information Processing Systems ,pages 4148–4158, 2017.[39] Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rankand sparse decomposition. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 7370–7379, 2017.[40] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 , 2012. ppendix-A: Proof of Theorems The Proofs 6 and 6 correspond to the proof of Theorem 1 for p = 2 and p = 1 , respectively. Proof. (Theorem 1) For simplicity of notations, we assume the following replacements A = Φ k a , B = (cid:80) k b k = k a ∇ ˜ f k ( Φ k ) , and C = Φ k b . So, the SGD update in (7) changes to C = A − η B . Usingthe Deﬁnition 1, the knowledge gain of matrix C (assumed to be a column matrix N ≤ M ) isexpressed by G ( C ) = 1 N σ ( C ) N (cid:48) (cid:88) i =1 σ i ( C ) = 1 N || C || Tr (cid:8) C T C (cid:9) . (12)An upper-bound of ﬁrst singular value can be calculated by ﬁrst recalling its equivalence to (cid:96) -normand then applying the triangular inequality σ ( C ) = || C || = || A − η B || ≤ || A || + η || B || + 2 η || A || || B || . (13)By substituting (13) in (12) and expanding the terms in trace, a lower bound of C is given by G ( C ) ≥ N γ (cid:2) Tr (cid:8) A T A (cid:9) − η Tr (cid:8) A T B (cid:9) + η Tr (cid:8) B T B (cid:9)(cid:3) , (14)where, γ = || A || + η || B || + 2 η || A || || B || . The latter inequality can be revised to G ( C ) ≥ Nγ (cid:104)(cid:16) − γ || A || + γ || A || (cid:17) Tr (cid:8) A T A (cid:9) − η Tr (cid:8) A T B (cid:9) + η Tr (cid:8) B T B (cid:9)(cid:105) = Nγ (cid:104) γ || A || Tr (cid:8) A T A (cid:9) + (cid:16) − γ || A || (cid:17) Tr (cid:8) A T A (cid:9) − η Tr (cid:8) A T B (cid:9) + η Tr (cid:8) B T B (cid:9)(cid:105) = G ( A ) + Nγ (cid:18) − γ || A || (cid:19) Tr (cid:8) A T A (cid:9) − η Tr (cid:8) A T B (cid:9) + η Tr (cid:8) B T B (cid:9)(cid:124) (cid:123)(cid:122) (cid:125) D  . (15)Therefore, the bound in (15) is revised to G ( C ) − G ( A ) ≥ N γ D. (16)The monotonicity of the knowledge gain in (16) is guaranteed if D ≥ . The remaining term D canbe expressed as a quadratic function of ηD ( η ) = (cid:20) Tr (cid:8) B T B (cid:9) − || B || || A || Tr (cid:8) A T A (cid:9)(cid:21) η − (cid:20) Tr (cid:8) A T B (cid:9) + 2 || B || || A || Tr (cid:8) A T A (cid:9)(cid:21) η (17)where, the condition for D ( η ) ≥ in (17) is η ≥ max  Tr (cid:8) A T B (cid:9) + || B || || A || Tr (cid:8) A T A (cid:9) Tr { B T B } − || B || || A || Tr { A T A } ,  . (18)Hence, given the lower bound in (18) it will guarantee the monotonicity of the knowledge gainthrough the update scheme C = A − η B .Our ﬁnal inspection is to check if the substitution of step-size (8) in (16) would still hold the inequalitycondition in (16). Followed by the substitution, the inequality should satisfy η ≥ ζ N γ D. (19)We have found that D ( η ) ≥ for some lower bound in (18), where the inequality in (19) also holdsfrom some ζ ≥ and the proof is done. 11 roof. (Theorem 1) Following similar notation in Proof 6, the knowledge gain of matrix C isexpressed by G ( C ) = 1 N σ ( C ) N (cid:48) (cid:88) i =1 σ i ( C ) . (20)By stacking all singular values in a vector form (and recall from (cid:96) and (cid:96) norms inequality)  N (cid:48) (cid:88) i =1 σ i ( C )  = || σ ( C ) || ≥ || σ ( C ) || = N (cid:48) (cid:88) i =1 σ i ( C ) = Tr (cid:8) C T C (cid:9) , and by substituting the matrix composition C , the following inequality holds  N (cid:48) (cid:88) i =1 σ i ( C )  ≥ Tr (cid:8) A T A (cid:9) − η Tr (cid:8) A T B (cid:9) + η Tr (cid:8) B T B (cid:9) . (21)An upper-bound of ﬁrst singular value can be calculated by recalling its equivalence to (cid:96) -norm andtriangular inequality as follows σ ( C ) = || C || = || A − η B || ≤ || A || + 2 η || A || || B || + η || B || . (22)By substituting the lower-bound (21) and upper-bound (22) into (20), a lower bound of knowledgegain is given by G ( C ) ≥ N γ (cid:2) Tr (cid:8) A T A (cid:9) − η Tr (cid:8) A T B (cid:9) + η Tr (cid:8) B T B (cid:9)(cid:3) , where γ = || A || + 2 η || A || || B || + η || B || . The latter inequality can be revised to G ( C ) ≥ N γ [ N (cid:48)(cid:48) γ || A || Tr (cid:8) A T A (cid:9) + (1 − N (cid:48)(cid:48) γ || A || ) Tr (cid:8) A T A (cid:9) − η Tr (cid:8) A T B (cid:9) + η Tr (cid:8) B T B (cid:9)(cid:124) (cid:123)(cid:122) (cid:125) D ] , (23)where, the lower bound of the ﬁrst summand term is given by N (cid:48)(cid:48) γ || A || Tr (cid:8) A T A (cid:9) = N (cid:48)(cid:48) γ || A || N (cid:48)(cid:48) (cid:88) i =1 σ i ( A ) = N (cid:48)(cid:48) γσ ( A ) || σ ( A ) || ≥ γσ ( A ) || σ ( A ) || = γN G ( A ) . Therefore, the bound in (23) is revised to G ( C ) ≥ G ( A ) + 1 N γ D. (24)Note that γ ≥ (step-size η ≥ is always positive) and the only condition for the bound in (24) tohold is to D ≥ . Here the remaining term D can be expressed as quadratic function of step-size i.e. D ( η ) = aη + bη + c where a = Tr (cid:8) B T B (cid:9) − N (cid:48)(cid:48) || B || || A || Tr (cid:8) A T A (cid:9) , b = − Tr (cid:8) A T B (cid:9) − N (cid:48)(cid:48) || B || || A || , c = − ( N (cid:48)(cid:48) − Tr (cid:8) A T A (cid:9) . The quadratic function can be factorized D ( η ) = ( η − η )( η − η ) where the roots η = ( − b + √ ∆) / a and η = ( − b − √ ∆) / a , and ∆ = b − ac . Here c ≤ and assuming a ≥ then ∆ ≥ .Accordingly, η ≥ and η ≤ . For the function D ( η ) to yield a positive value, both factorizedelements should be either positive (i.e. η − η ≥ and η − η ≥ ) or negative (i.e. η − η ≤ and η − η ≤ ). Here, only the positive conditions hold which yield η ≥ η . The assumption a ≥ is equivalent to (cid:80) N (cid:48)(cid:48)(cid:48) i =1 σ i ( B ) /σ ( B ) ≥ N (cid:48)(cid:48) (cid:80) N (cid:48)(cid:48) i =1 σ i ( A ) /σ ( A ) . The condition strongly holds forthe beginning epochs due to random initialization of weights where the low-rank matrix A is indeedan empty matrix at epoch = 0 . By progression epoch training this condition loosens and might nothold. Therefore, the monotonicity of knowledge gain for p = 1 could be violated in the interimprocess.ï˙z£ 12 ppendix-B: AdaS Ablation Study The ablative analysis of AdaS optimizer is studied here with respect to different parameter settings.Figure 3 demonstrates the AdaS performance with respect to different range of gain-factor β . Figure 4demonstrates the knowledge gain of different dataset and network with respect to different gain-factorsettings over successive epochs. Similarly, Figure 5 also demonstrates the rank gain (aka the ratio ofnon-zero singular values of low-rank structure with respect to channel size) over successive epochs.Mapping conditions are shown in Figure 6 and Figure 7 demonstrates the learning rate approximationthrough AdaS algorithm over successive epoch training. Evolution of knowledge gain versus mappingconditions are also shown in Figure 8 and Figure 9. (a) CIFAR10/VGG16 (b)

CIFAR10/ResNet34 (c)

CIFAR100/VGG16 (d)

CIFAR100/ResNet34 (e)

CIFAR10/VGG16 (f)

CIFAR10/ResNet34 (g)

CIFAR100/VGG16 (h)

CIFAR100/ResNet34

Figure 3: Ablative study of AdaS momentum rate over two different datasets (i.e. CIFAR10 andCIFAR100) and two CNNs (i.e. VGG16 and ResNet34). Top row corresponds to test-accuracies andbottom row to training-losses. 13 a) CIFAR10/VGG16/ β = 0 . (b) CIFAR10/VGG16/ β = 0 . (c) CIFAR10/VGG16/ β = 0 . (d) CIFAR10/VGG16/ β = 0 . (e) CIFAR10/ResNet34/ β = 0 . (f) CIFAR10/ResNet34/ β = 0 . (g) CIFAR10/ResNet34/ β = 0 . (h) CIFAR10/ResNet34/ β = 0 . (i) CIFAR100/VGG16/ β = 0 . (j) CIFAR100/VGG16/ β = 0 . (k) CIFAR100/VGG16/ β = 0 . (l) CIFAR100/VGG16/ β = 0 . (m) CIFAR100/ResNet34/ β = 0 . (n) CIFAR100/ResNet34/ β = 0 . (o) CIFAR100/ResNet34/ β = 0 . (p) CIFAR100/ResNet34/ β = 0 . Figure 4: Ablative study of AdaS momentum rate β versus knowledge gain G over two differentdatasets (i.e. CIFAR10 and CIFAR100) and two CNNs (i.e. VGG16 and ResNet34). The transitionin color shades from light to dark lines correspond to ﬁrst to the end of convolution layers in eachnetwork. 14 a) CIFAR10/VGG16/ β = 0 . (b) CIFAR10/VGG16/ β = 0 . (c) CIFAR10/VGG16/ β = 0 . (d) CIFAR10/VGG16/ β = 0 . (e) CIFAR10/ResNet34/ β = 0 . (f) CIFAR10/ResNet34/ β = 0 . (g) CIFAR10/ResNet34/ β = 0 . (h) CIFAR10/ResNet34/ β = 0 . (i) CIFAR100/VGG16/ β = 0 . (j) CIFAR100/VGG16/ β = 0 . (k) CIFAR100/VGG16/ β = 0 . (l) CIFAR100/VGG16/ β = 0 . (m) CIFAR100/ResNet34/ β = 0 . (n) CIFAR100/ResNet34/ β = 0 . (o) CIFAR100/ResNet34/ β = 0 . (p) CIFAR100/ResNet34/ β = 0 . Figure 5: Ablative study of AdaS momentum rate β versus rank gain rank { ˆΦ } over two differentdatasets (i.e. CIFAR10 and CIFAR100) and two CNNs (i.e. VGG16 and ResNet34). The transitionin color shades from light to dark lines correspond to ﬁrst to the end of convolution layers in eachnetwork. 15 a) CIFAR10/VGG16/ β = 0 . (b) CIFAR10/VGG16/ β = 0 . (c) CIFAR10/VGG16/ β = 0 . (d) CIFAR10/VGG16/ β = 0 . (e) CIFAR10/ResNet34/ β = 0 . (f) CIFAR10/ResNet34/ β = 0 . (g) CIFAR10/ResNet34/ β = 0 . (h) CIFAR10/ResNet34/ β = 0 . (i) CIFAR100/VGG16/ β = 0 . (j) CIFAR100/VGG16/ β = 0 . (k) CIFAR100/VGG16/ β = 0 . (l) CIFAR100/VGG16/ β = 0 . (m) CIFAR100/ResNet34/ β = 0 . (n) CIFAR100/ResNet34/ β = 0 . (o) CIFAR100/ResNet34/ β = 0 . (p) CIFAR100/ResNet34/ β = 0 . Figure 6: Ablative study of AdaS momentum rate β versus mapping condition κ over two differentdatasets (i.e. CIFAR10 and CIFAR100) and two CNNs (i.e. VGG16 and ResNet34). The transitionin color shades from light to dark lines correspond to ﬁrst to the end of convolution layers in eachnetwork. 16 a) CIFAR10/VGG16/ β = 0 . (b) CIFAR10/VGG16/ β = 0 . (c) CIFAR10/VGG16/ β = 0 . (d) CIFAR10/VGG16/ β = 0 . (e) CIFAR10/ResNet34/ β = 0 . (f) CIFAR10/ResNet34/ β = 0 . (g) CIFAR10/ResNet34/ β = 0 . (h) CIFAR10/ResNet34/ β = 0 . (i) CIFAR100/VGG16/ β = 0 . (j) CIFAR100/VGG16/ β = 0 . (k) CIFAR100/VGG16/ β = 0 . (l) CIFAR100/VGG16/ β = 0 . (m) CIFAR100/ResNet34/ β = 0 . (n) CIFAR100/ResNet34/ β = 0 . (o) CIFAR100/ResNet34/ β = 0 . (p) CIFAR100/ResNet34/ β = 0 . Figure 7: Ablative study of AdaS momentum rate β versus mapping condition κ over two differentdatasets (i.e. CIFAR10 and CIFAR100) and two CNNs (i.e. VGG16 and ResNet34). The transitionin color shades from light to dark lines correspond to ﬁrst to the end of convolution layers in eachnetwork. 17 a) VGG16, AdaGrad (b)

VGG16, RMSProp (c)

VGG16, AdaM (d)

VGG16, AdaBound (e)

VGG16, StepLR (f)

VGG16, OneCy-cleLR (g)

VGG16, AdaS- β =0 . (h) VGG16, AdaS- β =0 . (i) VGG16, AdaS- β =0 . (j) VGG16, AdaS- β =0 . (k) VGG16, AdaS- β =0 . (l) VGG16, AdaS- β =0 . (m) ResNet34,AdaGrad (n)

ResNet34, RMSProp (o)

ResNet34, AdaM (p)

ResNet34,AdaBound (q)

ResNet34, StepLR (r)

ResNet34, OneCy-cleLR (s)

ResNet34,AdaS- β = 0 . (t) ResNet34,AdaS- β = 0 . (u) ResNet34,AdaS- β = 0 . (v) ResNet34,AdaS- β = 0 . (w) ResNet34, AdaS- β = 0 . (x) ResNet34,AdaS- β = 0 .975

ResNet34,AdaS- β = 0 . (t) ResNet34,AdaS- β = 0 . (u) ResNet34,AdaS- β = 0 . (v) ResNet34,AdaS- β = 0 . (w) ResNet34, AdaS- β = 0 . (x) ResNet34,AdaS- β = 0 .975 Figure 8: Evolution of knowledge gain versus mapping condition over iteration of epoch trainingfor CIFAR10 dataset. Transition of colors shades correspond to different convolution blocks. Thetransparency of scatter plots corresponds to the convergence in epochs–the higher the transparency,the faster the convergence. 18 a) VGG16, AdaGrad (b)