CCollaborative Learning for Deep Neural Networks
Guocong Song
Playground GlobalPalo Alto, CA 94306 [email protected]
Wei Chai
GoogleMountain View, CA 94043 [email protected]
Abstract
We introduce collaborative learning in which multiple classifier heads of thesame network are simultaneously trained on the same training data to improvegeneralization and robustness to label noise with no extra inference cost. It acquiresthe strengths from auxiliary training, multi-task learning and knowledge distillation.There are two important mechanisms involved in collaborative learning. First, theconsensus of multiple views from different classifier heads on the same exampleprovides supplementary information as well as regularization to each classifier,thereby improving generalization. Second, intermediate-level representation (ILR)sharing with backpropagation rescaling aggregates the gradient flows from all heads,which not only reduces training computational complexity, but also facilitatessupervision to the shared layers. The empirical results on CIFAR and ImageNetdatasets demonstrate that deep neural networks learned as a group in a collaborativeway significantly reduce the generalization error and increase the robustness tolabel noise.
When training deep neural networks, we must confront the challenges of general nonconvex opti-mization problems. Local gradient descent methods that most deep learning systems rely on, suchas variants of stochastic gradient descent (SGD), have no guarantee that the optimization algorithmwill converge to a global minimum. It is well known that an ensemble of multiple instances of atarget neural network trained with different random seeds generally yields better predictions thana single trained instance. However, an ensemble of models is too computationally expensive atinference time. To keep the exact same computational complexity for inference, several trainingtechniques have been developed by adding additional networks in the training graph to boost accuracywithout affecting the inference graph, including auxiliary training [19], multi-task learning [4, 3],and knowledge distillation [10]. Auxiliary training is introduced to improve the convergence of deepnetworks by adding auxiliary classifiers connected to certain intermediate layers [19]. However,auxiliary classifiers require specific new designs for their network structures in addition to the targetnetwork. Furthermore, it is found later [20] that auxiliary classifiers do not result in obvious improvedconvergence or accuracy. Multi-task learning is an approach to learn multiple related tasks simultane-ously so that knowledge obtained from each task can be reused by the others [4, 3, 21]. However, itis not useful for a single task use case. Knowledge distillation is introduced to facilitate training asmaller network by transferring knowledge from another high-capacity model, so that the smaller oneobtains better performance than that trained by using labels only [10]. However, distillation is not anend-to-end solution due to having two separate training phases, which consume more training time.In this paper, we propose a framework of collaborative learning that trains several classifier headsof the same network simultaneously on the same training data to cope with the above challenges.The method acquires the advantages from auxiliary training, multi-task learning, and knowledgedistillation, such as, appending the exact same network as the target one in the training graph for a a r X i v : . [ s t a t . M L ] N ov ingle task, sharing intermediate-level representation (ILR), learning from the outputs of other heads(peers) besides the ground-truth labels, and keeping the inference graph unchanged. Experimentshave been performed with several popular deep neural networks on different datasets to benchmarkperformance, and their results demonstrate that collaborative learning provides significant accuracyimprovement for image classification problems in a generic way. There are two major mechanismscollaborative learning benefits from: 1) The consensus of multiple views from different classifierheads on the same data provides supplementary information and regularization to each classifier. 2)Besides computational complexity reduction benefited from ILR sharing, backpropagation rescalingaggregates the gradient flows from all heads in a balanced way, which leads to additional performanceenhancement. The per-layer network weight distribution shows that ILR sharing reduces the numberof “dead” filter weights in the bottom layers due to the vanishing gradient issue, thereby enlargingthe network capacity.The major contributions are summarized as follows. 1) Collaborative learning provides a new trainingframework that for any given model architecture, we can use the proposed collaborative trainingmethod to potentially improve accuracy, with no extra inference cost, with no need to design anothermodel architecture, with minimal hyperparameter re-tuning. 2) We introduce ILR sharing into co-distillation that not only enhances training time/memory efficiency but also improves generalizationerror. 3) Backpropagation rescaling we propose to avoid gradient explosion when the number of headsis big is also proven able to improve accuracy when the number of heads is small. 4) Collaborativelearning is demonstrated to be robust to label noise. In addition to auxiliary training, multi-task learning, and distillation mentioned before, we list otherrelated work as follows.
General label smoothing.
Label smoothing replaces the hard values (1 or 0) in one-hot labels for aclassifier with smoothed values, and is shown to reduce the vulnerability of noisy or incorrect labels indatasets [20]. It regularizes the model and relaxes the confidence on the labels. Temporal ensemblingforms a consensus prediction of the unknown labels using the outputs of the network-in-training ondifferent epochs to improve the performance of semi-supervised learning [14]. However, it is hard toscale for a large dataset since temporal ensembling requires to memorize the smoothed label of eachdata example.
Two-way distillation.
Co-distillation of two instances of the same neural network is studied in [2]with a focus on training speed-up in a distributed learning environment. Two-way distillation betweentwo networks, which can use the same architecture or different, is also studied in [23]. Each of themalternatively optimizes its own network parameters. However, the developed algorithms are far fromoptimized. First, when different classifiers have different architectures, each of them should havea different weight associated with its loss function to balance injected backpropagation error flows.Second, multiple copies of the target network increase proportionally the memory consumption in graphics processing unit (GPU) and the training time.
Self-distillation/born-again neural networks.
Self-distillation is a kind of distillation when thestudent network is identical to the teacher in terms of the network graph. Furthermore, the distillationprocess can be performed consecutively several times. At each consecutive step, a new identicalmodel is initialized from a different random seed and trained from the supervision of the earliergeneration. At the end of the procedure, additional gains can be achieved with an ensemble ofmultiple students generations [7]. However, multiple self-distillation processes multiply the totaltraining time proportionally; an ensemble of multiple student generations increases the inference timeaccordingly as well.In comparison, the major goal of this paper is to improve the accuracy of a target network withoutchanging its inference graph and emphasize both the accuracy and the training efficiency.
The framework of collaborative learning consists of three major parts: the generation of a populationof classifier heads in the training graph, the formulation of the learning objective, and optimization2 a) Target network (b) Multiple instances (c) Simple ILR sharing (d) Hierarchical ILR sharing
Figure 1:
Multiple head patterns for training.
Three colors represent subnets g , g , and g in (1).for learning a group of classifiers collaboratively. We will describe the details of each of them in thefollowing subsections. Similar to auxiliary training [19], we add several new classifier heads into the original network graphduring training time. At inference time, only the original network is kept and all added parts arediscarded. Unlike auxiliary training, each classifier head here has an identical network to the originalone in terms of graph structure. This approach leads to advantages over auxiliary training in termsof engineering effort minimization. First, it does not require to design additional networks for theauxiliary classifiers. Second, the structure symmetry for all heads does not require additional differentweights associated with loss functions to well balance injected backpropagation error flows, becausean equal weight for each head’s objective is optimal for training.Figure 1 illustrates several patterns to create a group of classifiers in the training graph. Figure 1 (a)is a target network to train. The network can be expressed as z = g ( x ; θ ) , where g is determinedby the graph architecture, and θ represents the network parameters. To better explain the followingpatterns, we assume the network g can be represented as a cascade of three functions or subnets, g ( x ; θ ) = g ( g ( g ( x ; θ ); θ ); θ ) (1)where θ = [ θ , θ , θ ] and θ i includes all parameters of subnet g i accordingly. In Figure 1 (b),each head is just a new instance of the original network. The output of head h is z ( h ) = g ( x ; θ ( h ) ) ,where θ ( h ) is an instance of network parameters for head h . Another pattern allows all heads toshare ILRs in the same low layers, which is shown in Figure 1 (c). This structure is very similar tomulti-task learning [4, 3], in which different supervised tasks share the same input, as well as someILR. However, collaborative learning has the same supervised tasks for all heads. It can be expressedas follows z ( h ) = g ( g ( g ( x ; θ ); θ ( h ) ); θ ( h )3 ) , where there is only one instance of θ shared byall heads. Furthermore, multi-heads can take advantage of multiple hierarchical ILRs, as shown inFigure 1 (d). The hierarchy is similar to a binary tree in which the branches at the same levels arecopies of each other. For inference, we just need to keep one head with its dependent nodes anddiscard the rest. Therefore, the inference graph is identical to the original graph g .It is shown in [17, 5] that the training memory size is roughly proportional to the number oflayers/operations. With the multi-instance pattern, the number of parameters in the whole traininggraph is proportional to the number of heads. Obviously, ILR sharing can proportionally reduce thememory consumption and speed up training, compared to multiple instances without sharing. It ismore interesting that the empirical results and analysis in Section 4 will demonstrate that ILR sharingis able to boost the classification accuracy as well. The main idea of collaborative learning is that each head learns from ground-truth labels but also fromthe whole population through the training process. We focus on multi-class classification problemsin this paper. For head h , the classifier’s logit vector is represented as z = [ z , z , . . . , z m ] tr for m T is defined as follows, σ i ( z ( h ) ; T ) = exp (cid:16) z ( h ) i /T (cid:17) m P j =1 exp (cid:16) z ( h ) j /T (cid:17) (2)When T = 1 , (2) is just a normal softmax function. Using a higher value for T produces a softerprobability distribution over classes. The loss function for head h is proposed as L ( h ) = βJ hard ( y , z ( h ) ) + (1 − β ) J soft ( q ( h ) , z ( h ) ) (3)where β ∈ (0 , . The objective function with regard to a ground-truth label J hard is just theclassification loss – cross entropy between a one-hot encoding of the label y and the softmax outputwith temperature of 1: J hard ( y , z ( h ) ) = − P mi =1 y i log( σ i ( z ( h ) ; 1)) . The soft label of head h isproposed to be a consensus of all other heads’ predictions as follows: q ( h ) = σ H − X j = h z ( j ) ; T which combines the multiple views on the same data and contains additional information ratherthan the ground-truth label. The objective function with regard to the soft label is the cross entropybetween the soft label and the softmax output with a certain temperature, i.e. J soft ( q ( h ) , z ( h ) ) = − m X i =1 q ( h ) i log( σ i ( z ( h ) ; T )) which can be regarded as a distance measure between an average prediction from population and theprediction of each head [10]. Minimizing this objective aims at transferring the information from thesoft label to the logits and regularizing the training network. In addition to performance optimization, another design criterion for collaborative learning is to keepthe hyperparameters in training algorithms, e.g. the type of SGD, regularization, and learning rateschedule, the same as those used in individual learning. Thus, collaborative learning can be simplyput on top of individual learning. The optimization here is mainly designed to take new conceptsinvolved in collaborative learning into account, including a group of classifiers, and ILR sharing.
Simultaneous SGD.
Since multiple heads are involved in optimization, it seems straightforward toalternatively update the parameters associated with each head one-by-one. This algorithm is usedin both [23, 2]. In fact, alternative optimization is popular in generative adversarial networks [8], inwhich a generator and discriminator get alternatively updated. However, alternative optimization hasthe following shortcomings. In terms of speed, it is slow because one head needs to recalculate a newprediction after updating its parameters. In terms of convergence, recent work [15, 16] reveals thatsimultaneous SGD has faster convergence and achieves better performance than the alternative one.Therefore, we propose to apply SGD and update all parameters simultaneously in the training graphaccording to the total loss, which is the sum of each head’s loss as well as regularization Ω( θ ) . L = H X h =1 L ( h ) + λ Ω( θ ) (4)We suggest keeping the same regularization and its hyperparameters as individual training whenapplying collaborative learning. It is important to avoid unnecessary hyperparameter search inpractice when introducing a new training approach. The effectiveness of simultaneous SGD will bevalidated in Section 4.1. Backpropagation rescaling.
First, we describe an important stability issue with ILR sharing. As-sume that there are H heads sharing subnet g ( · ; θ ) as shown in Figure 2 (a), in which θ and θ ( h )2 represent the parameters of g and those of g associated with head h , respectively. The output of4 a) No rescaling (b) Backprop rescaling. Operation I is described in (5). Figure 2: No rescaling vs backpropagation rescalingthe shared layers, x , is fed to all corresponding heads. However, the backward graph becomesa many-to-one connection. According to (4), the backpropagation input for the shared layers is ∇ x L = P Hh =1 ∇ x L ( h ) . It is not hard to discover an issue that the variance of ∇ x L grows as thenumber of heads grows. Assume that the gradient of each head’s loss has a limited variance, i.e.,Var (( ∇ x L ( h ) ) i ) < ∞ , where i represents each element in a vector. We should make the systemstable, i.e., Var (( ∇ x L ) i ) < ∞ , even when H −→ ∞ . Unfortunately, the backpropagation flow ofFigure 2 (a) is unstable in the asymptotic sense due to the sum of all gradient flows.Note that simple loss scaling, i.e., L = H P h L ( h ) , bring another problem: resulting in very slowlearning w.r.t θ ( h )2 . The SGD update is θ ( h )2 ← θ ( h )2 − η H ∇ θ ( h )2 L ( h ) . For a fixed learning rate η , η H ∇ θ ( h )2 L ( h ) → when H → ∞ .Therefore, backpropagation rescaling is proposed to achieve two goals at the same time – to normalizethe backpropagation flow in subnet g and keep that in subnet g the same as the single classifier case.The solution to add a new operation I ( · ) between g and g , shown in Figure 2 (b), which is I ( x ) = x , ∇ x I = 1 H (5)And then the backpropagation input for the shared layers becomes ∇ x L = 1 H H X h =1 ∇ x L ( h ) (6)The variance of (6) is then always limited, which is proven in Session 1 of Supplementary material.Backpropagation rescaling is essential for ILR sharing to have better performance by just reusing atraining configuration well tuned in individual learning. Its effectiveness on classification accuracywill be validated in Section 4.1. Balance between hard and soft loss objectives.
We follow the suggestion in [10] that the back-propagation flow from each soft objective should be multiplied by T since the magnitudes of thegradients produced by the soft targets scale as /T . This ensures that the relative contributions ofthe hard and soft targets remain roughly unchanged when tuning T . In supervised learning, it is hard to completely avoid confusion during network training either dueto incorrect labels or data augmentation. For example, random cropping is a very important dataaugmentation technique when training an image classifier. However, the entire labeled objects orlarge portion of them occasionally get cut off, which really challenges the classifier. Since multipleviews on the same example have diversity of predictions, collaborative learning is by nature morerobust to label noise than individual learning, which will be validated in Section 4.1.5
Experiments
We will evaluate the performance of collaborative learning on various network architectures forseveral datasets, with analysis of important and interesting observations. We use T = 2 and β = 0 . for all experiments. In addition, the performance of any model trained with collaborative learning isevaluated using the first classifier head without head selection. All experiments are conducted withTensorflow [1]. The two CIFAR datasets, CIFAR-10 and CIFAR-100, consist of colored natural images with 32x32pixels [13] and have 10 and 100 classes, respectively. We conduct empirical studies on the CIFAR-10dataset with ResNet-32, ResNet-110 [9], and DenseNet-40-12 [11]. ResNets and DenseNets forCIFAR are all designed to have three building blocks, residual or dense blocks. For the simple ILRsharing, the split point is just after the first block. For the hierarchical sharing, the two split points arelocated after the first and second blocks, respectively. Refer to Section 2 in Supplementary materialfor the detailed training setup.Table 1:
Test errors (%) on CIFAR-10.
All experiments are performed 5 runs except for those ofDenseNet-40-12 are done for 3 runs.
ResNet-32 ResNet-110 DenseNet-40-12Individuallearning Single instance 6.66 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Classification results.
All results are summarized in Table 1. It can be concluded from Table 1 thatwith a given training graph pattern, the more classifier heads, the lower generalization error. Moreimportant, ILR sharing reduces not only GPU memory consumption and training time but also thegeneralization error considerately.
Simultaneous vs alternative optimization.
We repeat an experiment that was performed in [23].It is just a special case of collaborative learning in which we train two instances of ResNet-32 onCIFAR-100 with T = 1 , β = 0 . . The only difference is that we replace the alternative optimization[23] with the simultaneous one. It is shown in Table 2 that based on the corresponding baseline,simultaneous optimization provides additional 1%+ accuracy gain compared to alternative one. With T = 2 , simultaneous one has another 1% boost. Thus, simultaneous optimization substantiallyoutperforms alternative one in terms of accuracy and speed.Table 2: Alternative optimization [23] vs simultaneous optimization (ours) in terms of test errors ofResNet-32 on CIFAR-100. Single instance (baseline) Head 1 in two instances Head 2 in two instances[23] 31.01 28.81 29.25Collaborativelearning T=1 30.52 ± ± ± ± ± Backpropagation rescaling.
Backpropagation rescaling is proposed to be necessary for ILR sharingtheoretically in Section 3.3. We intend to confirm it by experiments on the CIFAR-10 dataset. Totrain a ResNet-32, we use a simple ILR sharing topology with four heads, and the split point locatedafter the first residual block. The results in Table 3 provide evidence that backpropagation rescalingclearly outperforms others – no scaling and loss scaling. While no scaling suffers from too largegradients in the shared layers, loss scaling results in a too small factor for updating the parameters6igure 3:
Test error on CIFAR-10 with label noise . Noise level is the percentage of corruptedlabels over the all training set. The noisy labels are randomly generated every epoch.of independent layers. We suggest backpropagation rescaling for all multi-head learning problemsbeyond collaborative learning.Table 3:
Impact of backprop rescaling.
Four heads based on ResNet-32 share the low layers up tothe first residual block. With no scaling, the factor for each head’s loss is one. With loss scaling, thefactor for each head’s loss is 1/4.
No scaling Loss scaling Backprop rescalingError (%) of ResNet-32 6.04 ± ± ± Noisy label robustness.
In this experiment, we aim at validating the noisy label resistance ofcollaborative learning on the CIFAR-10 dataset with ResNet-32. Assume that a portion of labels,whose percentage is called noise level, are corrupted with a uniform distribution over the label set.The partition for images with corruption or not is fixed for all runs; their noisy labels are randomlygenerated every epoch. The results in Figure 3 validate that the test error rates of all collaborativelearning setups are substantially lower than the baseline, and the accuracy gain becomes larger at aconsiderately larger noise level. It is well expected since the consensus formed from a group is ableto mitigate the effect of noisy labels without knowledge of noise distribution. Another observationis that 4 heads with hierarchical ILR sharing, which constantly provides the lowest error rate at arelatively low noise level, seems worse at a high noise level. We conjecture that the diversity ofpredictions is more important than better ILR sharing in this scenario. Collaborative learning providesflexibility to trade off the diversity of predictions from the group with additional supervision andregularization for the common layers.
The ILSVRC 2012 classification dataset consists of 1.2 million for training, and 50,000 for validation[6]. We evaluate how collaborative learning helps improve the performance of ResNet-50 network.As following the notations in [9], we consider two heads sharing ILRs up to “conv3_x" block forsimple ILR sharing. For the hierarchical sharing with four heads, two split points are located after“conv3_x" and “conv4_x" blocks, respectively. Refer to Section 3 in Supplementary material for thedetailed training setup.
Classification error vs training computing resources (GPU memory consumption as well astraining time).
Classification error on Imagenet is particularly important because many state-of-the-art computer vision problems derive image features or architectures from ImageNet classificationmodels. For instance, a more accurate classifier typically leads to a better object detection modelbased on the classifier [12]. Table 4 summarizes the performance of various training graph patterns7able 4:
Validation errors of ResNet-50 on ImageNet . Label smoothing, distillation and collabora-tive learning all do not affect inference’s memory size and running time.
Top-1 error Top-5 error Training time MemoryIndividuallearning Baseline 23.47 6.83 1x 1xLabel smoothing (0.1) 23.34 6.80 1x 1xDistillation From ensemble of two ResNet-50s 22.65 6.34 3.42x 1.05xCollabor-ativelearning 2 instances 22.81 6.45 2x 2x2 heads w/ simple ILR sharing 22.70 6.37 1.4x 1.32x4 heads w/ hierarchical ILR sharing
Figure 4:
Per-layer weight distribution in trained ResNet-50 . As following the notations in [9],the two split points in the hierarchical sharing with four heads are located after “conv3_x" and“conv4_x" blocks, respectively.with ResNet-50 on ImageNet. As mentioned in Section 3.1, collaborative learning brings some extratraining cost since it generates more classifier heads in training, and ILR sharing is designed fortraining speedup and memory consumption reduction. We have measured GPU memory consumptionand training time and also listed them in Table 4. It is similar to the CIFAR results that two heads withsimple ILR sharing and four heads with hierarchical ILR sharing reduce the validation top-1 errorrate significantly in this case, from 23.47% with the baseline to 22.70% and 22.29%, respectively.Note that increasing training time for individual learning does not improve accuracy [22]. Sincethe convolution filters are shared in the space domain in deep convolutional networks, the memoryconsumption by storing the intermediate feature maps is much higher than that by model parametersin training [17]. Therefore, ILR sharing is especially computationally efficient for deep convolutionalnetworks because it contains only one copy of shared layers. Compared to distillation , collaborativelearning can achieve a lower error rate with a much less training time in an end-to-end way. Model weight distribution and mechanisms of ILR sharing.
We have plotted the statisticaldistribution of each layer’s weights of trained ResNet-50 in Figure 4, including the baseline, distilledand trained versions with hierarchical ILR sharing. Refer to Section 5 in Supplementary material formore results with other training configurations. The first finding is that the weight distribution of thebaseline has a very large spike at near zero in the bottom layers. We conjecture that the gradients to Training time of distillation is analyzed in Section 4 in Supplementary material. . Compared to distillation, ILR sharing more effectivelyhelps reduce the number of "dead" weights, thereby improve the accuracy. The second finding is thatcollaborative learning makes the weight distribution be more centralized to zero overall. Note that wealso calculate per-layer model weight standard deviation values in Table 1 in Supplementary materialto additionally support this claim. The results indicate that the consensus of multiple views on thesame data provides additional regularization.ILR sharing is somewhat related to the concept of hint training [18], in which a teacher transfersits knowledge to a student network by using not only the teacher’s predictions but also an ILR. Incollaborative learning, ILR sharing can be regarded as an extreme case in which the ILRs of twoseparated classifier heads converge to the exact same one by forcing them to match. It is reported in[18] that using hints can outperform distillation. To a certain extent, this provides an indirect evidencefor the possibility of accuracy improvement from ILR sharing.Again, two hyperparameters β and T are fixed in all of our experiments. It is possible that moreextensive hyper-parameter searches may further improve the performance on specific datasets. Weevaluate the impact of hyperparameters, β , T , and split point locations for ResNet-32 on CIFAR-10in Section 6 in Supplementary material. We have proposed a framework of collaborative learning to train a deep neural network in a group ofgenerated classifiers based on the target network. The consensus of multiple views from differentclassifier heads on the same example provides supplementary information as well as regularization toeach classifier, thereby improving the generalization. By well aggregating the gradient flows from allheads, ILR sharing with backpropagation rescaling not only lowers training computational cost, butalso facilitates supervision to the shared layers. Empirical results have also validated the advantagesof simultaneous optimization and backpropagation rescaling in group learning. Overall, collaborativelearning provides a flexible and powerful end-to-end training approach for deep neural networks toachieve better performance. Collaborative learning also opens up several possibilities for future work.The mechanism of group collaboration and noisy label resistance imply that it may potentially bebeneficial to semi-supervised learning. Furthermore, other machine learning tasks, such as regression,may take advantage of collaborative learning as well.
Acknowledgement
We would like to thank Qiqi Yan for many helpful discussions.
References [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-fowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Va-sudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software availablefrom tensorflow.org.[2] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton. Large scale distributedneural network training through online distillation. In
International Conference on LearningRepresentations (ICLR) , 2018.[3] J. Baxter. Learning internal representations. In
Proceedings of the Eighth Annual Conferenceon Computational Learning Theory , COLT ’95, pages 311–320. ACM, 1995. We ran another experiment in which the weight decay was reduced by half in first three layers to verify ourhypothesis. Refer to Section 5 in Supplementary material for more details.
94] R. Caruana. Multitask learning: A knowledge-based source of inductive bias. In
Proceedingsof the Tenth International Conference on Machine Learning (ICML) , pages 41–48. MorganKaufmann, 1993.[5] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. arXiv , abs/1604.06174, 2016.[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-ScaleHierarchical Image Database. In
CVPR09 , 2009.[7] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar. Born again neuralnetworks. In
International Conference on Machine Learning (ICML) , July 2018.[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio. Generative adversarial nets. In
Advances in Neural Information Processing Systems27 , pages 2672–2680, 2014.[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2016.[10] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In
NIPSDeep Learning and Representation Learning Workshop , 2015.[11] G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger. Densely connected convolutionalnetworks. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.[12] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song,S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional objectdetectors. In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.[13] A. Krizhevsky. Learning multiple layers of features from tiny images.
Technical Report , 2009.[14] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In
InternationalConference on Learning Representations (ICLR) , 2017.[15] L. Mescheder, S. Nowozin, and A. Geiger. The numerics of gans. In
Advances in NeuralInformation Processing Systems (NIPS) , 2017.[16] V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. In
Advancesin Neural Information Processing Systems (NIPS) , 2017.[17] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. vDNN: Virtualizeddeep neural networks for scalable, memory-efficient neural network design. In , 2016.[18] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints forthin deep nets. In
International Conference on Learning Representations (ICLR) , 2015.[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, andA. Rabinovich. Going deeper with convolutions. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , pages 1–9, 2015.[20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec-ture for computer vision. In
IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2016.[21] Y. Yang and T. Hospedales. Deep multi-task representation learning: A tensor factorisationapproach. In
International Conference on Learning Representations (ICLR) , 2017.[22] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk mini-mization. In
International Conference on Learning Representations (ICLR) , 2018.[23] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. arXiv , abs/1706.00384,2017. 10 upplementary Material of Collective Training forDeep Neural Networks
Guocong Song
Playground GlobalPalo Alto, CA 94306 [email protected]
Wei Chai
GoogleMountain View, CA 94043 [email protected] H This is equivalent to prove that Var (cid:16) H P Hh =1 X h (cid:17) < ∞ for all H if Var ( X h ) < ∞ for ∀ h . Proof.
Var H H X h =1 X h ! = H X h =1 Var ( X h /H ) + X i = j Cov ( X i /H, X j /H ) (1) ≤ H H X h =1 Var ( X h ) + 1 H X i = j | Cov ( X i , X j ) | (2) ≤ H H X h =1 Var ( X h ) + 1 H X i = j p Var ( X i ) q Var ( X j ) (3) ≤ H H max h ( Var ( X h )) (4) = max h ( Var ( X h )) (5)Inequality (3) is because of Cauchy–Schwarz inequality. Therefore, if Var ( X h ) < ∞ for ∀ h ,Var (cid:16) H P Hh =1 X h (cid:17) < ∞ as well. We adopt a standard data augmentation scheme that is widely used for those two datasets [2, 3]. Wetrain three target networks: ResNet-32, ResNet-110 [2], and DenseNet-40-12 [4]. In training, we usea weight decay of − , and a Nesterov momentum of 0.9 for SGD for all networks. ResNet-32 andResNet-110 are trained with a mini-batch size of 128 up to 200 epochs. For them, we start with alearning rate of 0.1 and divide it by 10 at 100, 150, and 192 epochs. DenseNet-40-12 is trained with amini-batch size of 64 up to 300 epochs. Its learning rate is initially set to 0.1 and is divided it by 10 at150, 225, and 290 epochs. We adopt the same data augmentation scheme for training images as in [1]. Each network inputimage is a 224x224 pixel random crop from an augmented image or its horizontal flip, and then is a r X i v : . [ s t a t . M L ] N ov ormalized by the per-color mean and standard deviation. We train ResNet-50 [2] with a Nesterovmomentum [5] of 0.9 and a weight decay of − up to 100 epochs. Each GPU consumes 32 imagesper mini-batch. The learning rate is initially set to 0.1, and then is divided by 10 at 30, 60, and 90epochs. A single central crop with size of 224x224 is applied for validation. The training time of distillation can be expressed as T train = T t + T s + T tf where T t is the training time of the teacher network, T s is that of the student one, and T tf is theforward passing time of the teacher during distillation. For example, when distilling a ResNet-50from an ensemble of two ResNet-50s. T t = 2 T s , and T tf ≈ . T s . Therefore, the total training timeis roughly 3.4x that with individual learning. The distributions in other cases are shown in Figure 1. Per-layer weight standard deviation values arelisted with different training approaches in Table 1.To validate our conjecture that the gradients to many weights in the bottom layers may be vanishedso small that the weight decay part takes the major impact, which causes near-zero "dead" valueseventually, we perform an experiment in which the value of weight decay is reduced to . · − inconv1, conv2_x, and conv3_x layers, and that in other layers remains to be · − . Figure 2 showsthe expected results that a reduced weight decay does reduce the spike in the weight distribution.However, it does not reduce the error rate of the classifier, which is 23.5% for the top-1 error.Therefore, although weight decay is related to these "dead" filter weights, simply reducing weightdecay is not a solution to improve accuracy.Table 1: Per-layer weight standard deviation in ResNet-50 conv1 conv2_x conv3_x conv4_x conv5_x denseIndividuallearning Baseline 0.116 0.034 0.024 0.017 0.014 0.033Label smoothing (0.1) 0.103 0.029 0.021 0.015 0.013 0.027Distillation From ensemble of two ResNet-50s0.113 0.035 0.022 0.016 0.013 0.030Collabor-ativelearning 2 instances 0.077 0.024 0.016 0.011 0.009 0.0222 heads w/ simple ILR sharing 0.078 0.025 0.017 0.011 0.009 0.0224 heads w/ hierarchical ILR sharing0.076 0.024 0.016 0.011 0.008 0.022 β and T We have run some experiments with different β and T values and plotted the results in Fig 3. Theerror is not sensitive to them. Carefully tuning β and T could obtain better results from the currentsettings ( β = 0 . , T = 2 ), but the improvement is expected to be small. We evaluate the impact of different split point locations in ResNet-32 with 2-head simple ILR sharingon CIFAR-10, and summarize the results in Table 2.
References [1] S. Gross and M. Wilber. Training and investigating residual nets.https://github.com/facebook/fb.resnet.torch, 2016.2igure 1: Per-layer weight distribution in trained ResNet-50[2] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
The IEEEConference on Computer Vision and Pattern Recognition (CVPR) , June 2016.[3] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In
EuropeanConference on Computer Vision (ECCV) , 2016.[4] G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger. Densely connected convolutionalnetworks. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.[5] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization andmomentum in deep learning. In the 30th International Conference on Machine Learning (ICML) ,pages 1139–1147, 2013. 3igure 2:
Per-layer weight distribution in trained ResNet-50 with baseline and per-layer weightdecay . For the baseline, the value of weight decay is · − . For per-layer weight decay, the value ofweight decay is . · − for conv1, conv2_x, and conv3_x layers, and · − otherwise. However,its top-1 error rate is 23.5%, and not improved from the baseline.Table 2: Error of ResNet-32 on CIFAR-10 with different split point locations.
Simple ILRsharing is applied with two heads. RB is short for residual block.Before RB 1 After RB 1 After RB 2 After RB 3Error (%) 6.25 ± ± ± ±±