[PDF] Deep Mutual Learning

Abstract

Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to low-memory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary -- mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.

Full PDF

DDeep Mutual Learning

Ying Zhang , , Tao Xiang , Timothy M. Hospedales , Huchuan Lu Dalian University of Technology, China Queen Mary University of London, UK University of Edinburgh, UK {ying.zhang, t.xiang}@qmul.ac.uk, [email protected], [email protected]

Abstract

Model distillation is an effective and widely used technique to transfer knowledgefrom a teacher to a student network. The typical application is to transfer froma powerful large network or ensemble to a small network, that is better suitedto low-memory or fast execution requirements. In this paper, we present a deepmutual learning (DML) strategy where, rather than one way transfer between astatic pre-deﬁned teacher and a student, an ensemble of students learn collabo-ratively and teach each other throughout the training process. Our experimentsshow that a variety of network architectures beneﬁt from mutual learning andachieve compelling results on CIFAR-100 recognition and Market-1501 personre-identiﬁcation benchmarks. Surprisingly, it is revealed that no prior powerfulteacher network is necessary – mutual learning of a collection of simple studentnetworks works, and moreover outperforms distillation from a more powerful yetstatic teacher.

Deep neural networks achieve state of the art performance on many problems, but are often verylarge in depth or width, and contain large numbers of parameters [6, 25]. This has the drawback thatthey may be slow to execute or demand large memory to store, limiting their use in applicationsor platforms with low memory or fast execution requirements. This has led to a rapidly growingarea of research on smaller and faster models. Achieving compact yet accurate models has beenapproached in a variety of ways including explicit frugal architecture design [8], model compression[20], pruning [13], binarisation [18] and most interestingly model distillation [7].Distillation-based model compression relates to the observation [3, 2] that small networks oftenhave the same representation capacity as large networks; but compared to large networks they aresimply harder to train and ﬁnd the right parameters that realise the desired function. That is, thelimitation seems to lie in the difﬁculty of optimisation rather than in the network size [2]. To betterlearn a small network, the distillation approach starts with a powerful (deep and/or wide) teachernetwork (or network ensemble), and then trains a smaller student network to mimic the teacher[7, 2, 16, 3]. Mimicking the teacher’s class probabilities [7] and/or feature representation [2, 19]conveys additional information beyond the conventional supervised learning target. The optimisationproblem of learning to mimic the teacher turns out to be easier than learning the target functiondirectly, and the much smaller student can match or even outperform [19] the larger teacher.In this paper we explore a different but related idea to model distillation – that of mutual learning .Distillation starts with a powerful large and pre-trained teacher network and performs one-wayknowledge transfer to a small untrained student. In contrast, in mutual learning we start with a poolof untrained students who learn simultaneously to solve the task together. Speciﬁcally, each studentis trained with two losses: a conventional supervised learning loss, and a mimicry loss that alignseach student’s class posterior with the class probabilities of other students. Trained in this way, it a r X i v : . [ c s . C V ] J un urns out that each student in such a peer-teaching based scenario learns signiﬁcantly better thanwhen learning alone in a conventional supervised learning scenario. Moreover student networkstrained in this way achieve better results than students trained by conventional distillation from alarger pre-trained teacher. Furthermore, while the conventional understanding of distillation requiresa teacher larger and more powerful than the intended student, it turns out that in many cases mutuallearning of several large networks also improves performance compared to independent learning.It is perhaps not obvious why the proposed procedure should work at all. Where does the additionalknowledge come from, when the learning process starts out with all small and untrained studentnetworks? Why does it converge to a good solution rather than being hamstrung by groupthink as‘the blind lead the blind’. Some intuition about these questions can be gained by considering thefollowing: Each student is primarily directed by a conventional supervised learning loss, which meansthat their performance generally increases and they cannot drift arbitrarily into groupthink as a cohort.With supervised learning, all networks soon predict the same (true) labels for each training instance;but since each network starts from a different initial condition, their estimates of the probabilities ofthe next most likely classes vary. It is these secondary quantities that provide the extra information indistillation [7] as well as mutual learning. In mutual learning the student cohort effectively pools theircollective estimate of the next most likely classes. Finding out – and matching – the other most likelyclasses for each training instance according to their peers increases each student’s posterior entropy[4, 17], which helps them to converge to a more robust (ﬂatter) minima with better generalisation totesting data. This is related to very recent work on the robustness of high posterior entropy solutions(network parameter settings) in deep learning [4, 17], but with a much more informed choice ofalternatives than blind entropy regularisation.Overall, mutual learning provides a simple but effective way to improve the generalisation ability ofa network by training collaboratively with a cohort of other networks. Compared with distillationby a pre-trained static large network, collaborative learning by small peers even achieves betterperformance. Furthermore we observe that: (i) the efﬁcacy increases with the number of networks inthe cohort (by training on small networks only, more of them can ﬁt on one GPU for effective mutuallearning); (ii) it applies to a variety of network architectures, and to heterogeneous cohorts consistingof mixed big and small networks; and (iii) even large networks mutually trained in cohort improveperformance compared to independent training. Finally, we note that while our focus is on obtaininga single effective network, the entire cohort can also be used as a highly effective ensemble model. Related Work

The distillation-based approach to model compression has been proposed over adecade ago [3] but was recently re-popularised by [7], where some additional intuition about whyit works – due to the additional supervision and regularisation of the higher entropy soft-targets– was presented. Initially, a common application was to distill the function approximated by apowerful model/ensemble teacher into a single neural network student [3, 7]. But later, the ideahas been applied to distill powerful and easy-to-train large networks into small but harder-to-trainnetworks [19] that can even outperform their teacher. Recently, distillation has been connected moresystematically to information learning theory [15] and SVM + [22] – an intelligent teacher providesprivileged information to the student. Here we address dispensing with the teacher altogether, andallowing an ensemble of students to teach each other in mutual distillation.Other related ideas include Dual Learning [5] where two cross-lingual translation models teach eachother interactively. But this only applies in this special translation problems where an unconditionalwithin-language model is available to be used to evaluate the quality of the predictions, and ultimatelyprovides the supervision that drives the learning process. In contrast, our mutual learning approachapplies to general classiﬁcation problems. While conventional wisdom about ensembles prioritisesdiversity [12], our mutual learning approach reduces diversity in the sense that all students becomesomewhat more similar by learning to mimic each other. However, our goal is not necessarily toproduce a diverse ensemble, but to enable networks to ﬁnd robust solutions that generalise well totesting data, which would otherwise be hard to ﬁnd through conventional supervised learning. We formulate the proposed DML approach with a cohort of two networks (see Fig. 1). Extension tomore networks is straightforward (see Sec. 2.3). Given N samples X = { x i } Ni =1 from M classes,2 etwork Θ " Network Θ Predictions labels

KL(p1||p2)KL(p2||p1) ) " ) Logits * " * … $ H" $ H Figure 1: Deep Mutual Learning (DML) schematic. Each network is trained with a supervisedlearning loss, and a KLD-based mimcry loss to match the probability estimates of its peers.we denote the corresponding label set as Y = { y i } Ni =1 with y i ∈ { , , ..., M } . The probability ofclass m for sample x i given by a neural network Θ is computed as p m ( x i ) = exp ( z m ) (cid:80) Mm =1 exp ( z m ) , (1)where the logit z m is the output of the “softmax” layer in Θ .For multi-class classiﬁcation, the objective function to train the network Θ is deﬁned as the crossentropy error between the predicted values and the correct labels, L C = − N (cid:88) i =1 M (cid:88) m =1 I ( y i , m ) log( p m ( x i )) , (2)with an indicator function I deﬁned as I ( y i , m ) = (cid:26) y i = m y i (cid:54) = m .The conventional supervised loss trains the network to predict the correct labels for the traininginstances. To improve the generalisation performance of Θ on testing instances, we use another peernetwork Θ to provide training experience in the form of its posterior probability p . To measure thematch of the two network’s predictions p and p , we adopt the Kullback Leibler (KL) Divergence.The KL distance from p to p is computed as D KL ( p (cid:107) p ) = N (cid:88) i =1 M (cid:88) m =1 p m ( x i ) log p m ( x i ) p m ( x i ) . (3)The overall loss function L Θ for network Θ is deﬁned as L Θ = L C + D KL ( p (cid:107) p ) . (4)Similarly, the objective loss function L Θ for network Θ can be computed as L Θ = L C + D KL ( p (cid:107) p ) . (5)In this way each network learns both to correctly predict the true label of training instances (supervisedloss L C ) as well as to match the probability estimate of its peer (KL mimicry loss). The mutual learning strategy is performed in each mini-batch based model update step and throughoutthe whole training process. At each iteration, we compute the predictions of the two models andupdate both networks’ parameters according to the predictions of the other. The optimisation of Θ and Θ is conducted iteratively until convergence. The optimisation details are summarised inAlgorithm 1. 3 lgorithm 1: Deep Mutual Learning Input:

Training set X , label set Y , learning rate γ ,t and γ ,t . Initialize:

Models Θ and Θ to different initial conditions. Repeat : t = t + 1 Randomly sample data x from X . Update the predictions p and p of x by (1) for the current mini-batch Compute the stochastic gradient and update Θ : Θ ← Θ + γ ,t ∂L Θ ∂ Θ (6) Update the predictions p and p of x by (1) for the current mini-batch Compute the stochastic gradient and update Θ : Θ ← Θ + γ ,t ∂L Θ ∂ Θ (7) Until : convergence

The proposed DML approach naturally extends to more networks in the student cohort. Given K networks Θ , Θ , ..., Θ K ( K ≥ , the objective function for optimising Θ k , (1 ≤ k ≤ K ) becomes L Θ k = L C k + 1 K − K (cid:88) l =1 ,l (cid:54) = k D KL ( p l (cid:107) p k ) . (8)Equation (8) indicates that with K networks, DML for each student effectively takes the other K − networks in the cohort as K − teachers to provide learning experience. Equation (4) is now a specialcase of (8) with K = 2 . Note that we have added the coefﬁcient K − to make sure that the trainingis mainly directed by supervised learning of the true labels. The optimisation for DML with morethan two networks is a straightforward extension of Algorithm 1. It can be distributed by learningeach network on one device and passing the small probability vectors between devices.With more than two networks, an interesting alternative learning strategy for DML is to take theensemble of all the other K − networks as a single teacher to provide an averaged learningexperience, which would be very similar to the distillation approach but performed at each mini-batchmodel update. Then the objective function of Θ k can be written as L Θ k = L C k + D KL ( p avg (cid:107) p k ) , p avg = 1 K − K (cid:88) l =1 ,l (cid:54) = k p l . (9)In our experiments (see Sec. 3.6), we ﬁnd that this DML strategy with a single ensemble teacheror DML_e leads to worse performance than DML with K − teachers. This is because the modelaveraging step (Equation (9)) to build the teacher ensemble makes the teacher’s posterior probabilitiesmore peaked at the true class, thus reducing the posterior entropy over all classes. It is thereforecontradictory to one of the objectives of DML which is to produce robust solutions with high posteriorentropy. Two datasets are used in our experiments. The

CIFAR-100 [11] dataset consists of × color images drawn from 100 classes, which are split into 50,000 train and 10,000 test images.The Top-1 classiﬁcation accuracy is reported. The Market-1501 [27] dataset is widely used in theperson re-identiﬁcation problem which aims to associate people across different non-overlapping4amera views. It contains 32,668 images of 1,501 identities captured from six camera views, with 751identities for training and 750 identities for testing. As per state of the art approaches to this problem[28], we train the network for 751-way classiﬁcation and use the resulting feature of the last poolinglayer as a representation for nearest neighbour matching at testing. This is a more challenging datasetthan CIFAR-100 because the task is instance recognition thus more ﬁne-grained, and the dataset issmaller with more classes. For evaluation, the standard Cumulative Matching Characteristic (CMC)Rank-k accuracy and mean average precision (mAP) metrics [27] are used.

Implementation Details

We implement all networks and training procedures in TensorFlow [1]and conduct all experiments on an NVIDIA GeForce GTX 1080 GPU. For CIFAR-100, we followthe experimental settings of [25]. Speciﬁcally, we use SGD with Nesterov momentum and set theinitial learning rate to 0.1, momentum to 0.9 and mini-batch size to 64. The learning rate dropped by0.1 every 60 epochs and we train for 200 epochs. The data augmentation includes horizontal ﬂipsand random crops from image padded by 4 pixels on each side, ﬁlling missing pixels with reﬂectionsof original image. For Market-1501, we use Adam optimiser [10], with learning rate lr = 0 . , β = 0 . , β = 0 . and a mini-batch size of 16. We train all the models for 100,000 iterations. Wealso report results with and without pre-training on ImageNet. Model Size

The networks used in our experiments includes compact networks of typical studentsize: Resnet-32 [6] and MobileNet [8]; as well as large networks of typical teacher size: InceptionV1[21] and Wide ResNet WRN-28-10 [25]. Table 1 compares the number of parameters of all thenetworks on CIFAR-100. ResNet-32 MobileNet InceptionV1 WRN-28-10

Table 2 compares the Top-1 accuracy of the CIFAR-100 dataset obtained by various architecturesin a two-network DML cohort. From the table we can make the following observations: (i) All thedifferent network combinations among ResNet-32, MobileNet and WRN-28-10 improve performancewhen learning in a cohort compared to learning independently, indicated by the all positive valuesin the “DML-Independent” columns. (ii) The networks with smaller capacity (ResNet-32 andMobileNet) generally beneﬁt more from DML. (iii) Although WRN-28-10 is a much larger networkthan MobileNet or ResNet-32 (Table 1) it still beneﬁts from being trained together with a smaller peer.(iv) Training a cohort of large networks (WRN-28-10) is still beneﬁcial compared to learning alone.Thus in contrast to the conventional wisdom of model distillation, we see that a large pre-trainedteacher is not necessary to obtain beneﬁts, and multiple large networks can still beneﬁt from ourdistillation-like process.Network Types Independent DML DML-IndependentNet 1 Net 2 Net 1 Net 2 Net 1 Net 2 Net 1 Net 2Resnet-32 Resnet-32 68.99 68.99 71.19 70.75 1.20 1.76WRN-28-10 Resnet-32 78.69 68.99 78.96 70.73 0.27 1.74MobileNet Resnet-32 73.65 68.99 76.13 71.10 2.48 2.11MobileNet MobileNet 73.65 73.65 76.21 76.10 2.56 2.45WRN-28-10 MobileNet 78.69 73.65 80.28 77.39 1.59 3.74WRN-28-10 WRN-28-10 78.69 78.69 80.28 80.08 1.59 1.39Table 2: Top-1 accuracy (%) on the CIFAR-100 dataset. “DML-Independent” measures the differencein accuracy between the network learned with DML and the same network learned independently.

Table 3 summarises the mAP (%) and rank-1 accuracy (%) of Market-1501 with/without DML, aswell as the comparison against existing state of the art methods. Each MobileNet is trained in atwo-network cohort and the averaged performance of the two networks in the cohort is reported. We5an see that DML greatly improves the performance of MobileNet compared to independent learning,both with and without pre-training on ImageNet. It can also be seen that the performance of theproposed DML approach trained with two MobileNets signiﬁcantly outperforms prior state-of-the-artdeep learning methods.Method ImageNet Single-Query Multi-QueryPretrain? mAP Rank-1 mAP Rank-1CAN [14] yes 24.40 48.20 - -Gated S-CNN [23] no 39.55 65.88 48.45 76.04Siamese LSTM [24] no - - 35.30 61.60 k -reciprocal Re-ranking [28] yes 63.63 77.11 - -MobileNet no 46.07 72.18 54.31 79.81MobileNet+DML no 52.15 76.90 60.97 83.48MobileNet yes 60.68 83.94 68.25 87.89MobileNet+DML yes Table 3: Comparative results on the Market-1501 dataset

As our method is strongly related to model distillation, we next provide a focused comparison toDistillation [7]. Table 4 compares our DML with model distillation where the teacher network (Net1) is pre-trained and provides ﬁxed posterior targets for the student network (Net 2). As expectedthe conventional distillation approach from a powerful pre-trained teacher does indeed improvethe student performance compared to independently learning the student (1 distills 2 versus Net2 Independent). However, the results show that not only is a pre-trained teacher unnecessary; buttraining both networks together in deep mutual learning provides a clear improvement comparedto distillation (1 distills 2 versus DML Net 2). This implies that in the process of mutual learningthe network that would play the role of teacher actually becomes better than a pre-trained teacher,via learning from interactions with an a-priori untrained student. Finally, we note that on Market-1501 training two compact MobileNets together provides a similar boost over independent learningcompared to mutual learning with InceptionV1 and MobileNet: Peer teaching of small networks canbe highly effective. In contrast, using the same network as teacher in model distillation actually makesthe student worse than independent learning (the last row 1 distills 2 (45.16) vs. Net 2 Independent(46.07)).Dataset Network Types Independent 1 distills 2 DMLNet1 Net 2 Net 1 Net 2 Net 2 Net 1 Net 2CIFAR-100 WRN-28-10 ResNet-32 78.69 68.99 69.48 78.96 70.73MobilNet ResNet-32 73.65 68.99 69.12 76.13 71.10Market-1501 Inception V1 MobileNet 65.26 46.07 49.11 65.34 52.87MobileNet MobileNet 46.07 46.07 45.16 52.95 51.26Table 4: Comparison with distillation on CIFAR-100 (Top-1 accuracy (%)) and Market-1501 dataset(mAP (%))

The prior experiments studied cohorts of 2 students. In this experiment, we study how DML scaleswith more students in the cohort. Figure 2(a) shows the results on Market-1501 with DML trainingof increasing cohort sizes of MobileNets. The ﬁgure shows average mAP, as well as the standarddeviation. From Fig. 2(a) we can see that the mAP performance of the average single networkincreases with the number of networks in the cohort with DML, hence its gap to the independentlytrained networks. This demonstrates that the generalisation ability of students is enhanced whenlearning together with increasing numbers of peers. From the standard deviations we can also seethat the results get more and more stable with increasing number of networks in DML.A common technique when training multiple networks is to group them as an ensemble and make acombined prediction. In Fig. 2(b) we use the same models as Fig. 2(a) but make predictions based6n the ensemble (matching based on concatenated feature of all members) instead of reporting theaverage prediction of each individual. From the results we can see that the ensemble predictionoutperforms individual network predictions as expected (Fig. 2(b) vs (a)). Moreover, the ensemblepredictions also beneﬁt from training multiple networks as a cohort (Fig. 2(b) DML ensemblevs. Independent ensemble). The ability of DML to improve models ensembles (Fig 2) illustratesthat it may be a generally useful technique to improve performance in applications where modelensembles are standard practice, as there is minimal additional cost if ensembles are already used.

Number of Networks m AP ( % ) DMLIndependent

Number of Networks m AP ( % ) DMLIndependent (a) DML vs. Independent (b) DML ensemble vs Independent ensemble.

Figure 2: Performance (mAP (%)) on Market-1501 with different numbers of networks in cohort

In this section we attempt to give some insights about how and why our deep mutual learning strategyworks. There has been a wave of recent research on the subject of "Why Deep Nets Generalise"[4, 26, 9], which have provided some insights such as: While there are often many solutions (deepnetwork parameter settings) that generate zero train error, some of these generalise better than othersdue to being in wide valleys rather than narrow crevices [4, 9] – so that small perturbations do notchange the prediction efﬁcacy drastically; and that deep networks are better than might be expectedat ﬁnding these good solutions [26], but that the tendency towards ﬁnding robust minima can beenhanced by biasing deep nets towards solutions with higher posterior entropy [4, 17].

Better Quality Solutions with More Robust Minima

With these insights in mind we make someobservations about the DML process. Firstly we note that in our applications, the networks ﬁt thetraining data perfectly: training accuracy goes to 100% and classiﬁcation loss becomes minimal(Fig. 3(a)). However, as we saw earlier, DML performs better on test data. Therefore rather thanhelping to ﬁnd a better (deeper) minima of training loss, DML appears to be helping us to ﬁnd awider/more robust minima that generalises better to test data. Inspired by [4, 9], we perform a simpletest to analyse the robustness of the discovered minima on Market-1501 using MobileNet. For theDML and independent models, we compare the training loss of the learned models before and afteradding independent Gaussian noise with variable standard deviation σ to each model parameter. Wesee that the depths of the two minima were the same (Fig. 3(a)), but after adding this perturbationthe training loss of the independent model jumps up while the loss of the DML model increasesmuch less. This suggests that the DML model has found a much wider minima, which is expected toprovide better generalisation performance [4, 17]. How a Better Minima is Found

How does DML help to ﬁnd such a better minima? When askingeach network to match its peers probability estimates, mismatches where a given network predictszero and its teacher/peer predicts non-zero are heavily penalised. Therefore the overall effect ofDML is that, where each network independently would put a small mass on a small set of secondaryprobabilities, all networks in the DML tend to aggregate their prediction of secondary probabilities,and both (i) put more mass on the secondary probabilities altogether, and (ii) place non-zero masson more distinct secondary probabilities. We illustrate this effect by comparing the probabilitiesassigned to the top-5 highest ranked classes obtained by a ResNet-32 on CIFAR-100 trained by DMLvs. an independently trained ResNet-32 model in Fig. 3(c). For each training sample, the top 5 classesare ranked according to the posterior probabilities produced by the model (Class 1 being the true class7

Iterations × C l a ss i f i c a t i on l o ss DMLIndependent σ Lo ss C hange s DMLIndependent

Classes M ean P o s t e r i o r P r obab ili t y DMLIndependent (a) Training Loss (b) Loss change given parameter noise (c) Posterior certainty comparison

Figure 3: Analysis on why DML worksand Class 2 the second most probable class, so on and so forth). Here we can see that the assignmentof mass to probabilities below the Top-1 decays much quicker for Independent than DML learning.This can be quantiﬁed by the entropy values averaged over all training samples of the DML trainedmodel and the independently trained model being 1.7099 and 0.2602 respectively. Thus our methodhas connection to entropy regularisation-based approaches [4, 17] to ﬁnding wide minima, but bymutual probability matching on ‘reasonable’ alternatives, rather than a blind high-entropy preference.

DML with Ensemble Teacher

In our DML strategy, each student is taught by all other students inthe cohort individually, regardless how many students are in the cohort (Eq. (10)). In Sec. 2.3, analternative DML strategy is discussed, by which each student is asked to match the predictions of theensemble of all other students in the cohort (Eq. (11)). One might reasonably expect this approachto be better. As the ensemble prediction is better than individual predictions, it should provide acleaner and stronger teaching signal – more like conventional distillation. In practice the results ofensemble rather than peer teaching are worse (see Fig. 4 (a)). By analysing the teaching signal of theensemble in comparison to peer teaching, the ensemble target is much more sharply peaked on thetrue label than the peer targets, resulting in larger prediction entropy value for DML than DML_e(see Fig. 4 (b)). Thus while the noise-averaging property of ensembling is effective for making acorrect prediction, it is actually detrimental to providing a teaching signal where the secondary classprobabilities are the salient cue in the signal and having high-entropy posterior leads to more robustsolutions to model training.

DML DML_e505254565860 m AP ( % ) DML DML_e0.10.150.20.250.30.350.4 P r ed i c t i on E n t r op y (a) Averaged mAP Results (b) Prediction Entropy Figure 4: Comparison of DML with each individual peer student as teacher and DML with peerstudent ensemble as teacher (DML_e) with 5 MobileNets trained on Market-1501

We have proposed a simple and generally applicable approach to improving the performance of deepneural networks by training them in a cohort with peers and mutual distillation. With this approachwe can obtain compact networks that perform better than those distilled from a strong by staticteacher. One application of DML is to obtain compact/fast and effective networks. We also showedthat this approach is also promising to improve the performance of large powerful networks, andthat the network cohort trained in this manner can be combined as an ensemble to further improveperformance. 8 eferences [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S.Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp,Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster,Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, VijayVasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,and Xiaoqiang Zheng. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems.

CoRR , abs/1603.04467, 2016. URL http://arxiv.org/abs/1603.04467 .[2] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In

Advances in Neural InformationProcessing Systems . 2014.[3] Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In

KDD , 2006.[4] Pratik Chaudhar, Anna Choromansk, Stefano Soatt, Yann LeCun, Carlo Baldass, Christian Borg, JenniferChays, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In

Proceedings of International Conference on Learning Representations , 2017.[5] Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning formachine translation. In

Advances in Neural Information Processing Systems , pages 820–828, 2016.[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , pages 770–778, 2016.[7] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. arXiv:1503.02531 , 2015.[8] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobilevision applications. arXiv: 1704.04861 , 2017.[9] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak PeterTang. On large-batch training for deep learning: Generalization gap and sharp minima. In

Proceedings ofInternational Conference on Learning Representations , 2017.[10] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

Proceedings ofInternational Conference on Learning Representations , 2015.[11] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research),2009. URL .[12] Ludmila I. Kuncheva and Christopher J. Whitaker. Measures of diversity in classiﬁer ensembles and theirrelationship with the ensemble accuracy.

Machine Learning , 51(2):181–207, 2003.[13] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters for efﬁcientconvnets. In

Proceedings of International Conference on Learning Representations , 2017.[14] Hao Liu, Jiashi Feng, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. End-to-end comparative attentionnetworks for person re-identiﬁcation.

CoRR , abs/1606.04404, 2016.[15] David Lopez-Paz, Leon Bottou, Bernhard Scholkopf, and Vladimir Vapnik. Unifying distillation andprivileged information. In

Proceedings of International Conference on Learning Representations , 2016.[16] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transferreinforcement learning. In

Proceedings of International Conference on Learning Representations , 2016.[17] Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey Hinton. Regularizing neuralnetworks by penalizing conﬁdent output distributions. In

ICLR Workshops , 2017.[18] M Rastegari, V Ordonez, J Redmon, and A Farhadi. Xnor-net: Imagenet classiﬁcation using binaryconvolutional neural networks. In

Proceedings of European Conference on Computer Vision , 2016.[19] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and YoshuaBengio. Fitnets: Hints for thin deep nets. In

Proceedings of International Conference on LearningRepresentations , 2015.

20] William J. Dally Song Han, Huizi Mao. Deep compression: Compressing deep neural networks withpruning, trained quantization and huffman coding. In

Proceedings of International Conference on LearningRepresentations , 2016.[21] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, DumitruErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In

Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition , pages 1–9, 2015.[22] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: Similarity control andknowledge transfer. 2015.[23] Rahul Rama Varior, Mrinal Haloi, and Gang Wang. Gated siamese convolutional neural network archi-tecture for human re-identiﬁcation. In

Proceedings of European Conference on Computer Vision , pages791–808, 2016.[24] Rahul Rama Varior, Bing Shuai, Jiwen Lu, Dong Xu, and Gang Wang. A siamese long short-term memoryarchitecture for human re-identiﬁcation. In

Proceedings of European Conference on Computer Vision ,pages 135–153, 2016.[25] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In

Proceedings of British MachineVision Conference , 2016.[26] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deeplearning requires rethinking generalization. In

Proceedings of International Conference on LearningRepresentations , 2017.[27] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable personre-identiﬁcation: A benchmark. In

Proceedings of IEEE International Conference on Computer Vision ,2015.[28] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identiﬁcation with k -reciprocal encoding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition ,2017.,2017.