Heterogeneous Knowledge Distillation using Information Flow Modeling
HHeterogeneous Knowledge Distillation using Information Flow Modeling
N. Passalis, M. Tzelepi and A. TefasDepartment of Informatics, Aristotle University of Thessaloniki, Greece { passalis, mtzelepi, tefas } @csd.auth.gr Abstract
Knowledge Distillation (KD) methods are capable oftransferring the knowledge encoded in a large and com-plex teacher into a smaller and faster student. Early meth-ods were usually limited to transferring the knowledge onlybetween the last layers of the networks, while latter ap-proaches were capable of performing multi-layer KD, fur-ther increasing the accuracy of the student. However, de-spite their improved performance, these methods still suf-fer from several limitations that restrict both their efficiencyand flexibility. First, existing KD methods typically ignorethat neural networks undergo through different learningphases during the training process, which often requiresdifferent types of supervision for each one. Furthermore,existing multi-layer KD methods are usually unable to ef-fectively handle networks with significantly different archi-tectures (heterogeneous KD). In this paper we propose anovel KD method that works by modeling the informationflow through the various layers of the teacher model andthen train a student model to mimic this information flow.The proposed method is capable of overcoming the afore-mentioned limitations by using an appropriate supervisionscheme during the different phases of the training process,as well as by designing and training an appropriate auxil-iary teacher model that acts as a proxy model capable of“explaining” the way the teacher works to the student. Theeffectiveness of the proposed method is demonstrated usingfour image datasets and several different evaluation setups.
1. Introduction
Despite the tremendous success of Deep Learning (DL)in a wide range of domain [12], most DL methods sufferfrom a significant drawback: powerful hardware is neededfor training and deploying DL models. This significantlyhinders DL applications on resource-scarce environments,such as embedded and mobile devices, leading to the de-velopment of various methods for overcoming these limi-tations. Among the most prominent methods for this task
CriticalconnectionsformedForming criticalconnections Furtherfitting andcompression ≤ 100 ≈ 1.9 < 0.01 Figure 1. Existing knowledge distillation approaches ignore theexistence of critical learning periods when transferring the knowl-edge, even when multi-layer transfer approaches are used. How-ever, as argued in [1], the information plasticity rapidly declinesafter the first few training epochs, reducing the effectiveness ofknowledge distillation. On the other hand, the proposed methodmodels the information flow in the teacher network and providesthe appropriate supervision during the first few critical learningepochs in order to ensure that the necessary connections betweensuccessive layers of the networks will be formed. Note that eventhough this process initially slows down the convergence of thenetwork slightly (epochs 1-8), it allows for rapidly increasing therate of convergence after the critical learning period ends (epochs10-25). The parameter α controls the relative importance of trans-ferring the knowledge from the intermediate layers during the var-ious learning phases, as described in detail in Section 3. is knowledge distillation (KD) [9], which is also known as knowledge transfer (KT) [27]. These approaches aim totransfer the knowledge encoded in a large and complex neu-ral network into a smaller and faster one. In this way, it ispossible to increase the accuracy of the smaller model, com-pared to the same model trained without employing KD.Typically, the smaller model is called student model, whilethe larger model is called teacher model.Early KD approaches focused on transferring the knowl-edge between the last layer of the teacher and student mod- a r X i v : . [ c s . C V ] M a y ls [4, 9, 16, 23, 25, 28]. This allowed for providing richertraining targets to the student model, which capture more in-formation regarding the similarities between different sam-ples, reducing overfitting and increasing the student’s ac-curacy. Later methods further increased the efficiency ofKD by modeling and transferring the knowledge encodedin the intermediate layers of the teacher [19, 27, 29]. Theseapproaches usually attempt to implicitly model the way in-formation gets transformed through the various layers of anetwork, providing additional hints to the student model re-garding the way that the teacher model process the informa-tion.Even though these methods were indeed able to furtherincrease the accuracy of models trained with KD, they alsosuffer from several limitations that restrict both their ef-ficiency and flexibility. First, note that neural networksexhibit an evolving behavior, undergoing several differentand distinct phases during the training process. For ex-ample, during the first few epochs critical connections areformed [1], defining almost permanently the future informa-tion flow paths on a network. After fixing these paths, thetraining process can only fine-tune them, while forming newpaths is significantly less probable after the critical learningperiod ends [1]. After forming these critical connections,the fitting and compression (when applicable) phases fol-low [21, 20]. Despite this dynamic time-dependent behav-ior of neural networks, virtually all existing KD approachesignores the phases that neural networks undergo during thetraining. This observation leads us to the first research ques-tion of this paper: Is a different type of supervision neededduring the different learning phases of the student and is itpossible to use a stronger teacher to provide this supervi-sion?
To this end, we propose a simple, yet effective way toexploit KD to train a student that mimics the informationflow paths of the teacher, while also providing further evi-dence confirming the existence of critical learning periodsduring the training phase of a neural network, as originallydescribed in [1]. Indeed, as also demonstrated in the abla-tion study shown in Fig. 1, providing the correct supervisionduring the critical learning period of a neural network canhave a significant effect on the overall training process, in-creasing the accuracy of the student model. More informa-tion regarding this ablation study are provided in Section 4.It is worth noting that the additional supervision, which isemployed to ensure that the student will form similar infor-mation paths to the teacher, actually slows down the learn-ing process until the critical learning period is completed.However, after the information flow paths are formed, therate of convergence is significantly accelerated compared tothe student networks that do not take into account the exis-tence of critical learning periods.Another limitation of existing KD approaches that em-
Layer 1
NCC: 41.04%
Layer 2
NCC: 48.82%
Layer 3
NCC: 59.70%
Layer 4
NCC: 65.95%
Layer 1
NCC: 46.40%
Layer 2
NCC: 69.19%
Layer 3
NCC: 86.18%
Layer 4
NCC: 92.14%
Layer 1
NCC: 43.09%
Layer 2
NCC: 57.39%
Layer 3
NCC: 65.75%
Layer 4
NCC: 74.71%
No IntermediateLayer Supervision76.18%76.32%74.30%69.78%74.98% No IntermediateLayer Supervision77.93%77.39%Student precision (top-1) @ layer 4 77.27%77.76%76.40%Teacher Student Auxiliary Teacher
RepresentationCollapse Over-regularization
Correctlayermatching
Positiveregularizationeffect
ResNet-18 CNN-1 CNN-1-A
Figure 2. Examining the effect of transferring the knowledge fromdifferent layers of a teacher model into the third layer of the stu-dent model. Two different teachers are used, a strong teacher(ResNet-18, where each layer refers to each layer block) and anauxiliary teacher (CNN-1-A). The nearest centroid classifier accu-racy (NCC) is reported for the representations extracted from eachlayer (in order to provide an intuitive measure of how each layertransforms the representations extracted from the input data). Thefinal precision is reported for a student model trained by either notusing intermediate layer supervision (upper black values) or by us-ing different layers of the teacher (4 subsequent precision values).Several different phenomena are observed when the knowledgeis transferred from different layers, while the proposed auxiliaryteacher allows for achieving the highest precision and provides astraightforward way to match the layers between the models (theauxiliary teacher transforms the data representations in a way thatis closer to the student model, as measure through the NCC accu-racy). ploy multiple intermediate layers is their ability to han-dle heterogeneous multi-layer knowledge distillation , i.e.,transfer the knowledge between teachers and students withvastly different architectures. Existing methods almost ex-clusively use network architectures that provide a trivialone-to-one matching between the layers of the student andteacher, e.g., ResNets with the same number of blocks areoften used, altering only the number of layers inside of eachresidual block [27, 29]. Many of these approaches, suchas [27], are even more restrictive, also requiring the layersof the teacher and student to have the same dimensional-ity. As a result, it is especially difficult to perform multi-layer KD between networks with vastly different architec-tures, since even if just one layer of the teacher model isincorrectly matched to a layer of the student model, thenthe accuracy of the student can be significantly reduced, ei-ther due to over-regularizing the network or by forcing toearly compress the representations of the student. This be-avior is demonstrated in Fig. 2, where the knowledge istransferred from the 3rd layer of two different teachers tovarious layers of the student. These findings lead us to thesecond research question of this paper:
Is it possible to han-dle heterogeneous KD in a structured way to avoid suchphenomena?
To this end, in this work, we propose a simple, yet effec-tive approach for training an auxiliary teacher model, whichis closer to the architecture of the student model. This auxil-iary teacher is responsible for explaining the way the largerteacher works to the student model. Indeed, this approachcan significantly increase the accuracy of the teacher, asdemonstrated both in Fig 2, as well as on the rest of theexperiments conducted in this paper. It is worth noting thatduring our initial experiments it was almost impossible tofind a layer matching that would actually help us to improvethe accuracy of the student model without first designing anappropriate auxiliary teacher model, highlighting the im-portance of using auxiliary teachers in heterogeneous KDscenarios, as also highlighted in [15].The main contribution of this paper is proposing aKD method that works by modeling the information flowthrough the teacher model and then training a student modelto mimic this information flow. However, as it was ex-plained previously and experimentally demonstrated in thispaper, this process is often very difficult, especially whenthere is no obvious layer matching between the teacher andstudent models, which can often process the information invastly different ways. In fact, even a single layer mismatch,i.e., overly regularizing the network or forcing for an earlycompression of the representation, can significantly reducethe accuracy of the student model. To overcome these lim-itations the proposed method works by a) designing andtraining an appropriate auxiliary teacher model that allowsfor a direct and effective one-to-one matching between thelayers of the student and teacher models, as well as b) em-ploying a critical-learning aware KD scheme that ensuresthat critical connections will be formed allowing for effec-tively mimicking the teacher’s information flow instead ofjust learning a student that mimics the output of the student.The effectiveness of the proposed method is demon-strated using several different tasks, ranging from metriclearning and classification to mimicking handcrafted fea-ture extractors for providing fast neural network-based im-plementations for low power embedded hardware. The ex-perimental evaluation also includes an extensive represen-tation learning evaluation, given its increasing importancein many embedded DL and robotic applications and follow-ing the evaluation protocol of recently proposed KD meth-ods [16, 28]. An open-source implementation of the pro-posed method is provided in https://github.com/passalis/pkth .The rest of the paper is structured as follows. First, the related work is briefly discussed and compared to the pro-posed method in Section 2. Then, the proposed method ispresented in Section 3, while the experimental evaluationis provided in Section 4. Finally, conclusions are drawn inSection 5.
2. Related Work
A large number of knowledge transfer methods whichbuild upon the neural network distillation approach havebeen proposed [2, 4, 9, 23, 25]. These methods typically usea teacher model to generate soft-labels and then use thesesoft-labels for training a smaller student network. It is worthnoting that several extensions to this approach have beenproposed. For example, soft-labels can be used for pre-training a large network [22] and performing domain adap-tion [25], while an embedding-based approach for trans-ferring the knowledge was proposed in [17]. Also, onlinedistillation methods, such as [3, 30], employ a co-trainingstrategy, training both the student and teacher models si-multaneously. However, none of these approaches take intoaccount that deep neural networks transition through sev-eral learning phases, each one with different characteristics,which requires handling them in different ways. On theother hand, the proposed method models the informationflow in the teacher model and then employs a weightingscheme that provides the appropriate supervision during theinitial critical learning period of the student, ensuring thatthe critical connections and information paths formed in theteacher model will be transferred to the student.Furthermore, several methods that support multi-layerKD have been proposed, such as using hints [19], the flowof solution procedure matrix (FSP) [27], attention trans-fer [29], or singular value decomposition to extract majorfeatures from each layer [13]. However, these approachesusually only target networks with compatible architecture,e.g., residual networks with the same number of residualblocks, both for the teacher and student models. Also, itis not straightforward to use them to successfully transferthe knowledge between heterogeneous models, since evena slight layer mismatch can have a devastating effect onthe student’s accuracy, as demonstrated in Fig. 2. It is alsoworth noting that we were actually unable to effectively ap-ply most of these methods for heterogenous KD, since ei-ther they do not support transferring the knowledge betweenlayers of different dimensionality, e.g., [27], or they areprone to over-regularization or representation collapse (asdemonstrated in Fig. 2) reducing the overall performance ofthe student.In contrast with the aforementioned approaches, the pro-posed method provides a way to perform heterogeneousmulti-layer KD by appropriately designing and training anauxiliary network and exploiting the knowledge encoded bythe earlier layers of this network. In this way, the proposedethod provides an efficient way for handling any possi-ble network architecture by employing an auxiliary networkthat is close to the architecture of the student model regard-less the architecture of the teacher model. Using the pro-posed auxiliary network strategy ensures that the teachermodel will transform the representations extracted from thedata in a way compatible with the student model, allowingfor providing a one-to-one matching between the interme-diate layers of the networks. It is also worth nothing that theuse of a similar auxiliary network, which is used as an inter-mediate step for KD, was also proposed in [15]. Howeverin contrast with the proposed method, the auxiliary networkused in [15] was employed for merely improving the perfor-mance of KD between the final classification layers, insteadof designing an auxiliary network that can facilitate efficientmulti-layer KD, as proposed in this paper. Finally, to thebest of our knowledge, in this work we propose the firstarchitecture-agnostic probabilistic KD approach that worksby modeling the information flow through the various lay-ers of the teacher model using a hybrid kernel formulation,can support heterogeneous network architectures and caneffectively supervise the student model during its criticallearning period.
3. Proposed Method
Let T = { t , t , . . . , t N } denote the transfer set thatcontains N transfer samples and it is used to transfer theknowledge from the teacher model to the student model.Note that the proposed method can also work in a purely un-supervised fashion and, as a result, unlabeled data samplescan be also used for transferring the knowledge. Also, let x ( l ) i = f ( t i , l ) denote the representation extracted from the l -th layer of the teacher model f ( · ) and y ( l ) i = g ( t i , l, W ) denote the representation extracted from the l -th layer of thestudent model g ( · ) . Note that the trainable parameters of thestudent model are denoted by W . The proposed methodaims to train the student model g ( · ) , i.e., learn the appropri-ate parameters W , in order to “mimic” the behavior of f ( · ) as close as possible.Furthermore, let X ( l ) denote the random variable that de-scribes the representation extracted from the l -th layer of theteacher model and Y ( l ) the corresponding random variablefor the student model. Also, let Z denote the random vari-able that describes the training targets for the teacher model.In this work, the information flow of the teacher network isdefined as progression of mutual information between everylayer representation of the network and the training targets,i.e., I ( X ( l ) , Z ) ∀ l . Note that even though the training tar-gets are required for modeling the information flow, theyare not actually needed during the KD process, as we willdemonstrate later. Then, we can define the information flowvector that characterizes the way the network process infor- mation as: ω t := (cid:104) I ( X (1) , Z ) , . . . , I ( X ( N Lt ) , Z ) (cid:105) T ∈ R N Lt , (1)where N L t is the number of layers of the teacher model.Similarly, the information flow vector for the student modelis defined as: ω s := (cid:104) I ( Y (1) , Z ) , . . . , I ( Y ( N Ls ) , Z ) (cid:105) T ∈ R N Ls , (2)where again N L s is the number of layers of the studentmodel. The proposed method works by minimizing the di-vergence between the information flow in the teacher andstudent models, i.e., D F ( ω s , ω t ) , where D F ( · ) is a met-ric used to measure the divergence between two, possiblyheterogeneous, networks. To this end, the information flowdivergence is defined as the sum of squared differences be-tween each paired element of the information flow vectors:(3) D F ( ω s , ω t ) = N Ls (cid:88) i =1 (cid:0) [ ω s ] i − [ ω t ] κ ( i ) (cid:1) , where the layer of the teacher κ ( i ) is chosen in order tominimize the divergence with the corresponding layer of theteacher: κ ( i ) = (cid:40) N L t if i = N L s arg min j ([ ω s ] i − [ ω t ] j ) , otherwise (4)and the notation [ x ] i is used to refer to the i -th element ofvector x . This definition employs the optimal matching be-tween the layers (considering the discriminative power ofeach layer), except from the final one, which correspondsto the task hand. In this way, it allows for measuring theflow divergence between networks with different architec-tures. At the same time it is also expected to minimize theimpact of over-regularization and/or representation collapsephenomena, such as those demonstrated in Fig. 2, which of-ten occur when there is large divergence between the layersused for transferring the knowledge. However, this also im-plies that for networks with vastly different architectures orfor networks not yet trained for the task at hand, the samelayer of the teacher may be used for transferring the knowl-edge to multiple layers of the student model, leading to asignificant loss of granularity during the KD and leadingto stability issues. In Subsection 3.2 we provide a simple,yet effective way to overcome this issue by using auxiliaryteacher models. Note that more advanced methods, suchas employing fuzzy assignments between different sets oflayers can be also used. In order to effectively transfer the knowledge betweentwo different networks, we have to provide an efficient wayo calculate the mutual information, as well as to train thestudent model to match the mutual information betweentwo layers of different networks. Recently, it has beendemonstrated that when the Quadratic Mutual Information(QMI) [24] is used, it is possible to efficiently minimizethe difference between the mutual information of a specificlayer of the teacher and student by appropriately relaxingthe optimization problem [16]. More specifically, the prob-lem of matching the mutual information between two layerscan be reduced into a simpler probability matching prob-lem that involves only the pairwise interactions between thetransfer samples. Therefore, to transfer the knowledge be-tween a specific layer of the student and an another layerof the teacher, it is adequate to minimize the divergencebetween the teacher’s and student’s conditional probabilitydistributions, which can be estimated as [16]: p ( t,l t ) i | j = K ( x ( l t ) i , x ( l t ) j ) (cid:80) Ni =1 ,i (cid:54) = j K ( x ( l t ) i , x ( l t ) j ) ∈ [0 , , (5) and p ( s,l s ) i | j = K ( y ( l s ) i , y ( l s ) j ) (cid:80) Ni =1 ,i (cid:54) = j K ( y ( l s ) i , y ( l s ) j ) ∈ [0 , , (6) where K ( · ) is a kernel function and l t and l s refer to thestudent and teacher layers used for the transfer. These prob-abilities also express how probable is for each sample toselect each of its neighbors [14], modeling in this way thegeometry of the feature space, while matching these twodistributions also ensures that the mutual information be-tween the models and a set of (possibly unknown) classes ismaintained [16]. Note that the actual training labels are notrequired during this process, and, as a result, the proposedmethod can work in a purely unsupervised fashion.The kernel choice can have a significant effect on thequality of the KD, since it alters how the mutual informa-tion is estimated [16]. Apart from the well known Gaussiankernel, which is however often hard to tune, other kernelchoices include cosine-based kernels [16], e.g., K c ( a , b ) = ( a T b || a || || b || + 1) , where a and b are two vectors, and theT-student kernel, i.e., K T ( a , b ) = || a − b || d , where d istypically set to 1. Selecting the most appropriate kernelfor the task at hand can lead to significant performance im-provements, e.g., cosine-based kernels perform better forretrieval tasks, while using kernel ensembles, i.e., estimat-ing the probability distribution using multiple kernels, canalso improve the robustness of mutual information estima-tion. Therefore, in this paper a hybrid objective that aims atminimizing the divergence calculated using both the cosinekernel, which ensures the good performance of the learnedrepresentation for retrieval tasks, and the T-student kernel,which experimentally demonstrated good performance forclassification tasks, is used: L ( l t ,l s ) = D ( P ( t,l t ) c ||P ( s,l s ) c ) + D ( P ( t,l t ) T ||P ( s,l s ) T ) , (7) where D ( · ) is a probability divergence metric and the no-tation P ( t,l t ) c and P ( t,l t ) T is used to denote the conditionalprobabilities of the teacher calculated using the cosineand T-student kernels respectively. Again, the representa-tions used for KD were extracted from the l t -th/ l s -th layer.The student probability distribution is denoted similarly by P ( s,l s ) c and P ( s,l s ) T . The divergence between these distri-butions can be calculated using a symmetric version ofthe Kullback-Leibler (KL) divergence, the Jeffreys diver-gence [10]: (8) D ( P ( t,l t ) ||P ( s,l s ) ) = N (cid:88) i =1 N (cid:88) j =1 ,i (cid:54) = j (cid:16) p ( t,l t ) j | i − p ( s,l s ) j | i (cid:17) · (cid:16) log p ( t,l t ) j | i − log p ( s,l s ) j | i (cid:17) , which can be sampled at a finite number of points during theoptimization, e.g., using batches of 64-128 samples. Thisbatch-based strategy has been successfully employed in anumber of different works [16, 28], without any significanteffect on the optimization process. Even though the flow divergence metric defined in (3)takes into account the way different networks process theinformation, it suffers from a significant drawback: if theteacher process the information in a significantly differentway compared to the student, then the same layer of theteacher model might be used for transferring the knowledgeto multiple layers of the student model, leading to a sig-nificant loss in the granularity of information flow used forKD. Furthermore, this problem can also arise even when thestudent model is capable of processing the information in away compatible with the teacher, but it has not been yet ap-propriately trained for the task at hand. To better understandthis, note that the information flow divergence in (3) is cal-culated based on the estimated mutual information and notthe actual learning capacity of each model. Therefore, di-rectly using the flow divergence definition presented in (3)is not optimal for KD. It is worth noting that this issue isespecially critical for every KD method that employs multi-ple layer, since as we demonstrate in Section 4, if the layerpairs are not carefully selected, the accuracy of the studentmodel is often lower compared to a model trained withoutusing multi-layer transfer at all.Unfortunately, due to the poor understanding of the waythat neural networks transform the probability distributionof the input data, there is currently no way to select the mostappropriate layers for transferring the knowledge a priori .This process can be especially difficult and tedious, espe-cially when the architectures of the student and teacher dif-fer a lot. To overcome this critical limitation in this work wepropose constructing an appropriate auxiliary proxy for theteacher model, that will allow for directly matching betweenall the layers of the auxiliary model and the student model, ayer 1Layer 2Layer 3Layer 4Layer N L t ⋮ Layer 1Layer 2Layer 3Layer N L s ⋮ Layer 1Layer 2Layer 3Layer N L s ⋮ Auxiliary Teacher ModelTeacher Model Student Model T ea c he r I n f o r m a t i on F l o w V e c t o r S t uden t I n f o r m a t i on F l o w V e c t o r Information FlowDivergence
Critical Period-awareOptimization
Step 2:Information FlowDivergenceMinimizationStep 1: KD to Auxiliary
Figure 3. First, the knowledge is transferred to an appropriate aux-iliary teacher, which will better facilitate the process of KD. Then,the proposed method minimizes the information flow divergencebetween the two models, taking into account the existence of crit-ical learning periods. as shown in Fig. 3. In this way, the proposed method em-ploys an auxiliary network, that has a compatible architec-ture with the student model, to better facilitate the processof KD. A simple, yet effective approach for designing theauxiliary network is employed in this work: the auxiliarynetwork follows the same architecture as the student model,but using twice the neurons/convolutional filters per layer.Thus, the greater learning capacity of the auxiliary net-work ensures that enough knowledge will be always avail-able to the auxiliary network (when compared to the stu-dent model), leading to better results compared to directlytransferring the knowledge from the teacher model. Design-ing the most appropriate auxiliary network is an open re-search area and significantly better ways than the proposedone might exist. However, even this simple approach wasadequate to significantly enhance the performance of KDand demonstrate the potential of information flow model-ing, as further demonstrated in the ablation studies providedin Section 4. Also, note that a hierarchy of auxiliary teach-ers can be trained in this fashion, as also proposed in [15].The final loss used be optimize the student model, whenan auxiliary network is employed, is calculated as: L = N Ls (cid:88) i =1 α i L ( i,i ) , (9)where α i is a hyper-parameter that controls the relativeweight of transferring the knowledge from the i -th layer ofthe teacher to the i -th layer of the student and the loss L ( i,i ) defined in (7) is calculated using the auxiliary teacher, in-stead of the initial teacher. The value of α i can be dynami-cally selected during the training process, to ensure that theapplied KD scheme takes into account the current learningstate of the network, as further discussed in Subsection 3.3.Finally, stochastic gradient descent is employed to train thestudent model: ∆ W = − η ∂ L ∂ W , where W is the matrix with the parameters of the student model and η is the em-ployed learning rate. Neural networks transition through different learningphases during the training process, with the first few epochsbeing especially critical for the later behavior of the net-work [1]. Using a stronger teacher model provides the op-portunity of guiding the student model during the initial crit-ical learning period in order to form the appropriate con-nectivity between the layers, before the information plas-ticity declines. However, just minimizing the informationflow divergence does not ensure that the appropriate con-nections will be formed. To better understand this, we haveto consider that the gradients back-propagated through thenetwork depend both on the training target, as well as onthe initialization of the network. Therefore, for a randomlyinitialized student, the task of forming the appropriate con-nections between the intermediate layers might not facilitatethe final task at hand (until reaching a certain critical point).This was clearly demonstrated in Fig 1, where the conver-gence of the network was initially slower, when the pro-posed method was used, until reaching the point at whichthe critical learning period ends and the convergence of thenetwork is accelerated.Therefore, in this work we propose using an appropriateweighting scheme for calculating the value of the hyper-parameter α i during the training process. More specifically,during the critical learning period a significantly higherweight is given to match the information flow for the earlierlayers, ignoring the task at hand dictated by the final layerof the teacher, while this weight gradually decays to 0, asthe training process progresses. Therefore, the parameter α i is calculated as: (10) α i = (cid:40) , if i = N L S α init · γ k otherwise , where k is the current training epoch, γ is a decay factor and α init is the initial weight used for matching the informationflow in the intermediate layers. The parameter γ was set to . , while α init was set to for all the experiments con-ducted in this paper (unless otherwise stated). Therefore,during the first few epochs (1-10) the final task at hand has aminimal impact on the optimization objective. However, asthe training process progresses the importance of matchingthe information flow for the intermediate layers graduallydiminishes and the optimization switches to fine-tuning thenetwork for the task at hand.
4. Experimental Evaluation
The experimental evaluation of the proposed method isprovided in this Section. The proposed method was eval- able 1. Metric Learning Evaluation: CIFAR-10
Method mAP (e) mAP (c) top-100 (e) top-100 (c)Baseline Models
Teacher (ResNet-18) .
18 90 .
47 92 .
15 92 . Aux. (CNN1-A) .
12 66 .
78 73 .
72 75 . With Constrastive Supervision
Student (CNN1) .
69 48 .
72 57 .
46 58 . Hint. .
56 48 . .
44 62 . MKT .
34 46 .
84 55 .
89 57 . PKT .
87 49 .
95 58 .
44 59 . Hint-H .
24 47 .
46 58 .
97 61 . MKT-H .
83 47 .
12 56 .
28 57 . PKT-H .
69 50 .
09 58 .
71 60 . Proposed .
55 50 . .
50 60 . Without Constrastive Supervision
Student (CNN1) .
30 39 .
00 55 .
87 58 . Distill. .
39 40 .
53 56 .
17 58 . Hint. .
99 48 .
99 60 .
69 62 . MKT .
26 38 .
20 50 .
55 52 . PKT .
07 51 .
56 60 .
02 62 . Hint-H .
65 46 .
46 58 .
51 60 . MKT-H .
16 43 .
99 55 .
10 57 . PKT-H .
05 51 .
73 60 .
39 63 . Proposed .
20 53 .
06 61 .
54 64 . Table 2. Classification Evaluation: CIFAR-10
Method Train Accuracy Test Accuracy
Distill .
50 70 . Hint. .
29 70 . MKT .
73 69 . PKT .
70 70 . Hint-H .
93 69 . MKT-H .
67 68 . PKT-H .
43 71 . Proposed . . uated using four different datasets (CIFAR-10 [11], STL-10 [6], CUB-200 [26] and SUN Attribute [18] datasets)and compared to four competitive KD methods: neural net-work distillation [9], hint-based transfer [19], probabilisticknowledge transfer (PKT) [16] and metric knowledge trans-fer (abbreviated as MKT) [28]. A variety of different eval-uation setups were used to evaluate various aspects of theproposed method. Please refer to the appendix for a de-tailed description of the employed networks and evaluationsetups.First, the proposed method was evaluated in a metriclearning setup using the CIFAR-10 dataset (Table 1). Themethods were evaluated under two different settings: a)using contrastive supervision (by adding a contrastive lossterm in the loss function [8]), as well as b) using a purelyunsupervised setting (cloning the responses of the powerfulteacher model). The simple variants (Hint, MKT, PKT) re-fer to transferring the knowledge only from the penultimatelayer of the teacher, while the “-H” variants refer to trans-ferring the knowledge simultaneously from all the layers ofthe auxiliary model. The abbreviation “e” is used to refer toretrieval using the Euclidean similarity metric, while “c” isused to refer to retrieval using the cosine similarity.First, note that using all the layers for distilling the Table 3. Metric Learning Evaluation: STL Distribution Shift
Method mAP (e) mAP (c) top-100 (e) top-100 (c)
Teacher (ResNet-18) .
40 61 .
20 66 .
75 69 . Aux. (CNN1-A) .
89 48 .
48 53 .
54 56 . Student (CNN1) .
60 33 .
04 39 .
08 41 . Distill .
56 36 .
23 43 .
32 46 . Hint. .
11 40 .
33 46 .
60 49 . MKT .
46 35 .
91 40 .
65 43 . PKT .
22 40 .
26 44 .
73 47 . Hint-H .
56 37 .
85 43 .
83 46 . MKT-H .
57 35 .
23 40 .
20 42 . PKT-H .
56 39 .
77 44 .
76 47 . Proposed .
11 40 .
35 48 .
44 50 . knowledge provides small to no improvements in the re-trieval precision, with the exception of the MKT method(when applied without any form of supervision). Actu-ally, in some cases (e.g., when hint-based transfer is em-ployed) the performance when multiple layers are used isworse. This behavior further confirms and highlights thedifficulty in applying multi-layer KD methods between het-erogeneous architectures. Also, using contrastive supervi-sion seems to provide more consistent results for the com-petitive methods, especially for the MKT method. Using theproposed method leads to a significant increase in the mAP,as well as in the top-K precision. For example, mAP (c)increases by over 2.5% (relative increase) over the next bestperforming method (PKT-H). At the same time, note thatthe proposed method seems to lead to overall better resultswhen there is no additional supervision. This is again linkedto the existence of critical learning periods. As explainedbefore, forming the appropriate information flow paths re-quires little to no supervision from the final layers, when thenetwork is randomly initialized (since forming these pathsusually change the way the network process information,increasing temporarily the loss related to the final task athand). Similar conclusions can be also drawn from the clas-sification evaluation using the CIFAR-10 dataset. The re-sults are reported in Table 2. Again, the proposed methodleads to a relative increase of about 0.7% over the next best-performing method.Next, the proposed method was evaluated under a distri-bution shift setup using the STL-10 dataset (Table 3). Forthese experiments, the teacher model was trained using theCIFAR-10 dataset, but the KD was conducted using the un-labeled split of the STL dataset. Again, similar results aswith the CIFAR-10 dataset are observed, with the proposedmethod outperforming the rest of the evaluated methodsover all the evaluated metrics. Again, it is worth noting thatdirectly transferring the knowledge between all the layers ofthe network often harms the retrieval precision for the com-petitive approaches. This behavior is also confirmed usingthe more challenging CUB-200 data set (Table 4), wherethe proposed method again outperforms the rest of the eval-uated approaches both for the retrieval evaluation, as well asfor the classification evaluation. For the latter, a quite large able 4. Metric Learning and Classification Evaluation: CUB-200 Method mAP (e) mAP (c) top-10 (e) top-10 (c) Acc.
Teacher .
17 78 .
17 76 .
02 81 .
64 72 . Aux. .
01 18 .
98 25 .
77 27 .
07 32 . Student .
60 17 .
24 23 .
40 24 .
89 34 . Distill .
40 18 .
55 24 .
82 26 .
57 35 . Hint. .
34 15 .
98 22 .
31 23 .
41 28 . MDS .
99 13 .
39 20 .
60 20 .
59 30 . PKT .
36 18 .
57 24 .
68 26 .
70 34 . Hint-H .
94 15 .
37 21 .
75 22 .
61 28 . MDS-H .
83 15 .
39 21 .
27 22 .
76 32 . PKT-H .
58 17 .
77 23 .
50 25 .
39 33 . Proposed .
70 19 .
01 25 .
41 27 .
67 36 . Table 5. HoG Cloning Network: SUN Dataset
Method mAP (c) top-1 (e) top-10 (c)
HoG . ± .
20 62 . ± .
10 47 . ± . Aux. . ± .
09 55 . ± .
03 42 . ± . Hint . ± .
13 44 . ± .
11 31 . ± . MDS . ± .
79 43 . ± .
87 31 . ± . PKT . ± .
60 49 . ± .
67 36 . ± . Proposed . ± .
62 51 . ± .
74 38 . ± . improvement is observed, since the accuracy increases byover 1.5% over the next best performing method.Furthermore, we also conducted a HoG [7] cloning ex-periment, in which the knowledge was transferred from ahandcrafted feature extractor to demonstrate the flexibilityof the proposed approach. The same strategy as in the pre-vious experiments were used, i.e., the knowledge was firsttransferred to an auxiliary model and then further distilled tothe student model. It is worth noting that this setup has sev-eral emerging applications, as discussed in a variety of re-cent works [16, 5], since it allows for pre-training deep neu-ral networks for domains for which it is difficult to acquirelarge annotated datasets, as well as providing a straight-forward way to exploit the highly optimized deep learninglibraries for embedded devices to provide neural network-based implementations of hand-crafted features. The evalu-ation results for this setup are reported in Table 5, confirm-ing again that the proposed method outperforms the rest ofthe evaluated methods.Finally, several ablation studies have been conducted.First, in Fig. 1 we evaluated the effect of using the pro-posed weighting scheme that takes into account the exis-tence of critical learning periods. The proposed scheme in-deed leads to faster convergence over both single layer KDusing the PKT method, as well as over the multi-layer PKT-H method. To validate that the improved results arise fromthe higher weight given to the intermediate layers over thecritical learning period, we used the same decaying schemefor the PKT-H method, but with the initial α init set to 1instead of 100. Next, we also demonstrated the impactof matching the correct layers in Fig. 2. Several interest-ing conclusions can be drawn from the results reported inFig. 2. For example, note that over-regularization occurswhen transferring the knowledge from a teacher layer that Table 6. Effect of using auxiliary networks of different sizes (CNNorder in term of parameters: CNN-1-H > CNN-1-A > CNN-1 > CNN-1-L)
Method mAP (e) mAP (c) top-100 (e) top-100 (c)
CNN1-L → CNN1 .
03 37 .
89 46 .
31 49 . CNN1-A → CNN1 .
20 53 .
06 61 .
54 64 . CNN1-H → CNN1 .
82 52 .
77 61 .
25 63 . CNN1 → CNN1-L .
49 39 .
25 48 .
21 50 . CNN-1-A → CNN-1-L .
72 38 .
61 47 .
25 50 . CNN-1-H → CNN-1-L .
90 37 .
51 45 .
83 48 . has lower MI with the targets (lower NCC accuracy). On theother hand, using a layer with slightly lower discriminativepower (Layer 1 of ResNet-18) can have a slightly positiveregularization effect. At the same time, using too discrim-inative layers (Layer 3 and 4 of ResNet-18) can lead to anearly collapse of the representation, harming the precisionof the student. The accuracy of the student increases onlywhen the correct layers of the auxiliary teacher are matchedto the student (Layers 2 and 3 of CNN-1-A).Furthermore, we also evaluated the effect of using aux-iliary models of different sizes on the precision of the stu-dent model trained with the proposed method. The evalua-tion results are provided in Table 6. Two different studentmodels are used: CNN-1 (15k parameters) and CNN-1-L(6k parameters). As expected, the auxiliary models that arecloser to the complexity of the student lead to improved per-formance compared both to the more complex and the lesscomplex teachers. That is, when the CNN-1 model is usedas student, the CNN-1-A teacher achieves the best results,while when the CNN-1-L is used as student, the weakerCNN-1 teacher achieves the highest precision. Note thatas the complexity of the student increases, the efficiency ofthe KD process declines.
5. Conclusions
In this paper we presented a novel KD method that thatworks by modeling the information flow through the variouslayers of the teacher model. The proposed method was ableto overcome several limitations of existing KD approaches,especially when used for training very lightweight deeplearning models with architectures that differ significantlyfrom the teacher, by a) designing and training an appro-priate auxiliary teacher model, and b) employing a critical-learning aware KD scheme that ensures that critical connec-tions will be formed to effectively mimic the informationflow paths of the auxiliary teacher.
Acknowledgment
This work was supported by the European Union’s Hori-zon 2020 Research and Innovation Program (OpenDR) un-der Grant 871449. This publication reflects the authors”views only. The European Commission is not responsiblefor any use that may be made of the information it contains. . Appendix
A.1. Datasets and Evaluation Setups
The proposed method was evaluated using four differentdatasets: the CIFAR-10 [11] dataset, the STL-10 [6] dataset,the CUB-200 [26] dataset and the SUN Attribute [18]dataset. For the CIFAR-10, the training split was used fortraining and transferring the knowledge to the student mod-els, while for the retrieval evaluation the training split wasalso used to compile the database. Then, the test set wasused to query the database and measure the retrieval perfor-mance of various representations. For the STL-10 datasetwe followed the same setup as for the CIFAR-10, but wealso used the provided unlabeded training split for transfer-ring the knowledge to the student models. For the CUB-200we also followed the same setup, however the experimentswere conducted using the first 30 classes of the data, dueto the significantly restricted learning capacity of the em-ployed student models (recall that among the objectives ofthe paper is to evaluate the performance of KD approachesfor ultra-lightweight network architectures and heteroge-neous KD setups). Finally, images from the eight most com-mon categories (for which at least 40 images exist) wereused for training and evaluating the methods when the SUNAttribute dataset was employed, since a very small num-ber of images exist for the rest of the categories. The 80%of the extracted images was used for training the networksand building the database, while the rest 20% was used toquery the database. The evaluation process was repeated5 times and the mean and standard deviation of the evalu-ated metrics are reported. For the SUN attribute dataset, theknowledge was distilled from a × HoG features.For the CIFAR-10 and STL dataset we used the sup-plied images without performing any resizing (the original × images were used). However, the training datasetwas augmented by randomly performing horizontal flippingand randomly cropping the images using padding of 4 pix-els. A similar augmentation protocol was used for the CUB-200 dataset. However, the images of the CUB-200 datasetwere first resized into × pixels and then a randomcrop of × pixels was used (a center crop of the samesize was used during the evaluation process). Also, randomrotation up to ◦ was used when training the models. Fi-nally, the images of the SUN attribute dataset were resizedinto × pixels, before feeding them into the network,following the protocol used in [16]. A.2. Network Architectures
The network architectures used for the conducted experi-ments are shown in Fig. 4. The CNN-1 family was used forthe experiments conducted using the CIFAR-10 and STLdataset, the CNN-2 family was used for the experimentsconducted using the CUB-200 dataset, while the CNN-3
Convolutional 2D(3 x 3, 8 filters)
CNN-1
Convolutional 2D(3 x 3, 16 filters)Convolutional 2D(3 x 3, 32 filters)Fully Connected(64 Neurons)Fully Connected(N C Neurons)Convolutional 2D(3 x 3, 4 filters)
CNN-1-L
Convolutional 2D(3 x 3, 8 filters)Convolutional 2D(3 x 3, 16 filters)Fully Connected(64 Neurons)Fully Connected(N C Neurons) Convolutional 2D(3 x 3, 16 filters)
CNN-1-A
Convolutional 2D(3 x 3, 32 filters)Convolutional 2D(3 x 3, 64 filters)Fully Connected(128 Neurons)Fully Connected(N C Neurons) Convolutional 2D(3 x 3, 32 filters)
CNN-1-H
Convolutional 2D(3 x 3, 64 filters)Convolutional 2D(3 x 3, 128 filters)Fully Connected(128 Neurons)Fully Connected(N C Neurons)
Convolutional 2D(9 x 9, 8 filters,stride 2)
CNN-2
Convolutional 2D(5 x 5, 16 filters)Fully Connected(64 Neurons) Convolutional 2D(9 x 9, 32 filters,stride 2)
CNN-2-A
Convolutional 2D(5 x 5, 64 filters)Fully Connected(128 Neurons) Convolutional 2D(3 x 3, 4 filters)
CNN-3
Convolutional 2D(3 x 3, 4 filters)Fully Connected(32 Neurons) Convolutional 2D(3 x 3, 8 filters)
CNN-3-A
Convolutional 2D(3 x 3, 16 filters)Fully Connected(128 Neurons)Convolutional 2D(5 x 5, 32 filters) Convolutional 2D(5 x 5, 128 filters)
Figure 4. Network architectures used for the conducted experi-ments. The green model was used as the student for the conductedexperiments (unless otherwise stated), while the red model wasused as the auxiliary teacher. For experiments involving classifi-cation, an additional fully connected layer with N C (number ofclasses) neurons was added. family was used for the SUN Attribute dataset. The suffix“-A” is used to denote the model that was used as the aux-iliary teacher. The auxiliary teacher was trained using thePKT method [16], by transferring the knowledge from thepenultimate layer of a ResNet-18 teacher (for the CIFAR-10, STL and CUB-200 datasets) or from handcrafted fea-tures (for the SUN Attribute dataset). The ReLU activationfunction was used for all the layers, while the batch normal-ization was used after each convolutional layer. A.3. Training Hyper-parameters
For all the conducted experiments we used the Adam op-timizer, with the default training hyper-parameters. For theexperiments conducted using the CIFAR-10 dataset the op-timization ran for 50 training epochs with a learning rate of0.001 (batches of 128 samples were used) for all the evalu-ated methods. For the ablation results reported in Fig. 2 ofthe main manuscript the optimization ran for 20 epochs. Forthe STL dataset the optimization ran for 30 training epochswith a learning rate of 0.001 and batch size equal to 128.For the CUB-200 dataset the optimization ran for 100 train-ing epochs, using a learning rate of 0.001 for the first 50raining epochs and 0.0001 for the subsequent 50 trainingepochs. Also, for the SUN Attribute dataset the optimiza-tion ran for 20 training epochs. Furthermore, the decay fac-tor γ was set to 0.6 for this dataset, due to the smaller num-ber of training epochs. Finally, note that for the experimentsconducted with the contrastive supervision (CIFAR-10) weemployed the contrastive loss with the margin set to 1 andthe loss was combined with the KD loss after weighting itwith 0.1. Also, for the classification experiments reportedin Table 2, all the methods were also trained using a super-vised classification term (cross-entropy loss). Finally, forall the experiments conducted using the distillation loss, atemperature of T = 2 was used. References [1] Alessandro Achille, Matteo Rovere, and Stefano Soatto.Critical learning periods in deep neural networks. arXivpreprint arXiv:1711.08856 , 2017. 1, 2, 6[2] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil DLawrence, and Zhenwen Dai. Variational information distil-lation for knowledge transfer. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 9163–9171, 2019. 3[3] Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Or-mandi, George E Dahl, and Geoffrey E Hinton. Large scaledistributed neural network training through online distilla-tion. arXiv preprint arXiv:1804.03235 , 2018. 3[4] Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In
Proceedings of the ACMSIGKDD International Conference on Knowledge Discoveryand Data mining , pages 535–541, 2006. 2, 3[5] Zhenghua Chen, Le Zhang, Zhiguang Cao, and Jing Guo.Distilling the knowledge from handcrafted features for hu-man activity recognition.
IEEE Transactions on IndustrialInformatics , 14(10):4334–4342, 2018. 8[6] Adam Coates, Andrew Ng, and Honglak Lee. An analysisof single-layer networks in unsupervised feature learning. In
Proceedings of the Conference on Artificial Intelligence andStatistics , pages 215–223, 2011. 7, 9[7] N. Dalal and B. Triggs. Histograms of oriented gradientsfor human detection. In
Proceedings of the Computer Soci-ety Conference on Computer Vision and Pattern Recognition ,pages 886–893, 2005. 8[8] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensional-ity reduction by learning an invariant mapping. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , volume 2, pages 1735–1742, 2006. 7[9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling theknowledge in a neural network. In
Proceedings of the NeuralInformation Processing Systems Deep Learning Workshop ,2014. 1, 2, 3, 7[10] Harold Jeffreys. An invariant form for the prior probabilityin estimation problems.
Proceedings of the Royal Societyof London. Series A. Mathematical and Physical Sciences ,186(1007):453–461, 1946. 5 [11] Alex Krizhevsky and Geoffrey Hinton. Learning multiplelayers of features from tiny images.
Technical Report , 2009.7, 9[12] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deeplearning.
Nature , 521(7553):436–444, 2015. 1[13] Seung Hyun Lee, Dae Ha Kim, and Byung Cheol Song. Self-supervised knowledge distillation using singular value de-composition. In
European Conference on Computer Vision ,pages 339–354. Springer, 2018. 3[14] Laurens van der Maaten and Geoffrey Hinton. Visualizingdata using t-sne.
Journal of Machine Learning Research ,9:2579–2605, 2008. 5[15] Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, andHassan Ghasemzadeh. Improved knowledge distillation viateacher assistant: Bridging the gap between student andteacher. arXiv preprint: 1902.03393 , 2019. 3, 4, 6[16] Nikolaos Passalis and Anastasios Tefas. Learning deep rep-resentations with probabilistic knowledge transfer. In
Pro-ceedings of the European Conference on Computer Vision ,pages 268–284, 2018. 2, 3, 5, 7, 8, 9[17] Nikolaos Passalis and Anastasios Tefas. Unsupervisedknowledge transfer using similarity embeddings.
IEEETransactions on Neural Networks and Learning Systems ,30(3):946–950, 2018. 3[18] Genevieve Patterson and James Hays. Sun attribute database:Discovering, annotating, and recognizing scene attributes. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 2751–2758, 2012. 7, 9[19] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou,Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets:Hints for thin deep nets. In
Proceedings of the InternationalConference on Learning Representations , 2015. 2, 3, 7[20] Andrew Michael Saxe, Yamini Bansal, Joel Dapello, MadhuAdvani, Artemy Kolchinsky, Brendan Daniel Tracey, andDavid Daniel Cox. On the information bottleneck theory ofdeep learning. In
Proceedings of the International Confer-ence on Learning Representations . 2[21] Ravid Shwartz-Ziv and Naftali Tishby. Opening the blackbox of deep neural networks via information. arXiv preprintarXiv:1703.00810 , 2017. 2[22] Zhiyuan Tang, Dong Wang, Yiqiao Pan, and ZhiyongZhang. Knowledge transfer pre-training. arXiv preprint:1506.02256 , 2015. 3[23] Zhiyuan Tang, Dong Wang, and Zhiyong Zhang. Recurrentneural network training with dark knowledge transfer. In
Proceedings of the IEEE International Conference on Acous-tics, Speech and Signal Processing , pages 5900–5904, 2016.2, 3[24] Kari Torkkola. Feature extraction by non-parametric mutualinformation maximization.
Journal of Machine Learning Re-search , 3(Mar):1415–1438, 2003. 5[25] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko.Simultaneous deep transfer across domains and tasks. In
Proceedings of the IEEE International Conference on Com-puter Vision , pages 4068–4076, 2015. 2, 3[26] Peter Welinder, Steve Branson, Takeshi Mita, CatherineWah, Florian Schroff, Serge Belongie, and Pietro Perona.Caltech-ucsd birds 200. 2010. 7, 927] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. Agift from knowledge distillation: Fast optimization, networkminimization and transfer learning. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 7130–7138, 2017. 1, 2, 3[28] Lu Yu, Vacit Oguz Yazici, Xialei Liu, Joost van de Weijer,Yongmei Cheng, and Arnau Ramisa. Learning metrics fromteachers: Compact networks for image embedding. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 2907–2916, 2019. 2, 3, 5, 7[29] Sergey Zagoruyko and Nikos Komodakis. Paying more at-tention to attention: Improving the performance of convolu-tional neural networks via attention transfer. arXiv preprintarXiv:1612.03928 , 2016. 2, 3[30] Ying Zhang, Tao Xiang, Timothy M Hospedales, andHuchuan Lu. Deep mutual learning. In