[PDF] A Novel DNN Training Framework via Data Sampling and Multi-Task Optimization

Abstract

Conventional DNN training paradigms typically rely on one training set and one validation set, obtained by partitioning an annotated dataset used for training, namely gross training set, in a certain way. The training set is used for training the model while the validation set is used to estimate the generalization performance of the trained model as the training proceeds to avoid over-fitting. There exist two major issues in this paradigm. Firstly, the validation set may hardly guarantee an unbiased estimate of generalization performance due to potential mismatching with test data. Secondly, training a DNN corresponds to solve a complex optimization problem, which is prone to getting trapped into inferior local optima and thus leads to undesired training results. To address these issues, we propose a novel DNN training framework. It generates multiple pairs of training and validation sets from the gross training set via random splitting, trains a DNN model of a pre-specified structure on each pair while making the useful knowledge (e.g., promising network parameters) obtained from one model training process to be transferred to other model training processes via multi-task optimization, and outputs the best, among all trained models, which has the overall best performance across the validation sets from all pairs. The knowledge transfer mechanism featured in this new framework can not only enhance training effectiveness by helping the model training process to escape from local optima but also improve on generalization performance via implicit regularization imposed on one model training process from other model training processes. We implement the proposed framework, parallelize the implementation on a GPU cluster, and apply it to train several widely used DNN models. Experimental results demonstrate the superiority of the proposed framework over the conventional training paradigm.

Full PDF

AA Novel DNN Training Framework via DataSampling and Multi-Task Optimization

Boyu Zhang, A. K. Qin, Hong Pan and Timos Sellis

Department of Computer Science and Software EngineeringSwinburne University of Technology

Melbourne, AustraliaEmail: [email protected], [email protected], [email protected], [email protected] ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works.

Abstract —Conventional DNN training paradigms typicallyrely on one training set and one validation set, obtained bypartitioning an annotated dataset available for the purpose oftraining, namely gross training set, in a certain way. The trainingset is used for training the model while the validation set isused to estimate the generalization performance of the trainedmodel as the training proceeds to avoid over-ﬁtting. Thereexist two major issues in this training paradigm. Firstly, thevalidation set may hardly guarantee an unbiased estimate ofthe generalization performance due to potential mismatchingwith the test data. Secondly, training a DNN corresponds tosolve a complex optimization problem, which is prone to gettingtrapped into inferior local optima and thus leads to the undesiredtraining result. To address these issues, we propose a novel DNNtraining framework. It generates multiple pairs of training andvalidation sets from the gross training set via random splitting,trains a DNN model of a pre-speciﬁed network structure oneach pair while making the useful knowledge (e.g., promisingnetwork parameters) obtained from one model training processto be transferred to other model training processes via multi-taskoptimization (i.e., a recently emerging optimization paradigm),and outputs the best one, among all trained models, which hasthe overall best performance across the validation sets fromall pairs. The knowledge transfer mechanism featured in thisnew framework can not only enhance training effectiveness byhelping the model training process to escape from local optimabut also improve on generalization performance via implicitregularization imposed on one model training process from othermodel training processes. We implement the proposed frame-work, parallelize the implementation on a GPU cluster, and applyit to train several widely used DNN models. Experimental resultson several classiﬁcation datasets of different nature demonstratethe superiority of the proposed framework over the conventionaltraining paradigm.

Index Terms —Multi-task optimization, MTO, training deepneural networks, data sampling.

I. I

NTRODUCTION

Deep neural networks (DNNs) have achieved performancebreakthroughs in many real-world applications due to theirpowerful feature learning capabilities, which are typicallycharacterized by sophisticated architectures involving a mas-sive amount of parameters. Training a DNN is equivalent tosolving a highly complex non-convex optimization task whichis easily stuck into inferior local optima and accordingly leadsto the undesired training result.A commonly employed way to train a DNN relies onan available training set, where a certain loss function de- ﬁned on the training set is optimized with respect to net-work parameters to derive optimal parameter values. Thisoptimization process, a.k.a. training process, is iterative andusually terminated by the pre-speciﬁed maximal number oftraining epochs. However, the way of manually specifying themaximum number of training epochs is often too subjective,which increases the risk of over-ﬁtting or under-ﬁtting and ac-cordingly results in undesired generalisation performance. Toaddress this issue, another training paradigm has gained muchpopularity nowadays, which partitions the original training set,namely the gross training set, into one training set and onevalidation set via a certain sampling method and utilizes thevalidation set to estimate the generalization performance of thetrained model as the training process (based on the training set)proceeds [1], [2]. This way may improve the generalizationperformance of the trained model. However, the validation setmay not well represent the potential test data and thus becomeless effective to provide an unbiased estimate of generalizationperformance.To address the above issues, we propose a novel DNNtraining framework which formulates multiple related trainingtasks by using a certain sampling method to generate multipledifferent pairs of training and validation sets from the grosstraining set and solves these related tasks simultaneously via anewly emerging multi-task optimization (MTO) technique thatallows the useful knowledge (e.g., promising network parame-ters) obtained from one training task to be transferred to othertraining tasks. Speciﬁcally, this framework generates multiplepairs of training and validation sets from the gross trainingset via a speciﬁc sampling method, trains a DNN model ofa pre-speciﬁed network structure on each pair while enablingthe useful knowledge obtained from one training process tobe shared with other training processes via MTO, and ﬁnallyoutputs the best one, among all the trained models, whichachieves the overall best performance across the validation setsfrom all pairs. The knowledge transfer and sharing mechanismfeatured in the proposed framework can not only enhancetraining effectiveness by helping the training processes toescape from local optima but also improve on generalizationvia implicit regularization imposed on one training processwhich comes from other training processes. It is worth notingthat the cross-validation technique [3] commonly used toimprove generalization performance when training machine a r X i v : . [ c s . N E ] J u l earning (ML) models is for tuning the hyper-parametersof the ML model instead of the model parameters per se.Therefore, it is irrelevant to this study which is focused onlearning the parameters (i.e., connection weights and biases)of a DNN model with pre-speciﬁed hyper-parameters. Anothermachine learning technique named ensemble learning alsotrains multiple models to make a prediction. However, it aimsat achieving the best performance by assembling all modelsin a certain way while our proposed framework aims to traina single best model with the help from training other models.We implement the proposed training framework, paral-lelize the implementation on a GPU cluster, and apply it fortraining three popular DNN models, i.e., DenseNet-121 [4],MobileNetV2 [5] and SqueezeNet [6]. Performance evaluationand comparison on three classiﬁcation data sets of differentnature demonstrate the superiority of the proposed trainingframework over the conventional training paradigm in termsof the classiﬁcation accuracy obtained on the testing set.In the following, we will ﬁrst introduce the background ofthis work in section II, then describe the proposed frameworkand its implementations in detail in section III, and ﬁnally dis-cuss and analyze experimental results in section IV, followedby concluding remarks and future work in section V.II. B ACKGROUND

A. Training Deep Neural Networks

Gradient descent based optimization algorithms are widelyused for training DNNs, e.g., in supervised learning problems.These algorithms normally use back propagation (BP) [7] tocalculate the gradients of DNN’s parameters and accordinglyupdate parameter values. Speciﬁcally, given a training setcomposed of multiple pairs of the input and the output (i.e.,the label of the input), a loss function is formulated whichmeasures the mismatch between the output of the networkw.r.t. an input and the actual output of that input in the trainingset summed over all input-and-output pairs in the trainingset. Then, BP with stochastic gradient descent is applied tocalculate the partial derivative of the loss function with respectto each parameter in the DNN from the last layer to theﬁrst layer. Next, parameter values are updated based on thecalculated derivatives via a certain learning rule. This trainingprocess is iterated until a certain stopping criterion is met.Training DNNs relies on an available training set whichmay be used in different ways, leading to different trainingparadigms. One common practice is to directly train a DNN onthe gross training set (i.e. the original training set) and use thepre-speciﬁed maximum number of training epochs to terminatethe training process. However, the subjective choice of themaximum number of training epochs is likely to make thetrained model to overﬁt the training set and thus lead to inferiorgeneralization. Another popular training paradigm addressesthis issue by partitioning the gross training set into one trainingset and one validation set via random splitting and using thevalidation set to estimate the generalization performance of thetrained model as the training process proceeds and accordingly stop the training process if the generalization performance ofthe trained model cannot be much improved [8]–[12].In ML, cross-validation is a widely used strategy to im-prove the generalization performance of a trained ML model.However, it is merely applied to tune the hyper-parametersof the model [13], e.g., the number of layers, the number ofhidden neurons, and the learning rate in the context of DNNtraining. In this work, we focus on learning the parameters(i.e., connection weights and biases) of a DNN model with pre-speciﬁed hyper-parameters. Therefore, cross-validation strat-egy is irrelevant to this study.

B. Multi-Task Optimization

MTO investigates how to effectively and efﬁciently tacklemultiple optimization tasks concurrently via online knowl-edge transfer. This paradigm has been inspired by the well-established concepts of transfer learning [14] and MTL [15]in predictive analytics. Existing MTO techniques are mainlydeveloped for Bayesian optimisation [16]–[18] or evolution-ary computation techniques [19]–[22]. Swersky et al. [16]proposed multi-task Bayesian optimisation (MTBO) which isbased on the well-studied multi-task Gaussian process models.This work can transfer the knowledge gained from prior opti-misations to new tasks in order to ﬁnd optimal hyperparametersettings more efﬁciently, or optimise multiple tasks simultane-ously when the goal is maximizing average performance, e.g.,optimizing k -fold cross-validation. [23] proposed an MTObased evolutionary algorithm (EA) where tasks beneﬁt fromthe implicit knowledge transfer during the task solving and canoften lead to accelerated convergence for a variety of complexoptimization functions. Besides EA, other variants [24], [25]are also developed to solve the MTO problems. In [20], Feng et al. proposed an evolutionary multitasking algorithm withexplicit knowledge transfer via denoising autoencoder, whichdemonstrates higher efﬁcacy over implicit knowledge transfer.[21] proposed an MTO based framework of generating featuresubspaces for ensemble classiﬁcation.III. P ROPOSED M ETHOD

A. Framework

The conventional way of training DNNs corresponds to min-imize a loss function which measures the mismatch betweenthe output of the network w.r.t. an input and the actual output,which can be regarded as a single task optimization (STO)problem. One common training paradigm can be deﬁned asfollows: given a training set D = { x i , y i } ni =1 where x i and y i refer to the input and the corresponding actual output, ittrains a DNN f (; θ ) until reaching the pre-speciﬁed maximumepochs (Fig. 1a) to minimize a particular loss function (1), min θ J ( θ | D ) (1)where we deﬁne J ( θ | D ) = n (cid:80) ni =1 J ( f ( x i ; θ ) , y i ) . Once thetraining is completed, the trained DNN is evaluated on thetesting set D (cid:48) = { x (cid:48) j , y (cid:48) j } kj =1 which is invisible during training.Another popular training paradigm partitions the gross training + Train the DNN to pre-specified maximum epochs (a) STO: Training the DNN without validation set. + Train the DNNunder the monitor of , & & ' Sampling (b) STO: Training the DNN with validation set. %& + Sampling + %' Sampling + $& $' Sampling Model selection … …

Train the DNNunder the monitor of , Train the DNNunder the monitor of , $& Train the DNNunder the monitor of , %& Related Training Tasks Formulation Multi-task Optimization $ : $ $ : $ % : (c) The proposed MTO based DNN training framework. Fig. 1: The existing and the proposed MTO based framework of training DNN. Each task in the proposed framework obtainsa pair of training and validation sets via a certain sampling method. It uses the training set for training and validation set formonitoring the generalization ability of its trained DNNs. During the training process, the intermediate knowledge from eachtask is shared across all tasks to help their training where the knowledge transfer is represented by the red dashed line. At theend of the training process, the model which achieves overall best performance across all the validation sets from all pairs isselected as the ﬁnal outcome.set into one training set D t and one validation set D v viaa certain sampling method and use the D v to estimate thegeneralization performance (evaluated by the validation loss J ( θ | D v ) ) of the trained DNN during the training process(Fig. 1b). The training process will be terminated if thevalidation loss cannot be much reduced and the trained DNNwith minimum validation loss is regarded as the ﬁnal trainedmodel.Unlike the above conventional training paradigms, our pro-posed training framework has two modules which are relatedtraining tasks formulation and MTO (Fig. 1c). In the ﬁrstmodule, we would formulate multiple related training tasks { T m } Mm =1 , where each task obtains a distinct pair of training and validation sets { D tm , D vm } Mm =1 via a certain samplingmethod. Then each task is to use the D tm to train one individualDNN model with a pre-speciﬁed network structure and D vm formonitoring the generalization performance during the trainingprocess.After that, the MTO module will solve all tasks simultane-ously, and apply knowledge transfer across all tasks to helpthem ﬁnd better model parameters which produce a lowervalidation loss on their associated validation set during thetraining process. At last, one trained DNN which achievesthe overall best performance across all the validation setsfrom all pairs will be selected as the ﬁnal trained model. Theconventional STO based training method (with validation set,ig. 1b) can be regarded as a special case of our proposedframework when there is only a single training task.In this framework, the training set in each task is differentand accordingly, the model learned in each task may containthe useful knowledge (e.g. promising parameter values) whichcan be transferred and shared with some other tasks to helptheir training processes to escape from inferior local optima.Meanwhile, the validation sets in different tasks provide theestimate of generalization from different perspectives. As aresult, knowledge transfer and sharing across different tasksmay impose implicit regularization on the training process ofone task from other training processes of other tasks, aimingto produce a DNN with improved generalization which canperform well on all validation sets. B. Implementation1) Formulating Multiple Related Training Tasks:

In thisimplementation, we formulate each training task like this:ﬁrstly, we randomly split a ratio of samples from the grosstraining set as validation set and the remaining as trainingset to form the pair { D tm , D vm } ; secondly, we use the pairto formulate a training tasks T m (2) which aims to train oneindividual DNN model with a pre-speciﬁed network structurevia its pair of training and validation sets. T m : min θ m J ( θ m | D tm ) (2)During training, the D vm is used to estimate the change ofthe generalization ability of the θ m and also provides a wayto evaluate if the knowledge from other tasks is beneﬁcial toimprove the generalization ability of task T m .This process is repeated M times to formulate M trainingtasks { T m } Mm =1 . These tasks are highly related since they aresampled from the same gross training set.

2) Adaptive Multi-Task Optimization based DNN TrainingAlgorithm:

We propose an adaptive MTO based DNN trainingalgorithm (AMTO) which targets at solving all tasks simul-taneously and transfer the intermediate learned knowledge(which we deﬁne it as model parameters θ ) across all tasksto improve their training performance. To effectively transferknowledge across tasks, especially when a large number oftasks are solved together, the formulated training tasks can(i) learn task relationship with other tasks so that knowledgetransfer is more likely to occur between related tasks, and(ii) determine whether to accept the transferred knowledgebased on whether it can help it improve on generalizationperformance during the training process.Speciﬁcally, each task maintains a relationship list ( RL )which records how it is related to other tasks. For the T m , itsrelationship list RL m is represented as (3), RL m = [ r , r , . . . , r M ] (3)where the r j ∈ [ −∞ , + ∞ ] represents the degree of relation-ship to T j . Then we convert the elements of RL m to the probabilities of acquiring knowledge from the correspondingtasks which sum to one by softmax function (4). p j = e r j (cid:80) Mk =1 ,k (cid:54) = m e r k , j (cid:54) = m (4)Apparently, a higher r j represents a higher possibility ofacquiring knowledge from T j .At the beginning of the algorithm, all elements of RL areinitialized as zero. Then for the T m , it selects T j at randomwith the probability generated from (4) and acquires the modelparameters θ j as θ m . We name this operation as knowledgereallocation . T (cid:48) m : min θ m J ( f ( D tm ; θ m )) , θ m ← θ j (5)After that, a temporary DNN training task T (cid:48) m is formulatedas (5) to evaluate if training the θ m on D tm can achieve bettergeneralization estimated on D vm than θ m . Then the modelparameters of all tasks, including { T m } Mm =1 and { T (cid:48) m } Mm =1 ,are trained for c iterations simultaneously via a gradientdescent based method. Next, for each task T m , we evalu-ate the validation loss J ( θ m | D vm ) and J ( θ m | D vm ) on theircorresponding validation sets and substitute the θ m with θ m if the latter one achieves lower validation loss. In this way,the task T m can actively accept the knowledge from othertasks if that knowledge helps to improve its generalizationperformance and decline it otherwise. We name this operationas determining transfer . Meanwhile, the relationship list willbe updated. Speciﬁcally, the r j will be updated by (6). r j ← r j + tanh ( J ( θ m | D vm ) − J ( θ m | D vm )) (6)In other words, the r j increases if the transferred θ j ( θ m ) aftertraining on D tm can achieve lower validation loss on D vm than θ m and vice versa. After this operation, the algorithm goesback to operation knowledge reallocation or terminates if itreaches the maximum training iterations or the validation lossof any tasks does not reduce for p consecutive validations.Through periodically implementing the knowledge real-location and determining transfer , each task can share itslearned knowledge with other tasks and can investigate ifthe knowledge from other tasks is beneﬁcial to improveits own generalization performance. In this process, once atask gets trapped into inferior local optima (i.e., unable tofurther reduce the validation loss), the knowledge from othertasks can potentially help it escape there. Meanwhile, thetransferred knowledge from different training tasks imposesimplicit regularization on the trained DNNs, which improveson the generalization performance. At the end of the trainingprocess, the DNN which achieves the highest harmonic accu-racy ( A har ) across all the validation sets { D vm } Mm =1 is selectedas the ﬁnal output. Equation 7 shows the A har where the A m represents the accuracy evaluated on validation set D vm . A har = M (cid:80) Mm =1 /A m (7) ̅! ! $ ̅! $ ! % ̅! % &' &' $ &' % $ $ $ $ % … master(GPU ) slave(GPU ) knowledgereallocation determiningtransfer Fig. 2: Computing architecture and the knowledge transferprocess.

Algorithm 1

Parallelized Adaptive Multitask Optimizationbased DNN Training Algorithm

Input : D train : training set, J : loss function, M axIter :maximum training iterations, c : number of training steps in theinterval of two consecutive knowledge allocation operation. Output : Trained DNN parameters θ . parfor m = 0 → M do (cid:46) initialization Generate a pair of training and validation sets { D tm , D vm } from the D train via random splitting. Formulate task T m which trains the DNN on D tm anduses D vm for estimating the generalization performance. Assign task T m to the master of CU m . Copy the task T m to the slave of CU m and name itsmodel parameters as θ m . Initialize all elements of RL m as zero. end parfor Iter ← . while Iter ∗ c < M axIter or not trigger early stopping do parfor m = 0 → M do Acquire θ j according to (4). θ m ← θ j , m (cid:54) = j . (cid:46) knowledge reallocation Train both θ m and θ m on D tm for c iterations. if J ( θ m | D vm ) < J ( θ m | D vm ) then θ m ← θ m . (cid:46) determining transfer end if Update r j by (6). end parfor Iter ← Iter + 1 . end while Select one model from { θ m } Mm =1 which achieves the high-est harmonic accuracy across all validation sets { D vm } Mm =1 as the ﬁnal learned outcome.

3) Parallelization of the implementation on a GPU cluster:

The algorithm is well-suited for parallelization to improveefﬁciency. We implement the algorithm on a GPU-enabledsupercomputer called OzSTAR . As demonstrated in Fig. 2,we deﬁne the Computing Unity ( CU ) as a basic unit to solveone training task T m which consists of two GPUs. Thesetwo GPUs act as master and slave respectively, where the slave solves the temporary training task T (cid:48) m and the master solves the T m . During training, all formulated training tasksare solved simultaneously where each CU solves one task.In this case, the efﬁciency is comparable to the conventionalSTO based training paradigm. We present the pseudocode ofthe parallelized AMTO in Algorithm 1.IV. E XPERIMENTS

This section evaluates the proposed AMTO method onthree publicly available image classiﬁcation datasets, aimingto demonstrate, • The proposed AMTO algorithm can achieve better gen-eralization performance than the conventional STO. • The performance of the proposed AMTO algorithm im-proves with the number of formulated tasks.We will elaborate on the datasets details, experimental setting,and results with analysis.

A. Datasetsa) UCMerced:

This dataset [26] was manually extractedfrom the USGS National Map Urban Area Imagery collectionfor various urban areas around the country. The pixel reso-lution of this public domain imagery is 0.3m. This datasetcontains 21 land-use classes with 100 images per class, andeach image is in a size of × pixels. We randomly dividethis dataset into a gross training set (80%) and a testing set(20%). b) OxfordPets: This dataset [27] has around 7400 im-ages, which contains 37 different breeds of pets. This datasethas been pre-partitioned into a gross training set (50%) and atesting set (50%). The relatively small ratio of the trainingset increases the challenge of training a DNN with goodgeneralization ability. c) RSSCN7:

This dataset [28] contains 2800 remotesensing images which are from 7 typical land-use classes.There are 400 images per class collected from Google Earthwhich are sampled on 4 different scales with 100 imagesper scale. Each image has a size of × pixels. Thisdataset is rather challenging due to the wide diversity of thescene images which are captured under different seasons andvarious weathers, and sampled with different scales. Same asUCMerced, this dataset is randomly partitioned into a grosstraining set (80%) and a testing set (20%).We will further generate training set and validation set basedon the gross training set for training and use the testing setfor testing. https://supercomputing.swin.edu.au/ V a l l o ss UCMerced DenseNet-121Val lossAccuracy 97.897.998.098.198.2 A cc u r a c y ( % ) V a l l o ss UCMerced MobileNetV2Val lossAccuracy 97.2597.3097.3597.4097.4597.50 A cc u r a c y ( % ) V a l l o ss UCMerced SqueezeNetVal lossAccuracy 93.594.094.595.095.5 A cc u r a c y ( % ) V a l l o ss OxfordPets DenseNet-121Val lossAccuracy 93.0093.0593.1093.1593.20 A cc u r a c y ( % ) V a l l o ss OxfordPets MobileNetV2Val lossAccuracy 90.790.890.991.091.191.291.3 A cc u r a c y ( % ) V a l l o ss OxfordPets SqueezeNetVal lossAccuracy 83.0083.2583.5083.7584.0084.2584.50 A cc u r a c y ( % ) V a l l o ss RSSCN7 DenseNet-121Val lossAccuracy 96.8096.8596.9096.9597.0097.05 A cc u r a c y ( % ) V a l l o ss RSSCN7 MobileNetV2Val lossAccuracy 96.196.296.396.496.596.696.7 A cc u r a c y ( % ) V a l l o ss RSSCN7 SqueezeNetVal lossAccuracy 93.7594.0094.2594.5094.7595.0095.25 A cc u r a c y ( % ) Fig. 3: The mean validation loss and Top-1 accuracy of 5 runs with the different number of formulated tasks of AMTO.

B. Experimental Setting

We compare our method with the conventional STO trainingparadigm (training one individual DNN with validation set)on various popular DNN models, including SqueezeNet [6],MobileNetV2 [5], and DenseNet-121 [4]. In the STO, werandomly split 10% data for validation and the remaining fortraining from the gross training set of each dataset. In theMTO, we formulate each related training task with distinctpair of validation set and training set generated from the grosstraining set where the ratio for validation set is 10% as well.Since the datasets we use are small, we initialize theseDNNs via the parameters pre-trained on ImageNet [29]. Thetraining samples are augmented by random horizontal ﬂippingand resized to ( × × ) pixels to match the requiredinput size of the DNN models. Each single task is solvedby momentum SGD with Nesterov [30] where the initiallearning rate is e − and the momentum is 0.9. The maximumtraining iterations are e and the mini-batch size is 64. Thelearning rate is dropped by 0.1 in e , e iterations. Weapply early stopping both on the STO and AMTO and the training process will be terminated if the validation loss ofany tasks does not reduce after 10 consecutive validations.For the AMTO method, we formulate four training tasks andapply knowledge reallocation and determining transfer every100 training iterations.The experiments are executed for 5 runs on each dataset andDNN model where the STO and AMTO start from the samerandom seed in one run. We use the mean Top-1 accuracyon the testing set of the 5 runs as a metric to measurethe generalization performance of the trained DNN. All theexperiments are implemented with Pytorch and run on theHPC platform Ozstar where each node has two Nvidia TeslaP100 GPUs. C. Results and Analysis1) Comparison With the Single Task Optimization:

Traininga DNN for image classiﬁcation aims to ﬁnd a DNN model withdesirable generalization performance, which is measured bythe testing performance after training. In this experiment, wecompare our proposed AMTO method with the conventionalTO in terms of the Top-1 accuracy to verify the effectivenessin improving generalization ability.Table I compares the mean Top-1 accuracy of 5 runsachieved by STO and AMTO of three datasets obtained bythree popular architectures. From the table we can makethe following observations: (i) the DNNs trained by AMTOperforms better than those of STO on all cases, which demon-strates that AMTO is effective in training the DNN with bettergeneralization performance. (ii) The networks with a smallercapacity (SqueezeNet and MobileNetV2) generally beneﬁtmore from AMTO. This is noteworthy as the small networkoften performs less desirable due to the trade-off with speedand size. Improving the performance of small networks cangreatly enhance their applicability, e.g., on portable devices.

2) AMTO With Different Number of Formulated Tasks:

Theprior experiments studied AMTO of four formulated tasks. Wenext investigate how AMTO scales with different numbers offormulated training tasks. Fig. 3 shows the AMTO’s meanvalidation loss, as well as the mean Top-1 accuracy on threedatasets with 1,2,4,6 formulated tasks, where one formulatedtask represents the conventional STO. From this ﬁgure, we canﬁnd that the mean validation loss of the target task reducesas the number of formulated tasks increases in the AMTOin all cases. This demonstrates that the optimization abilityof AMTO is enhanced as the number of formulated tasksincreases.On the other hand, the mean Top-1 accuracy on the testingset of the trained DNN is higher than that of STO in allcases which veriﬁes the effectiveness in improving the DNN’sgeneralization performance. It’s also noticeable that the meanTop-1 accuracy does not monotonically improve as the meanvalidation loss decreases. This phenomenon is reasonablesince the distribution gap exists between the validation setsand testing set so that decreasing validation loss does notguarantee improving testing performance. The other possiblereason is, randomness exists in the algorithm, especially in thestep of knowledge reallocation , which causes the ﬂuctuation.Moreover, as the total number of training iterations is ﬁxed,TABLE I: Comparison of the mean Top-1 accuracy(%) of 5runs of STO and the AMTO on the testing set of three datasetsand three different DNN models.

UCMercedMethod SqueezeNet MobileNetV2 DenseNet-121STO 93.48 97.24 97.76

AMTO 94.95 97.29 98.05 ( gap ) 1.47 0.05 0.29OxfordPetsMethod SqueezeNet MobileNetV2 DenseNet-121STO 82.90 90.65 93.00 AMTO 84.34 91.34 93.19 ( gap ) 1.44 0.69 0.19RSSCN7Method SqueezeNet MobileNetV2 DenseNet-121STO 93.71 96.04 96.79 AMTO 94.75 96.43 96.89 ( gap ) 1.04 0.39 0.1 an increasing number of formulated tasks may lead to lesschance of transferring useful knowledge across the tasks. Tofurther improve the stability of AMTO, a more effective wayof knowledge transfer method needs to be proposed.V. C ONCLUSIONS AND F UTURE W ORK

We proposed a novel DNN training framework based onMTO which can not only enhance training effectivenessbut also improve on the generalization performance of thetrained DNN model via knowledge transfer and sharing.We implemented the proposed framework, parallelized theimplementation on a GPU cluster, and applied it to threepopular DNN models. Performance evaluation and comparisondemonstrated that the DNN models trained via the proposedframework achieved better generalization performance thanthe conventional training paradigm. In the future, we plan toexplore more different ways to formulate the related trainingtasks. Furthermore, we will perform an in-depth study onhow the number of formulated training tasks inﬂuences theperformance so as to devise a way to make best use of multiplerelated training tasks. Moreover, we plan to further enhancethe modules of related training tasks formulation and MTO inthe proposed framework based on some of our previous works[31]–[33]. A

CKNOWLEDGMENT

This work was performed on the OzSTAR national facilityat Swinburne University of Technology. The OzSTAR programreceives funding in part from the Astronomy National Collab-orative Research Infrastructure Strategy (NCRIS) allocationprovided by the Australian Government. This work was sup-ported in part by the Australian Research Council (ARC) underGrant No. LP170100416, LP180100114 and DP200102611,the Research Grants Council of the Hong Kong SAR underProject CityU11202418, and the China Scholarship Council(CSC). R

EFERENCES[1] L. Prechelt, “Early stopping-but when?” in

Neural Networks: Tricks ofthe trade . Springer, 1998, pp. 55–69.[2] ——, “Automatic early stopping using cross validation: quantifying thecriteria,”

Neural Networks , vol. 11, no. 4, pp. 761–767, 1998.[3] S. Arlot, A. Celisse et al. , “A survey of cross-validation procedures formodel selection,”

Statistics surveys , vol. 4, pp. 40–79, 2010.[4] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in

Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 2017, pp. 4700–4708.[5] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in

Proceedingsof the IEEE conference on computer vision and pattern recognition ,2018, pp. 4510–4520.[6] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360 ,2016.[7] H. Leung and S. Haykin, “The complex backpropagation algorithm,”

IEEE Transactions on Signal Processing , vol. 39, no. 9, pp. 2101–2104,1991.[8] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng,“On optimization methods for deep learning,” 2011.9] J. T. Springenberg, “Unsupervised and semi-supervised learningwith categorical generative adversarial networks,” arXiv preprintarXiv:1511.06390 , 2015.[10] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Don-ahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan et al. , “Population based training of neural networks,” arXiv preprintarXiv:1711.09846 , 2017.[11] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing betweencapsules,” in

Advances in neural information processing systems , 2017,pp. 3856–3866.[12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,” arXiv preprint arXiv:1503.02531 , 2015.[13] T. Q. Huynh and R. Setiono, “Effective neural network pruning usingcross-validation,” in

Proceedings. 2005 IEEE International Joint Con-ference on Neural Networks, 2005. , vol. 2. IEEE, 2005, pp. 972–977.[14] S. J. Pan and Q. Yang, “A survey on transfer learning,”

IEEE Trans-actions on knowledge and data engineering , vol. 22, no. 10, pp. 1345–1359, 2009.[15] R. Caruana, “Multitask learning,” in

Learning to learn . Springer, 1998,pp. 95–133.[16] K. Swersky, J. Snoek, and R. P. Adams, “Multi-task bayesian optimiza-tion,” in

Advances in neural information processing systems , 2013, pp.2004–2012.[17] R. Bardenet, M. Brendel, B. K´egl, and M. Sebag, “Collaborative hy-perparameter tuning,” in

International conference on machine learning ,2013, pp. 199–207.[18] D. Yogatama and G. Mann, “Efﬁcient transfer learning method for au-tomatic hyperparameter tuning,” in

Artiﬁcial intelligence and statistics ,2014, pp. 1077–1085.[19] A. Gupta, Y.-S. Ong, and L. Feng, “Multifactorial evolution: towardevolutionary multitasking,”

IEEE Transactions on Evolutionary Compu-tation , vol. 20, no. 3, pp. 343–357, 2016.[20] L. Feng, L. Zhou, J. Zhong, A. Gupta, Y.-S. Ong, K.-C. Tan, andA. Qin, “Evolutionary multitasking via explicit autoencoding,”

IEEEtransactions on cybernetics , no. 99, pp. 1–14, 2018.[21] B. Zhang, A. K. Qin, and T. Sellis, “Evolutionary feature subspacesgeneration for ensemble classiﬁcation,” in

Proceedings of the Geneticand Evolutionary Computation Conference . ACM, 2018, pp. 577–584.[22] A. Gupta and Y.-S. Ong, “Genetic transfer or population diversiﬁcation?deciphering the secret ingredients of evolutionary multitask optimiza-tion,” in . IEEE, 2016, pp. 1–7.[23] A. Gupta, Y.-S. Ong, and L. Feng, “Multifactorial evolution: towardevolutionary multitasking,”

IEEE Transactions on Evolutionary Compu-tation , vol. 20, no. 3, pp. 343–357, 2015.[24] L. Feng, W. Zhou, L. Zhou, S. Jiang, J. Zhong, B. Da, Z. Zhu, andY. Wang, “An empirical study of multifactorial pso and multifactorialde,” in .IEEE, 2017, pp. 921–928.[25] J. Zhong, L. Feng, W. Cai, and Y.-S. Ong, “Multifactorial geneticprogramming for symbolic regression problems,”

IEEE Transactions onSystems, Man, and Cybernetics: Systems , no. 99, pp. 1–14, 2018.[26] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensionsfor land-use classiﬁcation,” in

Proceedings of the 18th SIGSPATIAL in-ternational conference on advances in geographic information systems .ACM, 2010, pp. 270–279.[27] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats anddogs,” in . IEEE, 2012, pp. 3498–3505.[28] Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based featureselection for remote sensing scene classiﬁcation,”

IEEE Geoscience andRemote Sensing Letters , vol. 12, no. 11, pp. 2321–2325, 2015.[29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in . Ieee, 2009, pp. 248–255.[30] Y. Nesterov, “A method for unconstrained convex minimization problemwith the rate of convergence o (1/kˆ 2),” in

Doklady an ussr , vol. 269,1983, pp. 543–547.[31] A. K. Qin and P. N. Suganthan, “Initialization insensitiveLVQ algorithmbased on cost-function adaptation,”

Pattern Recognition , vol. 38, no. 5,pp. 773–776, 2005.[32] M. Gong, Y. Wu, Q. Cai, W. Ma, A. K. Qin, Z. Wang, and L. Jiao,“Discrete particle swarm optimization for high-order graph matching,”

Information Sciences , vol. 328, pp. 158–171, 2016. [33] L. Feng, L. Zhou, J. Zhong, A. Gupta, Y.-S. Ong, K.-C. Tan, andA. K. Qin, “Evolutionary multitasking via explicit autoencoding,”