[PDF] Transfer Learning using Neural Ordinary Differential Equations

Abstract

A concept of using Neural Ordinary Differential Equations(NODE) for Transfer Learning has been introduced. In this paper we use the EfficientNets to explore transfer learning on CIFAR-10 dataset. We use NODE for fine-tuning our model. Using NODE for fine tuning provides more stability during training and validation.These continuous depth blocks can also have a trade off between numerical precision and speed .Using Neural ODEs for transfer learning has resulted in much stable convergence of the loss function.

Full PDF

TTransfer Learning using Neural OrdinaryDifferential Equations

Rajath S

Student,Department Of Computer SciencePES University

Bengaluru,[email protected]

Sumukh Aithal K

Student,Department Of Computer SciencePES University

Bengaluru,[email protected]

Dr.S Natarajan

Professor,Department Of Computer SciencePES University

Bengaluru,[email protected]

Abstract —We introduce a concept of using Neural OrdinaryDifferential Equations(NODE) for Transfer Learning. In thispaper we use the EfﬁcientNets to explore transfer learning onCIFAR-10 dataset. We use NODE for ﬁne-tuning our model.Using NODE for ﬁne tuning provides more stability duringtraining and validation.These continuous depth blocks can alsohave a trade off between numerical precision and speed .Weconclude that the using Neural ODEs for transfer learning resultsin much stable convergence of the loss function.

Index Terms —Transfer Learning,Neural Ordinary DifferentialEquations(NODE),Image Classiﬁcation,EfﬁcientNet

I. I

NTRODUCTION

Image classiﬁcation is one of the fundamental tasks incomputer vision and there has been signiﬁcant improvementin the accuracy of image classiﬁcation models since CNNs.AlexNet [9] and GoogleNet [17] showed that deeper andlarger neural network models perform better at imageclassiﬁcation. CNNs learn by feature extraction and featuresextracted by one model trained on a particular dataset can beused by another model performing a similar task.Given the enormous amount of resources used to traincomputer vision models,transfer learning is a very populartechnique used in deep learning.Transfer learning signiﬁcantlyimproves the training time and gives better results comparedto the conventional techniques. While most machinelearning algorithms are designed to address single tasks, thedevelopment of algorithms that facilitate transfer learning isa topic of ongoing interest in the machine-learning community.In this paper we study the concept of using NODE fortransfer learning by using EfﬁcientNet model as the backboneand imagenet weights as the pretrained weights.The intuitionbehind it being that brain is also considered as continuous timesystems.The remaining structure of the paper is as follows.SectionII contains the related work.The proposed method and detailsof the experiments is explained in Section III.Results andConclusions is explained in Section IV and V respectively. II. R

ELATED W ORK

EfﬁcientNets [19] are a family of models which can besystematically scaled up based on the resources available.Thisfamily of models achieve better accuracy than traditionalConvNets. In these models , there is a principled way inwhich models are scaled up . A balance between width,depthand height is achieved by simply scaling them up with aconstant ratio. These models can be scaled up based on thecomputation resources available.For instance if there exists n more resources then the depth could be increased by α n , width by β n and image size by γ n where α, β, γ areconstant coefﬁcients EfﬁcientNets transfer well on datasetslike CIFAR-100 [8], Flowers etc with fewer parameters. Tanet al, [19], have examined eight models from EfﬁcientNetB0-EfﬁcientNetB7 for their efﬁciency and performance.Neural Ordinary Differential Equations (NODE) [3] are afamily of Neural Networks where discrete sequence of hiddenlayers need not be speciﬁed, instead the derivative of thehidden state is parameterized using a neural network.Networkssuch as ResNets [6] and Recurrent Neural Networks [13] canbe modelled as continuous transforms using NODE. Thesecontinuous-depth models have constant memory cost, adapttheir evaluation strategy to each input, and can explicitlytrade numerical precision for speed.Chen et al., [3] use adaptive step-size solvers to solve ODEsreliably. This solver uses the adjoint method [12] and thenetwork used has a memory cost of O(1). Here, a networkis also tested with the same architecture but where gradientsare backpropagated directly through a Runge-Kutta integrator,referred to as RK-Net, which has O(L) memory cost whereL stands for the number of layers in the network. Here , thecontinuous dynamics of hidden units are parameterized usingan ordinary differential equation (ODE) speciﬁed by a neuralnetwork: dh ( t ) dt = f ( h ( t ) , t, θ ) (1)where t ∈ ...T Input layer would be h(0), output layer canbe deﬁned as h(T) to be the solution to this ODE initial valueproblem at some time T. a r X i v : . [ c s . L G ] J a n urrently, as numerical instability is an issue with ODEs,augmented version of Neural ODE networks have beendeveloped [5] [20]. Among many extensions of Neural ODEsdeveloped, one such is an approach that allows evolutionof the neural network parameters, in a coupled ODE-basedformulation. Also Augmented Neural ODEs are modelledwhich, in addition to being more expressive models thantraditional Neueral ODEs, are empirically more stable,generalize better and have a lower computational cost thanNeural ODEs [20].Having stability while training deep Neural Networks isimportant as it consistently offers improved robustnessagainst a broader range of distortion strengths and typesunseen during training, a considerably smaller hyperparameterdependence and less potentially negative side effects comparedto data augmentation [10].Stability during training is also achieved using differentactivation functions such as bounded Rectiﬁed Linear Unit(ReLU), bounded leaky ReLU, and bounded bi-ﬁring [11].There are many domains where well annotated data is noteasy to obtain due to data acquisition expenses. Collectionof data is complex and expensive that make it extremelydifﬁcult to build a large-scale, high-quality annotated dataset.Transfer learning is the solution to this problem as it relaxesthe hypothesis that the training data must be independent andidentically distributed with the test data [18].III. P ROPOSED M ETHOD

In the proposed method EfﬁcientNet B0 is used, from thefamily of EfﬁcientNet models as the base model. A NODElayer is added before the ﬁnal layer for ﬁne tuning theCIFAR-10 dataset [8].Although it increases the time taken to train per epoch,theNODE block is added to gain stability in this process. TwoODE Solvers for NODE i.e Runge Kutta Method [16] and themodern adjoint sensitivity method [15] are used and compared.The default implementation tf.contrib.integrate.odeint is usedto solve the Ordinary Differential Equation initial valueproblems . This default method uses the Runge-Kutta solver.Adjoint sensitivity method is also used to for better memoryefﬁciency. The adjoint method is computationally efﬁcientand is numerically much more stable.While using the Runge-Kutta(RK) Method, a user deﬁnedhyperparameter, relative and absolute tolerance limit can bevaried to achieve optimal performance.Setting both theseparameters as − provided optimal results .Tolerance limitis a parameter which is a trade off between accuracy andcomputational cost.For the Adjoint sensitivity method, the default parametersprovided by Chen et.al. [3] is used. The proposed model consists of two variations, one with RKsolver which is run for 200 epochs and one with the modernadjoint sensitivity method which is run for 160 epochs.In the proposed method, EfﬁcientB0 model is run for 200epochs to obtain the desired accuracy as mentioned inthe performance table .The NODE block is then added tothe previous model and is run until the desired validationaccuracy is observed. It is then observed that the samevalidation accuracy obtained after 100 epochs in the caseof RK solver and 160 epochs in the case of the adjointsensitivity method.For the initial model without NODE and the proposedmodel with RK solver, the Adam optimizer [7] was used.For the proposed model with the Adjoint method, stochasticgradient descent [2] was better than the Adam Optimizer.Using ImageNet [4] weights for training the model enablesquicker convergence as the features learnt by the pre-trainedmodel are common to the image classiﬁcation task.The concept of using NODE instead of the fully connectedlayer is that it potentially reduces the number of parameters,as the hidden blocks are now continuous functions of time.Also one can tune the tolerance parameter for speed/accuracytrade-off. Fig. 1. Model Visualization

IV. R

ESULTS AND C ONCLUSION

It is observed that using NODE before the ﬁnal layerguarantees better stability during training and validation. Bothresults are shown in Table I, the one with NODE at the endand the one without NODE(purely EfﬁcientNetB0 with just aﬁnal fully connected layer). Results of both the variations ofthe proposed model is shown.The model with EfﬁcientNetB0 and a ﬁnal fully connectedayer is trained for 200 epochs.The proposed model with RK solver is trained for 100epochs while the model with Adjoint solver is trained for160 epochs. It is observed that in both variants, the proposedmodel converges to the desired accuracy and loss muchquicker than the model without NODE.The performance andstability get enhanced in the proposed model.The ﬁrst four ﬁgures-2,3,4,5 show the training accuracy,training loss, validation accuracy and validation lossrespectively for the model trained with EfﬁcientNetB0 baseand a fully connected ﬁnal layer.It is observed that duringtraining there is a lot of ﬂuctuation and instability.In factthe validation curves are also not stable. By just adding theNODE block in end before the ﬁnal layer, it is seen that thetraining process stabilizes.Figures-6,7,8,9 depict the accuracy and loss graphs of theproposed model with the RK solver.Figure 6 depicts the training graph, where it easily attainstraining accuracy of about 98.5 % in just 100 epochs. Figure7 shows the steady decrease in the loss of the proposedmodel. Figure 8 depicts the validation curve. It is observed tobe very stable in comparison to the previous model. Figure 9shows the corresponding validation loss.Figures-10,11 and 12 show the training accuracy, trainingloss and validation accuracy of the proposed model with theadjoint sensitivity method. Training accuracy of 99.2 % anda validation accuracy of 85.3 % is observed.The adjoint sensitivity method provides even more stabilitythan the Runge-Kutta Solver. It also provides better validationaccuracy, and is quicker in convergence.Table I shows the performance of the proposed modelin comparison to a model with just EfﬁcientNetB0. Table Ishows its performance on train, validation and test sets.Table III shows the parameters set for the proposed modelwith the RK solver.Table II shows the number of epochs both models weretrained on the corresponding time taken to train. Weobserve that although that the proposed model takes moretime per epoch, the total time taken to converge is much better.The EfﬁcientNetB0 model and proposed model withRK solver were both developed on Keras with Tensorﬂow[1] backend. The proposed model with the adjoint sensititvitymethod was developed on PyTorch [14].All the above models were trained on GTX 1080 Ti GPUswith 8GB memory.

Fig. 2. Training accuracyFig. 3. Training lossFig. 4. Validation accuracyFig. 5. Validation lossig. 6. Training Accuracy(proposed model with RK solver)Fig. 7. Training loss(proposed model RK solver)Fig. 8. Validation Accuracy(proposed model RK solver)Fig. 9. Validation loss(proposed model RK solver) Fig. 10. Training accuracy using adjoint solver(proposed model)Fig. 11. Training loss using adjoint solver(proposed model)Fig. 12. Validation Accuracy using adjoint solver(proposed model)TABLE IP

ERFORMANCE

Performance of our modelModel Name Trainingaccuracy Validation Accu-racy Test Ac-curacyEfﬁcientNetB0 98.7% 84.5% 81%ProposedModel(RK) 98.5% 84.7% 81%ProposedModel(Adjoint) 99.2% 85.3% 81%ABLE IIT

IME T AKEN

Model Name Epochs Time TakenEfﬁcientNetB0 200 4.5 hoursProposed Model(RK) 100 3.75 hoursProposed Model(Adjoint) 160 3.25 hoursTABLE IIIP

ARAMETERS S ET Model Name Absolute Tolerance Relative ToleranceProposed Model(RK) − − V. F

UTURE W ORK

Using continuous depth models for transfer learning givesmore freedom to choose between a trade off between accuracyand time by optimizing the tolerance values.Other solvers can be explored and examined. Exploring dif-ferent methods can provide us with suitable solvers whichcan have better numerical stability and precision. NODE canused for transfer learning with different network backbones tosee which networks perform better. By optimizing the modelfurther, NODE can be a benchmark for transfer learning formost of the common datasets used in Deep Learning.R

EFERENCES[1] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, GeoffreyIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,Manjunath Kudlur, Josh Levenberg, Dan Man´e, Rajat Monga, SherryMoore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, VincentVanhoucke, Vijay Vasudevan, Fernanda Vi´egas, Oriol Vinyals, PeteWarden, Martin Wattenberg, Martin Wicke, Yuan Yu, and XiaoqiangZheng. TensorFlow: Large-scale machine learning on heterogeneoussystems, 2015. Software available from tensorﬂow.org.[2] L´eon Bottou. Large-scale machine learning with stochastic gradientdescent. In

Proceedings of COMPSTAT’2010 , pages 177–186. Springer,2010.[3] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Du-venaud. Neural ordinary differential equations. In

Advances in neuralinformation processing systems , pages 6571–6583, 2018.[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255.Ieee, 2009.[5] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neuralodes. arXiv preprint arXiv:1904.01681 , 2019.[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 770–778,2016.[7] Diederik P Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[8] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers offeatures from tiny images. Technical report, Citeseer, 2009.[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advancesin neural information processing systems , pages 1097–1105, 2012.[10] Jan Laermann, Wojciech Samek, and Nils Strodthoff. Achieving gener-alizable robustness of deep neural networks by stability training. arXivpreprint arXiv:1906.00735 , 2019. [11] Shan Sung Liew, Mohamed Khalil-Hani, and Rabia Bakhteri. Boundedactivation functions for enhanced training stability of deep neuralnetworks on visual pattern recognition problems.

Neurocomputing ,216:718–734, 2016.[12] Valdemar Melicher, Tom Haber, and Wim Vanroose. Fast derivatives oflikelihood functionals for ode based models using adjoint-state method.

Computational Statistics , 32(4):1621–1643, 2017.[13] Tom´aˇs Mikolov, Martin Karaﬁ´at, Luk´aˇs Burget, Jan ˇCernock`y, andSanjeev Khudanpur. Recurrent neural network based language model. In

Eleventh annual conference of the international speech communicationassociation , 2010.[14] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, ZacharyDeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, BenoitSteiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: Animperative style, high-performance deep learning library. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett,editors,

Advances in Neural Information Processing Systems 32 , pages8024–8035. Curran Associates, Inc., 2019.[15] Lev Semenovich Pontryagin.

Mathematical theory of optimal processes .Routledge, 2018.[16] Michael Schober, David K Duvenaud, and Philipp Hennig. Probabilisticode solvers with runge-kutta means. In

Advances in neural informationprocessing systems , pages 739–747, 2014.[17] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and AndrewRabinovich. Going deeper with convolutions. In

Proceedings of theIEEE conference on computer vision and pattern recognition , pages 1–9, 2015.[18] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang,and Chunfang Liu. A survey on deep transfer learning. In

InternationalConference on Artiﬁcial Neural Networks , pages 270–279. Springer,2018.[19] Mingxing Tan and Quoc V Le. Efﬁcientnet: Rethinking model scalingfor convolutional neural networks. arXiv preprint arXiv:1905.11946 ,2019.[20] Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, JosephGonzalez, George Biros, and Michael Mahoney. Anodev2: A coupledneural ode evolution framework. arXiv preprint arXiv:1906.04596arXiv preprint arXiv:1906.04596