Transfer Learning using Neural Ordinary Differential Equations
TTransfer Learning using Neural OrdinaryDifferential Equations
Rajath S
Student,Department Of Computer SciencePES University
Bengaluru,[email protected]
Sumukh Aithal K
Student,Department Of Computer SciencePES University
Bengaluru,[email protected]
Dr.S Natarajan
Professor,Department Of Computer SciencePES University
Bengaluru,[email protected]
Abstract —We introduce a concept of using Neural OrdinaryDifferential Equations(NODE) for Transfer Learning. In thispaper we use the EfficientNets to explore transfer learning onCIFAR-10 dataset. We use NODE for fine-tuning our model.Using NODE for fine tuning provides more stability duringtraining and validation.These continuous depth blocks can alsohave a trade off between numerical precision and speed .Weconclude that the using Neural ODEs for transfer learning resultsin much stable convergence of the loss function.
Index Terms —Transfer Learning,Neural Ordinary DifferentialEquations(NODE),Image Classification,EfficientNet
I. I
NTRODUCTION
Image classification is one of the fundamental tasks incomputer vision and there has been significant improvementin the accuracy of image classification models since CNNs.AlexNet [9] and GoogleNet [17] showed that deeper andlarger neural network models perform better at imageclassification. CNNs learn by feature extraction and featuresextracted by one model trained on a particular dataset can beused by another model performing a similar task.Given the enormous amount of resources used to traincomputer vision models,transfer learning is a very populartechnique used in deep learning.Transfer learning significantlyimproves the training time and gives better results comparedto the conventional techniques. While most machinelearning algorithms are designed to address single tasks, thedevelopment of algorithms that facilitate transfer learning isa topic of ongoing interest in the machine-learning community.In this paper we study the concept of using NODE fortransfer learning by using EfficientNet model as the backboneand imagenet weights as the pretrained weights.The intuitionbehind it being that brain is also considered as continuous timesystems.The remaining structure of the paper is as follows.SectionII contains the related work.The proposed method and detailsof the experiments is explained in Section III.Results andConclusions is explained in Section IV and V respectively. II. R
ELATED W ORK
EfficientNets [19] are a family of models which can besystematically scaled up based on the resources available.Thisfamily of models achieve better accuracy than traditionalConvNets. In these models , there is a principled way inwhich models are scaled up . A balance between width,depthand height is achieved by simply scaling them up with aconstant ratio. These models can be scaled up based on thecomputation resources available.For instance if there exists n more resources then the depth could be increased by α n , width by β n and image size by γ n where α, β, γ areconstant coefficients EfficientNets transfer well on datasetslike CIFAR-100 [8], Flowers etc with fewer parameters. Tanet al, [19], have examined eight models from EfficientNetB0-EfficientNetB7 for their efficiency and performance.Neural Ordinary Differential Equations (NODE) [3] are afamily of Neural Networks where discrete sequence of hiddenlayers need not be specified, instead the derivative of thehidden state is parameterized using a neural network.Networkssuch as ResNets [6] and Recurrent Neural Networks [13] canbe modelled as continuous transforms using NODE. Thesecontinuous-depth models have constant memory cost, adapttheir evaluation strategy to each input, and can explicitlytrade numerical precision for speed.Chen et al., [3] use adaptive step-size solvers to solve ODEsreliably. This solver uses the adjoint method [12] and thenetwork used has a memory cost of O(1). Here, a networkis also tested with the same architecture but where gradientsare backpropagated directly through a Runge-Kutta integrator,referred to as RK-Net, which has O(L) memory cost whereL stands for the number of layers in the network. Here , thecontinuous dynamics of hidden units are parameterized usingan ordinary differential equation (ODE) specified by a neuralnetwork: dh ( t ) dt = f ( h ( t ) , t, θ ) (1)where t ∈ ...T Input layer would be h(0), output layer canbe defined as h(T) to be the solution to this ODE initial valueproblem at some time T. a r X i v : . [ c s . L G ] J a n urrently, as numerical instability is an issue with ODEs,augmented version of Neural ODE networks have beendeveloped [5] [20]. Among many extensions of Neural ODEsdeveloped, one such is an approach that allows evolutionof the neural network parameters, in a coupled ODE-basedformulation. Also Augmented Neural ODEs are modelledwhich, in addition to being more expressive models thantraditional Neueral ODEs, are empirically more stable,generalize better and have a lower computational cost thanNeural ODEs [20].Having stability while training deep Neural Networks isimportant as it consistently offers improved robustnessagainst a broader range of distortion strengths and typesunseen during training, a considerably smaller hyperparameterdependence and less potentially negative side effects comparedto data augmentation [10].Stability during training is also achieved using differentactivation functions such as bounded Rectified Linear Unit(ReLU), bounded leaky ReLU, and bounded bi-firing [11].There are many domains where well annotated data is noteasy to obtain due to data acquisition expenses. Collectionof data is complex and expensive that make it extremelydifficult to build a large-scale, high-quality annotated dataset.Transfer learning is the solution to this problem as it relaxesthe hypothesis that the training data must be independent andidentically distributed with the test data [18].III. P ROPOSED M ETHOD
In the proposed method EfficientNet B0 is used, from thefamily of EfficientNet models as the base model. A NODElayer is added before the final layer for fine tuning theCIFAR-10 dataset [8].Although it increases the time taken to train per epoch,theNODE block is added to gain stability in this process. TwoODE Solvers for NODE i.e Runge Kutta Method [16] and themodern adjoint sensitivity method [15] are used and compared.The default implementation tf.contrib.integrate.odeint is usedto solve the Ordinary Differential Equation initial valueproblems . This default method uses the Runge-Kutta solver.Adjoint sensitivity method is also used to for better memoryefficiency. The adjoint method is computationally efficientand is numerically much more stable.While using the Runge-Kutta(RK) Method, a user definedhyperparameter, relative and absolute tolerance limit can bevaried to achieve optimal performance.Setting both theseparameters as − provided optimal results .Tolerance limitis a parameter which is a trade off between accuracy andcomputational cost.For the Adjoint sensitivity method, the default parametersprovided by Chen et.al. [3] is used. The proposed model consists of two variations, one with RKsolver which is run for 200 epochs and one with the modernadjoint sensitivity method which is run for 160 epochs.In the proposed method, EfficientB0 model is run for 200epochs to obtain the desired accuracy as mentioned inthe performance table .The NODE block is then added tothe previous model and is run until the desired validationaccuracy is observed. It is then observed that the samevalidation accuracy obtained after 100 epochs in the caseof RK solver and 160 epochs in the case of the adjointsensitivity method.For the initial model without NODE and the proposedmodel with RK solver, the Adam optimizer [7] was used.For the proposed model with the Adjoint method, stochasticgradient descent [2] was better than the Adam Optimizer.Using ImageNet [4] weights for training the model enablesquicker convergence as the features learnt by the pre-trainedmodel are common to the image classification task.The concept of using NODE instead of the fully connectedlayer is that it potentially reduces the number of parameters,as the hidden blocks are now continuous functions of time.Also one can tune the tolerance parameter for speed/accuracytrade-off. Fig. 1. Model Visualization
IV. R
ESULTS AND C ONCLUSION
It is observed that using NODE before the final layerguarantees better stability during training and validation. Bothresults are shown in Table I, the one with NODE at the endand the one without NODE(purely EfficientNetB0 with just afinal fully connected layer). Results of both the variations ofthe proposed model is shown.The model with EfficientNetB0 and a final fully connectedayer is trained for 200 epochs.The proposed model with RK solver is trained for 100epochs while the model with Adjoint solver is trained for160 epochs. It is observed that in both variants, the proposedmodel converges to the desired accuracy and loss muchquicker than the model without NODE.The performance andstability get enhanced in the proposed model.The first four figures-2,3,4,5 show the training accuracy,training loss, validation accuracy and validation lossrespectively for the model trained with EfficientNetB0 baseand a fully connected final layer.It is observed that duringtraining there is a lot of fluctuation and instability.In factthe validation curves are also not stable. By just adding theNODE block in end before the final layer, it is seen that thetraining process stabilizes.Figures-6,7,8,9 depict the accuracy and loss graphs of theproposed model with the RK solver.Figure 6 depicts the training graph, where it easily attainstraining accuracy of about 98.5 % in just 100 epochs. Figure7 shows the steady decrease in the loss of the proposedmodel. Figure 8 depicts the validation curve. It is observed tobe very stable in comparison to the previous model. Figure 9shows the corresponding validation loss.Figures-10,11 and 12 show the training accuracy, trainingloss and validation accuracy of the proposed model with theadjoint sensitivity method. Training accuracy of 99.2 % anda validation accuracy of 85.3 % is observed.The adjoint sensitivity method provides even more stabilitythan the Runge-Kutta Solver. It also provides better validationaccuracy, and is quicker in convergence.Table I shows the performance of the proposed modelin comparison to a model with just EfficientNetB0. Table Ishows its performance on train, validation and test sets.Table III shows the parameters set for the proposed modelwith the RK solver.Table II shows the number of epochs both models weretrained on the corresponding time taken to train. Weobserve that although that the proposed model takes moretime per epoch, the total time taken to converge is much better.The EfficientNetB0 model and proposed model withRK solver were both developed on Keras with Tensorflow[1] backend. The proposed model with the adjoint sensititvitymethod was developed on PyTorch [14].All the above models were trained on GTX 1080 Ti GPUswith 8GB memory.
Fig. 2. Training accuracyFig. 3. Training lossFig. 4. Validation accuracyFig. 5. Validation lossig. 6. Training Accuracy(proposed model with RK solver)Fig. 7. Training loss(proposed model RK solver)Fig. 8. Validation Accuracy(proposed model RK solver)Fig. 9. Validation loss(proposed model RK solver) Fig. 10. Training accuracy using adjoint solver(proposed model)Fig. 11. Training loss using adjoint solver(proposed model)Fig. 12. Validation Accuracy using adjoint solver(proposed model)TABLE IP
ERFORMANCE
Performance of our modelModel Name Trainingaccuracy Validation Accu-racy Test Ac-curacyEfficientNetB0 98.7% 84.5% 81%ProposedModel(RK) 98.5% 84.7% 81%ProposedModel(Adjoint) 99.2% 85.3% 81%ABLE IIT
IME T AKEN
Model Name Epochs Time TakenEfficientNetB0 200 4.5 hoursProposed Model(RK) 100 3.75 hoursProposed Model(Adjoint) 160 3.25 hoursTABLE IIIP
ARAMETERS S ET Model Name Absolute Tolerance Relative ToleranceProposed Model(RK) − − V. F
UTURE W ORK
Using continuous depth models for transfer learning givesmore freedom to choose between a trade off between accuracyand time by optimizing the tolerance values.Other solvers can be explored and examined. Exploring dif-ferent methods can provide us with suitable solvers whichcan have better numerical stability and precision. NODE canused for transfer learning with different network backbones tosee which networks perform better. By optimizing the modelfurther, NODE can be a benchmark for transfer learning formost of the common datasets used in Deep Learning.R
EFERENCES[1] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, GeoffreyIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,Manjunath Kudlur, Josh Levenberg, Dan Man´e, Rajat Monga, SherryMoore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, VincentVanhoucke, Vijay Vasudevan, Fernanda Vi´egas, Oriol Vinyals, PeteWarden, Martin Wattenberg, Martin Wicke, Yuan Yu, and XiaoqiangZheng. TensorFlow: Large-scale machine learning on heterogeneoussystems, 2015. Software available from tensorflow.org.[2] L´eon Bottou. Large-scale machine learning with stochastic gradientdescent. In
Proceedings of COMPSTAT’2010 , pages 177–186. Springer,2010.[3] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Du-venaud. Neural ordinary differential equations. In
Advances in neuralinformation processing systems , pages 6571–6583, 2018.[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255.Ieee, 2009.[5] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neuralodes. arXiv preprint arXiv:1904.01681 , 2019.[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition , pages 770–778,2016.[7] Diederik P Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[8] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers offeatures from tiny images. Technical report, Citeseer, 2009.[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural networks. In
Advancesin neural information processing systems , pages 1097–1105, 2012.[10] Jan Laermann, Wojciech Samek, and Nils Strodthoff. Achieving gener-alizable robustness of deep neural networks by stability training. arXivpreprint arXiv:1906.00735 , 2019. [11] Shan Sung Liew, Mohamed Khalil-Hani, and Rabia Bakhteri. Boundedactivation functions for enhanced training stability of deep neuralnetworks on visual pattern recognition problems.
Neurocomputing ,216:718–734, 2016.[12] Valdemar Melicher, Tom Haber, and Wim Vanroose. Fast derivatives oflikelihood functionals for ode based models using adjoint-state method.
Computational Statistics , 32(4):1621–1643, 2017.[13] Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, Jan ˇCernock`y, andSanjeev Khudanpur. Recurrent neural network based language model. In
Eleventh annual conference of the international speech communicationassociation , 2010.[14] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, ZacharyDeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, BenoitSteiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: Animperative style, high-performance deep learning library. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett,editors,
Advances in Neural Information Processing Systems 32 , pages8024–8035. Curran Associates, Inc., 2019.[15] Lev Semenovich Pontryagin.
Mathematical theory of optimal processes .Routledge, 2018.[16] Michael Schober, David K Duvenaud, and Philipp Hennig. Probabilisticode solvers with runge-kutta means. In
Advances in neural informationprocessing systems , pages 739–747, 2014.[17] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and AndrewRabinovich. Going deeper with convolutions. In
Proceedings of theIEEE conference on computer vision and pattern recognition , pages 1–9, 2015.[18] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang,and Chunfang Liu. A survey on deep transfer learning. In
InternationalConference on Artificial Neural Networks , pages 270–279. Springer,2018.[19] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scalingfor convolutional neural networks. arXiv preprint arXiv:1905.11946 ,2019.[20] Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, JosephGonzalez, George Biros, and Michael Mahoney. Anodev2: A coupledneural ode evolution framework. arXiv preprint arXiv:1906.04596arXiv preprint arXiv:1906.04596