TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning
TTiny Transfer Learning:Towards Memory-Efficient On-Device Learning
Han Cai , Chuang Gan , Ligeng Zhu , Song Han Massachusetts Institute of Technology, MIT-IBM Watson AI Lab
Abstract
We present
Tiny-Transfer-Learning (TinyTL), an efficient on-device learningmethod to adapt pre-trained models to newly collected data on edge devices. Dif-ferent from conventional transfer learning methods that fine-tune the full networkor the last layer, TinyTL freezes the weights of the feature extractor while onlylearning the biases, thus doesn’t require storing the intermediate activations, whichis the major memory bottleneck for on-device learning. To maintain the adaptationcapacity without updating the weights, TinyTL introduces memory-efficient literesidual modules to refine the feature extractor by learning small residual featuremaps in the middle. Besides, instead of using the same feature extractor, TinyTLadapts the architecture of the feature extractor to fit different target datasets whilefixing the weights: TinyTL pre-trains a large super-net that contains many weight-shared sub-nets that can individually operate; different target dataset selects thesub-net that best match the dataset. This backpropagation-free discrete sub-netselection incurs no memory overhead. Extensive experiments show that TinyTLcan reduce the training memory cost by order of magnitude (up to × ) withoutsacrificing accuracy compared to fine-tuning the full network. Intelligent edge devices with rich sensors (e.g., billions of mobile phones and IoT devices) havebeen ubiquitous in our daily lives. These devices keep collecting new and sensitive data throughthe sensor every day while being expected to provide high-quality and customized services withoutsacrificing privacy . This requires the AI systems to have the ability to continually adapt pre-trainedmodels to these newly collected data without leaking them to the cloud (i.e., on-device learning).While there is plenty of efficient inference techniques that have significantly reduced the parametersize and the computation FLOPs [2, 3, 18, 19, 21, 22, 40, 42, 47, 51], the size of intermediate activations,required by back-propagation, causes a huge training memory footprint (Figure 1 left), making itdifficult to train on edge devices.First, edge devices are memory-constrained. For example, a Raspberry Pi 1 Model A only has256MB of memory, sufficient for the inference. However, as shown in Figure 1 (left, red line), thememory footprint of the training phase can easily exceed this limit, even using a lightweight neuralnetwork architecture (MobileNetV2 [40]). Furthermore, the memory is shared by various on-deviceapplications (e.g., other deep learning models) and the operating system. A single application mayonly be allocated a small fraction of the total memory, which makes this challenge more critical.Second, edge devices are energy-constrained. Under the 45nm CMOS technology [19], a 32bitoff-chip DRAM access consumes 640 pJ, which is two orders of magnitude larger than a 32biton-chip SRAM access (5 pJ) or a 32bit float multiplication (3.7 pJ). The large memory footprint https://ec.europa.eu/info/law/law-topic/data-protection_en Preprint. Under review. a r X i v : . [ c s . C V ] J u l raining Memory Cost
Untitled 2
Untitled 3
Untitled 4 TPU SRAM (28MB)
21 4 8
Raspberry Pi 1 DRAM (256MB) float mult SRAM access DRAM accessEnergy
Training InferenceBatch Size M ob il e N e t V M e m o r y F oo t p r i n t ( M B ) TPU SRAM (28MB)
21 4 8 16
Raspberry Pi 1 Model A DRAM (256MB)
32 bitFloat Mult 32 bitSRAM Access 32 bitDRAM Access E ne r g y ( p J ) J J
640 p J Table 1
ResNet MBV2-1.4Params (M)
102 24
Activations (M)
Table 1-1
MobileNetV3-1.4 Batch Size M b V M e m o r y F oo t p r i n t ( M B ) Activation is the main bottleneck. Figure 1:
Left : The memory footprint required by training grows linearly w.r.t. the batch size andsoon exceeds the limit of edge devices.
Right : Memory cost comparison between ResNet-50 andMobileNetV2-1.4 under batch size 8. Recent advances in efficient model design only reduce the sizeof parameters, but activation size, the main bottleneck for training, does not improve much.required by training cannot fit into the limited on-chip SRAM. For instance, TPU [27] only has 28MBof SRAM that is far smaller than the training memory footprint of MobileNetV2, even using a batchsize of 1 (Figure 1 left). It results in many costly DRAM accesses and thereby consumes lots ofenergy, draining the battery of edge devices.In this work, we propose
Tiny-Transfer-Learning (TinyTL) to address these challenges. By analyzingthe memory footprint during the backward pass, we notice that the intermediate activations (themain bottleneck) are only involved in updating the weights; updating the biases does not requirethem (Eq. 2). Inspired by this finding, we propose to freeze the weights of the pre-trained featureextractor to reduce the memory footprint (Figure 2b). To compensate for the capacity loss due tofreezing weights while keeping the memory overhead small, we introduce lite residual learning thatimproves the model capacity by learning lite residual modules to refine the intermediate feature mapsof the pre-trained feature extractor (Figure 2c). Meanwhile, it aggressively shrinks the resolutiondimension and width dimension of the lite residual modules to have a small memory overhead. Wealso empirically find that different transfer datasets require very different feature extractors, especiallywhen the weights are frozen (Figure 3). Therefore, we introduce feature extractor adaptation toupdate the architecture of the feature extractor while fixing the weights to fit different target datasets(Figure 2d). Concretely, we select different sub-nets from a large pre-trained super-net. Differentfrom conventional approaches that fix the architecture and update the weights in the continuousoptimization space, our approach optimizes the feature extractor in the discrete space, which doesnot require any back-propagation and thus do not incur additional memory overhead. Extensiveexperiments on transfer learning datasets demonstrate that TinyTL achieves the same level (or evenhigher) accuracy then fine-tuning the full network while reducing the training memory footprint byup to × . Our contributions can summarized as follows: • We propose TinyTL, a novel transfer learning method for memory-efficient on-device learning. Tothe best of our knowledge, this is the first work that tackles this challenging but critical problem. • We systematically analyze the memory bottleneck of training and find the heavy memory costcomes from updating the weights, not biases (assume ReLU activation). • We propose two novel techniques ( lite residual learning and feature extractor adaptation ) toimprove the model capacity while freezing the weights with little memory overhead. • Extensive experiments on transfer learning tasks show that our method is highly memory-efficientand effective. It reduces the training memory footprint by up to 13.3 × , making it possible to learnon memory-constrained edge devices (e.g., Raspberry Pi) without sacrificing accuracy. Efficient Inference Techniques.
Improving the inference efficiency of deep neural networks onresource-constrained edge devices has recently drawn extensive attention. Starting from [10, 14, 18,19, 44], one line of research focuses on compressing pre-trained neural networks, including i) networkpruning that removes less-important units [12, 19] or channels [20, 33]; ii) network quantization thatreduces the bitwidth of parameters [7, 18] or activations [26, 45]. However, these techniques cannothandle the training phase, as they rely on a well-trained model on the target task as the starting point.2nother line of research focuses on lightweight neural architectures by either manual design [22,23, 24, 40, 51] or neural architecture search [1, 4, 42, 47]. These lightweight neural networks providehighly competitive accuracy [2, 43] while significantly improving inference efficiency. However,concerning the training memory efficiency, key bottlenecks are not solved: the training memory isdominated by activations, not parameters. For example, Figure 1 (right) shows the cost comparisonbetween ResNet-50 and MobileNetV2-1.4. In terms of parameter size, MobileNetV2-1.4 is 4.3 × smaller than ResNet-50. However, in terms of the training activation size, MobileNetV2-1.4 is almostthe same as ResNet-50 (only 1.1 × smaller), leading to little memory footprint reduction. Training Memory Cost Reduction.
Researchers have been seeking ways to reduce the trainingmemory footprint. One typical approach is to re-compute discarded activations during backward [6,16]. This approach reduces memory usage at the cost of a large computation overhead. Thus itis not preferred for edge devices. Layer-wise training [15] can also reduce the memory footprintcompared to end-to-end training. However, it cannot achieve the same level of accuracy as end-to-endtraining. Another representative approach is through activation pruning [32], which builds a dynamicsparse computation graph to prune activations during training. Similarly, [46] proposes to reducethe bitwidth of training activations by introducing new reduced-precision floating-point formats.Different from these techniques that prune or quantize existing networks with a given architecture,we can adapt the architecture to different datasets, and our method is orthogonal to these techniques.
Transfer Learning.
Neural networks pre-trained on large-scale datasets (e.g., ImageNet [9]) arewidely used as a fixed feature extractor for transfer learning, then only the last layer needs to befine-tuned [5, 11, 13, 41]. This approach does not require to store the intermediate activations of thefeature extractor, and thus is memory-efficient. However, the capacity of this approach is limited,resulting in poor accuracy, especially on datasets [30, 35] whose distribution is far from ImageNet(e.g., only 45.9% Aircraft top1 accuracy achieved by Inception-V3 [36]). Alternatively, fine-tuningthe full network can achieve better accuracy [8, 29]. But it requires a vast memory footprint andhence is not friendly for training on edge devices. Recently, [37] proposes to reduce the numberof trainable parameters by only updating parameters of the batch normalization (BN) [25] layers.Unfortunately, parameter-efficiency doesn’t translate to memory-efficiency. It still requires a largeamount of memory (e.g., 326MB under batch size 8) to store the input activations of the BN layers(Table 1). Additionally, the accuracy of this approach is still much worse than fine-tuning the fullnetwork (70.7% v.s. 85.5%; Table 1). People can also partially fine-tune some layers, but how manylayers to select is still ad hoc. This paper provides a systematic approach to adapt the feature extractorto different datasets and use lite residual learning to save memory.
Without loss of generality, we consider a neural network M that consists of a sequence of layers: M ( · ) = F w n ( F w n − ( · · · F w ( F w ( · )) · · · )) , (1) where w i denotes the parameters of the i th layer. Let a i and a i +1 be the input and output activationsof the i th layer, respectively, and L be the loss. In the backward pass, given ∂ L ∂ a i +1 , there are twogoals for the i th layer: computing ∂ L ∂ a i and ∂ L ∂ w i .Assuming the i th layer is a linear layer whose forward process is given as: a i +1 = a i W + b , thenits backward process under batch size 1 is ∂ L ∂ a i = ∂ L ∂ a i +1 ∂ a i +1 ∂ a i = ∂ L ∂ a i +1 W T , ∂ L ∂ W = a Ti ∂ L ∂ a i +1 , ∂ L ∂ b = ∂ L ∂ a i +1 . (2) According to Eq. (2), the intermediate activations (i.e., { a i } ) that dominate the memory footprint areonly required to compute the gradient of the weights (i.e., ∂ L ∂ W ), not the bias. If we only update thebias, training memory can be greatly saved. This property is also applicable to convolution layers andnormalization layers (e.g., batch normalization [25], group normalization [48], etc) since they can beconsidered as special types of linear layers. 3 map in memoryfmap not in memory learned weights on target taskpre-trained weights (a) Fine-tune the full networkDownsample Upsample (b) Lightweight residual learning (ours) (d) Our lightweight residual branch KxK GroupConv 1x1 Conv keep activations small while using group conv to increase the arithmetic intensity (c) Mobile inverted bottleneck block little computation but large activation (a) Fine-tune the full network (Conventional)KxK Separable Conv1x1 Conv i-th mobile inverted bottleneck block 1x1 Conv(a) Fine-tune the full networkDownsample KxK Group Conv 1x1 Conv Upsample(b) Lightweight residual learning (ours)1x1 Conv i-th mobile inverted bottleneck block 1x1 Conv train a once-for-all network (c) Lite residual learning (d) Feature extractor adaptation fmap in memory fmap not in memorylearnable params fixed paramsweight bias mobile inverted bottleneck block i th Aircraft Cars FlowersUpsampleDownsample Group Conv1x1 Conv 1x1 Conv
Avoid inverted bottleneck (b) Fine-tune bias only (a) Fine-tune the full network (Conventional)(c) Lite residual learning(d) Feature network adaptation fmap in memory fmap not in memorylearnable params fixed paramsweight bias mobile inverted bottleneck block i th Aircraft Cars Flowers Downsample Group Conv1x1 Conv
Avoid inverted bottleneck (b) Fine-tune bias only
C, R 6C, R 6C, R C, RC, 0.5R C, 0.5R Figure 2: TinyTL overview (“C” denotes the width and “R” denote the resolution). Conventionaltransfer learning fixes the architecture of the feature extractor and relies on fine-tuning the weightsto adapt the model (Fig.a), which requires a large amount of activation memory (in blue) for back-propagation. TinyTL reduces the memory usage by fixing the weights while: (Fig.b) only fine-tuningthe bias. (Fig.c) exploit lite residual learning to compensate for the capacity loss, using groupconvolution and avoiding inverted bottleneck to achieve high arithmetic intensity and small memoryfootprint. (Fig.d) adapting the feature extractor architecture to different downstream tasks, whichcan specialize a small feature extractor for an easy dataset (Flowers), and a large feature extractorfor a difficult dataset (Aircraft). Their weights are shared from the same super-net, which is alsoparameter-efficient.Regarding non-linear activation layers (e.g., ReLU, sigmoid, h-swish) , sigmoid and h-swish requireto store a i to compute ∂ L ∂ a i , hence they are not memory-efficient. Activation layers that build uponthem are also not memory-efficient consequently, such as tanh, swish [39], etc. In contrast, ReLUand other ReLU-styled activation layers (e.g., LeakyReLU [49]) only requires to store a binary maskrepresenting whether the value is smaller than 0, which is 32 × smaller than storing a i . Based on the memory footprint analysis, one possible solution of reducing the memory cost is tofreeze the weights of the pre-trained feature extractor while only update the biases (Figure 2b).However, only updating biases has limited adaptation capacity. In this section, we explore twooptimization techniques to improve the model capacity without updating weights: i) lite residualmodules to refine the intermediate feature maps (Figure 2c); ii) feature extractor adaptation to enablespecialized feature extractors that best match different transfer datasets (Figure 2d).
Formally, a layer with frozen weights and learnable biases can be represented as: a i +1 = F W ( a i ) + b . (3) To improve the model capacity while keeping a small memory footprint, we propose to add a literesidual module that generates a residual feature map to refine the output: a i +1 = F W ( a i ) + b + F w r ( a (cid:48) i = reduce ( a i )) , (4) where a (cid:48) i = reduce ( a i ) is the reduced activation. According to Eq. (2), learning these lite residualmodules only requires to store the reduced activations { a (cid:48) i } rather than the full activations { a i }. Implementation (Figure 2c).
We apply Eq. (4) to mobile inverted bottleneck blocks (MB-block)[40]. The key principle is to keep the activation small. Following this principle, we explore twodesign dimensions to reduce the activation size: Detailed forward and backward processes of the activation layers are provided in Appendix D. mageNetsuper netw. weightsharing SkipSkip sub-ops (including skip)(1) Pre-training (cloud) (2) Fine-tuning on the target dataset (edge) (3) Discrete op selection (edge)target dataset
Skip discrete op selection random sampletarget dataset
ImageNet Head + Lightweight residual Head +
Lightweight residual Head +
Lightweight residual fmap in memoryfmap not in memorylearned weightspre-trained weights
MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 ImageNetImageNet Top1
ImageNet Top1 (%)
MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 OursFlowers102
Flowers Top1 (%)MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 Ours
MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 Ours
Stanford cars
Cars Top1 (%)
MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 Ours
Aircraft
Aircraft Top1 (%)
R50 R101 I3Mb3PNASMNASR34Mb2 Mb2 R34MNASPNASMb3 R50R101 I3 Ours Mb2R34MNASPNASMb3R50R101 I3 Ours OursMb2 R34MNASPNASMb3R50R101 I3
ImageNet Top1 (%)
Flowers Top1 (%)MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 Ours
Cars Top1 (%)
Aircraft Top1 (%) Figure 3: Transfer learning performances of various ImageNet pre-trained models with the last linearlayer trained. The relative accuracy order between different pre-trained models changes significantlyamong ImageNet and the transfer learning datasets. For example, our specialized feature extractors(red) consistently achieve the best results on transfer datasets, though having weaker ImageNetaccuracy. It suggests that we need to adapt the feature extractor to fit different transfer datasetsinstead of using the same one for all datasets. • Width.
The widely-used inverted bottleneck requires a huge number of channels (6 × ) to com-pensate for the small capacity of a depthwise convolution, which is parameter-efficient but highlyactivation-inefficient. Even worse, converting 1 × channels to 6 × channels back and forth requirestwo × projection layers, which doubles the total activation to 12 × . Depthwise convolution alsohas a very low arithmetic intensity (its OPs/Byte is less than 4% of × convolution’s OPs/Byteif with 256 channels), thus highly memory in-efficient with little reuse. To solve these limitations,our lite residual module employs the group convolution (g=2) that has 300 × higher arithmeticintensity than depthwise convolution, providing a good trade-off between FLOPs and memory.That also removes the 1 × projection layer, reducing the total channel number by × × . • Resolution.
The activation size grows quadratically with the resolution. Therefore, we aggressivelyshrink the resolution in the lite residual module by employing a × average pooling to downsamplethe input feature map. The output of the lite residual module is then upsampled to match the sizeof the main branch’s output feature map via bilinear upsampling. Combining resolution and widthoptmizations, the activation of our lite residual module is × ×
12 = 48 × smaller than theinverted bottleneck. Conventional transfer learning chooses the feature extractor according to higher pre-training accuracy (e.g., ImageNet accuracy) and uses the same one for all transfer tasks [8, 37].However, we find this approach sub-optimal, since different target tasks may need very differentfeature extractors and high pre-training accuracy does not guarantee good transferability of thepre-trained weights. This is especially critical in our case where the weights are frozen.Figure 3 shows the top1 accuracy of various widely used ImageNet pre-trained models on threetransfer datasets by only learning the last layer, which reflects the transferability of their pre-trainedweights. The relative order between different pre-trained models is not consistent with their ImageNetaccuracy on all three datasets. This result indicates that the ImageNet accuracy is not a good proxyfor transferability. Besides, we also find that the same pre-trained model can have very differentrankings on different tasks. For instance, Inception-V3 gives poor accuracy on Flowers but providestop results on the other two datasets. Therefore, we need to specialize the feature extractor to bestmatch the target dataset.
Implementation (Figure 2d).
Motivated by these observations, we propose to adapt the featureextractor for different transfer tasks. This is achieved by allowing a set of candidate weight operationsinstead of using a fixed weight operation: {M ( · ) } = F { w n , ··· , w mn } ( · · · F { w , ··· , w m } ( F { w , ··· , w m } ( · )) · · · ) . (5) It forms a discrete optimization space, allowing us to adapt the feature extractor for different targetdatasets without updating the weights. The detailed training flow is described as follows: • Pre-training . The size of all possible weight operation combinations is exponentially large w.r.t.the depth, making it computationally impossible to pre-train all of them independently. Therefore,we employ the weight sharing technique [4, 31, 50] to reduce the pre-training cost, where asingle super-net is jointly optimized on the pre-training dataset (e.g., ImageNet) to support all5
Flowers102
MobileNetV2
R=192
R=160
R=128
R=96
Training Memory (MB)Batch Size = 8 F l o w e r s T op1 ( % ) Network Info
Batch size Act size (MB) Param on flowers102 Param on aircraft Param on stanford car8 - - - -MobileNetV2 224 54.8 9.4 9.4 9.9MobileNetV2 192 40.3 9.4 9.4 9.9MobileNetV2 160 28.0 9.4 9.4 9.9MobileNetV2 128 17.9 9.4 9.4 9.9MobileNetV2 96 10.1 9.4 9.4 9.9ResNet-50 224 88.4 95 95 96ResNet-50 192 65.0 95 95 96ResNet-50 160 45.1 95 95 96ResNet-50 128 28.9 95 95 96ResNet-50 96 16.2 95 95 96InceptionV3 299 93.6 101 101 102InceptionV3 224 47.1 101 101 102InceptionV3 192 33.2 101 101 102InceptionV3 160 21.8 101 101 102InceptionV3 128 12.8 101 101 102ResNet-34 224 29.2 85 85 86ResNet-18 224 19.1 45 45 45
Aircraft
MobileNetV2
R=192
R=160
R=128
R=96
Training Memory (MB)Batch Size = 8 A i r c r a ft T op1 ( % ) Stanford-Cars
MobileNetV2
R=192
R=160
R=128
R=96
Training Memory (MB)Batch Size = 8 C a r s T op1 ( % ) Figure 4: Results under different resolutions. With the same level of accuracy, TinyTL provide anorder of magnitude training memory saving compared to fine-tuning the full ResNet-50, making itpossible to learning on-device for Raspberry Pi 1.possible sub-nets (i.e., different combinations of weight operations). Different sub-nets can operateindependently by selecting different parts from the super-net. For example, centered weights inthe full convolution kernels are taken to form smaller convolution kernels; blocks are skippedto form a sub-net with a lower depth; channels are skipped to reduce the width of a convolutionoperation. In our experiments, we use ImageNet as the pre-training dataset. We employ progressiveshrinking [2, 50] for training the super-net, using the same training setting suggested by [2]. • Fine-tuning the super-net.
We fine-tune the pre-trained super-net on the target transfer datasetwith the weights of the main branches (i.e., MB-blocks) frozen and the other parameters (i.e.,biases, lite residual modules, classification head) updated via gradient descent. In this phase, werandomly sample one sub-net in each training step. • Discrete operation search.
Based on the fine-tuned super-net, we collect 450 [sub-net, accuracy]pairs on the validation set (20% randomly sampled training data) and train an accuracy predictor using the collected data [2]. We employ evolutionary search [17] based on the accuracy predictorto find the sub-net (i.e., the combination of weight operations) that best matches the target transferdataset. No back-propagation on the super-net is required in this step, thus incurs no additionalmemory overhead. • Final fine-tuning.
Finally, we fine-tune the searched model with the weights of the main branchesfrozen and the other parameters updated (i.e., biases, lite residual modules, classification head),using the full training set to get the final results.
Following the common practice [8, 29, 37], we evaluate our TinyTL on three benchmark datasetsincluding Cars [30], Flowers [38], and Aircraft [35], using ImageNet [9] as the pre-training dataset.
Model Architecture.
We build the super-net using the MobileNetV2 design space [4, 42] thatcontains five stages with a gradually decreased resolution, and each stage consists of a sequence ofMB-blocks. In the stage-level, it supports different depths (i.e., 2, 3, 4). In the block-level, it supportsdifferent kernel sizes (i.e., 3, 5, 7) and different width expansion ratios (i.e., 3, 4, 6) . For eachMB-block, we insert a lite residual module as described in Section 3.2.1 and Figure 2 (c). The groupnumber = 2, and the kernel size = 5. We use the ReLU activation since it is more memory-efficientaccording to Section 3.1. Training Details.
We freeze the weights of the feature extractors while allowing biases to beupdated during transfer learning. Both the lite residual learning (LiteResidual; Section 3.2.1) andfeature extractor adaptation (FeatureAdapt; Section 3.2.2) are applied in our experiments. For fine-tuning the pre-trained super-net, we use the Adam optimizer [28] with an initial learning rate of4e-3 following the cosine learning rate decay [34]. The model is trained on 80% randomly sampledtraining data for 50 epochs. For fine-tuning the searched model, we use the same training setting Details of the accuracy predictor is provided in Appendix E. The detailed architecture of the super-net is provided in Appendix C. ∗ indicates our re-implementedresults. “R” denotes the input image size. TinyTL reduces the training memory by × withoutsacrificing accuracy compared to fine-tuning the full Inception-V3. Method Flowers (Batch Size = 8) Cars Flowers AircraftTrain. Mem. Reduce Rate Top1 (%) Top1 (%) Top1 (%)Last Inception-V3 [37] 94MB 1.0 × ∗ × × BN+ ResNet-50 ∗ × × ∗ × × LiteResidual (R=224) × × × × × × × but on the full training data. Additionally, we apply 8bits weight quantization [18] on the frozenweights to reduce the parameter size, which causes a negligible accuracy drop in our experiments.For all compared methods, we also assume the 8bits weight quantization is applied if eligible whencalculating their training memory footprint. Main Results.
Table 1 reports the comparison between TinyTL and previous transfer learningmethods that are divided into three groups, including: i) fine-tuning the last linear layer [5, 11, 41](referred as
Last ); ii) fine-tuning the BN layers and the last linear layer [36] (referred as
BN+Last ) ;iii) fine-tuning the full network [8, 29] (referred as
Full ).In the first group, we only apply FeatureAdapt to adapt the feature extractor while only trainingthe parameters of the last linear layer, similar to
Last . Compared to
Last +Inception-V3, our modelreduces the training memory cost by × while improving the top1 accuracy by on Cars, on Flowers, and on Aircraft. It shows our specialized feature extractors can better fit differenttransfer datasets than these fixed feature extractors. In the second group, we only apply LiteResidualto refine the intermediate feature maps using Proxyless-Mobile [4] as the feature extractor. Comparedto BN+Last with ResNet-50, our model improves the training memory efficiency by × whileproviding consistently better accuracy ( higher on Cars, higher on Flowers, and higher on Aircraft). By increasing the input image size from 224 to 320, we can further increasethe accuracy improvement from on Cars, from on Flowers, from on Aircraft, which shows that learning lite residual modules and biases is not onlymore memory-efficient but also more effective than BN+Last . In the third group, we apply bothFeatureAdapt and LiteResidual. Compared to
Full +Inception-V3, TinyTL can achieve the same levelof accuracy while providing × training memory saving, reducing the training memory from850MB to 64MB (the same level as only learning the last linear layer).Figure 4 demonstrates the results under different input resolution. With similar accuracy, TinyTLprovides an order of magnitude memory reduction ( × on Cars, × on Flowers, and × onAircraft) compared to fine-tuning the full ResNet-50. Remarkably, it moves the training memorycost from the out-of-memory region (red) to the feasible region (green) on Raspberry Pi 1,making it possible to learn on-device without sacrificing accuracy.4.1 Ablation Studies and DiscussionsComparison with Dynamic Activation Pruning.
The comparison between TinyTL and dynamicactivation pruning [32] is summarized in Figure 5. TinyTL is more effective because it re-designedthe transfer learning architecture (lite residual module, feature extractor adaptation) rather than prunean existing architecture. The transfer accuracy drops quickly when the pruning ratio increases beyond7 lowers102
ResNet-50 Activation Pruning Ours MobileNetV2 Activation Pruning
Aircraft
ResNet-50 Activation Pruning Ours MobileNetV2 Activation Pruning
Stanford-Cars
ResNet-50 Activation Pruning Ours MobileNetV2 Activation Pruning F l o w e r s T op1 ( % ) A i r c r a ft T op1 ( % ) C a r s T op1 ( % ) Figure 5: Compared with dynamic activation pruning [32], TinyTL saves the memory more effectively.
Table 2 w/ LRL Ours w/o LRL TinyTL w/o LRL320 A i r c r a ft T op1 ( % ) Training Memory (MB)
Table 2-1
Weight + Arch Ours Only Arch Only ArchUntitled 1
Untitled 2
Untitled 3
Untitled 4
Batch Size = 8
Training Memory (MB)
Batch Size = 8 Figure 6:
Left & Middle:
Ablation Studies of TinyTL on Aircraft.
Right:
TinyTL reduces both theparameter size and the activation size, providing a more balanced cost composition than previousefficient inference techniques that focus on reducing the parameter size.50% (only 2 × memory saving). In contrast, TinyTL can achieve much higher memory reductionwithout loss of accuracy. Effectiveness of LiteResidual.
Figure 6 (left) shows the results of TinyTL with and withoutLiteResidual (only bias) on Aircraft, where we can observe significant accuracy drops (up to 7.4%) ifdisabling the lite residual modules.
Pre-trained Weight Matters, Not Only Architecture.
Figure 6 (middle) reports the performanceof TinyTL if retraining the searched feature extractor on ImageNet (only arch). The retrained featureextractor cannot reach the same accuracy compared to keeping both the pre-trained weight and thearchitecture. It suggests that not only the architecture of the feature extractor matters, the pre-trainedweight also contributes a lot to the final performance.
Flowers
None) (3x3_SimMBConv6.000_RELU6_O32_G1
None) (7x7_MBConv6_RELU6_O32
Identity) (7x7_MBConv6_RELU6_O32
Identity) (3x3_MBConv3_RELU6_O56
None) (7x7_MBConv4_RELU6_O56
Identity) (3x3_MBConv4_RELU6_O56
Identity) (7x7_MBConv3_RELU6_O56
Identity) (3x3_MBConv3_RELU6_O104
None) (7x7_MBConv3_RELU6_O104
Identity) (3x3_MBConv6_RELU6_O104
Identity) (5x5_MBConv6_RELU6_O128
None) (5x5_MBConv4_RELU6_O128
Identity) (5x5_MBConv4_RELU6_O128
Identity) (7x7_MBConv6_RELU6_O248
None) (3x3_MBConv3_RELU6_O248
Identity) (3x3_MBConv4_RELU6_O416
None)
Stanford Cars
None) (3x3_SimMBConv6.000_RELU6_O32_G1
None) (7x7_MBConv6_RELU6_O32
Identity) (3x3_MBConv6_RELU6_O32
Identity) (3x3_MBConv4_RELU6_O32
Identity) (7x7_MBConv3_RELU6_O56
None) (7x7_MBConv4_RELU6_O56
Identity) (7x7_MBConv6_RELU6_O56
Identity) (5x5_MBConv6_RELU6_O56
Identity) (3x3_MBConv4_RELU6_O104
None) (5x5_MBConv4_RELU6_O104
Identity) (5x5_MBConv6_RELU6_O104
Identity) (7x7_MBConv6_RELU6_O104
Identity) (5x5_MBConv3_RELU6_O128
None) (3x3_MBConv6_RELU6_O128
Identity) (7x7_MBConv6_RELU6_O128
Identity) (3x3_MBConv6_RELU6_O128
Identity) (5x5_MBConv4_RELU6_O248
None) (3x3_MBConv4_RELU6_O248
Identity) (3x3_MBConv6_RELU6_O248
Identity) (3x3_MBConv6_RELU6_O416
None)
Aircraft
None) (3x3_SimMBConv4.000_RELU6_O32_G1
None) (3x3_MBConv3_RELU6_O32
Identity) (3x3_MBConv4_RELU6_O32
Identity) (7x7_MBConv3_RELU6_O32
Identity) (3x3_MBConv6_RELU6_O56
None) (7x7_MBConv6_RELU6_O56
Identity) (7x7_MBConv6_RELU6_O56
Identity) (3x3_MBConv6_RELU6_O56
Identity) (7x7_MBConv4_RELU6_O104
None) (3x3_MBConv6_RELU6_O104
Identity) (5x5_MBConv6_RELU6_O104
Identity) (3x3_MBConv3_RELU6_O104
Identity) (3x3_MBConv4_RELU6_O128
None) (5x5_MBConv6_RELU6_O128
Identity) (7x7_MBConv3_RELU6_O128
Identity) (3x3_MBConv4_RELU6_O248
None) (7x7_MBConv3_RELU6_O248
Identity) (3x3_MBConv4_RELU6_O416
None) M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B (a) Feature extractor on Flowers(b) Feature extractor on Aircraft(c) Feature extractor on Cars Table 1
Flowers Aircraft Cars
102 100 196
Flowers Aircraft Cars
Table 1-1
Flowers Aircraft Cars
Flowers Aircraft Cars
16 18 20
Dataset Statistics
MAC
Flowers Aircraft Cars
729 738 948
Params
Flowers Aircraft Cars
Specialized Feature Extractor Statistics L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B (b) Feature extractor on Aircraft(c) Feature extractor on Cars L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R MB6 3x3 MB6 7x7 MB6 3x3 MB4 3x3 MB3 7x7 MB4 7x7 MB6 7x7 MB6 5x5 MB4 3x3 MB4 5x5 MB6 5x5 MB6 7x7 MB3 5x5 MB6 3x3 MB6 7x7 MB6 3x3 MB4 5x5 MB4 3x3 MB6 3x3 MB6 3x3
Figure 7: TinyTL can adapt the feature extractor’s architecture to different transfer datasets.
Adapt the Feature Extractor to Different Transfer Datasets.
Figure 7 reports the details of thetransfer learning datasets and the corresponding feature extractors specialized for these datasets inTinyTL. For an easier dataset such as Flowers (fewer
Table 1
Weight (8bit) fw Weight (32bit) Flowers102 fw+b/w Weight (32bit) Aircraft fw+b/w Weight (32bit) Cars fw+b/w Act per batch Act Mask Total Flowers102 (MB) Total Aircraft (MB) Total Cars (MB)Mbv2 (last)
ResNet-50 (last)
Mbv2 (last + bn)
ResNet-50 (last + bn)
Inception-V3 (last + bn)
Proxyless-Mobile (last + bn)
Mbv3 (Full)
Proxyless-Mobile (Full)
Mbv2 (Full)
Mbv2-1.4 (Full)
ResNet-18 (Full)
ResNet-34 (Full)
ResNet-50 (Full)
ResNet-101 (Full)
Inception-V3 (Full)
Table 2
Batch Size Table 1-1
Params Peak Inference Act Grad Act Final Act Final Memory Cost
ResNet-50 (Full)
Mbv2-1.4 (Full)
TinyTL (R=256)
Param (MB) Activation (MB)ResNet-50 MbV2-1.4 TinyTL P e r c en t age Peak Inference Act. BP Act. Overall Act. A c t i v a t i on S i z e ( M B ) ResNet-50 MbV2-1.4 TinyTL
10x smaller
Table 1-1-1
Memory Cost (GB) Computation Cost (T) Flowers Top1 (%)
ResNet-50 (Full)
802 62781.44 97.5 136
Mbv2-1.4 (Full)
644 8919.04 97.5 141
TinyTL (R=256)
64 491 97.5 214
Memory Cost (MB)644 6410x smaller
Computation Cost (TMAC)8,919 49118x smaller
Super-Net Fine-tuningDiscrete Operation SearchFinal Fine-tuning
Memory Cost (MB)644 6410x smaller
Computation Cost (TMAC)8,919 49118x smaller
Figure 8: On-device training cost on Flowers. Achieving the same accuracy, TinyTL requires10 × smaller memory cost and 18 × smaller computation cost compared to fine-tuning the fullMobileNetV2-1.4 [29].chooses a smaller feature extractor (fewer blocks, fewer parameters, less computation). For a moredifficult dataset like Cars (more Cost Details.
As shown in Figure 6 (right), TinyTL reduces both the parameter size and theactivation size instead of only reducing the parameter size as previous efficient inference methodsdid, hence provides a more balanced cost composition. This activation size is the peak activationsize during the three on-device phases (Section 3.2.2), including fine-tuning the super-net, discreteoperation search, and final fine-tuning. Concretely, for each layer, we compute the size of alreadystored activations (required by back-propagation), the size of already stored binary masks (requiredby ReLU layers), and the size of buffers (required by the forward process). The peak value of theirsum across all layers is taken as the peak activation size.The on-device training cost is summarized in Figure 8. TinyTL reduces the training memory by 10 × ,and reduces the training computation by 18 × , achieving the same accuracy as fine-tuning the fullMobileNetV2-1.4. The peak memory cost of TinyTL under resolution 256 is 64MB while the totalMAC is 491T. In contrast, fine-tuning the full network requires 644MB and the total MAC is 8,919T(20,000 steps with batch size 256 [29]) . TinyTL is not only much more memory-efficient but alsomuch more computation-efficient. We proposed Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning that aims toadapt pre-trained models to newly collected data on edge devices. Unlike previous transfer learningmethods that fix the architecture and fine-tune the weights to fit different target datasets, TinyTL fixesthe weights while adapting the architecture of the feature extractor and learning memory-efficientlite residual modules and biases to fit different target datasets. Extensive experiments on benchmarktransfer learning datasets consistently show the effectiveness and memory-efficiency of TinyTL,paving the way for efficient on-device machine learning.
References [1] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture searchby network transformation. In
AAAI , 2018. 3[2] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train onenetwork and specialize it for efficient deployment. In
ICLR , 2020. 1, 3, 6[3] Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Kuan Wang, Tianzhe Wang, Ligeng Zhu, and SongHan. Automl for architecting efficient and specialized neural networks.
IEEE Micro , 2019. 1[4] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on targettask and hardware. In
ICLR , 2019. 3, 5, 6, 7[5] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devilin the details: Delving deep into convolutional nets. In
BMVC , 2014. 3, 7 We report the memory cost under batch size 8 for consistency, which does not change the reduction ratio.
96] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinearmemory cost. arXiv preprint arXiv:1604.06174 , 2016. 3[7] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deepneural networks with binary weights during propagations. In
NeurIPS , 2015. 2[8] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large scale fine-grainedcategorization and domain-specific transfer learning. In
CVPR , 2018. 3, 5, 6, 7[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In
CVPR , 2009. 3, 6[10] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploitinglinear structure within convolutional networks for efficient evaluation. In
NeurIPS , 2014. 2[11] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and TrevorDarrell. Decaf: A deep convolutional activation feature for generic visual recognition. In
ICML ,2014. 3, 7[12] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainableneural networks. In
ICLR , 2019. 2[13] Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. Devnet: Adeep event network for multimedia event detection and evidence recounting. In
CVPR , pages2568–2577, 2015. 3[14] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutionalnetworks using vector quantization. arXiv preprint arXiv:1412.6115 , 2014. 2[15] Klaus Greff, Rupesh K Srivastava, and Jürgen Schmidhuber. Highway and residual networkslearn unrolled iterative estimation. arXiv preprint arXiv:1612.07771 , 2016. 3[16] Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. In
NeurIPS , 2016. 3[17] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and JianSun. Single path one-shot neural architecture search with uniform sampling. arXiv preprintarXiv:1904.00420 , 2019. 6[18] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neuralnetworks with pruning, trained quantization and huffman coding. In
ICLR , 2016. 1, 2, 7[19] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections forefficient neural network. In
NeurIPS , 2015. 1, 2[20] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neuralnetworks. In
ICCV , 2017. 2[21] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan,Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3.In
ICCV , 2019. 1, 13[22] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, TobiasWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neuralnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861 , 2017. 1, 3, 13[23] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: Anefficient densenet using learned group convolutions. In
CVPR , 2018. 3[24] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, andKurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mbmodel size. arXiv preprint arXiv:1602.07360 , 2016. 3[25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. In
ICML , 2015. 3[26] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard,Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks forefficient integer-arithmetic-only inference. In
CVPR , 2018. 2[27] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, RaminderBajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performanceanalysis of a tensor processing unit. In
ISCA , 2017. 2[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014. 6[29] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better?In
CVPR , 2019. 3, 6, 7, 9 1030] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In
Proceedings of the IEEE International Conference on ComputerVision Workshops , 2013. 3, 6[31] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In
ICLR , 2019. 5[32] Liu Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. Dynamicsparse graph for efficient deep learning. In
ICLR , 2019. 3, 7, 8[33] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.Learning efficient convolutional networks through network slimming. In
ICCV , 2017. 2[34] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXivpreprint arXiv:1608.03983 , 2016. 6[35] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 , 2013. 3, 6[36] Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. K forthe price of 1: Parameter efficient multi-task and transfer learning. In
ICLR , 2019. 3, 7[37] Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. K forthe price of 1: Parameter-efficient multi-task and transfer learning. In
ICLR , 2019. 3, 5, 6, 7[38] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large num-ber of classes. In
Sixth Indian Conference on Computer Vision, Graphics & Image Processing ,2008. 6[39] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. In
ICLRWorkshop , 2018. 4[40] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In
CVPR , 2018. 1, 3, 4, 12, 13[41] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn featuresoff-the-shelf: an astounding baseline for recognition. In
CVPR Workshops , 2014. 3, 7[42] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, andQuoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In
CVPR , 2019. 1,3, 6[43] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neuralnetworks. In
ICML , 2019. 3[44] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networkson cpus. In
NeurIPS Deep Learning and Unsupervised Feature Learning Workshop , 2011. 2[45] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automatedquantization. In
CVPR , 2019. 2[46] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan.Training deep neural networks with 8-bit floating point numbers. In
NeurIPS , 2018. 3[47] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, YuandongTian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnetdesign via differentiable neural architecture search. In
CVPR , 2019. 1, 3[48] Yuxin Wu and Kaiming He. Group normalization. In
ECCV , 2018. 3[49] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activationsin convolutional network. arXiv preprint arXiv:1505.00853 , 2015. 4[50] Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, MingxingTan, Thomas Huang, Xiaodan Song, Ruoming Pang, and Quoc Le. Bignas: Scaling up neuralarchitecture search with big single-stage models. arXiv preprint arXiv:2003.11142 , 2020. 5, 6[51] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficientconvolutional neural network for mobile devices. In
CVPR , 2018. 1, 311
Detailed Architectures of Specialized Feature Extractors
Flowers
None) (3x3_SimMBConv6.000_RELU6_O32_G1
None) (7x7_MBConv6_RELU6_O32
Identity) (7x7_MBConv6_RELU6_O32
Identity) (3x3_MBConv3_RELU6_O56
None) (7x7_MBConv4_RELU6_O56
Identity) (3x3_MBConv4_RELU6_O56
Identity) (7x7_MBConv3_RELU6_O56
Identity) (3x3_MBConv3_RELU6_O104
None) (7x7_MBConv3_RELU6_O104
Identity) (3x3_MBConv6_RELU6_O104
Identity) (5x5_MBConv6_RELU6_O128
None) (5x5_MBConv4_RELU6_O128
Identity) (5x5_MBConv4_RELU6_O128
Identity) (7x7_MBConv6_RELU6_O248
None) (3x3_MBConv3_RELU6_O248
Identity) (3x3_MBConv4_RELU6_O416
None)
Stanford Cars
None) (3x3_SimMBConv6.000_RELU6_O32_G1
None) (7x7_MBConv6_RELU6_O32
Identity) (3x3_MBConv6_RELU6_O32
Identity) (3x3_MBConv4_RELU6_O32
Identity) (7x7_MBConv3_RELU6_O56
None) (7x7_MBConv4_RELU6_O56
Identity) (7x7_MBConv6_RELU6_O56
Identity) (5x5_MBConv6_RELU6_O56
Identity) (3x3_MBConv4_RELU6_O104
None) (5x5_MBConv4_RELU6_O104
Identity) (5x5_MBConv6_RELU6_O104
Identity) (7x7_MBConv6_RELU6_O104
Identity) (5x5_MBConv3_RELU6_O128
None) (3x3_MBConv6_RELU6_O128
Identity) (7x7_MBConv6_RELU6_O128
Identity) (3x3_MBConv6_RELU6_O128
Identity) (5x5_MBConv4_RELU6_O248
None) (3x3_MBConv4_RELU6_O248
Identity) (3x3_MBConv6_RELU6_O248
Identity) (3x3_MBConv6_RELU6_O416
None)
Aircraft
None) (3x3_SimMBConv4.000_RELU6_O32_G1
None) (3x3_MBConv3_RELU6_O32
Identity) (3x3_MBConv4_RELU6_O32
Identity) (7x7_MBConv3_RELU6_O32
Identity) (3x3_MBConv6_RELU6_O56
None) (7x7_MBConv6_RELU6_O56
Identity) (7x7_MBConv6_RELU6_O56
Identity) (3x3_MBConv6_RELU6_O56
Identity) (7x7_MBConv4_RELU6_O104
None) (3x3_MBConv6_RELU6_O104
Identity) (5x5_MBConv6_RELU6_O104
Identity) (3x3_MBConv3_RELU6_O104
Identity) (3x3_MBConv4_RELU6_O128
None) (5x5_MBConv6_RELU6_O128
Identity) (7x7_MBConv3_RELU6_O128
Identity) (3x3_MBConv4_RELU6_O248
None) (7x7_MBConv3_RELU6_O248
Identity) (3x3_MBConv4_RELU6_O416
None) M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B (a) Feature extractor on Flowers(b) Feature extractor on Aircraft(c) Feature extractor on Cars Table 1
Flowers Aircraft Cars
102 100 196
Table 1-1
Flowers Aircraft Cars
Flowers Aircraft Cars
16 18 20
MAC
Flowers Aircraft Cars
729 738 948
Params
Flowers Aircraft Cars L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B (b) Feature extractor on Aircraft(c) Feature extractor on Cars L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R MB6 3x3 MB6 7x7 MB6 3x3 MB4 3x3 MB3 7x7 MB4 7x7 MB6 7x7 MB6 5x5 MB4 3x3 MB4 5x5 MB6 5x5 MB6 7x7 MB3 5x5 MB6 3x3 MB6 7x7 MB6 3x3 MB4 5x5 MB4 3x3 MB6 3x3 MB6 3x3
102 100 196 2040 6667 8144 16 18 20 5.0 5.3 7.4 729 738 948
Figure 9: Detailed architectures of the feature extractors on different transfer datasets. “LR” denotesthe lite residual module (Section 3.2.1) while “MB4 × ” denotes the mobile inverted bottleneckblock [40] with expansion ratio 4 and kernel size 7. TinyTL adapts a higher-capacity feature extractorfor a harder task (Cars). B Details of the On-device Training Cost
The detailed training cost of the on-device learning phases is described as follows: • Fine-tuning the super-net.
We fine-tune the pre-trained super-net under resolution 224. Thepeak memory cost of this phase is 64MB, which is reached when the largest sub-net is sampled.Regarding the computation cost, the average MAC (forward & backward) of sampled sub-netsis (802M + 2535M) / 2 = 1668.5M per sample, where 802M is the training MAC of the smallestsub-net and 2535M is the training MAC of the largest sub-net. Therefore, the total MAC of thisphase is 1668.5M × × ×
50 = 136T (27.7% of 491T) on Flowers, where 2040 is thenumber of total training samples, 0.8 means the super-net is fine-tuned on 80% of the trainingsamples (the remaining 20% is reserved for search), and 50 is the number of training epochs. • Discrete operation search.
As discussed in Appendix E, the memory overhead and computationoverhead of the accuracy predictor are negligible. The primary memory cost and computation costof this phase come from collecting 450 [sub-net, accuracy] pairs required to train the accuracypredictor. It only involves the forward processes of sampled sub-nets, and no back-propagation isrequired. Therefore, the memory overhead of this phase is negligible compared to the super-netfine-tuning phase. The average MAC (only forward) of sampled sub-nets is (352M + 1179M) / 2 =765.5M per sample, where 352M is the inference MAC of the smallest sub-net and 1179M is theinference MAC of the largest sub-net. Therefore, the total MAC of this phase is 765.5M × × ×
450 = 141T (28.7% of 491T) on Flowers, where 2040 is the number of total training samples, In the super-net fine-tuning phase, the training MAC of a sampled sub-net is roughly 2 × larger than itsinference MAC, rather than 3 × , since we do not need to update the weights of the main branches. • Final fine-tuning.
To achieve the same accuracy as fine-tuning the full MobileNetV2-1.4, we usea resolution of 256. The memory cost of this phase is 63.9MB and the total MAC is 2100M × × ×
50 = 214T (43.6% of 491T), on Flowers, where 2100M is the training MAC, 2040 isthe number of total training samples, 1.0 means the full training set is used, and 50 is the numberof training epochs.
C Detailed Architecture of the Super-Net
Table 2: Detailed architecture of the super-net using the MobileNetV2 design space with lite residualmodules (Section 3.2.1). “SepConv” denotes the separable convolution block [22] that consists of adepthwise-separable convolution layer and a × convolution layer. “MB-LiteResidual” denotes themobile inverted bottleneck block [40] with a lite residual module (described in Section 3.2.1).Input Operator × Conv2d
40 2 4 3 - × SepConv
24 1 2 3 × MB-LiteResidual
32 2 4 3 , , , , × MB-LiteResidual
32 1 2 3 , , , , , , × MB-LiteResidual
56 2 4 3 , , , , × MB-LiteResidual
56 1 2 3 , , , , , , × MB-LiteResidual
104 2 4 3 , , , , × MB-LiteResidual
104 1 2 3 , , , , , , × MB-LiteResidual
128 1 2 3 , , , , × MB-LiteResidual
128 1 2 3 , , , , , , × MB-LiteResidual
248 2 4 3 , , , , × MB-LiteResidual
248 1 2 3 , , , , , , × MB-LiteResidual
416 1 2 3 , , , , × Conv2d - - × Avg-pool - - × Linear - - - - D Memory Footprint of Non-Linear Activation Layers