[PDF] TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning

Abstract

On-device learning enables edge devices to continually adapt the AI models to new data, which requires a small memory footprint to fit the tight memory constraint of edge devices. Existing work solves this problem by reducing the number of trainable parameters. However, this doesn't directly translate to memory saving since the major bottleneck is the activations, not parameters. In this work, we present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning. TinyTL freezes the weights while only learns the bias modules, thus no need to store the intermediate activations. To maintain the adaptation capacity, we introduce a new memory-efficient bias module, the lite residual module, to refine the feature extractor by learning small residual feature maps adding only 3.8% memory overhead. Extensive experiments show that TinyTL significantly saves the memory (up to 6.5x) with little accuracy loss compared to fine-tuning the full network. Compared to fine-tuning the last layer, TinyTL provides significant accuracy improvements (up to 34.1%) with little memory overhead. Furthermore, combined with feature extractor adaptation, TinyTL provides 7.3-12.9x memory saving without sacrificing accuracy compared to fine-tuning the full Inception-V3.

Full PDF

TTiny Transfer Learning:Towards Memory-Efﬁcient On-Device Learning

Han Cai , Chuang Gan , Ligeng Zhu , Song Han Massachusetts Institute of Technology, MIT-IBM Watson AI Lab

Abstract

We present

Tiny-Transfer-Learning (TinyTL), an efﬁcient on-device learningmethod to adapt pre-trained models to newly collected data on edge devices. Dif-ferent from conventional transfer learning methods that ﬁne-tune the full networkor the last layer, TinyTL freezes the weights of the feature extractor while onlylearning the biases, thus doesn’t require storing the intermediate activations, whichis the major memory bottleneck for on-device learning. To maintain the adaptationcapacity without updating the weights, TinyTL introduces memory-efﬁcient literesidual modules to reﬁne the feature extractor by learning small residual featuremaps in the middle. Besides, instead of using the same feature extractor, TinyTLadapts the architecture of the feature extractor to ﬁt different target datasets whileﬁxing the weights: TinyTL pre-trains a large super-net that contains many weight-shared sub-nets that can individually operate; different target dataset selects thesub-net that best match the dataset. This backpropagation-free discrete sub-netselection incurs no memory overhead. Extensive experiments show that TinyTLcan reduce the training memory cost by order of magnitude (up to × ) withoutsacriﬁcing accuracy compared to ﬁne-tuning the full network. Intelligent edge devices with rich sensors (e.g., billions of mobile phones and IoT devices) havebeen ubiquitous in our daily lives. These devices keep collecting new and sensitive data throughthe sensor every day while being expected to provide high-quality and customized services withoutsacriﬁcing privacy . This requires the AI systems to have the ability to continually adapt pre-trainedmodels to these newly collected data without leaking them to the cloud (i.e., on-device learning).While there is plenty of efﬁcient inference techniques that have signiﬁcantly reduced the parametersize and the computation FLOPs [2, 3, 18, 19, 21, 22, 40, 42, 47, 51], the size of intermediate activations,required by back-propagation, causes a huge training memory footprint (Figure 1 left), making itdifﬁcult to train on edge devices.First, edge devices are memory-constrained. For example, a Raspberry Pi 1 Model A only has256MB of memory, sufﬁcient for the inference. However, as shown in Figure 1 (left, red line), thememory footprint of the training phase can easily exceed this limit, even using a lightweight neuralnetwork architecture (MobileNetV2 [40]). Furthermore, the memory is shared by various on-deviceapplications (e.g., other deep learning models) and the operating system. A single application mayonly be allocated a small fraction of the total memory, which makes this challenge more critical.Second, edge devices are energy-constrained. Under the 45nm CMOS technology [19], a 32bitoff-chip DRAM access consumes 640 pJ, which is two orders of magnitude larger than a 32biton-chip SRAM access (5 pJ) or a 32bit ﬂoat multiplication (3.7 pJ). The large memory footprint https://ec.europa.eu/info/law/law-topic/data-protection_en Preprint. Under review. a r X i v : . [ c s . C V ] J u l raining Memory Cost

Untitled 2

Untitled 3

Untitled 4 TPU SRAM (28MB)

21 4 8

Raspberry Pi 1 DRAM (256MB) ﬂoat mult SRAM access DRAM accessEnergy

Training InferenceBatch Size M ob il e N e t V M e m o r y F oo t p r i n t ( M B ) TPU SRAM (28MB)

21 4 8 16

Raspberry Pi 1 Model A DRAM (256MB)

32 bitFloat Mult 32 bitSRAM Access 32 bitDRAM Access E ne r g y ( p J ) J J

640 p J Table 1

ResNet MBV2-1.4Params (M)

102 24

Activations (M)

Table 1-1

MobileNetV3-1.4 Batch Size M b V M e m o r y F oo t p r i n t ( M B ) Activation is the main bottleneck. Figure 1:

Left : The memory footprint required by training grows linearly w.r.t. the batch size andsoon exceeds the limit of edge devices.

Right : Memory cost comparison between ResNet-50 andMobileNetV2-1.4 under batch size 8. Recent advances in efﬁcient model design only reduce the sizeof parameters, but activation size, the main bottleneck for training, does not improve much.required by training cannot ﬁt into the limited on-chip SRAM. For instance, TPU [27] only has 28MBof SRAM that is far smaller than the training memory footprint of MobileNetV2, even using a batchsize of 1 (Figure 1 left). It results in many costly DRAM accesses and thereby consumes lots ofenergy, draining the battery of edge devices.In this work, we propose

Tiny-Transfer-Learning (TinyTL) to address these challenges. By analyzingthe memory footprint during the backward pass, we notice that the intermediate activations (themain bottleneck) are only involved in updating the weights; updating the biases does not requirethem (Eq. 2). Inspired by this ﬁnding, we propose to freeze the weights of the pre-trained featureextractor to reduce the memory footprint (Figure 2b). To compensate for the capacity loss due tofreezing weights while keeping the memory overhead small, we introduce lite residual learning thatimproves the model capacity by learning lite residual modules to reﬁne the intermediate feature mapsof the pre-trained feature extractor (Figure 2c). Meanwhile, it aggressively shrinks the resolutiondimension and width dimension of the lite residual modules to have a small memory overhead. Wealso empirically ﬁnd that different transfer datasets require very different feature extractors, especiallywhen the weights are frozen (Figure 3). Therefore, we introduce feature extractor adaptation toupdate the architecture of the feature extractor while ﬁxing the weights to ﬁt different target datasets(Figure 2d). Concretely, we select different sub-nets from a large pre-trained super-net. Differentfrom conventional approaches that ﬁx the architecture and update the weights in the continuousoptimization space, our approach optimizes the feature extractor in the discrete space, which doesnot require any back-propagation and thus do not incur additional memory overhead. Extensiveexperiments on transfer learning datasets demonstrate that TinyTL achieves the same level (or evenhigher) accuracy then ﬁne-tuning the full network while reducing the training memory footprint byup to × . Our contributions can summarized as follows: • We propose TinyTL, a novel transfer learning method for memory-efﬁcient on-device learning. Tothe best of our knowledge, this is the ﬁrst work that tackles this challenging but critical problem. • We systematically analyze the memory bottleneck of training and ﬁnd the heavy memory costcomes from updating the weights, not biases (assume ReLU activation). • We propose two novel techniques ( lite residual learning and feature extractor adaptation ) toimprove the model capacity while freezing the weights with little memory overhead. • Extensive experiments on transfer learning tasks show that our method is highly memory-efﬁcientand effective. It reduces the training memory footprint by up to 13.3 × , making it possible to learnon memory-constrained edge devices (e.g., Raspberry Pi) without sacriﬁcing accuracy. Efﬁcient Inference Techniques.

Improving the inference efﬁciency of deep neural networks onresource-constrained edge devices has recently drawn extensive attention. Starting from [10, 14, 18,19, 44], one line of research focuses on compressing pre-trained neural networks, including i) networkpruning that removes less-important units [12, 19] or channels [20, 33]; ii) network quantization thatreduces the bitwidth of parameters [7, 18] or activations [26, 45]. However, these techniques cannothandle the training phase, as they rely on a well-trained model on the target task as the starting point.2nother line of research focuses on lightweight neural architectures by either manual design [22,23, 24, 40, 51] or neural architecture search [1, 4, 42, 47]. These lightweight neural networks providehighly competitive accuracy [2, 43] while signiﬁcantly improving inference efﬁciency. However,concerning the training memory efﬁciency, key bottlenecks are not solved: the training memory isdominated by activations, not parameters. For example, Figure 1 (right) shows the cost comparisonbetween ResNet-50 and MobileNetV2-1.4. In terms of parameter size, MobileNetV2-1.4 is 4.3 × smaller than ResNet-50. However, in terms of the training activation size, MobileNetV2-1.4 is almostthe same as ResNet-50 (only 1.1 × smaller), leading to little memory footprint reduction. Training Memory Cost Reduction.

Researchers have been seeking ways to reduce the trainingmemory footprint. One typical approach is to re-compute discarded activations during backward [6,16]. This approach reduces memory usage at the cost of a large computation overhead. Thus itis not preferred for edge devices. Layer-wise training [15] can also reduce the memory footprintcompared to end-to-end training. However, it cannot achieve the same level of accuracy as end-to-endtraining. Another representative approach is through activation pruning [32], which builds a dynamicsparse computation graph to prune activations during training. Similarly, [46] proposes to reducethe bitwidth of training activations by introducing new reduced-precision ﬂoating-point formats.Different from these techniques that prune or quantize existing networks with a given architecture,we can adapt the architecture to different datasets, and our method is orthogonal to these techniques.

Transfer Learning.

Neural networks pre-trained on large-scale datasets (e.g., ImageNet [9]) arewidely used as a ﬁxed feature extractor for transfer learning, then only the last layer needs to beﬁne-tuned [5, 11, 13, 41]. This approach does not require to store the intermediate activations of thefeature extractor, and thus is memory-efﬁcient. However, the capacity of this approach is limited,resulting in poor accuracy, especially on datasets [30, 35] whose distribution is far from ImageNet(e.g., only 45.9% Aircraft top1 accuracy achieved by Inception-V3 [36]). Alternatively, ﬁne-tuningthe full network can achieve better accuracy [8, 29]. But it requires a vast memory footprint andhence is not friendly for training on edge devices. Recently, [37] proposes to reduce the numberof trainable parameters by only updating parameters of the batch normalization (BN) [25] layers.Unfortunately, parameter-efﬁciency doesn’t translate to memory-efﬁciency. It still requires a largeamount of memory (e.g., 326MB under batch size 8) to store the input activations of the BN layers(Table 1). Additionally, the accuracy of this approach is still much worse than ﬁne-tuning the fullnetwork (70.7% v.s. 85.5%; Table 1). People can also partially ﬁne-tune some layers, but how manylayers to select is still ad hoc. This paper provides a systematic approach to adapt the feature extractorto different datasets and use lite residual learning to save memory.

Without loss of generality, we consider a neural network M that consists of a sequence of layers: M ( · ) = F w n ( F w n − ( · · · F w ( F w ( · )) · · · )) , (1) where w i denotes the parameters of the i th layer. Let a i and a i +1 be the input and output activationsof the i th layer, respectively, and L be the loss. In the backward pass, given ∂ L ∂ a i +1 , there are twogoals for the i th layer: computing ∂ L ∂ a i and ∂ L ∂ w i .Assuming the i th layer is a linear layer whose forward process is given as: a i +1 = a i W + b , thenits backward process under batch size 1 is ∂ L ∂ a i = ∂ L ∂ a i +1 ∂ a i +1 ∂ a i = ∂ L ∂ a i +1 W T , ∂ L ∂ W = a Ti ∂ L ∂ a i +1 , ∂ L ∂ b = ∂ L ∂ a i +1 . (2) According to Eq. (2), the intermediate activations (i.e., { a i } ) that dominate the memory footprint areonly required to compute the gradient of the weights (i.e., ∂ L ∂ W ), not the bias. If we only update thebias, training memory can be greatly saved. This property is also applicable to convolution layers andnormalization layers (e.g., batch normalization [25], group normalization [48], etc) since they can beconsidered as special types of linear layers. 3 map in memoryfmap not in memory learned weights on target taskpre-trained weights (a) Fine-tune the full networkDownsample Upsample (b) Lightweight residual learning (ours) (d) Our lightweight residual branch KxK GroupConv 1x1 Conv keep activations small while using group conv to increase the arithmetic intensity (c) Mobile inverted bottleneck block little computation but large activation (a) Fine-tune the full network (Conventional)KxK Separable Conv1x1 Conv i-th mobile inverted bottleneck block 1x1 Conv(a) Fine-tune the full networkDownsample KxK Group Conv 1x1 Conv Upsample(b) Lightweight residual learning (ours)1x1 Conv i-th mobile inverted bottleneck block 1x1 Conv train a once-for-all network (c) Lite residual learning (d) Feature extractor adaptation fmap in memory fmap not in memorylearnable params ﬁxed paramsweight bias mobile inverted bottleneck block i th Aircraft Cars FlowersUpsampleDownsample Group Conv1x1 Conv 1x1 Conv

Avoid inverted bottleneck (b) Fine-tune bias only (a) Fine-tune the full network (Conventional)(c) Lite residual learning(d) Feature network adaptation fmap in memory fmap not in memorylearnable params ﬁxed paramsweight bias mobile inverted bottleneck block i th Aircraft Cars Flowers Downsample Group Conv1x1 Conv

Avoid inverted bottleneck (b) Fine-tune bias only

C, R 6C, R 6C, R C, RC, 0.5R C, 0.5R Figure 2: TinyTL overview (“C” denotes the width and “R” denote the resolution). Conventionaltransfer learning ﬁxes the architecture of the feature extractor and relies on ﬁne-tuning the weightsto adapt the model (Fig.a), which requires a large amount of activation memory (in blue) for back-propagation. TinyTL reduces the memory usage by ﬁxing the weights while: (Fig.b) only ﬁne-tuningthe bias. (Fig.c) exploit lite residual learning to compensate for the capacity loss, using groupconvolution and avoiding inverted bottleneck to achieve high arithmetic intensity and small memoryfootprint. (Fig.d) adapting the feature extractor architecture to different downstream tasks, whichcan specialize a small feature extractor for an easy dataset (Flowers), and a large feature extractorfor a difﬁcult dataset (Aircraft). Their weights are shared from the same super-net, which is alsoparameter-efﬁcient.Regarding non-linear activation layers (e.g., ReLU, sigmoid, h-swish) , sigmoid and h-swish requireto store a i to compute ∂ L ∂ a i , hence they are not memory-efﬁcient. Activation layers that build uponthem are also not memory-efﬁcient consequently, such as tanh, swish [39], etc. In contrast, ReLUand other ReLU-styled activation layers (e.g., LeakyReLU [49]) only requires to store a binary maskrepresenting whether the value is smaller than 0, which is 32 × smaller than storing a i . Based on the memory footprint analysis, one possible solution of reducing the memory cost is tofreeze the weights of the pre-trained feature extractor while only update the biases (Figure 2b).However, only updating biases has limited adaptation capacity. In this section, we explore twooptimization techniques to improve the model capacity without updating weights: i) lite residualmodules to reﬁne the intermediate feature maps (Figure 2c); ii) feature extractor adaptation to enablespecialized feature extractors that best match different transfer datasets (Figure 2d).

Formally, a layer with frozen weights and learnable biases can be represented as: a i +1 = F W ( a i ) + b . (3) To improve the model capacity while keeping a small memory footprint, we propose to add a literesidual module that generates a residual feature map to reﬁne the output: a i +1 = F W ( a i ) + b + F w r ( a (cid:48) i = reduce ( a i )) , (4) where a (cid:48) i = reduce ( a i ) is the reduced activation. According to Eq. (2), learning these lite residualmodules only requires to store the reduced activations { a (cid:48) i } rather than the full activations { a i }. Implementation (Figure 2c).

We apply Eq. (4) to mobile inverted bottleneck blocks (MB-block)[40]. The key principle is to keep the activation small. Following this principle, we explore twodesign dimensions to reduce the activation size: Detailed forward and backward processes of the activation layers are provided in Appendix D. mageNetsuper netw. weightsharing SkipSkip sub-ops (including skip)(1) Pre-training (cloud) (2) Fine-tuning on the target dataset (edge) (3) Discrete op selection (edge)target dataset

Skip discrete op selection random sampletarget dataset

ImageNet Head + Lightweight residual Head +

Lightweight residual Head +

Lightweight residual fmap in memoryfmap not in memorylearned weightspre-trained weights

MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 ImageNetImageNet Top1

ImageNet Top1 (%)

MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 OursFlowers102

Flowers Top1 (%)MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 Ours

MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 Ours

Stanford cars

Cars Top1 (%)

MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 Ours

Aircraft

Aircraft Top1 (%)

R50 R101 I3Mb3PNASMNASR34Mb2 Mb2 R34MNASPNASMb3 R50R101 I3 Ours Mb2R34MNASPNASMb3R50R101 I3 Ours OursMb2 R34MNASPNASMb3R50R101 I3

ImageNet Top1 (%)

Flowers Top1 (%)MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3 Ours

Cars Top1 (%)

Aircraft Top1 (%) Figure 3: Transfer learning performances of various ImageNet pre-trained models with the last linearlayer trained. The relative accuracy order between different pre-trained models changes signiﬁcantlyamong ImageNet and the transfer learning datasets. For example, our specialized feature extractors(red) consistently achieve the best results on transfer datasets, though having weaker ImageNetaccuracy. It suggests that we need to adapt the feature extractor to ﬁt different transfer datasetsinstead of using the same one for all datasets. • Width.

The widely-used inverted bottleneck requires a huge number of channels (6 × ) to com-pensate for the small capacity of a depthwise convolution, which is parameter-efﬁcient but highlyactivation-inefﬁcient. Even worse, converting 1 × channels to 6 × channels back and forth requirestwo × projection layers, which doubles the total activation to 12 × . Depthwise convolution alsohas a very low arithmetic intensity (its OPs/Byte is less than 4% of × convolution’s OPs/Byteif with 256 channels), thus highly memory in-efﬁcient with little reuse. To solve these limitations,our lite residual module employs the group convolution (g=2) that has 300 × higher arithmeticintensity than depthwise convolution, providing a good trade-off between FLOPs and memory.That also removes the 1 × projection layer, reducing the total channel number by × × . • Resolution.

The activation size grows quadratically with the resolution. Therefore, we aggressivelyshrink the resolution in the lite residual module by employing a × average pooling to downsamplethe input feature map. The output of the lite residual module is then upsampled to match the sizeof the main branch’s output feature map via bilinear upsampling. Combining resolution and widthoptmizations, the activation of our lite residual module is × ×

12 = 48 × smaller than theinverted bottleneck. Conventional transfer learning chooses the feature extractor according to higher pre-training accuracy (e.g., ImageNet accuracy) and uses the same one for all transfer tasks [8, 37].However, we ﬁnd this approach sub-optimal, since different target tasks may need very differentfeature extractors and high pre-training accuracy does not guarantee good transferability of thepre-trained weights. This is especially critical in our case where the weights are frozen.Figure 3 shows the top1 accuracy of various widely used ImageNet pre-trained models on threetransfer datasets by only learning the last layer, which reﬂects the transferability of their pre-trainedweights. The relative order between different pre-trained models is not consistent with their ImageNetaccuracy on all three datasets. This result indicates that the ImageNet accuracy is not a good proxyfor transferability. Besides, we also ﬁnd that the same pre-trained model can have very differentrankings on different tasks. For instance, Inception-V3 gives poor accuracy on Flowers but providestop results on the other two datasets. Therefore, we need to specialize the feature extractor to bestmatch the target dataset.

Implementation (Figure 2d).

Motivated by these observations, we propose to adapt the featureextractor for different transfer tasks. This is achieved by allowing a set of candidate weight operationsinstead of using a ﬁxed weight operation: {M ( · ) } = F { w n , ··· , w mn } ( · · · F { w , ··· , w m } ( F { w , ··· , w m } ( · )) · · · ) . (5) It forms a discrete optimization space, allowing us to adapt the feature extractor for different targetdatasets without updating the weights. The detailed training ﬂow is described as follows: • Pre-training . The size of all possible weight operation combinations is exponentially large w.r.t.the depth, making it computationally impossible to pre-train all of them independently. Therefore,we employ the weight sharing technique [4, 31, 50] to reduce the pre-training cost, where asingle super-net is jointly optimized on the pre-training dataset (e.g., ImageNet) to support all5

Flowers102

MobileNetV2

R=192

R=160

R=128

R=96

Training Memory (MB)Batch Size = 8 F l o w e r s T op1 ( % ) Network Info

Batch size Act size (MB) Param on ﬂowers102 Param on aircraft Param on stanford car8 - - - -MobileNetV2 224 54.8 9.4 9.4 9.9MobileNetV2 192 40.3 9.4 9.4 9.9MobileNetV2 160 28.0 9.4 9.4 9.9MobileNetV2 128 17.9 9.4 9.4 9.9MobileNetV2 96 10.1 9.4 9.4 9.9ResNet-50 224 88.4 95 95 96ResNet-50 192 65.0 95 95 96ResNet-50 160 45.1 95 95 96ResNet-50 128 28.9 95 95 96ResNet-50 96 16.2 95 95 96InceptionV3 299 93.6 101 101 102InceptionV3 224 47.1 101 101 102InceptionV3 192 33.2 101 101 102InceptionV3 160 21.8 101 101 102InceptionV3 128 12.8 101 101 102ResNet-34 224 29.2 85 85 86ResNet-18 224 19.1 45 45 45

Aircraft

MobileNetV2

R=192

R=160

R=128

R=96

Training Memory (MB)Batch Size = 8 A i r c r a ft T op1 ( % ) Stanford-Cars

MobileNetV2

R=192

R=160

R=128

R=96

Training Memory (MB)Batch Size = 8 C a r s T op1 ( % ) Figure 4: Results under different resolutions. With the same level of accuracy, TinyTL provide anorder of magnitude training memory saving compared to ﬁne-tuning the full ResNet-50, making itpossible to learning on-device for Raspberry Pi 1.possible sub-nets (i.e., different combinations of weight operations). Different sub-nets can operateindependently by selecting different parts from the super-net. For example, centered weights inthe full convolution kernels are taken to form smaller convolution kernels; blocks are skippedto form a sub-net with a lower depth; channels are skipped to reduce the width of a convolutionoperation. In our experiments, we use ImageNet as the pre-training dataset. We employ progressiveshrinking [2, 50] for training the super-net, using the same training setting suggested by [2]. • Fine-tuning the super-net.

We ﬁne-tune the pre-trained super-net on the target transfer datasetwith the weights of the main branches (i.e., MB-blocks) frozen and the other parameters (i.e.,biases, lite residual modules, classiﬁcation head) updated via gradient descent. In this phase, werandomly sample one sub-net in each training step. • Discrete operation search.

Based on the ﬁne-tuned super-net, we collect 450 [sub-net, accuracy]pairs on the validation set (20% randomly sampled training data) and train an accuracy predictor using the collected data [2]. We employ evolutionary search [17] based on the accuracy predictorto ﬁnd the sub-net (i.e., the combination of weight operations) that best matches the target transferdataset. No back-propagation on the super-net is required in this step, thus incurs no additionalmemory overhead. • Final ﬁne-tuning.

Finally, we ﬁne-tune the searched model with the weights of the main branchesfrozen and the other parameters updated (i.e., biases, lite residual modules, classiﬁcation head),using the full training set to get the ﬁnal results.

Following the common practice [8, 29, 37], we evaluate our TinyTL on three benchmark datasetsincluding Cars [30], Flowers [38], and Aircraft [35], using ImageNet [9] as the pre-training dataset.

Model Architecture.

We build the super-net using the MobileNetV2 design space [4, 42] thatcontains ﬁve stages with a gradually decreased resolution, and each stage consists of a sequence ofMB-blocks. In the stage-level, it supports different depths (i.e., 2, 3, 4). In the block-level, it supportsdifferent kernel sizes (i.e., 3, 5, 7) and different width expansion ratios (i.e., 3, 4, 6) . For eachMB-block, we insert a lite residual module as described in Section 3.2.1 and Figure 2 (c). The groupnumber = 2, and the kernel size = 5. We use the ReLU activation since it is more memory-efﬁcientaccording to Section 3.1. Training Details.

We freeze the weights of the feature extractors while allowing biases to beupdated during transfer learning. Both the lite residual learning (LiteResidual; Section 3.2.1) andfeature extractor adaptation (FeatureAdapt; Section 3.2.2) are applied in our experiments. For ﬁne-tuning the pre-trained super-net, we use the Adam optimizer [28] with an initial learning rate of4e-3 following the cosine learning rate decay [34]. The model is trained on 80% randomly sampledtraining data for 50 epochs. For ﬁne-tuning the searched model, we use the same training setting Details of the accuracy predictor is provided in Appendix E. The detailed architecture of the super-net is provided in Appendix C. ∗ indicates our re-implementedresults. “R” denotes the input image size. TinyTL reduces the training memory by × withoutsacriﬁcing accuracy compared to ﬁne-tuning the full Inception-V3. Method Flowers (Batch Size = 8) Cars Flowers AircraftTrain. Mem. Reduce Rate Top1 (%) Top1 (%) Top1 (%)Last Inception-V3 [37] 94MB 1.0 × ∗ × × BN+ ResNet-50 ∗ × × ∗ × × LiteResidual (R=224) × × × × × × × but on the full training data. Additionally, we apply 8bits weight quantization [18] on the frozenweights to reduce the parameter size, which causes a negligible accuracy drop in our experiments.For all compared methods, we also assume the 8bits weight quantization is applied if eligible whencalculating their training memory footprint. Main Results.

Table 1 reports the comparison between TinyTL and previous transfer learningmethods that are divided into three groups, including: i) ﬁne-tuning the last linear layer [5, 11, 41](referred as

Last ); ii) ﬁne-tuning the BN layers and the last linear layer [36] (referred as

BN+Last ) ;iii) ﬁne-tuning the full network [8, 29] (referred as

Full ).In the ﬁrst group, we only apply FeatureAdapt to adapt the feature extractor while only trainingthe parameters of the last linear layer, similar to

Last . Compared to

Last +Inception-V3, our modelreduces the training memory cost by × while improving the top1 accuracy by on Cars, on Flowers, and on Aircraft. It shows our specialized feature extractors can better ﬁt differenttransfer datasets than these ﬁxed feature extractors. In the second group, we only apply LiteResidualto reﬁne the intermediate feature maps using Proxyless-Mobile [4] as the feature extractor. Comparedto BN+Last with ResNet-50, our model improves the training memory efﬁciency by × whileproviding consistently better accuracy ( higher on Cars, higher on Flowers, and higher on Aircraft). By increasing the input image size from 224 to 320, we can further increasethe accuracy improvement from on Cars, from on Flowers, from on Aircraft, which shows that learning lite residual modules and biases is not onlymore memory-efﬁcient but also more effective than BN+Last . In the third group, we apply bothFeatureAdapt and LiteResidual. Compared to

Full +Inception-V3, TinyTL can achieve the same levelof accuracy while providing × training memory saving, reducing the training memory from850MB to 64MB (the same level as only learning the last linear layer).Figure 4 demonstrates the results under different input resolution. With similar accuracy, TinyTLprovides an order of magnitude memory reduction ( × on Cars, × on Flowers, and × onAircraft) compared to ﬁne-tuning the full ResNet-50. Remarkably, it moves the training memorycost from the out-of-memory region (red) to the feasible region (green) on Raspberry Pi 1,making it possible to learn on-device without sacriﬁcing accuracy.4.1 Ablation Studies and DiscussionsComparison with Dynamic Activation Pruning.

The comparison between TinyTL and dynamicactivation pruning [32] is summarized in Figure 5. TinyTL is more effective because it re-designedthe transfer learning architecture (lite residual module, feature extractor adaptation) rather than prunean existing architecture. The transfer accuracy drops quickly when the pruning ratio increases beyond7 lowers102

ResNet-50 Activation Pruning Ours MobileNetV2 Activation Pruning

Aircraft

ResNet-50 Activation Pruning Ours MobileNetV2 Activation Pruning

Stanford-Cars

ResNet-50 Activation Pruning Ours MobileNetV2 Activation Pruning F l o w e r s T op1 ( % ) A i r c r a ft T op1 ( % ) C a r s T op1 ( % ) Figure 5: Compared with dynamic activation pruning [32], TinyTL saves the memory more effectively.

Table 2 w/ LRL Ours w/o LRL TinyTL w/o LRL320 A i r c r a ft T op1 ( % ) Training Memory (MB)

Table 2-1

Weight + Arch Ours Only Arch Only ArchUntitled 1

Untitled 2

Untitled 3

Untitled 4

Batch Size = 8

Training Memory (MB)

Batch Size = 8 Figure 6:

Left & Middle:

Ablation Studies of TinyTL on Aircraft.

Right:

TinyTL reduces both theparameter size and the activation size, providing a more balanced cost composition than previousefﬁcient inference techniques that focus on reducing the parameter size.50% (only 2 × memory saving). In contrast, TinyTL can achieve much higher memory reductionwithout loss of accuracy. Effectiveness of LiteResidual.

Figure 6 (left) shows the results of TinyTL with and withoutLiteResidual (only bias) on Aircraft, where we can observe signiﬁcant accuracy drops (up to 7.4%) ifdisabling the lite residual modules.

Pre-trained Weight Matters, Not Only Architecture.

Figure 6 (middle) reports the performanceof TinyTL if retraining the searched feature extractor on ImageNet (only arch). The retrained featureextractor cannot reach the same accuracy compared to keeping both the pre-trained weight and thearchitecture. It suggests that not only the architecture of the feature extractor matters, the pre-trainedweight also contributes a lot to the ﬁnal performance.

Flowers

None) (3x3_SimMBConv6.000_RELU6_O32_G1

None) (7x7_MBConv6_RELU6_O32

Identity) (7x7_MBConv6_RELU6_O32

Identity) (3x3_MBConv3_RELU6_O56

None) (7x7_MBConv4_RELU6_O56

Identity) (3x3_MBConv4_RELU6_O56

Identity) (7x7_MBConv3_RELU6_O56

Identity) (3x3_MBConv3_RELU6_O104

None) (7x7_MBConv3_RELU6_O104

Identity) (3x3_MBConv6_RELU6_O104

Identity) (5x5_MBConv6_RELU6_O128

None) (5x5_MBConv4_RELU6_O128

Identity) (5x5_MBConv4_RELU6_O128

Identity) (7x7_MBConv6_RELU6_O248

None) (3x3_MBConv3_RELU6_O248

Identity) (3x3_MBConv4_RELU6_O416

None)

Stanford Cars

None) (3x3_SimMBConv6.000_RELU6_O32_G1

None) (7x7_MBConv6_RELU6_O32

Identity) (3x3_MBConv6_RELU6_O32

Identity) (3x3_MBConv4_RELU6_O32

Identity) (7x7_MBConv3_RELU6_O56

None) (7x7_MBConv4_RELU6_O56

Identity) (7x7_MBConv6_RELU6_O56

Identity) (5x5_MBConv6_RELU6_O56

Identity) (3x3_MBConv4_RELU6_O104

None) (5x5_MBConv4_RELU6_O104

Identity) (5x5_MBConv6_RELU6_O104

Identity) (7x7_MBConv6_RELU6_O104

Identity) (5x5_MBConv3_RELU6_O128

None) (3x3_MBConv6_RELU6_O128

Identity) (7x7_MBConv6_RELU6_O128

Identity) (3x3_MBConv6_RELU6_O128

Identity) (5x5_MBConv4_RELU6_O248

None) (3x3_MBConv4_RELU6_O248

Identity) (3x3_MBConv6_RELU6_O248

Identity) (3x3_MBConv6_RELU6_O416

None)

Aircraft

None) (3x3_SimMBConv4.000_RELU6_O32_G1

None) (3x3_MBConv3_RELU6_O32

Identity) (3x3_MBConv4_RELU6_O32

Identity) (7x7_MBConv3_RELU6_O32

Identity) (3x3_MBConv6_RELU6_O56

None) (7x7_MBConv6_RELU6_O56

Identity) (7x7_MBConv6_RELU6_O56

Identity) (3x3_MBConv6_RELU6_O56

Identity) (7x7_MBConv4_RELU6_O104

None) (3x3_MBConv6_RELU6_O104

Identity) (5x5_MBConv6_RELU6_O104

Identity) (3x3_MBConv3_RELU6_O104

Identity) (3x3_MBConv4_RELU6_O128

None) (5x5_MBConv6_RELU6_O128

Identity) (7x7_MBConv3_RELU6_O128

Identity) (3x3_MBConv4_RELU6_O248

None) (7x7_MBConv3_RELU6_O248

Identity) (3x3_MBConv4_RELU6_O416

None) M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B (a) Feature extractor on Flowers(b) Feature extractor on Aircraft(c) Feature extractor on Cars Table 1

Flowers Aircraft Cars

102 100 196

Flowers Aircraft Cars

Table 1-1

Flowers Aircraft Cars

16 18 20

Dataset Statistics

MAC

Flowers Aircraft Cars

729 738 948

Params

Flowers Aircraft Cars

Specialized Feature Extractor Statistics L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B (b) Feature extractor on Aircraft(c) Feature extractor on Cars L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R MB6 3x3 MB6 7x7 MB6 3x3 MB4 3x3 MB3 7x7 MB4 7x7 MB6 7x7 MB6 5x5 MB4 3x3 MB4 5x5 MB6 5x5 MB6 7x7 MB3 5x5 MB6 3x3 MB6 7x7 MB6 3x3 MB4 5x5 MB4 3x3 MB6 3x3 MB6 3x3

Figure 7: TinyTL can adapt the feature extractor’s architecture to different transfer datasets.

Adapt the Feature Extractor to Different Transfer Datasets.

Figure 7 reports the details of thetransfer learning datasets and the corresponding feature extractors specialized for these datasets inTinyTL. For an easier dataset such as Flowers (fewer

Table 1

Weight (8bit) fw Weight (32bit) Flowers102 fw+b/w Weight (32bit) Aircraft fw+b/w Weight (32bit) Cars fw+b/w Act per batch Act Mask Total Flowers102 (MB) Total Aircraft (MB) Total Cars (MB)Mbv2 (last)

ResNet-50 (last)

Mbv2 (last + bn)

ResNet-50 (last + bn)

Inception-V3 (last + bn)

Proxyless-Mobile (last + bn)

Mbv3 (Full)

Proxyless-Mobile (Full)

Mbv2 (Full)

Mbv2-1.4 (Full)

ResNet-18 (Full)

ResNet-34 (Full)

ResNet-50 (Full)

ResNet-101 (Full)

Inception-V3 (Full)

Table 2

Batch Size Table 1-1

Params Peak Inference Act Grad Act Final Act Final Memory Cost

ResNet-50 (Full)

Mbv2-1.4 (Full)

TinyTL (R=256)

Param (MB) Activation (MB)ResNet-50 MbV2-1.4 TinyTL P e r c en t age Peak Inference Act. BP Act. Overall Act. A c t i v a t i on S i z e ( M B ) ResNet-50 MbV2-1.4 TinyTL

10x smaller

Table 1-1-1

Memory Cost (GB) Computation Cost (T) Flowers Top1 (%)

ResNet-50 (Full)

802 62781.44 97.5 136

Mbv2-1.4 (Full)

644 8919.04 97.5 141

TinyTL (R=256)

64 491 97.5 214

Memory Cost (MB)644 6410x smaller

Computation Cost (TMAC)8,919 49118x smaller

Super-Net Fine-tuningDiscrete Operation SearchFinal Fine-tuning

Memory Cost (MB)644 6410x smaller

Computation Cost (TMAC)8,919 49118x smaller

Figure 8: On-device training cost on Flowers. Achieving the same accuracy, TinyTL requires10 × smaller memory cost and 18 × smaller computation cost compared to ﬁne-tuning the fullMobileNetV2-1.4 [29].chooses a smaller feature extractor (fewer blocks, fewer parameters, less computation). For a moredifﬁcult dataset like Cars (more Cost Details.

As shown in Figure 6 (right), TinyTL reduces both the parameter size and theactivation size instead of only reducing the parameter size as previous efﬁcient inference methodsdid, hence provides a more balanced cost composition. This activation size is the peak activationsize during the three on-device phases (Section 3.2.2), including ﬁne-tuning the super-net, discreteoperation search, and ﬁnal ﬁne-tuning. Concretely, for each layer, we compute the size of alreadystored activations (required by back-propagation), the size of already stored binary masks (requiredby ReLU layers), and the size of buffers (required by the forward process). The peak value of theirsum across all layers is taken as the peak activation size.The on-device training cost is summarized in Figure 8. TinyTL reduces the training memory by 10 × ,and reduces the training computation by 18 × , achieving the same accuracy as ﬁne-tuning the fullMobileNetV2-1.4. The peak memory cost of TinyTL under resolution 256 is 64MB while the totalMAC is 491T. In contrast, ﬁne-tuning the full network requires 644MB and the total MAC is 8,919T(20,000 steps with batch size 256 [29]) . TinyTL is not only much more memory-efﬁcient but alsomuch more computation-efﬁcient. We proposed Tiny-Transfer-Learning (TinyTL) for memory-efﬁcient on-device learning that aims toadapt pre-trained models to newly collected data on edge devices. Unlike previous transfer learningmethods that ﬁx the architecture and ﬁne-tune the weights to ﬁt different target datasets, TinyTL ﬁxesthe weights while adapting the architecture of the feature extractor and learning memory-efﬁcientlite residual modules and biases to ﬁt different target datasets. Extensive experiments on benchmarktransfer learning datasets consistently show the effectiveness and memory-efﬁciency of TinyTL,paving the way for efﬁcient on-device machine learning.

References [1] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efﬁcient architecture searchby network transformation. In

AAAI , 2018. 3[2] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train onenetwork and specialize it for efﬁcient deployment. In

ICLR , 2020. 1, 3, 6[3] Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Kuan Wang, Tianzhe Wang, Ligeng Zhu, and SongHan. Automl for architecting efﬁcient and specialized neural networks.

IEEE Micro , 2019. 1[4] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on targettask and hardware. In

ICLR , 2019. 3, 5, 6, 7[5] Ken Chatﬁeld, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devilin the details: Delving deep into convolutional nets. In

BMVC , 2014. 3, 7 We report the memory cost under batch size 8 for consistency, which does not change the reduction ratio.

96] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinearmemory cost. arXiv preprint arXiv:1604.06174 , 2016. 3[7] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deepneural networks with binary weights during propagations. In

NeurIPS , 2015. 2[8] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large scale ﬁne-grainedcategorization and domain-speciﬁc transfer learning. In

CVPR , 2018. 3, 5, 6, 7[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In

CVPR , 2009. 3, 6[10] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploitinglinear structure within convolutional networks for efﬁcient evaluation. In

NeurIPS , 2014. 2[11] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and TrevorDarrell. Decaf: A deep convolutional activation feature for generic visual recognition. In

ICML ,2014. 3, 7[12] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainableneural networks. In

ICLR , 2019. 2[13] Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. Devnet: Adeep event network for multimedia event detection and evidence recounting. In

CVPR , pages2568–2577, 2015. 3[14] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutionalnetworks using vector quantization. arXiv preprint arXiv:1412.6115 , 2014. 2[15] Klaus Greff, Rupesh K Srivastava, and Jürgen Schmidhuber. Highway and residual networkslearn unrolled iterative estimation. arXiv preprint arXiv:1612.07771 , 2016. 3[16] Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efﬁcient backpropagation through time. In

NeurIPS , 2016. 3[17] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and JianSun. Single path one-shot neural architecture search with uniform sampling. arXiv preprintarXiv:1904.00420 , 2019. 6[18] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neuralnetworks with pruning, trained quantization and huffman coding. In

ICLR , 2016. 1, 2, 7[19] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections forefﬁcient neural network. In

NeurIPS , 2015. 1, 2[20] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neuralnetworks. In

ICCV , 2017. 2[21] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan,Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3.In

ICCV , 2019. 1, 13[22] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, TobiasWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neuralnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861 , 2017. 1, 3, 13[23] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: Anefﬁcient densenet using learned group convolutions. In

CVPR , 2018. 3[24] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, andKurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mbmodel size. arXiv preprint arXiv:1602.07360 , 2016. 3[25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. In

ICML , 2015. 3[26] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard,Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks forefﬁcient integer-arithmetic-only inference. In

CVPR , 2018. 2[27] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, RaminderBajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performanceanalysis of a tensor processing unit. In

ISCA , 2017. 2[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014. 6[29] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better?In

CVPR , 2019. 3, 6, 7, 9 1030] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for ﬁne-grained categorization. In

Proceedings of the IEEE International Conference on ComputerVision Workshops , 2013. 3, 6[31] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In

ICLR , 2019. 5[32] Liu Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. Dynamicsparse graph for efﬁcient deep learning. In

ICLR , 2019. 3, 7, 8[33] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.Learning efﬁcient convolutional networks through network slimming. In

ICCV , 2017. 2[34] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXivpreprint arXiv:1608.03983 , 2016. 6[35] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classiﬁcation of aircraft. arXiv preprint arXiv:1306.5151 , 2013. 3, 6[36] Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. K forthe price of 1: Parameter efﬁcient multi-task and transfer learning. In

ICLR , 2019. 3, 7[37] Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. K forthe price of 1: Parameter-efﬁcient multi-task and transfer learning. In

ICLR , 2019. 3, 5, 6, 7[38] Maria-Elena Nilsback and Andrew Zisserman. Automated ﬂower classiﬁcation over a large num-ber of classes. In

Sixth Indian Conference on Computer Vision, Graphics & Image Processing ,2008. 6[39] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. In

ICLRWorkshop , 2018. 4[40] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In

CVPR , 2018. 1, 3, 4, 12, 13[41] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn featuresoff-the-shelf: an astounding baseline for recognition. In

CVPR Workshops , 2014. 3, 7[42] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, andQuoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In

CVPR , 2019. 1,3, 6[43] Mingxing Tan and Quoc Le. Efﬁcientnet: Rethinking model scaling for convolutional neuralnetworks. In

ICML , 2019. 3[44] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networkson cpus. In

NeurIPS Deep Learning and Unsupervised Feature Learning Workshop , 2011. 2[45] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automatedquantization. In

CVPR , 2019. 2[46] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan.Training deep neural networks with 8-bit ﬂoating point numbers. In

NeurIPS , 2018. 3[47] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, YuandongTian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efﬁcient convnetdesign via differentiable neural architecture search. In

CVPR , 2019. 1, 3[48] Yuxin Wu and Kaiming He. Group normalization. In

ECCV , 2018. 3[49] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectiﬁed activationsin convolutional network. arXiv preprint arXiv:1505.00853 , 2015. 4[50] Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, MingxingTan, Thomas Huang, Xiaodan Song, Ruoming Pang, and Quoc Le. Bignas: Scaling up neuralarchitecture search with big single-stage models. arXiv preprint arXiv:2003.11142 , 2020. 5, 6[51] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufﬂenet: An extremely efﬁcientconvolutional neural network for mobile devices. In

CVPR , 2018. 1, 311

Detailed Architectures of Specialized Feature Extractors

Flowers

None) (3x3_SimMBConv6.000_RELU6_O32_G1

None) (7x7_MBConv6_RELU6_O32

Identity) (7x7_MBConv6_RELU6_O32

Identity) (3x3_MBConv3_RELU6_O56

None) (7x7_MBConv4_RELU6_O56

Identity) (3x3_MBConv4_RELU6_O56

Identity) (7x7_MBConv3_RELU6_O56

Identity) (3x3_MBConv3_RELU6_O104

None) (7x7_MBConv3_RELU6_O104

Identity) (3x3_MBConv6_RELU6_O104

Identity) (5x5_MBConv6_RELU6_O128

None) (5x5_MBConv4_RELU6_O128

Identity) (5x5_MBConv4_RELU6_O128

Identity) (7x7_MBConv6_RELU6_O248

None) (3x3_MBConv3_RELU6_O248

Identity) (3x3_MBConv4_RELU6_O416

None)

Stanford Cars

None) (3x3_SimMBConv6.000_RELU6_O32_G1

None) (7x7_MBConv6_RELU6_O32

Identity) (3x3_MBConv6_RELU6_O32

Identity) (3x3_MBConv4_RELU6_O32

Identity) (7x7_MBConv3_RELU6_O56

None) (7x7_MBConv4_RELU6_O56

Identity) (7x7_MBConv6_RELU6_O56

Identity) (5x5_MBConv6_RELU6_O56

Identity) (3x3_MBConv4_RELU6_O104

None) (5x5_MBConv4_RELU6_O104

Identity) (5x5_MBConv6_RELU6_O104

Identity) (7x7_MBConv6_RELU6_O104

Identity) (5x5_MBConv3_RELU6_O128

None) (3x3_MBConv6_RELU6_O128

Identity) (7x7_MBConv6_RELU6_O128

Identity) (3x3_MBConv6_RELU6_O128

Identity) (5x5_MBConv4_RELU6_O248

None) (3x3_MBConv4_RELU6_O248

Identity) (3x3_MBConv6_RELU6_O248

Identity) (3x3_MBConv6_RELU6_O416

None)

Aircraft

None) (3x3_SimMBConv4.000_RELU6_O32_G1

None) (3x3_MBConv3_RELU6_O32

Identity) (3x3_MBConv4_RELU6_O32

Identity) (7x7_MBConv3_RELU6_O32

Identity) (3x3_MBConv6_RELU6_O56

None) (7x7_MBConv6_RELU6_O56

Identity) (7x7_MBConv6_RELU6_O56

Identity) (3x3_MBConv6_RELU6_O56

Identity) (7x7_MBConv4_RELU6_O104

None) (3x3_MBConv6_RELU6_O104

Identity) (5x5_MBConv6_RELU6_O104

Identity) (3x3_MBConv3_RELU6_O104

Identity) (3x3_MBConv4_RELU6_O128

None) (5x5_MBConv6_RELU6_O128

Identity) (7x7_MBConv3_RELU6_O128

Identity) (3x3_MBConv4_RELU6_O248

None) (7x7_MBConv3_RELU6_O248

Identity) (3x3_MBConv4_RELU6_O416

Flowers Aircraft Cars

102 100 196

Table 1-1

Flowers Aircraft Cars

16 18 20

MAC

Flowers Aircraft Cars

729 738 948

Params

Flowers Aircraft Cars L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B M B (b) Feature extractor on Aircraft(c) Feature extractor on Cars L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R L R MB6 3x3 MB6 7x7 MB6 3x3 MB4 3x3 MB3 7x7 MB4 7x7 MB6 7x7 MB6 5x5 MB4 3x3 MB4 5x5 MB6 5x5 MB6 7x7 MB3 5x5 MB6 3x3 MB6 7x7 MB6 3x3 MB4 5x5 MB4 3x3 MB6 3x3 MB6 3x3

102 100 196 2040 6667 8144 16 18 20 5.0 5.3 7.4 729 738 948

Figure 9: Detailed architectures of the feature extractors on different transfer datasets. “LR” denotesthe lite residual module (Section 3.2.1) while “MB4 × ” denotes the mobile inverted bottleneckblock [40] with expansion ratio 4 and kernel size 7. TinyTL adapts a higher-capacity feature extractorfor a harder task (Cars). B Details of the On-device Training Cost

The detailed training cost of the on-device learning phases is described as follows: • Fine-tuning the super-net.

We ﬁne-tune the pre-trained super-net under resolution 224. Thepeak memory cost of this phase is 64MB, which is reached when the largest sub-net is sampled.Regarding the computation cost, the average MAC (forward & backward) of sampled sub-netsis (802M + 2535M) / 2 = 1668.5M per sample, where 802M is the training MAC of the smallestsub-net and 2535M is the training MAC of the largest sub-net. Therefore, the total MAC of thisphase is 1668.5M × × ×

50 = 136T (27.7% of 491T) on Flowers, where 2040 is thenumber of total training samples, 0.8 means the super-net is ﬁne-tuned on 80% of the trainingsamples (the remaining 20% is reserved for search), and 50 is the number of training epochs. • Discrete operation search.

As discussed in Appendix E, the memory overhead and computationoverhead of the accuracy predictor are negligible. The primary memory cost and computation costof this phase come from collecting 450 [sub-net, accuracy] pairs required to train the accuracypredictor. It only involves the forward processes of sampled sub-nets, and no back-propagation isrequired. Therefore, the memory overhead of this phase is negligible compared to the super-netﬁne-tuning phase. The average MAC (only forward) of sampled sub-nets is (352M + 1179M) / 2 =765.5M per sample, where 352M is the inference MAC of the smallest sub-net and 1179M is theinference MAC of the largest sub-net. Therefore, the total MAC of this phase is 765.5M × × ×

450 = 141T (28.7% of 491T) on Flowers, where 2040 is the number of total training samples, In the super-net ﬁne-tuning phase, the training MAC of a sampled sub-net is roughly 2 × larger than itsinference MAC, rather than 3 × , since we do not need to update the weights of the main branches. • Final ﬁne-tuning.

To achieve the same accuracy as ﬁne-tuning the full MobileNetV2-1.4, we usea resolution of 256. The memory cost of this phase is 63.9MB and the total MAC is 2100M × × ×

50 = 214T (43.6% of 491T), on Flowers, where 2100M is the training MAC, 2040 isthe number of total training samples, 1.0 means the full training set is used, and 50 is the numberof training epochs.

C Detailed Architecture of the Super-Net

Table 2: Detailed architecture of the super-net using the MobileNetV2 design space with lite residualmodules (Section 3.2.1). “SepConv” denotes the separable convolution block [22] that consists of adepthwise-separable convolution layer and a × convolution layer. “MB-LiteResidual” denotes themobile inverted bottleneck block [40] with a lite residual module (described in Section 3.2.1).Input Operator × Conv2d

40 2 4 3 - × SepConv

24 1 2 3 × MB-LiteResidual

32 2 4 3 , , , , × MB-LiteResidual

32 1 2 3 , , , , , , × MB-LiteResidual

56 2 4 3 , , , , × MB-LiteResidual

56 1 2 3 , , , , , , × MB-LiteResidual

104 2 4 3 , , , , × MB-LiteResidual

104 1 2 3 , , , , , , × MB-LiteResidual

128 1 2 3 , , , , × MB-LiteResidual

128 1 2 3 , , , , , , × MB-LiteResidual

248 2 4 3 , , , , × MB-LiteResidual

248 1 2 3 , , , , , , × MB-LiteResidual

416 1 2 3 , , , , × Conv2d - - × Avg-pool - - × Linear - - - - D Memory Footprint of Non-Linear Activation Layers