EfficientNet-eLite: Extremely Lightweight and Efficient CNN Models for Edge Devices by Network Candidate Search
EEfficientNet-eLite: Extremely Lightweight and Efficient CNN Models for EdgeDevices by Network Candidate Search
Ching-Chen Wang, Ching-Te Chiu, Jheng-Yi ChangDepartment of Computer Science, National Tsing Hua University. [email protected], [email protected], [email protected]
Abstract
Embedding Convolutional Neural Network (CNN) intoedge devices for inference is a very challenging task be-cause such lightweight hardware is not born to handlethis heavyweight software, which is the common overheadfrom the modern state-of-the-art CNN models. In this pa-per, targeting at reducing the overhead with trading theaccuracy as less as possible, we propose a novel of Net-work Candidate Search (NCS), an alternative way to studythe trade-off between the resource usage and the perfor-mance through grouping concepts and elimination tourna-ment. Besides, NCS can also be generalized across any neu-ral network. In our experiment, we collect candidate CNNmodels from EfficientNet-B0 to be scaled down in variedway through width, depth, input resolution and compoundscaling down, applying NCS to research the scaling-downtrade-off. Meanwhile, a family of extremely lightweight Ef-ficientNet is obtained, called EfficientNet-eLite.For further embracing the CNN edge application withApplication-Specific Integrated Circuit (ASIC), we adjustthe architectures of EfficientNet-eLite to build the morehardware-friendly version, EfficientNet-HF. Evaluation onImageNet dataset, both proposed EfficientNet-eLite andEfficientNet-HF present better parameter usage and ac-curacy than the previous start-of-the-art CNNs. Partic-ularly, the smallest member of EfficientNet-eLite is morelightweight than the best and smallest existing Mnas-Net with 1.46x less parameters and 0.56% higher accu-racy. Code is available at https://github.com/Ching-Chen-Wang/EfficientNet-eLite
1. Introduction
In recent decade, Convolutional Neural Network (CNN)presents the remarkable achievement on vision task such asaction recognition [26], object detection [37][8], image seg-mentation [5] and so on [35][7][17]. However, the CNN-based application is still uncommon nowadays, which we
Figure 1. The performance of model size and Top-1 accuracy onImageNet [6]. The proposed CNN family, EfficientNet-eLite, ismore lightweight than the other state-of-the-art models with higheraccuracy on ImageNet dataset. Particularly, the smallest memberof EfficientNet-eLite is more lightweight than the best and small-est existing MnasNet with 1.46x less parameters and 0.56% higheraccuracy. As for the two versions of hardware-friendly models,r128 and r256 denote the input resolution 128x128 and 256x256respectively. More details about performance are provided in Ta-ble 5. attribute the reasons to the hardware limitation and the soft-ware complexity. The former one is that the application isstrictly limited by the hardware condition. Enough RAMto save the heavy parameters and powerful computationalability to perform tons of operations are both necessary re-quirement of the CNN-based application. Besides, porta-bility and some physical limitations are also preventing itsdevelopment. On the other hand, the modern outperformedCNN models are always featured with the intensive com-putation and parameters. Although the mobile-size CNNmodels are proposed recently [12] [25] [11], the CNN mod-els are not lightweight enough for some edge devices suchas low-power IoT (Internet of things) devices or wearable1 a r X i v : . [ c s . C V ] S e p igure 2. We make EfficientNet more lightweight by scaling downEfficientNet-B0 (a) through Resolution (b), Width (c), Depth (d)and Compound (e). device such as smart glasses, watches and so on.The CNN accelerators appear to bridge the gap betweenthe CNN applications and the edge devices [13] [32]. Nev-ertheless, apart form designing CNN on general-purposehardware, the design space of CNN models on accelerator isscarified by some degrees of freedom and strictly restrictedby the hardware specification. To be more specific, all typesof operation in CNN should be fully compatible accordingto the instruction set architecture (ISA). In addition to thecompatibility, the performance of ASIC is another impor-tant design principle such as chip area, energy efficiency,utilization and so on. In [13], the state-of-the-art meth-ods are highlight for the the codesign concepts of the soft-ware and the hardware, to build not only accurate but alsohardware-friendly CNN structure.In this study, the modern outperformed CNN models, Ef-ficientNets [31], are selected as our backbone structure. Wetarget at building more lightweight version and adjustingtoward a more hardware-friendly structure. First of all, weapply EfficientNet-B0 (Baseline model) to be scaled down among channels, depths, input resolution as a techniqueof model compression, along with the compound scalingmethod and the constant ratio proposed by [31], shown inthe Figure 2. The thinner EfficientNets are collected intocandidate pool and used for studying the trade-off by pro-posed Network Candidate Search (NCS). Although com-pound scaling up with fixed scaling coefficients is system-atically analyzed in [31], the scaling down principle is not Figure 3. The overview of proposed Network Candidate Search. well understand. We believe that those three dimensionsmay possibly not keep the scaling relationship of constantratio. Our study indicates that input information plays moresignificant role than channels and depths when a model isscaled down.The Figure 3 provides the overview of NCS, which iscomposed of two major concepts, grouping and eliminationtournament respectively. The former is trying to investigatethe question : what is the CNN’s shape (i.e. the ratio orrelationship of width, depth and input resolution) that canachieve better accuracy. The basic idea is that candidates(CNN models) with similar parameter usage and Flops arefairly divided into group for comparing. By doing so, theoutperformed models inside the group have the relativelybetter shape and the worthy scaling-down trade-off. The let-ter is to mitigate the training cost under an affordable GPUhours. Only the potential models in each group survive andthe others gradually eliminate as a method to stop the train-ing for releasing the burden of GPU. As for determining thepotential models, we adopt using average accuracy as crite-ria, coming form observing the relationship between learn-ing performance and final accuracy. Finally the championof the elimination tournament in each group is obtained,called EfficientNet-eLite (Extremely lightweight Efficient-Net), which presents better parameter usage and accuracythan the previous start-of-the-art CNNs. Particularly, ourEfficientNet-eLite 9 outperforms MnasNet with 1.46x lessparameters and 0.56% higher accuracy on ImageNet.Secondly, to go further alleviating the difficulty of theCNN inference on the edge, we provide the hardware-friendly CNN models as candidates for NCS by consider-ing design concepts of Application-Specific Integrated Cir-cuit (ASIC). Finally, we obtain a family of relatively outper-formed models, called EfficientNet-HF (Hardware-friendlyEfficientNet), realizing CNN models could be not only ac-curate but also hardware-friendly for ASIC.The rest of this paper is organized as follow. Section 2presents the related works, including scaling methods that2e apply for model compression and our baseline modelEfficientNet [31] as well as some hardware-friendly designsfor CNN. Section 3 discusses the proposed Network Candi-date Search in detail. Proposed family of Hardware-friendlyEfficientNet is introduced in Section 4. Experimental re-sults and conclusion are depicted in Section 5 and Section 6respectively.
2. Related Work
Model scaling has been a very popular method for expand-ing the scale of CNN models to pursue the better accuracy.The early CNN, LeNet-5 for recognizing handwriting dig-its, has only 7 layers and thousands of trainable parameters.With the progress of the Graphics Processing Unit (GPU),CNN has evolved to solve the more complicated problemsby going bigger and bigger. AlexNet [18], the deeper net-work with 8 layers, is the widely known breakthrough in2012 ImageNet Large Scale Visual Recognition Challenge(ILSVRC) competition with about sixty millions parame-ters. VGG16 [27], 16 layers and regular architecture, pro-pose a consecutive stack of two 3x3 convolution layers toreplace a single 5x5 convolution layer, which makes itsstructure more deeper. ResNet [10] family wins the 2015ImageNet challenge. The vanishing gradient problem hasbeen dealt with by ResNet to achieve 152 layers architec-ture.We categorize the common ways to adjust the model sizeinto 4 segments, which are introduced respectively as fol-lowing.
Depth scaling :
It is widely believed that deeper networkshould achieve better performance [10] [15] [28] [29].Once the vanishing gradient problem is dealt the networkdepth can be deeper and gain the accuracy by tuning therepeat time of the basic building blocks. For example, bymodifying the repetition of residual blocks, ResNet can bescaled up to 152 layers or scaled down to only 18 layers.
Width scaling :
Several state of the arts commonly applywidth scaling [12] [25]. The intuition is that wider networkhas more filters to memorize the input patterns. Besides,wider network has the tendency of making the training eas-ier [39].
Resolution scaling :
Higher resolution can provide moredetail information for CNN models to tell apart from thedifference between the very similar input. Therefore, theaccuracy can be improved. Beginning from input size 224 x224, Inception-V4 applies 299 x 299 as input size. A largerinput size, 480 x 480, is used in [16].
Compound scaling :
With the CNN scaled up to reachthe hardware limit, researchers are devoting to finding themore efficient way of scaling. To be more specific, how theadditional hardware resource should be effectively assigned
Table 1. Family of EfficientNet by scaling up.
EfficientNet B0 B1 B2 B3 B4 ... φ (Available resource) 0 - 1 2 3 ...Depth D . · D . · D . · D . · D d · D Width
W W . · W . · W . · W w · W Resolution R . · R . · R . · R . · R r · R Parameters . M . M . M M M ...
Flops . B . B . B . B . B ...
Figure 4. The overview of EfficientNet. into the scaling dimension becomes an important question.That is to say, the scaling trade-off becomes an active re-search targeting how to gain accuracy with less resourcecost. EfficientNet[31] authors conduct series of experimentto find out the observation that scaling through single di-mension will quick saturate the accuracy gain and that com-pound scaling through three dimensions of depth, width andinput resolution will achieve better performance.
The Figure 4 illustrates the overview of building up fam-ily of EfficientNet and we divide the state-of-the-art Effi-cientNet into two parts to discuss.
Neural Architecture Search :
Neural Architecture Search(NAS) is a technique based on reinforcement learning to au-tomatically build up CNN models. The conventional CNNstructure is made in hand-crafted manner, which has lotsof architectural possibilities and significantly relies on hu-man’s expertise. NAS mitigates the efforts of trial and erroron constructing CNN networks and modern CNN modelsdesigned by automated approaches tends to outperform themanually designed one [30].
Grid Search :
The goal of grid search is to find out astrategy to effectively assign available hardware resourcesinto depth d , width w and input resolution r for expandingCNN models. In this case, the authors assume twice moreresources available, denoting φ = 1 . Candidate scaling co-efficients of depth α , width β , and resolution γ to determinehow to allocate those resources. There are lots of combina-tions of α , β and γ . After the grid search, the authors findthe best value for EfficientNet-B0 are α = 1 . , β = 1 . and γ = 1 . . The EfficientNet family, shown in Table 1,can be obtained by d , w and r according to the amount of3 able 2. The structure of the original and the scaled-down EfficientNet. Stage Operator Resolution Resolution Channels Channels Repeat Repeat Scaled down results of repeat time of operator(Baseline) (Scaled down) (Baseline) (Scaled down) (Baseline) (Scaled down) s H s × W s H s r k × W s r k C s C s × w i R s R s × d j d = 1 . d x = 0 . or . d = 0 . d = 0 . d = 0 . ×
224 224 r k × r k
32 32 × w i × d j ×
112 112 r k × r k × w i × d j ×
112 112 r k × r k × w i × d j ×
56 56 r k × r k × w i × d j ×
28 28 r k × r k × w i × d j ×
14 14 r k × r k × w i × d j ×
14 14 r k × r k × w i × d j × r k × r k × w i × d j × r k × r k × w i × d j available resources φ . There are several hardware-friendly designs for CNNmodels. For compressing toward small size of modelby using low precision data format, quantization anddynamic fixed-point representation are manipulated in[13] [22] [23]. Even binary precision is adopted in [1]. Asfor abolishing the redundant parameter, pruning and valuedecomposition are commonly used methodology. On theother hand, reducing the computational overhead is anothertype of hardware friendly principle. Depth-wise separableconvolution is introduced in MobileNet [12], known as im-pressively decrease of operation. Hardware friendly activa-tion functions [2][4] alleviate the challenge of fixed pointarithmetic.
Hardware-friendly CNN structure :
Under general pur-pose hardware, the modern CNNs are optimized for param-eter usage and Flops. As for the dedicated CNN accelera-tor, a regular and modularized architecture is considered ashardware-friendly design. Taking the state-of-the-art em-bedded CNN in [32] as example, we realize that the struc-ture is hardware-friendly for ASIC design regarded fromchannels and size of feature map in each layer. From theperspective of channels, the whole CNN is built upon basedon 3x3 convolution with 32 channels input to 32 channelsoutput. Therefore, the accelerator can have high utiliza-tion performance when process element (PE) is designedfor performing the parallel 32 channels operations. For sizeof feature map to be power of two, the tiling technique anddata partitioning are more easily to apply when the SRAMis not enough, thus to access the DRAM. Besides, the oper-ation of division (average pooling) can be done by shiftingrather than the hardware division unit.
3. Network Candidate Search
The core concept of NCS is searching for outperformedmodels over the candidate models which consume the sim-ilar hardware cost. In this section, we start from definingcandidates with EfficientNet-B0 [31] to be scaled down invaried way. Secondly, the similar candidates are grouped to-gether for comparing. Thirdly, we introduce the criteria for elimination. Lastly, the algorithm of NCS is summarized.
The baseline CNN model is broken down into nine stageswith relative operators [30], listed in the Table 2. Channels C s , Resolution H s × W s and Depth (sum of Repeat R s )are denoted as the baseline specification with relative stage s . We preserve all types of operation inside the Operator,focusing on scaling down channels w i , depth d j and inputresolution r k such that < w i , d j , r k ≤ , making thescaled down model more lightweight with Channels C s × w i , Resolution H s r k × W s r k and Repeat R s × d j .We define a candidate pool CP as a set, and each ele-ment in the pool symbolizes a CNN model, which is de-termined by various combination of scaling coefficients w i , d j , r k . Values i, j, k denote as index representing themagnitude of scaling. CP : { M odel ( w i , d j , r k ) |∀ i, j, k ∈ N and < w i , d j , r k ≤ } (1) Define Scaling coefficient from depth :
We find that depthcoefficient is easier to determine than width and input res-olution because repeat time has fewer possibilities. For ex-ample, original input resolution is × . The resolutionto be scaled down could be × , × ... and soon. Channel coefficients have lots of choices for the samereason. However, depth coefficients have just few cases,which we can start defining from. The right side of Table 2illustrates the scaled down results of depth. In Table 2, d x means the ”don’t care”. Because after the ceiling functionof repeat time, applying the coefficients 0.9 and 0.8 will re-sult the same depth as d = 1 . . Thus, we define d = 0 . and d , d , ..., d j according to Equation 2. d =1 . if j = 1 ,d j = d j − − . x, ∃ x ∈ N ,s.t. (cid:88) s =1 ceiling ( R s · d j − ) > (cid:88) s =1 ceiling ( R s · d j ) if j > (2) Define width and resolution coefficients :
EfficientNet[31] scales up model from B0 to B7 by a set of constant4 able 3. Coefficients for scaling down EfficientNet. d j d d d d d j Coefficient 1.0 0.7 0.6 0.5 ...Total operators t t t t t u
18 17 15 12 ... w i w w w w w i Coefficient 1.0 0.8666 0.701 0.514 ... r k r r r r r k Coefficient 1.0 0.905 0.766 0.587 ...Input resolution 224 203 172 132 ... ratio ( w = 1 . , d = 1 . j , r = 1 . , which is obtained bythe grid search under the predefined resource budget. Wealready have the depth coefficients d j . The idea is that wecan use this set of constant ratio to scale down, calculatingcorresponding w i , r k with the depth coefficients with Equa-tion 3. Note that we use total amount of operators t u insteadof d j because t u is more representative for the depth coef-ficient in EfficientNet [31] and that we only consider i, u, k less or equal than 4 due to the GPU resource limitation.Instead of directly using the compound scaling co-efficients (i.e. Model ( w , d , r ) , Model ( w , d , r ) orModel ( w , d , r )) , the flexibility is considered by collect-ing all the combination of w i , d j , r k into candidate pool. Bythis way, we can investigate the shape of CNN (i.e. the ratioor relationship of width, depth and input resolution) throughthe different scaled-down strategies, studying the scaled-down trade-off along with the proposed grouping method,introduced in the following section. (cid:52) w = w i +1 w i , (cid:52) d = t u +1 t u , (cid:52) r = r k +1 r k (cid:52) w : (cid:52) d : (cid:52) r = 1 . . . (3) The core idea of grouping method is to research the dif-ferent shapes of CNN under the similar kinds of resourceconsumption, which specifies as parameter usage and Flopsof a CNN model. Namely, CNN models with similar pa-rameter usage and Flops are gathered in the same group sothat the outperformed one in the group can be consideredas the relatively good shape of the CNN. Besides, the com-bination of scaling coefficients from width, depth and inputresolution are the comparatively better strategy for scalingdown.For fairly comparing the different shapes of CNN in can-didate pool, we keep the factors, which may affect the per-formance of CNN, the same as much as possible, includ-ing the training environment such as batch size, learningrate, data augmentation policy, optimization algorithm andso on. As for hardware resource usage, it is difficult tohave candidates with the exactly same parameter usage andFlops. Additionally, those two measurements are not inthe same scope, making it more challenge to fairly groupthe candidates. The statistic values are provided as follow. For each model in candidate pool, the mean of parame-ter usage is ¯ X P ara = 3 . million but the mean of Flopsis ¯ X F lops = 153 . million. As for the dispersion of thecandidate pool, standard deviation of parameter usage is σ P ara = 1 . million and the flops is σ P ara = 90 . million.As a result, we propose a grouping method based onstatistic distribution of both parameter usage and Flops. Thebenefit is that the statistic distribution is not affected by thescale. The Z-score value is calculated from each modelin candidate pool. The Z-score is the offset of how manystandard deviations from the mean the data is, denoting inEquation 4. Hence, with this offset, the larger Z-score valuerepresents the relatively heavy resource cost and vice versa.The parameter usage and Flops are standardized to the samescale so that we can adopt Z sum as standard to classify thecandidate models into groups. Z P ara = X P ara − ¯ X P ara σ P ara ,Z F lops = X F lops − ¯ X F lops σ F lops ,Z sum = Z P ara + Z F lops (4)
Expensive searching cost is a common issue of thesearching-based approaches. It is impractical to finish thetraining process of all candidate models. Therefore, severalsearching-based state-of-the-art methods apply the elimina-tion strategy to alleviate the searching cost. The state-of-the-art MnasNet[30] adopts the accuracy of fifth epoch toeliminate candidates during searching over 8,000 models.Other state-of-the-art methods, PNAS[20] applies the accu-racy of th epoch and AmoebaNet[24] uses the accuracyof th epoch to speculate whether the candidate is the out-performed CNN model.In this section, our goal is to find out the criteria to seekup potential models with the acceptable searching cost. In-stinctively, the outperformed model should show its talentduring the early training phase. Therefore, with few epochsof training, the outperformed model can be distinguishedwith higher accuracy in the early training phase. How-ever, in our study, we realize that the candidates in eachgroup have the closing accuracy during the early trainingphase, which may lead to mistakenly eliminate the promis-ing model due to the unexpected result of specific epoch.As a result, we propose using averaging accuracy, which iscounted the accuracy from the start to the current epoch, ascriteria for elimination.The thought is coming from the observation of the learn-ing performance. First of all, we begin with k models,which have similar hardware resource usage. Due to GPUlimitation, we set k=4 and the parameter usage and Flopsare listed in the table 4. We are targeting at finding the clues5 igure 5. Steps to make up the hypothesis.Table 4. Candidate models for observation. Candidate Parameterusage(M) Flops(M) Top-1 Accuracy onImageNet ( th epoch) RankingC1 3.7 296 75.65 4C2 3.8 384 76.45 2C3 4.4 314 76.14 3C4 4.7 362 76.62 1 Figure 6. Accuracy curves for observation. to know the ranking early (i.e. few training cost) rather thanwaiting until the final epoch. For example, in the table 4, wehope to stop training C1 and C3 as early as we can, becauseC2 and C4 are the relatively outperformed models insidethis group.
Observation :
The accuracy curves are crossing. (Before t h epoch, C2 and C3 have closing accuracy. C2 and C4have crossing accuracy curve.) Hypothesis :
Average accuracy is more representable forthe final performanceBy the observation, we find that performance of the spe-cific epoch is sometimes not representable of the final per-formance. As a result, we suggest that average accuracyis a better criteria for predicting outperformed model thanusing accuracy of specific epoch. There are two kinds ofphenomenon of the accuracy curve. The first one is that theaccuracy curve from a candidate is always higher than the other candidate’s accuracy curve. Both methods of averageand specific work perfectively to predict the outperformedCNN model. The other case is that the accuracy curve iscrossing. The average method is possibly more accuratethan the specific approach. Besides, the benefits of averageperformance is that the criteria is not arbitrary to the accu-racy of specific epoch.We evaluate the intuition by applying the criteria intothe training results among the candidates. We divide 350epochs as 35 rounds and 10 epochs per round as unit forobservation. For each round, we observe the relative per-formance of candidate models, measuring how the extendof the criteria matches the finial performance. That is tosay, whether the average accuracy or the specific accuracycan better reflect the final ranking. In equation 5,
Acc C m spe ( i ) stands for specific accuracy of i th epoch on candidate C m . Rank spe ( r ) = denotes an ascending order sequence sortedby the accuracy of specific epoch (Round r ) of candidatemodel. M at spe is for measuring how matching the criteriais for predicting the final ranking.
Rank spe ( r ) = ( C , C , C , C ) if Acc C spe (10 r ) > ... > Acc C spe (10 r ) Rank spe (35) = ( C , C , C , C ) P spe = (cid:88) r =1 Rank spe ( r ) == Rank spe (35)35 P spe = 2235 ≈
63 % (5)In equation 6,
Rank avg ( r ) denotes an ascending or-der sequence but sorted by the average accuracy from firstepoch to the current epoch r . Using average accuracy is94% matching the result. Acc C m avg (1 , k ) = k (cid:88) i =1 Acc C m spe ( i ) kRank avg ( r ) = ( C , C , C , C ) ,if Acc C avg (1 , r ) > ... > Acc C avg (1 , r ) P avg = (cid:88) r =1 Rank avg ( r ) == Rank spe (35)35 P avg = 3335 ≈
94 % (6)By the observation, the elimination criteria is based onsorted average accuracy, which is calculated form the firstepoch to the current epoch. Comparing to the method fromspecific accuracy, searching cost remains the same except6or the average operation and maintaining the history of theaccuracy. Therefore, we adopt the average accuracy as elim-ination criteria. • STEP 1: Candidate initializationSTEP 1.1: Defining candidate modelsSTEP 1.2: Sorting candidates by resource usageSTEP 1.3 : Dividing candidate into groups • STEP 2: Training a round r of all survived candidates(We use e=10 epochs as a round) • STEP 3: Eliminating modelsSTEP 3.1: Calculating the average accuracySTEP 3.2: Sorting average accuracy as ascendingorderSTEP 3.3 : Eliminating half of the candidateswith the sorted average accuracy per group • STEP 4: Repeat 2 and 3 until finish the training andone candidate remains in the group respectively
4. Hardware friendly EfficientNet
In this section, applying the same scaling coefficientsin Table 3, we consider hardware friendly designs into thescaled-down models. Meanwhile, the hardware-friendlycandidate pool can be determined, using NCS, which fea-tures for generalizing for any kind or any type of neuralnetwork, to select the outperformed models. The hardware-friendly designs can be categorized by channels and inputresolution.
Two kinds of input resolution are provided, × and × . The size of feature map for each operatorbecomes the power of two number, which is benefits for thedata partitioning problem. Besides, the operation of divi-sion (average pooling) can be done by shifting rather thanthe hardware division unit. The basic idea is to make each channels to be powerof two as well as to avoid distorting the model structureas much as possible. Denote the original channels C s onthe stage s. Channel by rounding up C RUs is defined as ceiling ( log C s ) . Channel by rounding down C RDs is de-fined as floor ( log C s ) . Compound rounding C CRs , definedin Equation 7, is combined rounding up and down for lessadjustment of channels because we want to keep the origi-nal shape of EfficientNet [31].
Figure 7. The performance of Flops and Top-1 accuracy on Ima-geNet. C CRs = (cid:26) C RUs , if C RUs − C s < C s − C RDs C RDs , else (7)
5. Experimental Results
Our experiment is conducted using NVIDIA 2080Ti andi7-9700K CPU with Pytorch implementation. The train-ing parameter settings are as follow, which is fundamentallyfollowing EfficientNet[31] except for some GPU equipmentlimitation. We adopt the batch size to be 100 and the train-ing epoch to be 350. The Optimization is based on RM-SProp. The Data augmentation policy is adopted from [19].
The champion model of each group is represented as amember of EfficientNet-eLite and EfficientNet-HF and wegive the lager id for the more lightweight member. Wefind that the winner of each group has the resolution co-efficient either r or r . Because we group CNN models bysimilar parameter usage and Flops, we realize that modelswith higher input information tend to outperform others un-der the similar resource usage condition when models arescaled down. Figure 1 shows the model size with the ImageNet Top-1accuracy. Under the same accuracy, our proposed modelsare smaller than the other state of the arts. From the verticalperspective for same parameter usage, our proposed mod-els have higher Top-1 accuracy on ImageNet than the other7 able 5. Comparison between state-of-the-art models and EfficientNet-eLite and EfficientNet-HF performance on ImageNet. The modelswith closing Top-1 accuracy on ImageNet are blocked together and organized as ascending order. Note that we define searching cost onImageNet dataset counting from the start to eliminate until only one candidate survives per group.
State-of-the-art models Publication Parameters(M) Flops(M) Top1 ACC.(%) Searching cost (GPU hours)Mnas-small [30] CVPR 2019 1.9 65.1 64.9 40,000MobileNet V3 small 0.75[11] ICCV2019 2 Model ( w , d , r )( eLite - × [9] CVPR2020 2.6 Model ( w , d , r )( eLite - Model ( w , d , r )( eLite - 2.18 127.99 70.18 90EfficientNet-HF3 - 3.44 92.29 70.32 40MobileNet[12] CVPR2017 4.24 569 70.6 ManualCondenseNet (G=C=8)[14] CVPR2018 2.9 274 71 ManualMobileNet V2[25] CVPR2018 3.4 300 72 Manual Model ( w , d , r )( eLite - Model ( w , d , r )( eLite - × (g=3) [40] CVPR2017 5.2 524 73.7 ManualGhostNet 1.0 × [9] CVPR2020 5.2 ≤ Model ( w , d , r )( eLite - ≤ Model ( w , d , r )( eLite - × [9] CVPR2020 7.3 Model ( w , d , r )( eLite - Model ( w , d , r )( eLite - 4.74 362.62 76.62 - Model ( w , d , r )( eLite - 5.33 385.81 76.89 -EfficientNet-B0 [31] ICML2019 5.3 390 77.3 - state of the arts. Figure 7 shows the Flops with the Ima-geNet Top-1 accuracy. Our proposed models outperformthe most of the state of the arts with fewer floating pointoperation and higher accuracy.
6. Conclusion
A family of extremely lightweight CNN models for edgedevices is proposed. Particularly, EfficientNet-eLite 9 ismore lightweight than the best and smallest existing modelMnas-small[30]. We study the trade-off between hardwareresource and accuracy by a novel of Network CandidateSearch, which candidates are determined by scaling down EfficientNet. We find that scaling down width and depthtends to have less accuracy drop than reducing the input in-formation. Besides, grouping and elimination concept areintroduced for effectively selecting the potential structureand reducing searching cost at the same time. We pro-pose using average accuracy to speculate the potential CNNmodels. To push the state-of-the-art CNN embedded foredge application, we also propose hardware-friendly CNNmodels by using the same methodology of NCS along withhardware-friendly adjustments. Finally, both of the familiesof models outperform the state-of-the-art CNN models withless parameter usage and higher accuracy on ImageNet.8 eferences [1] R. Andri, L. Cavigelli, D. Rossi, and L. Benini. Yodann:An architecture for ultralow power binary-weight cnn accel-eration.
IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems , 37(1):48–60, 2018.[2] R. Avenash. and P. Viswanath. Semantic segmentation ofsatellite images using a modified cnn with hard-swish acti-vation function. In
Proceedings of the 14th InternationalJoint Conference on Computer Vision, Imaging and Com-puter Graphics Theory and Applications - Volume 4 VISAPP:VISAPP, , pages 413–420. INSTICC, SciTePress, 2019.[3] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Directneural architecture search on target task and hardware. In
In-ternational Conference on Learning Representations , 2019.[4] Matthieu Courbariaux, Yoshua Bengio, and Jean-PierreDavid. Binaryconnect: Training deep neural networks withbinary weights during propagations. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, edi-tors,
Advances in Neural Information Processing Systems 28 ,pages 3123–3131. Curran Associates, Inc., 2015.[5] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware se-mantic segmentation via multi-task network cascades. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2016.[6] Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and LiFei-fei. Imagenet: A large-scale hierarchical image database.In
In CVPR , 2009.[7] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama,Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,and Trevor Darrell. Long-term recurrent convolutional net-works for visual recognition and description. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2015.[8] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and JitendraMalik. Rich feature hierarchies for accurate object detectionand semantic segmentation.
CoRR , abs/1311.2524, 2013.[9] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu,and Chang Xu. Ghostnet: More features from cheap opera-tions. In
Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR) , June 2020.[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2016.[11] Andrew Howard, Mark Sandler, Grace Chu, Liang-ChiehChen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and HartwigAdam. Searching for mobilenetv3. In
Proceedings of theIEEE/CVF International Conference on Computer Vision(ICCV) , October 2019.[12] Andrew G. Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, Marco An-dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-tional neural networks for mobile vision applications.
CoRR ,abs/1704.04861, 2017.[13] Chao-Tsung Huang, Yu-Chun Ding, Huan-Ching Wang,Chi-Wen Weng, Kai-Ping Lin, Li-Wei Wang, and Li-De Chen. Ecnn: A block-based and highly-parallel cnn accel-erator for edge inference. In
Proceedings of the 52nd AnnualIEEE/ACM International Symposium on Microarchitecture ,MICRO 52, page 182195, New York, NY, USA, 2019. Asso-ciation for Computing Machinery.[14] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kil-ian Q. Weinberger. Condensenet: An efficient densenet us-ing learned group convolutions. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) , June 2018.[15] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil-ian Q. Weinberger. Deep networks with stochastic depth.
CoRR , abs/1603.09382, 2016.[16] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat,Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam,Quoc V Le, Yonghui Wu, and zhifeng Chen. Gpipe: Efficienttraining of giant neural networks using pipeline parallelism.In
Advances in Neural Information Processing Systems 32 ,pages 103–112. Curran Associates, Inc., 2019.[17] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align-ments for generating image descriptions. In
Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2015.[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, editors,
Advances in Neural Information Pro-cessing Systems 25 , pages 1097–1105. Curran Associates,Inc., 2012.[19] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, andSungwoong Kim. Fast autoaugment. In
Advances in NeuralInformation Processing Systems 32 , pages 6665–6675. Cur-ran Associates, Inc., 2019.[20] Chenxi Liu, Barret Zoph, Maxim Neumann, JonathonShlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, JonathanHuang, and Kevin Murphy. Progressive neural architecturesearch. In
Proceedings of the European Conference on Com-puter Vision (ECCV) , September 2018.[21] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS:Differentiable architecture search. In
International Confer-ence on Learning Representations , 2019.[22] B. Moons and M. Verhelst. A 0.32.6 tops/w precision-scalable processor for real-time large-scale convnets. In , pages 1–2, 2016.[23] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li,Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song,Yu Wang, and Huazhong Yang. Going deeper with embed-ded fpga platform for convolutional neural network. In
Pro-ceedings of the 2016 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays , FPGA 16, page 2635,New York, NY, USA, 2016. Association for Computing Ma-chinery.[24] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V.Le. Regularized evolution for image classifier architecturesearch.
CoRR , abs/1802.01548, 2018.[25] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted esiduals and linear bottlenecks. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) , June 2018.[26] Karen Simonyan and Andrew Zisserman. Two-stream con-volutional networks for action recognition in videos.
CoRR ,abs/1406.2199, 2014.[27] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. In arXiv1409.1556 , 2014.[28] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going deeper withconvolutions. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , June2015.[29] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, JonShlens, and Zbigniew Wojna. Rethinking the inception ar-chitecture for computer vision. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) , June 2016.[30] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,Mark Sandler, Andrew Howard, and Quoc V. Le. Mnas-net: Platform-aware neural architecture search for mobile.In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR) , June 2019.[31] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinkingmodel scaling for convolutional neural networks. In
ICML ,pages 6105–6114, 2019.[32] C. Wang, C. Chiu, C. Huang, Y. Ding, and L. Wang. Fastand accurate embedded dcnn for rgb-d based sign languagerecognition. In
ICASSP 2020 - 2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) , pages 1568–1572, 2020.[33] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu. Eca-net:Efficient channel attention for deep convolutional neural net-works. In , pages 11531–11539, LosAlamitos, CA, USA, jun 2020. IEEE Computer Society.[34] Robert J. Wang, Xiang Li, and Charles X. Ling. Pelee: Areal-time object detection system on mobile devices. In S.Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural In-formation Processing Systems 31 , pages 1963–1972. CurranAssociates, Inc., 2018.[35] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid.Deepflow: Large displacement optical flow with deep match-ing. In , pages 1385–1392, 2013.[36] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang,Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, YangqingJia, and Kurt Keutzer. Fbnet: Hardware-aware efficient con-vnet design via differentiable neural architecture search. In
Proceedings of the IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition (CVPR) , June 2019.[37] Bichen Wu, Forrest Iandola, Peter H. Jin, and Kurt Keutzer.Squeezedet: Unified, small, low power fully convolu-tional neural networks for real-time object detection for au-tonomous driving. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Work-shops , July 2017.[38] Shan You, Tao Huang, Mingmin Yang, Fei Wang, ChenQian, and Changshui Zhang. Greedynas: Towards fastone-shot nas with greedy supernet. In
Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR) , June 2020.[39] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-works.
CoRR , abs/1605.07146, 2016.[40] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.Shufflenet: An extremely efficient convolutional neural net-work for mobile devices. In
Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,June 2018.[41] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V.Le. Learning transferable architectures for scalable imagerecognition. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , June 2018., June 2018.