[PDF] Enabling Large Neural Networks on Tiny Microcontrollers with Swapping

Abstract

Running neural networks (NNs) on microcontroller units (MCUs) is becoming increasingly important, but is very difficult due to the tiny SRAM size of MCU. Prior work proposes many algorithm-level techniques to reduce NN memory footprints, but all at the cost of sacrificing accuracy and generality, which disqualifies MCUs for many important use cases. We investigate a system solution for MCUs to execute NNs out of core: dynamically swapping NN data chunks between an MCU's tiny SRAM and its large, low-cost external flash. Out-of-core NNs on MCUs raise multiple concerns: execution slowdown, storage wear out, energy consumption, and data security. We present a study showing that none is a showstopper; the key benefit -- MCUs being able to run large NNs with full accuracy and generality -- triumphs the overheads. Our findings suggest that MCUs can play a much greater role in edge intelligence.

Full PDF

EEnabling Large Neural Networks on Tiny Microcontrollers with Swapping

Hongyu Miao

Purdue ECE

Felix Xiaozhu Lin

University of Virginia

Abstract

Running neural networks (NNs) on microcontroller units(MCUs) is becoming increasingly important, but is very dif-ﬁcult due to the tiny memory size of MCUs. Prior work pro-poses many algorithm-level techniques to reduce NN memoryfootprints, but all at the cost of sacriﬁcing accuracy and gener-ality, which disqualiﬁes MCUs for many important use cases.We investigate a system solution for MCUs to execute NNsout-of-core: dynamically swapping NN data chunks betweenan MCU’s tiny SRAM and its large, low-cost external ﬂash.Out-of-core NNs on MCUs raise multiple concerns: execu-tion slowdown, storage wear out, energy consumption, anddata security. We present a study showing that none is a show-stopper; the key beneﬁt – MCUs being able to run large NNswith full accuracy/generality – triumphs the overheads. Ourﬁndings suggest that MCUs can play a much greater role inedge intelligence.

With low cost and energy, MCUs are becoming ubiquitousplatforms for neural networks (NNs), a paradigm dubbedtinyML [17]. Running NN on MCU , rather than sending rawdata off, offers multiple advantages, notably tolerating poornetworks and preserving data privacy. Use cases include de-tecting crop disease by classifying photos of their leaves infarms [18]; monitoring trafﬁc patterns by analyzing city im-ages; recognizing human voice commands at home [12, 49].A top difﬁculty in tinyML is MCU’s memory limit. Onone hand, an MCU often has tens to hundreds KB of SRAMas its readable/writeable main memory; it has no more thana few MBs of byte-addressable on-chip ﬂash for read-onlycode and data (note: on-chip ﬂash is different from the ex-ternal, non-byte-addressable ﬂash such as SD cards, whichare of GBs) [20]. On the other hand, state-of-the-art NNsachieve high accuracy and generality with their large memoryfootprints [42, 43]. An NN’s memory footprint comprises itsparameters to load during NN execution, as well as featuremaps as the intermediate and ﬁnal results to load and store.Of a state-of-the-art NN, the parameters ranges from sev-eral MBs to more than 100 MBs (even with parameter quan-tized) [14]; the feature maps can be as large as tens ofMBs [14]. Despite an MCU can process each NN layerin memory before loading the next layer, the per-layer pa-rameters and feature maps can take up to 100 MB (e.g. Memory Footprint (KB)

MCU

SRAM

Range

MobileNets-v2 (2018) ResNet (2016)VGG (2014) EfficientNet (2019)

MobileNets-v2 Pruned ResNet Pruned

VGG Pruned T o p - A cc u r a c y ( % ) Figure 1: Many popular NNs exceed the MCU memorysize [25].VGG16 [41]). This exceeds the MCU memory size by upto two orders of magnitude. Such a memory gap is wideningas recent NNs are larger [44] while commodity MCUs seeslow, if at all, growth in memory sizes due to cost and resourceconstraints [7].A popular approach to reducing NN memory footprints isto engineer NNs themselves. Common techniques includemodel compression [29, 36, 50], parameter quantization [28],designing tiny NNs from scratch [47], as well as automatingthese procedures [37]. As a common tradeoff however, thesetechniques give away model accuracy or generality at varyingdegrees. Unfortunately, in order for an NN to ﬁt into the MCUmemory, the NN either becomes substantially inaccurate (e.g.< 60% top-1 accuracy as shown in Figure 1) or too specialized(e.g. can only detect a few object classes [46]).This disqualiﬁes MCUs from key uses where high accuracyor generality is desired while delays can be tolerated: (1)

NNinference on slowly changing signals , e.g., monitoring crophealth by analyzing hourly photos [18] and trafﬁc patterns byanalyzing video frames every 20-30 minutes [46]. (2) proﬁlingNNs on device : occasionally running a full-blown NN toestimate the accuracy of long-running smaller NNs [40]; (3) transfer learning : re-training NNs on MCUs with new datacollected from deployment every hour or day [45].

A case for out-of-core NNs

Can an MCU execute NNs thatfar exceed its physical memory size? A proven wisdom isto swap tiles of NN layers between memory tiers as NN is1 a r X i v : . [ c s . A R ] F e b eing executed [23]. In case of tinyML, this is to split one NNlayer’s working set into a series of data chunks, i.e. tiles, eachsmall enough to ﬁt the MCU memory; load tiles from externalstorage (a micro SD card) to memory, compute on them, andwrite results back to the storage for subsequent processing.While prior systems have swapped NN tiles between a server’sCPU/GPU memories for training [32], applying the idea toMCU for inference, in particular swapping between its smallSRAM and a wimpy SD card, raises multiple concerns: lossof SD card durability, slowdown in NN execution, energyincrease, and safety/security of out-of-core NN data. Key observations

This paper investigates the feasibility ofout-of-core NN execution on MCUs.•

Swapping overhead is only pronounced in certain layers.

Only on layers with low arithmetic intensity, notably fullyconnected (FC) layers, the swapping delay due to IO is longerthan that of computation; on layers with higher arithmeticintensity, e.g. convolution (Conv), the swapping delay isdwarfed by that of computation. The swapping overhead isfurther diminished by MCU’s relative low CPU speed as com-pared to its IO speed.•

Swapping rate is throttled by computation , which limits thewear rate of SD cards. As a common NN structure, IO-boundlayers such as FC are spaced by compute-bound layers suchas Conv. As a result, even with continuous NN executions, IOis only exercised intermittently.•

Read-most swapping IO.

While writes of NN featuremaps wear SD cards; reads of weights and input feature mapsdo not [2]. Fortunately, writes only constitute a small fractionof all the swapping IO trafﬁc as the paper will show.•

Hide swapping delays with parallelism at various granular-ities . Within a layer, the MCU can exploit tile parallelism,by computing on a tile while transferring others to/from thestorage. Between consecutive NN executions such as on a se-quence of video frames, the MCU can further exploit pipeline parallelism, by overlapping the swapping IO for an earlierframe with the computation of a later frame.•

Modern MCU hardware . Recent SD cards already over-provision durability at low cost, e.g., a 64 GB SD card can lastmore than 10 years with 100 GB of daily writes (Section 3.4).As such, MCU can trades the surplus durability as a systemresource for accommodating large NNs. Modern MCUs in-corporate rich specialized hardware, e.g., for DMA, hash, andcrypto, which facilitate fast and secure data swapping.•

IO adds marginal energy to an already busy MCU.

Withan MCU already busy on computation, most of its hardwarecomponents in high power states. Further activating the SDcard for swapping increases the system energy moderately.

Quantitative ﬁndings

We studied a diverse set of NNs, Mo-bileNets [31], AlexNet [34], and VGG16 [41], on a Cortex-M7MCU with 128 KB of SRAM. Our ﬁndings are:•

Low to modest speed overhead.

NNs with dominant compute-bound layers see negligible swapping overhead, bothin per-frame delay and throughput. Compared to runningAlexNet on an “ideal” MCU with inﬁnite memory, runningit out-of-core with 128 KB memory sees only 3.3% longerdelay and almost identical throughput. NNs with more IO-bound layers such as MobileNet see moderate delay increase(24.2%) while insigniﬁcant loss in throughput (2.5%) thanksto tile and pipeline parallelism.•

Low durability loss.

Even with an MCU executing NNscontinuously, the write trafﬁc due to swapping is no more thana few hundred GBs per day, comparable to SD card writeson a commodity surveillance camera. A 64 GB SD card cansustain such a write rate for 7.5 years before half of its cellsare worn out.•

Modest increase in energy consumption.

Our worst-caseestimation shows swapping increases system energy by lessthan 42% compared to running NNs with inﬁnite memory.•

Out-of-core data can be secured with known mechanismssuch as encryption and hash-based integrity protection. Spe-cialized hardware on MCUs further reduces their overhead.

Contributions

This paper contributes:• the ﬁrst study of applying swapping to NN on MCUs;• an analysis of IO behaviors of NN layers under swapping,characterizing performance, storage durability, energy, anddata security. It further presents new insights on extractingparallelism for hiding swapping delays;• a ﬁnding that an MCU of less than ten dollars with hun-dreds of KB SRAM can execute large NNs such as VGG16,expanding the scope of tinyML signiﬁcantly.

MCU hardware

We assume the following hardware compo-nents: (1) a CPU with clockrate from tens of MHz to a fewhundred MHz, as exempliﬁed by Arm Cortex M3 and M7; (2)on-chip SRAM: from tens of KBs to several hundreds of KBs;(3) on-chip NOR ﬂash: byte-addressable, read-only memoryno more than a few MBs; (4) cheap external storage, e.g. amicro SD card ranging from tens of GBs to a few hundredGBs; (5) a DMA engine, for moving data between SRAMand external storage without CPU involved; (6) optionally,on-chip accelerators for computing crypto and hash functions.Major vendors ship numerous MCU models meeting theabove conditions. Examples include the STM32 MCU familyfrom STMicroelectronics [10] and the LCP series from NXPSemiconductors [6]. They are priced at $1-$20 per unit.

NN workloads & metrics

We motivate our study by consid-ering periodic NN inference on video/audio data as a sequenceof frames captured by MCUs at run time. To characterize in-ference speed, we consider both the inference delay of eachframe and throughput as the number of frames processed persecond. MCU applications may be sensitive to either met-2 rame 0

FCConv

Frame 1

Frame 0Conv

CPU

W R

Frame 0Conv W Frame 1Conv

W/F R/C

Fr0FC

W/C

Frame 1Conv Fr0 FC R W R WRR/F

Tile0 …… … …

Tile parallelism in a layer Pipeline parallelism across frames … IO … Time

Tile1

Figure 2: An example of out-of-core NN execution, showingConv (compute-bound) and FC (IO-bound) layers.

MobileNet AlexNet VGG16

Number of compute-bound layers 14 5 13 Number of IO-bound layers 13 3 2 Size of feature maps (MB) 10 1 15 Size of weight parameters (MB) 4 62 138 Memory footprint (MB) 14 63 153

Table 1: A set of three NNs studied in this paper.ric or both. For instances, keyword spotting is sensitive toinference delays [49] and car counting beneﬁts from highthroughput [46].

Out-of-core NN executions

We consider the followingswapping strategy. An NN’s parameters are pre-stored onthe external ﬂash. Given an input frame, the MCU executesthe NN’s layers in sequence. It processes a layer in tiles, incase the layer’s memory footprint exceeds MCU’s main mem-ory: to do so, the MCU loads to the main memory a tile ofparameters and a tile of input feature maps, computes a tile ofoutput feature maps in memory, and writes back the output tothe external ﬂash. Altogether, the input and output tiles shallsimultaneously ﬁt in the main memory.As shown in Figure 2, MCU extracts CPU/IO parallelismfor hiding IO delays. (1)

Tile parallelism within an NN layer :while computing an output tile

Tile0 , MCU can pre-load fromﬂash the input tiles for computing the next output tile

Tile1 ;while writing back the completed

Tile0 back to ﬂash, MCUcan compute

Tile1 simultaneously. (2)

Layer parallelism : ina similar fashion, MCU can execute an earlier layer’s com-putation with a latter layer’s IO simultaneously. (3)

Pipelineparallelism across data frames : MCU can execute compute-bound and IO-bound layers for different frames in parallel, asthese layers exercise complementary resources, namely CPUand IO bandwidth. As shown in Figure 2, MCU swaps frame0’s FC layer while computes on frame 1’s Conv layer.

We study three representative NNs, whose memory footprintsrange from sveral-MB to hundred-MB (with quantization). Asshown in Table 1: MobileNet has large feature maps but smallweight parameters, AlexNet has small feature maps but largeweight parameters, and VGG16 has 1000 × larger memory Layer Compute (MOps) IO traffic (MB) N on typical MCUs block1_conv2 1849.69 6.46 block1_pool 3.21 4.01 block3_conv3 1849.69 2.19 block4_pool 0.40 0.50 block5_conv1 462.42 2.56 fc1 102.76 102.79 fc2 16.78 16.79 Table 2: Normalized arithmetic intensity ( N ) on NN layerswith MCU’s common speed range (64–480 MOPS) and IObandwidth range (10–40 MB/s). NN: VGG16footprint than MCUs’ SRAM size. To study the swapping overhead, we focus on a layer’s swap-ping delay relative to its computation delay on typical MCUs.The rationale is that as MCU can perform swapping and com-putation in parallel, the longer of the two delays will be thelayer’s bottleneck.In general, arithmetic intensity , as commonly used in HPC[8], characterizes a workload’s compute/IO ratio. It is deﬁnedas

W/Q , where Q is the amount of data to move in the mem-ory hierarchy and W is the amount of arithmetic operationson the data. By factoring in an MCU’s CPU speed ( S cpu ) andIO bandwidth ( S IO ), we deﬁne N = ( W/S

CPU ) / ( Q/S IO ) as the normalized arithmetic intensity on MCU. Of a givenlayer, N > 1 means swapping incurs less delay than compu-tation, i.e, a compute-bound layer;

N < 1 means swappingincurs longer delay, i.e. an IO-bound layer.On modern MCUs with simple CPU cores, S CPU is prim-iarly determined by the CPU clockrate; it ranges from 64MOPS to 480 MOPS [13, 16]. S IO is jointly determined bythe MCU’s DMA bandwidth and the SD card bandwidth, rang-ing from 10 MB/s to 40 MB/s as reported in literatures [5].With these values, common NN layers fall into three distinctcategories per their normalized arithemetic intensity ( N ). (1) A majority of compute-bound layers ( N >> 1 ). Notableexamples are Conv layers known for their high complexity.In the example of VGG16 (Table 2), N for the Conv layersfar exceeds 1 even with a high CPU clockrate and slow IO.They often dominate an NN’s execution time (51% – 90%),as exempliﬁed by the three NNs in Figure 3. On these layers,the computation delay overshadows the IO delay. (2) Some IO-bound layers ( N < 1 ). Examples include fullyconnected (FC) and depth-wise convolutional layers (DW).These layers perform light computation over large volumes offeature maps and weight parameters. Of all layers in an NN,they are often minorities (e.g. 2 out of 21 in VGG16). Without-of-core execution, the IO delay exceeds the computationdelay by up to 10 × (e.g. fc1 in Table 2 and Figure 3b). (3) Other layers with insigniﬁcant overheads , e.g., Relu andMaxpooling. These layers have low complexity and contribute3 c o n v p oo l r e l u c o n v p oo l r e l u c o n v c o n v c o n v p oo l f c f c f c I O / C o m p u t e t i m e ( S e c . ) Compute time IO time (a) AlexNet (in shape: 227) b - c o n v b - c o n v b - p oo l b - c o n v b - c o n v b - p oo l b - c o n v b - c o n v b - c o n v b - p oo l b - c o n v b - c o n v b - c o n v b - p oo l b - c o n v b - c o n v b - c o n v b - p oo l f c f c p r e d s I O / C o m p u t e t i m e ( S e c . ) Compute time IO time (b) VGG16 (in shape: 224) c o n v d w - p w - d w - p w - d w - p w - d w - p w - d w - p w - d w - p w - d w - p w - d w - p w - d w - p w - d w - p w - d w - p w - d w - p w - d w - p w - p r e d s I O / C o m p u t e t i m e ( S e c . ) Compute time IO time (c) MobileNet (in shape: 224, alpha: 1)

Figure 3: Compute and IO delay of NN layers. MCU: ARM Cortex-M7 @ 216 MHz, memory buffer for each layer: 128 KB.a tiny fraction of data to move and to compute (0.3%-0.9%)for an NN. As such, their swapping overhead is insigniﬁcant.

Most NNs, with a small fraction of IO-bound layers, see neg-ligible delay increase; NNs with more IO-bound layers seemodest delay increase.

Within a compute-bound layer, MCU can execute IO andcomputation for consecutive tiles simultaneously (as thesetiles are independent), completely hiding the IO delay be-hind the much longer computation delay. Within an IO-boundlayer, IO and compute for consecutive tiles can happen si-multaneously as well, but the long IO delay cannot be totallyhidden by relatively shorter compute delay. For other layers,e.g. relu/pooling, the IO/compute delay is insigniﬁcant.As such, the increased delay of an NN due to swapping ismainly determined by the proportion of IO-bound layers’ IOdelay to all layers’ total compute delay. The increased delayfor NNs with less IO-bound layers is negligible. As VGG andAlexNet shown in Table 1, only 2 out of 13 and 3 of 5 of layersare IO-bound, leading to 3.3% and 3.6% increased delay. Theincreased delay for NNs with more IO-bound layers is modest.As MobileNet shown in Table 1, 13 of 28 layers are IO-bound,leading to 24.2% increased delay. Overall, the increased delaydue to swapping is negligible for most NNs and modest forsome special NNs.

NNs see negligible throughput loss.

NNs with negligible delay increase will also see negligiblethroughput loss when processing a stream of frames. Forthose NNs seeing higher delay increase, fortunately, MCUcan reduce throughput loss by exploiting pipeline parallelismacross frames.A common pattern in an NN is that one or more compute-bound layers followed by one or more IO-bound layers, i.e.a pipeline with interleaved compute-bound and IO-boundstages. For instance, the AlexNet in Figure 3a, conv1-5 (compute-bound stage) is followed by fc6-8 (IO-bound stage).When executing NN on a sequence of frames, MCU canoverlap IO/compute-bound stages of adjacent frames, hence hiding the IO delays that cannot be hidden at the layer/tilelevels with each frame. As shown in Figure 4, MCU can swapfor frame 0’s FC layers while computing Frame 1’s Convlayers, leading high MCU/IO utilization and throughput.The throughput loss for a pair of compute-bound and IO-bound stages is zero, if their compute delay is longer thantheir IO delay. As shown in Figure 3: (1) both AlexNet andVGG have one compute-bound stage followed by one IO-bound stage, and their compute delay is much longer than IOdelay (Alexnet: 20 vs. 12. VGG: 602 vs. 55), so swappinghas no throughput loss for them. (2) MobileNet has 13 pairsof compute/IO-bound layers. Only 2 of 13 pairs (dw/pw-1/2)and two layers (conv1 and preds) have throughput loss (1.4%– 93%) because of longer IO delay than compute delay, lead-ing to 2.5% overall throughput loss for MobileNet. Overall,throughput loss due to swapping is negligible for NNs.

SD card sees negligible durability loss, and its lifetime couldbe years or tens of years with swapping.

The amount of data written to SD card per frame is not largebecause NN layers are read-most, and the write frequency islow due to the long execution time on slow MCU.

Modest write rate

For a given NN and SRAM size, theamount of data written to SD card is determined by the framerate (reciprocal of delay per frame) and the amount of datato write per frame (upper bound is the sum of output featuremaps of all layers), which have negative correlations: (1) forlarge NNs, frame rate is low but the amount of data to writeper frame is large; (2) for small NNs, frame rate is high butthe amount of data to write per frame is small. Therefore,no matter an NN is large or small, the data written per daywon’t be large. For instance, swapping writes only 2.0/2.8GB for VGG16/AlexNet per day. Even for the extreme case,MobileNet, which has high frame rate and relatively largefeature maps to write, swapping writes 123 GB per day.

SD card has long lifetime even with swapping

SD card isbuild up of many cells, which have limited write cycles [4].As the capacity is becoming larger [3], the durability budgetis keeping increasing. The study [9] keeps writing 24/7 asfast as possible to 40 4 GB SD cards, and 1, 20, and 40 of4 onv Conv2 Conv3 Conv4

Fc6 Fc7

Compute:IO:Time line: Frame 0 Frame 1

Conv1 Conv2 Conv3 Conv

Compute:

Fc6 Fc6 Fc7 Fc7

IO:

Tile parallelism in each layerTile parallelism in each layer + Pipeline parallelism across frames conv1 Fc6relu1pool1 Fc7 Fc8conv2 relu2pool2 conv3 conv4 conv5 pool5

AlexNet:

Figure 4: AlexNet: tile parallelism for low delay and pipelineparallelism for high throughput.40 cards observe the ﬁrst failures after writing 6.5 TB, 9 TB,and 12.5 TB of data to them. Based on their results, the ﬁrstcell is only expected to fail on a 64 GB SD card after runningMobileNet, AlexNet, and VGG16 for 2.4 – 4.5, 104 – 200,and 145 – 280 years, and 50% of cells fail (10K cycles percell [1, 21]) only after running for 7.5, 328, and 460 years.

Swapping adds modest energy consumption to an alreadybusy MCU.

We estimate the worst-case energy overhead due to swap-ping. Our test platform is an STM32F746NG-Discoveryboard (ARM Cortex-M7 at 216 MHz; 340 KB SRAM) withan external power meter [22]. We run two benchmarks. (1) in-core emulates NN executions with an inﬁnite amount ofmemory: it runs NN compute [35] for 1000 iterations. (2) out-of-core emulates NN executions with the most intensive IOtrafﬁc in parallel to the compute: it executes the same amountof compute with an IO thread repeatedly ﬂushing data blocksto SD card. Each data block is 100 KB (close to tile size); theﬂush is asynchronous using the MCU’s DMA engine.Our measurement shows that: the additional IO workloadsincreases the system energy by 42%, from 0.07 Wh (in-core)to 0.10 Wh (out-of-core); the total execution time goes from178 sec to 213 sec. Our obsevations are: (1) The actual energyoverhead in out-of-core NNs is likely much less: while the out-of-core benchmark keeps IO always busy, the actual out-of-core NNs exercise IO intermittently (§3.1) because mostNN layers are likely compute-bound. (2) We attribtute themodest energy overhead to the incremental nature of systemenergy: when an MCU-based device is already busy executingcompute, its most power-hungry hardware – cores, intercon-nect, SRAM, and regulators – is already activated; executingIO, which activiates an SD card and the MMC controller inaddition, adds to the energy but not much.

Compared to storing NN data in on-chip SRAM, (temporarily)storing it off-chip is more vulnerable to physical attacks [15]:adversaries may learn or corrupt the data by tapping into the IO bus between MCU and the SD card, or the SD carditself. Fortunately, by encrypting NN data before swappingout, MCU can ensure the data to be conﬁdential and integral;the overhead is linear to the data amount. Hardware crypto,such as for ASE [19, 39], is already common on modernMCUs. Its computation overhead is comparable to (or evenless than) the least intensive NN compute (e.g. FC layers).Compared to SRAM, SD cards are less durable. Yet, it isknown that a SD card rarely fails as a whole but seeing a grad-ual increase number of corrupted cells over time [11]. Cellcorruption is often silent, i.e. a read value simply differs fromwhat was written last time. Fortunately, MCU can detects suchfailures with hash-based integrity checking. With specializedhardware on MCUs, computing hash is no more expensivethan the least intensive NN compute [19]. Upon detection ofbad cells, the MCU can recompute the most recent NN layerand recover the corrupted out-of-core data.

Implications on model compression

Our solution boostsdesign freedom in tinyML, where memory limit was consid-ered as the primary motivation for model compression. Withthe removal of such a limit, developers now have the choiceof run large NNs without compression, retaining full modelaccuracy. Even in case of model compression is warranted,e.g. for faster NN execution, developers now have a widerselection of baseline

NNs, including the ones with orders ofhigher memory footprints than MCUs.

Relation to prior work

Prior work enables out-of-coreNN training with large batches on GPU/CPU memory sys-tems [30, 32, 33, 38, 48], but they cannot address the uniquechallenge on MCU that even a single layer exceeds mainmemory during NN inference, and it’s never been studiedhow swapping affects SD card lifetime, execution slowdown,energy consumption, and data security. Our study answersthese questions and shows that swapping is feasible withoutmuch overhead.Tensorﬂow Lite Micro [24] is a framework for running NNinference on embedded devices. CMSIS-NN [35] provides op-timized NN kernels for ARM Cortex-M MCUs. SONIC [27]supports intermittent computing for NN inference on MUCs.However, none of them supports out-of-core NN inferenceon MCU, as what our swapping solution does. Our solutionis a complement to existing systems and can be integrated tothem easily.Much prior work accelerates NN inference and trainingwith parallel optimizations. IOS [26] exploits inter-operatorscheduling to execute independent operators of multi-branchNNs in parallel on GPUs. However, IOS cannot beneﬁt single-branch NNs, which are common, e.g., MobileNet, AlexNet,and VGG we study in this paper. IOS assumes everything isin GPU memory with no IO trafﬁc, so it doesn’t exploit the5ipeline parallelism of overlapping IO-bound and compute-bound stages in consecutive frames, as what we do for out-of-core swapping on MCUs.

Conclusions

This paper advocates enabling large NNs ontiny MCUs without losing accuracy by swapping data to SDcard. Our study shows that none of SD card durability loss,execution slowdown, energy consumption, or data securityis an issue. We ﬁnd that an MCU with hundreds of KBsSRAM can execute NNs with a few hundreds MBs of memoryfootprint (a 1000 × gap). Out-of-core execution expands thescope of NN applications on MCUs. References [1] Every thing you need to know about slc, mlc, and tlcnand ﬂash. . mydigitaldiscount . com/everything-you-need-to-know-about-slc-mlc-and-tlc-nand-flash . html .[2] Flash memory. https://en . wikipedia . org/wiki/Flash_memory .[3] History and evolution of memory cards. https://koofr . eu/blog/posts/history-and-evolution-of-memory-cards .[4] Kingston ﬂash memory guide. https://media . kingston . com/pdfs/MKF_283 . . pdf .[5] microsd card benchmarks. . pidramble . com/wiki/benchmarks/microsd-cards .[6] Nxp general purpose microcontrollers. . nxp . com/products/processors-and-microcontrollers/arm-microcontrollers/general-purpose-mcus:GENERAL-PURPOSE-MCUS .[7] The role of srams in nextgen iot and wearable em-bedded designs. . embedded . com/the-role-of-srams-in-nextgen-iot-and-wearable-embedded-designs/ .[8] Rooﬂine model. https://en . wikipedia . org/wiki/Roofline_model .[9] Sd cart testing. https://support . embeddedarm . com/support/solutions/articles/22000202866-sd-card-testing .[10] Stm32 32-bit arm cortex mcu. . st . com/en/microcontrollers-microprocessors/stm32-32-bit-arm-cortex-mcus . html . [11] Reliable sd-based block storage. https://support . embeddedarm . com/support/solutions/articles/22000202867-reliable-sd-based-block-storage , 2017.[12] Amazon echo. https://en . wikipedia . org/wiki/Amazon_Echo , 2020.[13] Arm cortex-m. https://en . wikipedia . org/wiki/ARM_Cortex-M , 2020.[14] Estimating of memory consumption for various con-volutional neural networks. https://github . com/albanie/convnet-burden , 2020.[15] The exploration and exploitation of an sd mem-ory card. http://bunniefoo . com/bunnie/sdcard-30c3-pub . pdf , 2020.[16] Floating point operations per second. https://en . wikipedia . org/wiki/FLOPS , 2020.[17] An introduction to tinyml. https://towardsdatascience . com/an-introduction-to-tinyml-4617f314aa79 , 2020.[18] Nuru ai expansion: Supporting farmers to diagnosecrop diseases. https://blog . plantwise . org/2020/03/13/nuru-ai-expansion-supporting-farmers-to-diagnose-crop-diseases/ , 2020.[19] Performance of state-of-the-art cryptography onarm-based microprocessors. https://csrc . nist . gov/csrc/media/events/lightweight-cryptography-workshop-2015/documents/presentations/session7-vincent . pdf , 2020.[20] Stmicroelectronics stm32 family. https://en . wikipedia . org/wiki/STM32 , 2020.[21] Transcend industrial temp microsd 64 gb. https://cdn . transcend-info . com/products/images/modelpic/574/EN_USDC10I_PS_2020 . pdf ,2020.[22] Usb c power meter tester. . amazon . com/gp/product/B07X3HST7V/ref = ppx_yo_dt_b_asin_title_o00_s00 ? ie = UTF8 & psc = , 2020.[23] Manoj Alwani, Han Chen, Michael Ferdman, and PeterMilder. Fused-layer cnn accelerators. In The 49th An-nual IEEE/ACM International Symposium on Microar-chitecture , page 22. IEEE Press, 2016.[24] Robert David, Jared Duke, Advait Jain, Vijay JanapaReddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier,Meghna Natraj, Shlomi Regev, et al. Tensorﬂow litemicro: Embedded machine learning on tinyml systems. arXiv preprint arXiv:2010.08678 , 2020.625] Jonathan Frankle John Guttag Davis Blalock, Jose JavierGonzalez Ortiz. What is the state of neural networkpruning? In

MLSys , 2020.[26] Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhi-menko, and Song Han. Ios: Inter-operator schedulerfor cnn acceleration. arXiv preprint arXiv:2011.01302 ,2020.[27] Graham Gobieski, Brandon Lucia, and Nathan Beck-mann. Intelligence beyond the edge: Inference on in-termittent embedded systems. In

Proceedings of theTwenty-Fourth International Conference on Architec-tural Support for Programming Languages and Operat-ing Systems , pages 199–213. ACM, 2019.[28] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan,and Pritish Narayanan. Deep learning with limited nu-merical precision. In

International Conference on Ma-chine Learning , pages 1737–1746, 2015.[29] Song Han, Huizi Mao, and William J Dally. Deep com-pression: Compressing deep neural networks with prun-ing, trained quantization and huffman coding. arXivpreprint arXiv:1510.00149 , 2015.[30] Akio Hayakawa and Takuya Narihira. Out-of-coretraining for extremely large-scale neural networks withadaptive window-based scheduling. arXiv preprintarXiv:2010.14109 , 2020.[31] Andrew G Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, MarcoAndreetto, and Hartwig Adam. Mobilenets: Efﬁcientconvolutional neural networks for mobile vision appli-cations. arXiv preprint arXiv:1704.04861 , 2017.[32] Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapad-visor: Pushing deep learning beyond the gpu memorylimit via smart swapping. In

Proceedings of the Twenty-Fifth International Conference on Architectural Supportfor Programming Languages and Operating Systems ,pages 1341–1355, 2020.[33] Tian Jin and Seokin Hong. Split-cnn: Splitting window-based operations in convolutional neural networks formemory system optimization. In

Proceedings of theTwenty-Fourth International Conference on Architec-tural Support for Programming Languages and Operat-ing Systems , pages 835–847, 2019.[34] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neuralnetworks. In

Advances in neural information processingsystems , pages 1097–1105, 2012.[35] Liangzhen Lai, Naveen Suda, and Vikas Chandra.Cmsis-nn: Efﬁcient neural network kernels for armcortex-m cpus. arXiv preprint arXiv:1801.06601 , 2018. [36] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet,and Hans Peter Graf. Pruning ﬁlters for efﬁcient con-vnets. arXiv preprint arXiv:1608.08710 , 2016.[37] Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, ChuangGan, and Song Han. Mcunet: Tiny deep learning on iotdevices. arXiv preprint arXiv:2007.10319 , 2020.[38] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Ar-slan Zulﬁqar, and Stephen W Keckler. vdnn: Virtualizeddeep neural networks for scalable, memory-efﬁcient neu-ral network design. In ,pages 1–13. IEEE, 2016.[39] Peter Schwabe and Ko Stoffelen. All the aes you need oncortex-m3 and m4. In

International Conference on Se-lected Areas in Cryptography , pages 180–194. Springer,2016.[40] Haichen Shen, Seungyeop Han, Matthai Philipose, andArvind Krishnamurthy. Fast video classiﬁcation viaadaptive cascading of deep models. In

Proceedings ofthe IEEE conference on computer vision and patternrecognition , pages 3646–3654, 2017.[41] Karen Simonyan and Andrew Zisserman. Very deep con-volutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[42] Kevin Siu, Dylan Malone Stuart, Mostafa Mahmoud,and Andreas Moshovos. Memory requirements for con-volutional neural network hardware accelerators. In , pages 111–121. IEEE, 2018.[43] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-manet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,Vincent Vanhoucke, and Andrew Rabinovich. Goingdeeper with convolutions. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pages 1–9, 2015.[44] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-manet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,Vincent Vanhoucke, and Andrew Rabinovich. Goingdeeper with convolutions. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pages 1–9, 2015.[45] Mengwei Xu, Feng Qian, Qiaozhu Mei, Kang Huang,and Xuanzhe Liu. Deeptype: On-device deep learningfor input personalization service with minimal privacyconcern.

Proceedings of the ACM on Interactive, Mo-bile, Wearable and Ubiquitous Technologies , 2(4):1–26,2018.746] Mengwei Xu, Xiwen Zhang, Yunxin Liu, Gang Huang,Xuanzhe Liu, and Felix Xiaozhu Lin. Approximatequery service on autonomous iot cameras. In

Proceed-ings of the 18th International Conference on Mobile Sys-tems, Applications, and Services , pages 191–205, 2020.[47] Haojin Yang, Martin Fritzsche, Christian Bartz, andChristoph Meinel. Bmxnet: An open-source binaryneural network implementation based on mxnet. In

Pro-ceedings of the 25th ACM international conference onMultimedia , pages 1209–1212. ACM, 2017.[48] Yuan Yu, Martín Abadi, Paul Barham, Eugene Brevdo,Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghe-mawat, Tim Harley, Peter Hawkins, et al. Dynamiccontrol ﬂow in large-scale machine learning. In

Pro-ceedings of the Thirteenth EuroSys Conference , pages1–15, 2018.[49] Yundong Zhang, Naveen Suda, Liangzhen Lai, andVikas Chandra. Hello edge: Keyword spotting on mi-crocontrollers. arXiv preprint arXiv:1711.07128 , 2017.[50] Michael Zhu and Suyog Gupta. To prune, or not to prune:exploring the efﬁcacy of pruning for model compression. arXiv preprint arXiv:1710.01878arXiv preprint arXiv:1710.01878