[PDF] Benchmarking TinyML Systems: Challenges and Direction

Abstract

Recent advancements in ultra-low-power machine learning (TinyML) hardware promises to unlock an entirely new class of smart applications. However, continued progress is limited by the lack of a widely accepted benchmark for these systems. Benchmarking allows us to measure and thereby systematically compare, evaluate, and improve the performance of systems and is therefore fundamental to a field reaching maturity. In this position paper, we present the current landscape of TinyML and discuss the challenges and direction towards developing a fair and useful hardware benchmark for TinyML workloads. Furthermore, we present our four benchmarks and discuss our selection methodology. Our viewpoints reflect the collective thoughts of the TinyMLPerf working group that is comprised of over 30 organizations.

Full PDF

BB ENCHMARKING T INY

ML S

YSTEMS : C

HALLENGES AND D IRECTION

Colby R. Banbury Vijay Janapa Reddi Max Lam William Fu Amin Fazel Jeremy Holleman

Xinyuan Huang Robert Hurtado David Kanter Anton Lokhmotov David Patterson

Danilo Pau Jae-sun Seo Jeff Sieracki Urmish Thakker Marian Verhelst

15 16

Poonam Yadav A BSTRACT

Recent advancements in ultra-low-power machine learning (TinyML) hardware promises to unlock an entirely newclass of smart applications. However, continued progress is limited by the lack of a widely accepted benchmarkfor these systems. Benchmarking allows us to measure and thereby systematically compare, evaluate, and improvethe performance of systems and is therefore fundamental to a ﬁeld reaching maturity. In this position paper, wepresent the current landscape of TinyML and discuss the challenges and direction towards developing a fair anduseful hardware benchmark for TinyML workloads. Furthermore, we present our four benchmarks and discussour selection methodology. Our viewpoints reﬂect the collective thoughts of the TinyMLPerf working group thatis comprised of over 30 organizations.

NTRODUCTION

Machine learning (ML) inference on the edge is an increas-ingly attractive prospect due to its potential for increasingenergy efﬁciency (Fedorov et al., 2019), privacy, responsive-ness (Zhang et al., 2017), and autonomy of edge devices.Thus far, the ﬁeld edge ML has predominately focused onmobile inference which has led to numerous advancementsin machine learning models such as exploiting pruning, spar-sity, and quantization. But in recent years, there have majorbeen strides in expanding the scope of edge systems. In-terest is brewing in both academia (Fedorov et al., 2019;Zhang et al., 2017) and industry (Flamand et al., 2018; War-den, 2018a) towards expanding the scope of edge ML tomicrocontroller-class devices.The goal of “TinyML” (tinyML Foundation, 2019) is tobring ML inference to ultra-low-power devices, typically un-der a milliWatt, and thereby break the traditional power bar-rier preventing widely distributed machine intelligence. Byperforming inference on-device, and near-sensor, TinyMLenables greater responsiveness and privacy while avoidingthe energy cost associated with wireless communication,which at this scale is far higher than that of compute (War- Harvard University Samsung Semiconductor, Inc. Syntiant University of North Carolina, Charlotte Cisco Systems California State Polytechnic University, Pomona Real WorldInsights dividiti University of California, Berkeley Google STMicroelectronics, Italy Arizona State University RealityAI Arm ML Research Lab KU Leuven Interuniversity Micro-electronics Centre (IMEC) University of York. Correspondenceto: Colby R. Banbury < [email protected] > .Copyright 2020 by the authors. den, 2018b). Furthermore, the efﬁciency of TinyML enablesa class of smart, battery-powered, always-on applicationsthat can revolutionize the real-time collection and process-ing of data. This emerging ﬁeld, which is the culminationof many innovations, is poised only further to accelerate itsgrowth in the coming years.To unlock the full potential of the ﬁeld, hardware softwareco-design is required. Speciﬁcally, TinyML models mustbe small enough to ﬁt within the tight constraints of MCU-class devices (e.g., a few hundred kB of memory and limitedonboard compute horsepower in the order of MHz proces-sor clock speed), thus limiting the size of the input and thenumber of layers (Zhang et al., 2017) or necessitating theuse lightweight, non-neural network-based techniques (Ku-mar et al., 2017). TinyML tools are broadly deﬁned asanything that enables the design, mapping, and deploymentof TinyML algorithms including aggressive quantizationtechniques (Wang et al., 2019), memory aware neural archi-tecture searches (Fedorov et al., 2019), frameworks (Ten-sorFlow), and efﬁcient inference libraries (Lai et al., 2018;Garofalo et al., 2019). Efforts in TinyML hardware in-clude improving inference on the next generation of general-purpose MCUs (arm; Flamand et al., 2018), developinghardware specialized for low power inference, and creatingnovel architectures intended only as inference engines forspeciﬁc tasks (Moons et al., 2018).The complexity and dynamicity of the ﬁeld obscure the mea-surement of progress and make dynamism design decisionsintractable. In order to enable the continued innovation, afair and reliable method of comparison is needed. Sinceprogress is often the result of increased hardware capability, a r X i v : . [ c s . PF ] J a n enchmarking TinyML Systems a reliable TinyML hardware benchmark is required.In this paper, we discuss the challenges and opportunitiesassociated with the development of a TinyML hardwarebenchmark. Our short paper is a call to action for estab-lishing a common benchmarking for TinyML workloads onemerging TinyML hardware to foster the development ofTinyML applications. The points presented here reﬂect theongoing effort of the TinyMLPerf working group that is cur-rently comprised of over 30 organizations and 75 members.The rest of the paper is organized as follows. In Section 2,we discuss the application landscape of TinyML, includingthe existing use cases, models, and datasets. In Section 3, wedescribe the existing TinyML hardware solutions, includingoutlining improvements to general-purpose MCUs and thedevelopment of novel architectures. In Section 4, we discussthe inherent challenges of the ﬁeld and how they complicatethe development of a benchmark. In Section 5, we describethe existing benchmarks that relate to TinyML and identifythe deﬁciencies that still need to be ﬁlled. In Section 6 wediscuss the progress of the TinyMLPerf working group thusfar and describe the four benchmarks. In Section 7, weconcluded the paper and discuss future work. INY U SE C ASES , M

ODELS & D

ATASETS

In this section we attempt to summarize the ﬁeld of TinyMLby describing a set of representative use cases (Section2.1), their relevant datasets (Section 2.2), and the modelarchitectures commonly applied to these speciﬁc use cases(Section 2.3).

Despite the general lack of maturity within the ﬁeld, thereare a number of well established TinyML use cases. Wecategorize the application landscape of tiny ML by inputtype in Table 3, which in the context of TinyML systemsplays a crucial role in the use case deﬁnition.Audio wake words is already a fairly ubiquitous example ofalways-on ML inference. Audio wake words is generally aspeech classiﬁcation problem that achieves very low powerinference by limiting the label space, often to two labels:“wake word” and “not wake word” (Zhang et al., 2017).Anomaly detection and predictive maintenance are com-monly deployed on MCUs in factory settings where audio,motor bearing, or IMU data can be used to detect faults inproducts or equipment.Other deployed TinyML applications, like activity recog-nition from IMU data (Hassan et al., 2018), rely on lowfeature dimensionality to ﬁt within the tight constraints ofthe platforms. Some use cases have been proven viable, buthave yet to reach end users because they are too new, like visual wake words (Chowdhery et al., 2019).Many traditional ML use cases can be considered futuristicTinyML tasks. As ultra-low-power inference hardware con-tinues to improve, the threshold of viability expands. Taskslike large label space image classiﬁcation or object countingare well suited for low-power always-on applications butare currently too compute and memory hungry for today’sTinyML hardware.Furthermore, TinyML has a signiﬁcant role to play in futuretechnology. For example, many of the fundamental featuresof augmented reality (AR) glasses are always-on and battery-powered. Due to tight real time constraints, these devicescannot afford the latency of ofﬂoading computation to thecloud, an edge server, or even an accompanying mobiledevice. Thus, due to shared constraints, AR applications canbeneﬁt signiﬁcantly from progress in the ﬁeld of TinyML.

There are a number of open-source datasets that are relevantto TinyML usecases. Table 3 breaks them down by thetype of data. Despite the availability of these datasets, themajority of deployed TinyML models are trained on muchlarger, proprietary datasets. The open-source datasets thatare competitively large are not TinyML speciﬁc. The lackof large, TinyML focused, open-source datasets slows theprogress of academic research and limits the ability of abenchmark to represent real workloads accurately.

Table 3 lists common model types for TinyML use cases.Although neural networks (NN) are a dominant force intraditional ML, it is common to use non-NN based solutionslike decision trees (Kumar et al., 2017), for some TinyMLuse cases, due to their low compute and memory require-ments.Machine learning on MCU-class devices has only recentlybecome feasible; therefore, the community has yet to pro-duce models that have become widely accepted as Mo-bileNets have become for mobile devices. This makes thetask of selecting representative models challenging. How-ever, immaturity also brings opportunity as our decisionscan help direct future progress. Selecting a subset of thecurrently available models, outlining the rules for qualityversus accuracy trade-offs, and prescribing a measurementmethodology that can be faithfully reproduced will encour-age the community to develop new models, runtimes, andhardware that progressively outperform one another. enchmarking TinyML Systems

Table 1.

Survey of TinyML Use Cases, Models, and Datasets I NPUT T YPE U SE C ASES M ODEL T YPES D ATASETS A UDIO A UDIO W AKE W ORDS C ONTEXT R ECOGNITION C ONTROL W ORDS K EYWORD D ETECTION

DNNCNNRNNLSTM S

PEECH C OMMANDS (W ARDEN , 2018 A )A UDIOSET (G EMMEKE ET AL ., 2017)E

XTRA S ENSORY (V AIZMAN ET AL ., 2017)I

MAGE V ISUAL W AKE W ORDS O BJECT D ETECTION I MAGE C LASSIFICATION G ESTURE R ECOGNITION O BJECT C OUNTING T EXT R ECOGNITION

DNNCNNSVMD

ECISION T REES

KNNL

INEAR V ISUAL W AKE W ORDS (C HOWDHERY ET AL ., 2019)CIFAR10 (K

RIZHEVSKY ET AL ., 2009 B )MNIST (L E C UN & C ORTES , 2010)I

MAGE N ET (D ENG ET AL ., 2009)DVS128 G

ESTURE (A MIR ET AL ., 2017)P

HYSIOLOGICAL /B EHAVIORAL M ETRICS S EGMENTATION F ORECASTING A CTIVITY D ETECTION

DNND

ECISION T REE

SVML

INEAR P HYSIONET (G OLDBERGER ET AL ., 2000)HAR (C

RAMARIUC , 2019)DSA (A

LTUN ET AL ., 2010)O

PPORTUNITY (R OGGEN ET AL ., 2010)UCI EMG (L

OBOV ET AL ., 2018)I

NDUSTRY T ELEMETRY S ENSING ( LIGHT , TEMP , ETC )A NOMALY D ETECTION M OTOR C ONTROL P REDICTIVE M AINTENANCE

DNND

ECISION T REE

SVML

INEAR N AIVE B AYES

UCI A IR Q UALITY (D E V ITO ET AL ., 2008)UCI G AS (V ERGARA ET AL ., 2012)NASA’ S PC O E (S

AXENA & G

OEBEL , 2008)

Figure 1.

A logorithmic comparison of the active power consump-tion between TinyML systems and those supported by MLPerf.TinyML systems can be up to four orders of magnitude smaller inthe power budget as compared to state-of-the-art MLPerf systems.

INY H ARDWARE C ONSTRAINTS

TinyML hardware is deﬁned by its ultra-low power con-sumption, which is often in the range of 1 mWatt and below.At the top of this range are efﬁcient 32-bit MCUs, like thosebased on the Arm Cortex-M7 or RISC-V PULP processors, and at the bottom are novel ultra-low-power inference en-gines. Even the largest TinyML devices consume drasticallyless power than the smallest traditional ML devices. Figure1 shows the logarithmic comparison of the active powerconsumption between TinyML devices and those currentlysupported by MLPerf (v0.5 inference results from the openand closed divisions). TinyML devices can be up to four or-ders of magnitude smaller in the power budget as comparedto state-of-the-art MLPerf systems.The advent of low-power, cheap 32-bit MCUs have revolu-tionized the compute capability at the very edge. Cortex-Mbased platforms are now regularly performing tasks thatwere previously infeasible at this scale, mostly due to sup-port for single instruction multiple data (SIMD) and digitalsignal processing (DSP) instructions. This fast vector mathsupports NN and highly efﬁcient SVM implementations, italso accelerates many feature computations using 8bit ﬁxedpoint arithmetic.A feature of MCUs is the prevalence of on-chip SRAMand embedded Flash. Thus, when models can ﬁt within thetight on-chip memory constraints, they are free of the costlyDRAM accesses that hamper traditional ML. Widespreadadoption and dispersion of TinyML are reliant on the capa- enchmarking TinyML Systems bility of these platforms.Although general-purpose MCUs provide ﬂexibility, thehighest TinyML performance efﬁciency comes from special-ized hardware. Novel architectures can achieve performancein the range of one micro Joule per inference (Holleman,2019). These specialized devices expand the boundaries ofML to the ultra low power end of TinyML processors.

HALLENGES

TinyML systems present a number of unique challenges tothe design of a performance benchmark that can be usedto measure and quantify performance differences betweenvarious systems systematically. We discuss the four primaryobstacles and postulate how they might be overcome.

Low power consumption is one of the deﬁning features ofTinyML systems. Therefore, a useful benchmark shouldostensibly proﬁle the energy efﬁciency of each device. How-ever, there are many challenges in fairly measuring energyconsumption. Firstly, as illustrated in Figure 1, TinyMLdevices can consume drastically different amounts of power,which makes maintaining accuracy across the range of de-vices difﬁcult.Secondly, determining what falls under the scope of thepower measurement is difﬁcult to determine when data pathsand pre-processing steps can vary signiﬁcantly betweendevices. Other factors like chip peripherals and underlyingﬁrmware can impact the measurements. Unlike traditionalhigh-power ML systems, TinyML systems do not have sparecores to load the System-Under-Test (SUT) with minimaloverheads.

Due to their small size, TinyML systems often have tightmemory constraints. While traditional ML systems likesmartphones cope with resource constraints in the orderof a few GBs, tinyML systems are typically coping withresources that are two orders of magnitude smaller.Memory is one of the primary motivating factors for thecreation of a TinyML speciﬁc benchmark. TraditionalML benchmarks use inference models that have drasticallyhigher peak memory requirements (in the order of gigabytes)than TinyML devices can provide. This also complicatesthe deployment of a benchmarking suite as any overheadcan signiﬁcantly impact power consumption or even makethe benchmark too big to ﬁt. Individual benchmarks mustalso cover a wide range of devices; therefore, multiple levelsof quantization and precision should be represented in thebenchmarking suite. Finally, a variety of benchmarks should be chosen such that the diversity of the ﬁeld is supported.

Despite its nascency, TinyML systems are already diversein their performance, power, and capabilities. Devices rangefrom general-purpose MCUs to novel architectures, likein event-based neural processors (Brainchip) or memorycompute (Kim et al., 2019). This heterogeneity poses anumber of challenges as the system under test (SUT) willnot necessarily include otherwise standard features, likea system clock or debug interface. Furthermore, the taskof normalizing performance results across heterogeneousimplementations is a key challenge.Today’s state-of-the-art benchmarks are not designed to han-dle the challenges readily. They need careful re-engineeringto be ﬂexible enough to handle the extent of hardware het-erogeneity that is commonplace in the TinyML ecosystem.

There are three distinct methods for model deployment onto TinyML systems: hand coding, code generation, and MLinterpreters.Hand coding often produces the best results as it allows forlow-level, application speciﬁc optimizations; however, thetask is time consuming and the impact of the optimizationsare often opaque to anyone but the original design team.Moreover, hand coding limits the ability to share knowledgeand adopt new methods, which is detrimental to the rateof progress in TinyML. From a benchmarking perspective,hand coded submission will likely produce the best numeri-cal results at the cost of reproducibility, comparability andtime.Code generation methods produce well optimized code with-out the signiﬁcant effort of hand coding by abstracting andautomating system level optimizations. However, code gen-eration does not address the issues with comparability, aseach major vendor has their own set of proprietary tools andcompilers, which also makes portability a challenge.ML interpreters allow for signiﬁcant portability as theirabstract structure is the same across platforms. TensorFlowLite for Microcontrollers, a popular ML framework forTinyML, uses an interpreter to call individual kernels, likeconvolution, during run time. The framework is independentof the model architecture, therefore new models can beeasily swapped in. Additionally, the reference kernels canbe individually optimised and changed to ﬁt the platform.This method comes with a small overhead in binary sizeand performance. From a benchmarking perspective, thisabstraction separates the impact of the model architectureon the system level performance, which makes results moregeneralizable. enchmarking TinyML Systems

Table 2.

Existing Benchmarks B ENCHMARK

ML? P

OWER ? T

INY ? C ORE M ARK × √ √

MLM

ARK √ × ×

MLP

ERF I NFERENCE √ √ × T INY

ML R

EQUIREMENTS √ √ √

A benchmark suite must balance optimality with portabil-ity, and comparibility with representativeness. A TinyMLbenchmark should support many options for model deploy-ment but the impact of that choice on the results must becarefully evaluated.

ELATED W ORK

There are a number of ML related hardware benchmarks,however, none that accurately represent the performanceof TinyML workloads on tiny hardware. Table 2 shows asampling of the widely accepted industry benchmarks thatare directly applicable to the discussion on TinyML systems.EEMBC CoreMark (Gal-On & Levy) has become the stan-dard performance benchmark for MCU-class devices dueto its ease of implementation and use of real algorithms.Yet, CoreMark does not proﬁle full programs, nor does itaccurately represent machine learning inference workloads.EEMBC MLMark (Torelli & Bangale) addresses these is-sues by using actual ML inference workloads. However, thesupported models are far too large for MCU-class devicesand are not representative of TinyML workloads. They re-quire far too much memory (GBs) and have signiﬁcant runtimes. Additionally, while CoreMark supports power mea-surements with ULPMark-CM (EEMBC), MLMark doesnot, which is critical for a TinyML benchmark.MLPerf, a community-driven benchmarking effort, hasrecently introduced a benchmarking suite for ML infer-ence (Reddi et al., 2019) and has plans to add power mea-surements. However, much like MLMark, the currentMLPerf inference benchmark precludes MCUs and otherresource-constrained platforms due to a lack of small bench-marks and compatible implementations.As Table 2 summarizes, there is a clear and distinct need fora TinyML benchmark that caters to the unique needs of MLworkloads, makes power a ﬁrst-class citizen and prescribesa methodology that suits TinyML.

ENCHMARKS

To overcome theses challenges, we adopt a set of principlesfor the development of a robust TinyML benchmarking suiteand select a set of 4 benchmarks.

As previously stated, TinyML is a diverse ﬁeld, thereforenot all systems can be accommodated under strict rules,however, without strict rules, direct comparison of the hard-ware becomes more difﬁcult. To address this issue, weadopt MLPerf’s open and closed structure. More traditionalTinyML solutions can submit to the closed division wheresubmissions must use a model that is considered equivalentto the reference model. TinyML systems that fall outsidethe bounds of the ”closed” benchmark can submit results tothe open division which will allow submissions to deviateas necessary from the closed reference. We believe thisstructure increases the inclusivity of the bechmarking suitewhile maintaining the comparability of the results.Additionally, the open division allows for submissions todemonstrate novel software optimizations. Software basedorganizations can submit results using the reference plat-form while altering the model or inference engine to demon-strate the relative advantage of their unique solutions.

Our use case selection process prioritized diversity, fea-sibility, and industry relevance. Diversity to ensure ourbenchmark suite covered as much of the ﬁeld as possible,feasibility in terms of access to open source datasets andmodels, and relevance to real world applications.The group has selected four use cases to target: audiowake words, visual wake words, image classiﬁcation, andanomaly detection. Audio wake words refers to the com-mon, keyword spotting task (e.g. “Alexa”, “Ok Google”,and “Hey Siri”). Visual wake words is a binary image classi-ﬁcation task that indicates if a person is visible in the imageor not. The image classiﬁcation use cases targets small labelset size image classiﬁcation. Anomaly detection is a broaderuse case that classiﬁes time series data as “normal” or “ab-normal”. We speciﬁcally select audio anomaly detection asour use case due to the availability of a relevant dataset.These use cases have been selected to represent the broadrange of TinyML. They encompass three distinct input datatypes and range from relatively resource hungry (visualwake words) to light weight (anomaly detection). Further-more the models traditionally used for these use cases arevaried therefore the benchmarking suite can support a di-verse set of ML techniques.

The group has selected a dataset for each use case, as shownin Table 3. The datasets help specify the use cases, are usedto train the reference models, and are sampled to create thetests sets used during the measurement on device. Further- enchmarking TinyML Systems

Table 3.

TinyMLPerf Benchmarking Suite U SE C ASE D ATASETS M ODEL A UDIO W AKE W ORDS S PEECH C OMMANDS (W ARDEN , 2018 A ) DS-CNN (Z HANG ET AL ., 2017)V

ISUAL W AKE W ORDS V ISUAL W AKE W ORDS D ATASET (C HOWDHERY ET AL ., 2019) DS-CNN (TFLM-P

ERSON -D ETECTION )I MAGE C LASSIFICATION

CIFAR10(K

RIZHEVSKY ET AL ., 2009 A ) R ESNET

E ET AL ., 2016)A

NOMALY D ETECTION T OY ADMOS (T OY C AR )(K OIZUMI ET AL .,2019) D

EEP A UTO E NCODER (K OIZUMI ET AL .,2020) more, the datasets can be used to train a new or modiﬁedmodel in the open division. We have selected datasets thatare open, well known, and relevant to industry use cases.

The group has selected four reference models. These ref-erence models are the benchmark workloads in the closeddivision and act as a baseline for the open division. The DS-CNN described in (Zhang et al., 2017) have been selectedfor audio wake words. The MobilenetV1(Howard et al.,2017) used in the TensorFlow Lite for Microcontrollersperson detection example (TFLM-Person-Detection) hasbeen selected for visual wake words. An eight layer ResNetmodel (He et al., 2016) has been selected for image clas-siﬁcation. The baseline deep autoencoder from Task 2 ofDCASE2020 competetition has been selected for anomalydetection. (Koizumi et al., 2020). The models were se-lected, based on industry input, to be representative of theirrespective use cases.

The benchmarking suite will primarily measure inferencelatency with the option to measure energy consumption. Thescope of the the the measurements is determined by eachbenchmark. In the open division the accuracy of the modelmust remain within a set threshold of the closed divisionmodel.

Perfection is often the enemy of good, therefore, to ﬁllthe community’s need for comparability, our priority is toquickly establish a set of minimum viable benchmarks anditeratively address deﬁciencies. The benchmarking suitewill continue to evolve to meet the needs of the community. We plan to accept result submissions in March of 2021.

ONCLUSION

In conclusion, TinyML is an important and rapidly evolvingﬁeld that requires comparability amongst hardware inno-vations to enable continued progress and stability. In thispaper, we reviewed the current landscape of TinyML, in-cluding highlighting the need for a hardware benchmark.Additionally, we analyzed challenges associated with devel-oping said benchmark and discussed a path forward. Finally,we have selected use cases, datasets, and models for ourfour benchmarks.If you would like to contribute to the effort, join the work-ing group here: https://groups.google.com/u/4/a/mlcommons.org/g/tiny

The benchmark suite is available here: https://github.com/mlcommons/tiny R EFERENCES

Helium: Enhancing the capabilities of the smallest de-vices. URL .Altun, K., Barshan, B., and Tunc¸el, O. Comparative studyon classifying human activities with miniature inertial andmagnetic sensors.

Pattern Recognition , 43:3605–3620,10 2010. doi: 10.1016/j.patcog.2010.04.019.Amir, A., Taba, B., Berg, D., Melano, T., McKinstry, J.,Nolfo, C. D., Nayak, T., Andreopoulos, A., Garreau,G., Mendoza, M., Kusnitz, J., Debole, M., Esser, S.,Delbruck, T., Flickner, M., and Modha, D. A lowpower, fully event-based gesture recognition system. In enchmarking TinyML Systems

Recognition (CVPR) , pp. 7388–7397, July 2017. doi:10.1109/CVPR.2017.781.Brainchip. Akida neuromorphic sys-tem on chip. URL .Chowdhery, A., Warden, P., Shlens, J., Howard, A.,and Rhodes, R. Visual wake words dataset.

CoRR ,abs/1906.05721, 2019. URL http://arxiv.org/abs/1906.05721 .Cramariuc, A.-C. P. I. M. B. Precis har, 2019. URL http://dx.doi.org/10.21227/mene-ck48 .De Vito, S., Massera, E., Piga, M., and Martinotto, L. Onﬁeld calibration of an electronic nose for benzene estima-tion in an urban pollution monitoring scenario.

Sensorsand Actuators B Chemical , 129:750–757, 02 2008. doi:10.1016/j.snb.2007.09.060.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical ImageDatabase. In

CVPR09 , 2009.EEMBC. Ulpmark - an eembc benchmark. URL .Fedorov, I., Adams, R. P., Mattina, M., and Whatmough, P.Sparse: Sparse architecture search for cnns on resource-constrained microcontrollers. In

Advances in Neural In-formation Processing Systems 32 , pp. 4978–4990. CurranAssociates, Inc., 2019.Flamand, E., Rossi, D., Conti, F., Loi, I., Pullini, A., Roten-berg, F., and Benini, L. Gap-8: A risc-v soc for ai atthe edge of the iot. In , pp. 1–4, July 2018. doi:10.1109/ASAP.2018.8445101.Gal-On, S. and Levy, M. Exploring coremark - a bench-mark maximizing simplicity and efﬁcacy. Technical re-port. URL .Garofalo, A., Rusci, M., Conti, F., Rossi, D., and Benini,L. Pulp-nn: accelerating quantized neural networkson parallel ultra-low-power risc-v processors.

Philo-sophical Transactions of the Royal Society A: Mathe-matical, Physical and Engineering Sciences , 378(2164):20190155, Dec 2019. ISSN 1471-2962. doi: 10.1098/rsta.2019.0155. URL http://dx.doi.org/10.1098/rsta.2019.0155 .Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A.,Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. Audio set: An ontology and human-labeled dataset foraudio events. In

Proc. IEEE ICASSP 2017 , New Orleans,LA, 2017.Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff,J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody,G. B., Peng, C.-K., and Stanley, H. E. PhysioBank, Phys-ioToolkit, and PhysioNet: Components of a new researchresource for complex physiologic signals.

Circulation ,101(23):e215–e220, 2000. Circulation Electronic Pages:http://circ.ahajournals.org/content/101/23/e215.fullPMID:1085218; doi: 10.1161/01.CIR.101.23.e215.Hassan, M. M., Uddin, M. Z., Mohamed, A., and Almogren,A. A robust human activity recognition system usingsmartphone sensors and deep learning.

Future GenerationComputer Systems , 81:307–313, 2018.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016.Holleman, J. The speed and power advantageof a purpose-built neural compute engine, Jun2019. URL .Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:Efﬁcient convolutional neural networks for mobile visionapplications. arXiv preprint arXiv:1704.04861 , 2017.Kim, H., Chen, Q., Yoo, T., Kim, T. T.-H., and Kim, B. A1-16b precision reconﬁgurable digital in-memory com-puting macro featuring column-mac architecture and bit-serial computation. In

ESSCIRC 2019-IEEE 45th Eu-ropean Solid State Circuits Conference (ESSCIRC) , pp.345–348. IEEE, 2019.Koizumi, Y., Saito, S., Uematsu, H., Harada, N., and Imoto,K. Toyadmos: A dataset of miniature-machine operatingsounds for anomalous sound detection. In , pp. 313–317. IEEE, 2019.Koizumi, Y., Kawaguchi, Y., Imoto, K., Nakamura, T.,Nikaido, Y., Tanabe, R., Purohit, H., Suefusa, K., Endo,T., Yasuda, M., and Harada, N. Description and dis-cussion on DCASE2020 challenge task2: Unsupervisedanomalous sound detection for machine condition moni-toring. In arXiv e-prints: 2006.05822 , pp. 1–4, June 2020.URL https://arxiv.org/abs/2006.05822 .Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images. 2009a. enchmarking TinyML Systems

Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadianinstitute for advanced research). 2009b. URL .Kumar, A., Goyal, S., and Varma, M. Resource-efﬁcientmachine learning in 2 KB RAM for the internet ofthings. In Precup, D. and Teh, Y. W. (eds.),

Pro-ceedings of the 34th International Conference on Ma-chine Learning , volume 70 of

Proceedings of Ma-chine Learning Research , pp. 1935–1944, InternationalConvention Centre, Sydney, Australia, 06–11 Aug2017. PMLR. URL http://proceedings.mlr.press/v70/kumar17a.html .Lai, L., Suda, N., and Chandra, V. Cmsis-nn: Efﬁcientneural network kernels for arm cortex-m cpus, 2018.LeCun, Y. and Cortes, C. MNIST handwritten digitdatabase. 2010. URL http://yann.lecun.com/exdb/mnist/ .Lobov, S., Krilova, N., Kastalskiy, I., Kazantsev, V., andMakarov, V. Latent factors limiting the performanceof semg-interfaces.

Sensors , 18:1122, 04 2018. doi:10.3390/s18041122.Moons, B., Bankman, D., Yang, L., Murmann, B., andVerhelst, M. Binareye: An always-on energy-accuracy-scalable binary cnn processor with all memory on chipin 28nm cmos. In , pp. 1–4. IEEE, 2018.Reddi, V. J., Cheng, C., Kanter, D., Mattson, P.,Schmuelling, G., Wu, C.-J., Anderson, B., Breughe, M.,Charlebois, M., Chou, W., Chukka, R., Coleman, C.,Davis, S., Deng, P., Diamos, G., Duke, J., Fick, D., Gard-ner, J. S., Hubara, I., Idgunji, S., Jablin, T. B., Jiao, J.,John, T. S., Kanwar, P., Lee, D., Liao, J., Lokhmotov,A., Massa, F., Meng, P., Micikevicius, P., Osborne, C.,Pekhimenko, G., Rajan, A. T. R., Sequeira, D., Sirasao,A., Sun, F., Tang, H., Thomson, M., Wei, F., Wu, E., Xu,L., Yamada, K., Yu, B., Yuan, G., Zhong, A., Zhang, P.,and Zhou, Y. Mlperf inference benchmark, 2019.Roggen, D., Calatroni, A., Rossi, M., Holleczek, T., F¨orster,K., Tr¨oster, G., Lukowicz, P., Bannach, D., Pirkl, G.,Ferscha, A., Doppler, J., Holzmann, C., Kurz, M., Holl,G., Chavarriaga, R., Sagha, H., Bayati, H., Creatura,M., and d. R. Mill`an, J. Collecting complex activitydatasets in highly rich networked sensor environments.In , pp. 233–240, June 2010. doi:10.1109/INSS.2010.5573462.Saxena, A. and Goebel, K. Turbofan enginedegradation simulation data set, 2008. URL http://ti.arc.nasa.gov/project/prognostic-data-repository . TensorFlow. Tensorﬂow lite for microcontrollers.URL .TFLM-Person-Detection. Tensorﬂow lite for mi-crocontrollers person detection example. URL https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/person_detection .tinyML Foundation. tinyml summit, 2019. URL .Torelli, P. and Bangale, M. Measuring inference perfor-mance of machine-learning frameworks on edge-classdevices with the mlmark benchmark. URL .Vaizman, Y., Ellis, K., and Lanckriet, G. Recognizingdetailed human context in the wild from smartphones andsmartwatches.

IEEE Pervasive Computing , 16(4):62–74,October 2017. ISSN 1558-2590. doi: 10.1109/MPRV.2017.3971131.Vergara, A., Vembu, S., Ayhan, T., Ryan, M., Homer, M.,and Huerta, R. Chemical gas sensor drift compensationusing classiﬁer ensembles.

Sensors and Actuators B:Chemical , s 166–167:320–329, 05 2012. doi: 10.1016/j.snb.2012.01.074.Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. Haq:Hardware-aware automated quantization with mixed pre-cision. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pp. 8612–8620,2019.Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition, 2018a.Warden, P. why the future of machine learning is tiny, 2018b.URL https://petewarden.com/2018/06/11/why-the-future-of-machine-learning-is-tiny/https://petewarden.com/2018/06/11/why-the-future-of-machine-learning-is-tiny/