[PDF] Mixed-Precision Quantization and Parallel Implementation of Multispectral Riemannian Classification for Brain--Machine Interfaces

Abstract

With Motor-Imagery (MI) Brain--Machine Interfaces (BMIs) we may control machines by merely thinking of performing a motor action. Practical use cases require a wearable solution where the classification of the brain signals is done locally near the sensor using machine learning models embedded on energy-efficient microcontroller units (MCUs), for assured privacy, user comfort, and long-term usage. In this work, we provide practical insights on the accuracy-cost tradeoff for embedded BMI solutions. Our proposed Multispectral Riemannian Classifier reaches 75.1% accuracy on 4-class MI task. We further scale down the model by quantizing it to mixed-precision representations with a minimal accuracy loss of 1%, which is still 3.2% more accurate than the state-of-the-art embedded convolutional neural network. We implement the model on a low-power MCU with parallel processing units taking only 33.39ms and consuming 1.304mJ per classification.

Full PDF

aa r X i v : . [ ee ss . SP ] F e b ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works. Mixed-Precision Quantization and ParallelImplementation of Multispectral RiemannianClassiﬁcation for Brain–Machine Interfaces

Xiaying Wang ∗ , Tibor Schneider ∗ , Michael Hersche ∗ , Lukas Cavigelli † , Luca Benini ∗‡∗ ETH Z¨urich, D-ITET, Switzerland ‡ University of Bologna, DEI, Italy † Huawei Technologies, Zurich RC, Switzerland

Abstract —With Motor-Imagery (MI) Brain–Machine Inter-faces (BMIs) we may control machines by merely thinkingof performing a motor action. Practical use cases require awearable solution where the classiﬁcation of the brain signalsis done locally near the sensor using machine learning modelsembedded on energy-efﬁcient microcontroller units (MCUs), forassured privacy, user comfort, and long-term usage. In thiswork, we provide practical insights on the accuracy-cost trade-off for embedded BMI solutions. Our proposed MultispectralRiemannian Classiﬁer reaches 75.1% accuracy on 4-class MItask. We further scale down the model by quantizing it tomixed-precision representations with a minimal accuracy lossof 1%, which is still 3.2% more accurate than the state-of-the-art embedded convolutional neural network. We implement themodel on a low-power MCU with parallel processing units takingonly 33.39 ms and consuming 1.304 mJ per classiﬁcation.

Index Terms —brain–machine interface, edge computing, par-allel computing, machine learning, deep learning, motor imagery.

I. I

NTRODUCTION

Motor-Imagery (MI) Brain–Machine Interfaces (BMIs) useElectroencephalography (EEG) signals recorded from thebrain to decode a movement imagined by the subject. Thedecoded information can be used to control an external device,such as a drone [1] or a wheelchair [2], [3], or for strokerehabilitation [4]. It is especially useful for individuals withphysical disabilities to regain independence [4], [5]. However,the high variability across subjects and among different record-ing sessions poses big challenges to an accurate MI-BMI.Moreover, recording and labelling EEG data is expensive, timeconsuming, and prone to errors, resulting in scarce amounts ofdata available for training complex models with large numbersof parameters. In fact, many studies using ConvolutionalNeural Networks (CNNs) acknowledge the fact that overﬁttingis the biggest issue for these types of models [6], [7], [8].On the other hand, successful methods have been proposedto extract discriminative, domain-speciﬁc features from EEGsignals. The well-known Common Spatial Patterns (CSP)learns spatial ﬁlters that discern between different MI tasks [9].An improved algorithm, called Filter-Bank CSP, that accountsfor multiple frequency bands achieved better accuracy [10].More recent studies have proposed Riemannian methods toextract more comprehensive features also in absence of labeleddata [11]. The unsupervised feature calibration enables onlineadaptation of the classiﬁer to combat the large inter-sessionvariance in MI-BMIs [12]. So far, these methods are believedto be the most promising feature extractors for several kindsof BMI paradigms [13], [14], [15].Traditional BMI systems adopt ofﬂine, remote processingof the sensor data, raising concerns over data privacy, latency,

Corresponding emails: { xiaywang, herschmi } @iis.ee.ethz.ch high energy consumption, and battery lifetime. A promisingsolution is to bring the processing near the sensor, i.e. on thebody of the user, using low-power low-cost microcontrollerunits (MCUs), allowing the data to be processed locally [16].However, these devices suffer from limited on-board resourcesin terms of memory and computational capabilities. Hence,researching compact yet accurate algorithms [7], [17] anddesigning low-power processors with high capabilities [18] hasbecome an emerging trend. Most of the MI-BMI models, par-ticularly CNNs, are too demanding for low-power MCUs [19].TPCT [20] reached the state-of-the-art (SoA) accuracy of88.87% on the BCI Competition IV-2a dataset [21]. The modelconsists of around 7.78 M parameters. Other similar CNNsreach 81.1% with 240 k parameters [22] or 75.8% with 155 kparameters [23]. A notable exception is EEGN ET [17] withonly few thousands parameters, i.e. three orders of magnitudeless demanding, but still achieving around 70% accuracy on4-class MI classiﬁcation. By virtue of its compactness, ithas been successfully quantized with Q-EEGN ET [24] andimplemented on a low-power System-on-Chip (SoC) based onRISC-V called Mr. Wolf [18]. It has proven to be three ordersof magnitude more energy efﬁcient than an implementationon commercially available MCUs based on ARM Cortex-Marchitecture [25], making it the SoA embedded CNN interms of energy efﬁciency, compact model size yet accurateperformance. Another effort for embedded BMI has been madeby Belwaﬁ et al. [26] implementing a CSP-based classiﬁer ona FPGA device. The multispectral and multiscale Riemannianclassiﬁers proposed in [27], [28] outperform both EEGN ET and CSP-based models by around 5% and 2% higher accuracy,respectively. However, their proposed models are still verychallenging for embedded deployment on low-power resource-constrained MCUs due to large memory footprint and highcomputational complexity.For the ﬁrst time in literature, we propose an embedded MI-BMI based on a Riemannian classiﬁer [27]. The main contri-butions of this paper are: (a) We tailor the model for better em-bedded deployment by reducing its size and complexity, i.e.,the number of frequency bands and temporal windows, whileat the same time keeping comparable classiﬁcation accuracyby introducing regularization (75.1% ours vs. 75.5% [27]).(b) We further quantize the Multispectral Riemannian Clas-siﬁer (MRC) from full precision (32-bit ﬂoat) to a mixtureof precisions with 8-, 16-, 32-bit ﬁxed- and ﬂoating-pointrepresentations, to maximize efﬁciency on low-power MCUsby enabling the use of ﬁxed-point SIMD instructions whilemaintaining a minimal accuracy loss. The quantization yields1% accuracy drop which is still 3.2% more accurate than theembedded CNN-based EEGN ET (74.1% ours vs. 70.9% [24]).c) We efﬁciently implement the mixed-precision model onMr. Wolf by exploiting the underlying hardware architecture,i.e., custom Instruction Set Architecture (ISA) extensions andconcurrent execution on multiple cores and measure the per-formance on-board. Experimental measurements show that theproposed model takes only 33.30 ms and consumes 1.304 mJper inference. (d) Our work provides a practical accuracy-cost trade-off between MRC, a discriminative feature-basedapproach, and EEGN ET as a CNN-based approach, supportedby an actual implementation and measurement results. Beingthe ﬁrst embedded implementation of Riemannian covariancekernels and the most accurate embedded MI-BMI, it opensthe path for other BMI paradigms deploying Riemmanianmethods, e.g., steady-state visual evoked potential [15] andP300 [29]. Finally, we release open-source code .II. D ESIGN AND Q UANTIZATION

MRC [27] consists of a non-linear feature extraction ap-plying the Riemannian covariance method [30] on multiplefrequency bands and temporal windows, followed by a linearSupport Vector Machine (SVM), depicted in Fig. 1. First,the input data is ﬁltered using f different Inﬁnite ImpulseResponse (IIR) bandpass ﬁlters. Then, the covariance matrixis estimated and regularized with the parameter ρ . The nextblock, called Whitening , multiplies from the left and right witha reference matrix C − / ref ,k , that is computed for each frequencyband k independently during training. Afterwards, the matrixlogarithm is computed with the help of Eigendecomposition(EVD). Then, the function vect ( L k ) vectorizes the symmetricmatrix L k by concatenating the diagonal values and the upperright non-diagonal elements. To preserve the norm, the off-diagonal elements are scaled with √ . Finally, the SVMclassiﬁer predicts the MI class.We quantize the feature extraction to a mixture of 8-, 16-,and full precision 32-bit ﬁxed- and ﬂoating-point representa-tions and the SVM to 8-bit ﬁxed-point, summarized in Fig. 2.The decision on the precision depends on the trade-off betweenenergy efﬁciency and accuracy preservation. With 8- or 16-bit ﬁxed-point numbers, it is possible to exploit the SingleInstruction, Multiple Data (SIMD) instructions. However, notall the parts of the MRC can be quantized due to numericalinstability and signiﬁcant accuracy loss.

1) IIR Bandpass Filters:

The input data X ∈ R N ch × N s with dimensions number of EEG channels N ch and number oftime samples N s , is quantized to 8 bits. Each channel is ﬁlteredwith f IIR bandpass ﬁlters. The ﬁlters can become unstable,especially with quantization. The internal accumulators candiverge, even if the output remains bounded. We implementthe Direct-Form I deﬁned in [31], since it does not experiencenumerical overﬂow in the internal signals, because all internalregisters store either the input or the output of the ﬁlter [31].A typical approach for quantizing an IIR ﬁlter is to expressthem as a cascade of Second-Order Sections (SOSs), eachof which can be quantized with different dynamic ranges,thus minimizing the effect of quantizing the ﬁlter coefﬁcientson the impulse response. With 8-bit ﬁxed-point quantization,the impact is signiﬁcant, while with 12 bits these effects areminimal. Therefore, we choose 12 bits for the ﬁlter coefﬁcientsto prevent overﬂows that would occur with 16 bits. We re-scale https://github.com/pulp-platform/multispectral-riemannian the intermediate results in between the SOSs to remain in thesame dynamic range and accumulate them with 16-bit registersin order to use SIMD operations for the following iteration.All dynamic ranges for all sections are chosen independentlyand forced to be a power of two to implement simple bit-shiftsinstead of expensive divisions.

2) Covariance Matrix and Whitening:

Recall, that the co-variance matrix C ∈ R n × n , in our case n = N ch , includingregularization, is computed as C = XX T + ρ I (1)and Whitening is deﬁned as W = C − / ref CC − / ref , (2)with C − / ref being the reference matrix, computed by aver-aging the covariance matrices of all the training trials. Forquantization, we deﬁne n c and n ref to be the number ofbits to represent C and C − / ref , respectively. Since we canexploit either 4- or 2-way SIMD operations, we test both n c = n ref = 8 and . However, the former yields a signiﬁcantaccuracy drop, while the latter causes overﬂows. Hence, wereduce n ref , until training completes without overﬂow, resultingin n ref = 11 . Our experiments have shown that using n c = 16 and n ref = 11 yields similar accuracy to the full-precision ver-sion. Moreover, we force the scaling factor for the covariancematrix computation to be a power of two to exploit bit-shifts,while the dynamic range for the Whitening depends on thequantization of C and C − / ref . Finally, for the intermediateand ﬁnal results of Eq. 2, we keep the full dynamic rangewith 32 bits since the input to the matrix logarithm is verysensitive to quantization errors, as explained next.

3) Matrix Logarithm:

The matrix logarithm of a square,positive deﬁnite matrix A ∈ R n × n is deﬁned in terms of itsEVD, as logm ( A ) = Q − logm ( D ) Q , (3)where A = Q − DQ , and the logarithm of a diagonal matrix D is computed by applying the logarithm to its diagonalelements. The whitened covariance matrix W in MRC is denseand symmetric, allowing us to optimize the EVD. We ﬁrstcompute the tridiagonal decomposition to obtain a tridiagonalmatrix T similar to the original one, i.e. the Eigenvalues arepreserved. Then the EVD can be computed on T requiringless computational effort. The ﬁnal transformation is W = Q Tt TQ t = Q Tt Q Td DQ d Q t , (4)where Q t is the orthogonal matrix for the tridiagonal trans-formation and Q d the one for the EVD. Q d Q t is an orthog-onal matrix containing the Eigenvectors of W . To computethe tridiagonal matrix, we use the Householder transforma-tion [32]. The complexity of the transformation can be reducedby rearranging the operations and exploiting the sparsity ofthe vectors [32]. For computing the diagonal matrix D fromthe tridiagonal symmetric matrix T , we use the QR algorithmwith implicit Wilkinson Shift [33]. The matrix logarithm onlyexists if the matrix is positive deﬁnite, meaning that all theEigenvalues are positive. In full-precision MRC, the inputof the matrix logarithm is always positive deﬁnite, whilewith quantization the Eigenvalues change and in some caseseven become negative, making it impossible to compute reallogarithm. We address this issue by (a) making use of theentire 32-bit dynamic range for the inputs, and (b) clipping allEigenvalues λ k to max { λ k , λ min } by introducing a threshold IIR b ...IIR b f X X T + I ρ ... X f X Tf + I ρ C − / ref , CC − / ref , ... C − / ref ,f C f C − / ref ,f logm ( W ) ...logm ( W f ) vect ( L ) ...vect ( L f ) Filter CovarianceMatrix Whitening MatrixLogarithm S V M c l a ss e s X X f C C f W W f L L f Fig. 1: Multispectral Riemannian Classiﬁer with n = 18 frequency bands and one time window. ˜ X IIR Q ,Q Covmat Q Whitening Q Dequantize logm F Requantize vect Q SVM Q Q8 Q8 Q16 Q32 F32 F32 Q8 Q8 Q32

Fig. 2: Quantized MRC of a single frequency band, showing the representation of each intermediate signal. λ min = 10 − to ensure all Eigenvalues remain above zero.Its value is chosen based on the smallest Eigenvalue occur-ring while training the full precision MRC. Moreover, bothHouseholder transformation and QR algorithm are computedwith 32-bit ﬂoating-point values. Finally, we convert the resultsback to 8-bit ﬁxed-point format using the dynamic rangelearned during training.

4) Support Vector Machine (SVM):

The ﬁnal classiﬁer inMRC is a SVM, which we train on the quantized features.The weights and biases are then quantized with bit-width n w = 8 and n b = 32 , respectively, by determining the dynamicranges after training. We do not rescale the output of the SVMbecause the prediction is made based on the relative largestoutput value. Hence, the weight vector can use the entire rangeavailable with 8 bit, reducing the quantization error.III. I MPLEMENTATION

We implement the mixed-precision MRC on Mr. Wolf [18]which has a SoC domain and a compute cluster with 8 paral-lel RISC-V-based processors called RI5CY, or CV32E40P,implementing RV32IMFC ISA with custom X

PULP

V2 exten-sions for Digital Signal Processing (DSP), e.g., SIMD instruc-tions, hardware loops, post-incremental load and store [34].The cluster cores have two shared Floating Point Units (FPUs)and 64 kB of shared L1 memory via the Tightly Coupled DataMemory (TCDM) interconnect. More memory can be accessedvia a Direct Memory Access (DMA) unit from the shared L2memory (448 kB) present in the SoC domain.Our MRC implementation is divided into three main blocksframed with blue, red, and green lines in Fig. 1, respectively:(a) computation of the frequency bands until Whitening: eachfrequency band, highlighted with blue rectangle, is computedusing 8 cores as described in the following paragraphs; (b)computation of the matrix logarithm and vectorization: everycore computes one matrix logarithm followed by the vectoriza-tion concurrently with the other cores, i.e. 8 matrix logarithms,colored with red rectangle, are computed at the same time; (c)SVM computed with a single core, colored in green.

1) IIR Filter:

As described in Section II-1, we set the bit-width of the coefﬁcients to n a = n b = 12 , and the bit-widthof the internal registers to n i = 16 . Each SOS contains threeMultiply Accumulates (MACs) for the forward accumulationand two MACs for the backward accumulation. This enablesthe usage of SIMD instructions with bit-width 16. We computethe ﬁltered output of different EEG channels on separate coresof the cluster to utilize the concurrent capabilities of Mr. Wolf.

2) Covariance Matrix:

The computation of the covariancematrix is a matrix-matrix multiplication (MMM), as shown in Eq. (1), which results in a symmetric matrix. Therefore, weonly compute the upper right triangle and copy the remainingelements. Since X k is the ﬁltered input data of band k , packedto 8 bits, the implementation makes use of SIMD instructionsto improve the performance signiﬁcantly. The computation isimplemented concurrently by splitting the upper right part ofthe output matrix among all processing units.

3) Whitening:

Whitening consists of two MMMs, as de-scribed in Eq. (2). Based on the quantization scheme describedin Section II-2, the ﬁrst multiplication is computed in 16 bit,and the second in 32 bit. For the ﬁrst multiplication, we use 2-way SIMD instructions. We use the concurrent implementationfound in the DSP library [35] for PULP, where each corecomputes a part of the matrix.

4) Matrix Logarithm:

For computing the EVD, we imple-ment both the basic version of Householder transformation andthe improved version [32] for speedup analyses. The computa-tion of the rotation matrix required for the Givens rotation [36]of each QR step is done exclusively with multiplications,divisions, and additions, without using expensive trigonometricfunctions [37]. For parallel implementation, every core isassigned with a frequency band and computes the Householdertransformation and QR algorithm.

5) Support Vector Machine (SVM):

The matrix-vector prod-uct of the SVM is computed using 8-bit SIMD instructions. Weimplement it on a single core, since it accounts for a negligibleportion of the computation of the entire model.IV. E

XPERIMENTAL R ESULTS AND D ISCUSSION

We apply our methods on the BCI Competition IV-2adataset [21] with 22 EEG channels and 4 MI classes from 9different subjects. There are 288 trials for each of the trainingand testing sets. Each trial lasts 6 s and is sampled at 250 Hz.Table I reports the classiﬁcation accuracy of our proposedmodels compared to related work with different MRC con-ﬁgurations and EEGN ET . MRC can be scaled to use moreor fewer frequency bands and temporal windows. Herscheet al. [27] have shown that f = 43 frequency bands and asingle temporal window t = 1 can already achieve comparableaccuracy (74.8% on average) to the full MRC (75.5%) whilerequiring × fewer features. In this work, we use only onetemporal window t = 1 of 3.5 s and further scale downthe number of frequency bands. Our results show that with . × less frequency bands, i.e. f = 18 , of bandwidth 2 Hzbetween 4 and 40 Hz, our full precision model achievesslightly higher accuracy by introducing the regularization withthe hyperparameter ρ = 1 . Comparing to EEGN ET , whichis known to be a compact CNN for BMI applications [17],ur full precision MRC is 3.8% more accurate. Regardingthe quantization, EEGN ET can be quantized down to 8-bitprecision for the entire network with Q-EEGN ET [24] withoutsigniﬁcant loss in accuracy (0.4%). However, our proposedmixed-precision MRC is still 3.2% more accurate. The min-imal loss in accuracy of 1% from full to mixed-precisioncan be attributed mainly to the quantization at the input ofthe matrix logarithm. Regarding the memory footprint, Q-EEGN ET requires 68.15 kB, while our MRC implementationuses approximately 84 kB, i.e. 2 · ·

876 for 8-bit input andoutput of IIR ﬁlters, 18 · (22+1) · W k in 32-bits andreused for L k , 18 · (22+1) · C − / ref,k in 16 bits, and 4554 · × parallel speedup. Here, eachoutput sample requires 10 MACs, 3 shufﬂe operations, and4 bit-shifts, resulting in a theoretical maximum of 5 MACsper cycle. The covariance matrix computation reaches 8.14MACs per cycle with concurrent execution yielding a speedupof 7.10 × using 8 cores. The parallel speedup of the Whiteningis 4.98 × due to the parallelization overhead that is more visiblewith smaller matrix sizes (here 22 × × on the computation ofthe matrix logarithm compared to the baseline, while the paral-lel speedup is 5.67 × compared to the single core computationand 20.64 × compared to the baseline. 18 matrix logarithms arecomputed, distributed to the 8 cores on a ﬁrst-come ﬁrst-servedschedule, i.e. twice 8 matrix logarithms are computed on 8cores, then the remaining 2 on two cores, as reﬂected on thepower trace, framed with red dashdotted line. This workloadunbalance contributes negatively to the parallel speedup. How-ever, the performance would not increase signiﬁcantly with amore balanced distribution since the ideal speedup would be6 × with six parallel cores. Moreover, the maximal number ofFloating Point Operations (FLOPs) per cycle is 2, of whichwe reach 1.69, limited by the iteratively computed divisionsand square root operations. Finally, the SVM accounts fora minimal part of the execution with 0.15 ms, highlightedwith green frame in Fig. 3. For comparison, the embeddedTABLE I: Classiﬁcation accuracy (%) on 4-class MI. Q-EEGN ET MRCRef. [24] [24] [27] ∤ [27] ♮ Ours ⋄ Ours ⋄ Precision full 8-bit full full full mixed t / f / ρ Std. 11.5 14.3 12.8 13.9 12.2 13.2

HouseholderEVD Q − DQ vectWhiten.Filters Cov.Matr. 8 cores 8 cores 8 cores 2 cores Time [ms] P o w e r[ m W ] Fig. 3: End-to-end power measurement. The colors match thecompute blocks in Fig. 1 explained in Sec. III.BMI in [26] consumes 0.7 W and takes around 0.4 s, morethan an order of magnitude more in terms of both, powerconsumption and execution time—or two orders of magnitudeworse in terms of energy efﬁciency. We also compare to the Q-EEGN ET implementation in [24] that is publicly available. Werun both Q-EEGN ET and MRC on Mr. Wolf at 100 MHz and1.1 V. The former takes 13.64 ms consuming 0.678 mJ whilethe runtime of MRC lays within the same order of magnitudewith 33.39 ms and consumes 1.304 mJ. It is up to the user todecide on the trade-off between accuracy and cost dependingon the application scenario.V. C ONCLUSION

This paper presents an improved MRC with reducedmodel size while keeping comparable accuracy (75.1% vs.75.5% [27]), allowing accurate low-power embeeded BMI. Wefurther scale down the model by quantizing and proposing amixed-precision implementation yielding a minimal accuracyloss of 1%, which is still 3.2% more accurate than theSoA embedded CNN for BMI named Q-EEGN ET [24]. Wepropose a parallel implementation on a low-power MCU calledMr. Wolf, which takes only 33.39 ms and consumes 1.304 mJ.The higher accuracy compared to Q-EEGN ET comes at thecost of a 2.4 × longer execution time and a 1.9 × higher energyconsumption. However, it is still two orders of magnitudemore energy efﬁcient than other embedded solutions [26]. Weprovide an insight on accuracy-cost trade-off for embeddedBMI models with actual implementation and measurements.TABLE II: Computation time for MRC on Mr. Wolf with afrequency of 100 MHz at 1.1 V. baseline improvedEVD concurrent parallelspeedup ops/c ∤ Filter 66.67 ms 66.67 ms 9.18 ms 7.26 3.77Cov. matrix 34.80 ms 34.80 ms 4.90 ms 7.10 8.14Whitening 24.29 ms 24.29 ms 4.88 ms 4.98 0.79Matrix logm. 309.76 ms 85.18 ms 15.01 ms 5.67 1.69SVM 0.15 ms 0.15 ms 0.15 ms - 1.25Total 439.48 ms 206.93 ms

MACs/cycle ♮ FLOPs/cycle ⋄ insn/cycle 0.907 0.837 0.788 ♮ Number of ﬁxed-point MACs over number of cycles w/o matrix logarithms. ⋄ Number of FLOPs over number of cycles during matrix logarithms. ∤ MACs or FLOPs per cycle for the concurrent implementation except SVM.

EFERENCES[1] K. Koizumi, K. Ueda et al. , “Development of a cognitive brain-machineinterface based on a visual imagery method,” in , 2018, pp. 1062–1065.[2] Y. Yu, Z. Zhou et al. , “Self-Paced Operation of a Wheelchair Basedon a Hybrid Brain-Computer Interface Combining Motor Imageryand P300 Potential,”

IEEE Transactions on Neural Systems andRehabilitation Engineering , vol. 25, no. 12, pp. 2516–2526, 12 2017.[3] M. Xiong, R. Hotter et al. , “A low-cost, semi-autonomous wheelchaircontrolled by motor imagery and jaw muscle activation,” in .IEEE, 2019, pp. 2180–2185.[4] A. A. Frolov, O. Mokienko et al. , “Post-stroke RehabilitationTraining with a Motor-Imagery-Based Brain-Computer Interface (BCI)-Controlled Hand Exoskeleton: A Randomized Controlled MulticenterTrial.”

Frontiers in neuroscience , vol. 11, p. 400, 2017.[5] N. Kobayashi and M. Nakagawa, “BCI-based control of electricwheelchair using fractal characteristics of EEG,”

IEEJ Tran. on Electri-cal and Electronic Engineering , vol. 13, no. 12, pp. 1795–1803, 2018.[6] J. Le´on, J. J. Escobar et al. , “Deep learning for eeg-based motorimagery classiﬁcation: Accuracy-cost trade-off,”

PLOS ONE , vol. 15,no. 6, pp. 1–30, 06 2020.[7] H. Wu, Y. Niu et al. , “A parallel multiscale ﬁlter bank convolutionalneural networks for motor imagery eeg classiﬁcation,”

Frontiers inNeuroscience , vol. 13, p. 1275, 2019.[8] R. T. Schirrmeister, J. T. Springenberg et al. , “Deep learning withconvolutional neural networks for EEG decoding and visualization,”

Human Brain Mapping , vol. 38, no. 11, pp. 5391–5420, 2017.[9] F. Lotte and Cuntai Guan, “Regularizing Common Spatial Patterns toImprove BCI Designs: Uniﬁed Theory and New Algorithms,”

IEEETransactions on Biomedical Engineering , vol. 58, no. 2, pp. 355–362,2 2011.[10] Kai Keng Ang, Zhang Yang Chin et al. , “Filter Bank CommonSpatial Pattern (FBCSP) in Brain-Computer Interface,” in . IEEE, 2008, pp. 2390–2397.[11] C. H. Nguyen and P. Artemiadis, “Eeg feature descriptorsand discriminant analysis under riemannian manifold perspective,”

Neurocomputing , vol. 275, pp. 1871 – 1883, 2018.[12] S. Kumar, F. Yger et al. , “Towards adaptive classiﬁcation using rieman-nian geometry approaches in brain-computer interfaces,” in ,2019, pp. 1–6.[13] M. Congedo, A. Barachant et al. , “Riemannian geometry for eeg-basedbrain-computer interfaces; a primer and a review,”

Brain-ComputerInterfaces , vol. 4, pp. 1–20, 03 2017.[14] F. Yger, M. Berar et al. , “Riemannian Approaches in Brain-ComputerInterfaces: A Review,”

IEEE Transactions on Neural Systems andRehabilitation Engineering , vol. 25, no. 10, pp. 1753–1762, 10 2017.[15] S. Chevallier, E. Kalunga et al. , “Review of riemannian distances anddivergences, applied to ssvep-based bci,”

Neuroinformatics , 06 2020.[16] X. Wang, M. Magno et al. , “FANN-on-MCU: An Open-Source Toolkitfor Energy-Efﬁcient Neural Network Inference at the Edge of theInternet of Things,”

IEEE Internet of Things Journal , 2020.[17] V. J. Lawhern, A. J. Solon et al. , “EEGNet: a compact convolutionalneural network for EEG-based brain–computer interfaces,”

Journal ofNeural Engineering , vol. 15, no. 5, p. 056013, 2018.[18] A. Pullini, D. Rossi et al. , “Mr.Wolf: An Energy-Precision ScalableParallel Ultra Low Power SoC for IoT Edge Processing,”

IEEE Journalof Solid-State Circuits , vol. 54, no. 7, pp. 1970–1981, 2019.[19] T. Ingolfsson, M. Hersche et al. , “Eeg-tcnet: An accurate temporalconvolutional network for embedded motor-imagery brain-machineinterfaces,” arXiv:2006.00622 , 05 2020.[20] M.-A. Li, J.-F. Han et al. , “A Novel MI-EEG Imaging With the LocationInformation of Electrodes,”

IEEE Access , vol. 8, pp. 3197–3211, 2020.[21] C. Brunner, R. Leeb et al. , “BCI competition 2008 - Graz data set A,”http://bnci-horizon-2020.eu/database/data-sets.[22] Y. Zhao, S. Yao et al. , “On the improvement of classifying EEGrecordings using neural networks,” in

Proc. IEEE Big Data , Dec. 2017,pp. 1709–1711.[23] H. Wu, Y. Niu et al. , “A Parallel Multiscale Filter Bank ConvolutionalNeural Networks for Motor Imagery EEG Classiﬁcation,”

Frontiers inNeuroscience , vol. 13, Nov. 2019.[24] T. Schneider, X. Wang et al. , “Q-EEGNet: an Energy-Efﬁcient 8-bitQuantized Parallel EEGNet Implementation for Edge Motor-ImageryBrain–Machine Interfaces,” arXiv:2004.11690v1 , Apr. 2020.[25] X. Wang, M. Hersche et al. , “An accurate eegnet-based motor-imagerybrain–computer interface for low-power edge computing,” in , 2020, pp. 1–6. [26] K. Belwaﬁ, O. Romain et al. , “An embedded implementation based onadaptive ﬁlter bank for brain–computer interface systems,”

Journal ofNeuroscience Methods , 2018.[27] M. Hersche, T. Rellstab et al. , “Fast and Accurate Multiclass Inferencefor MI-BCIs Using Large Multiscale Temporal and Spectral Features,”in .IEEE, 9 2018, pp. 1690–1694.[28] P. Yang, J. Wang et al. , “Mlp with riemannian covariance for motorimagery based eeg analysis,”

IEEE Access , vol. 8, pp. 139 974–139 982,2020.[29] P. L. C. Rodrigues, C. Jutten et al. , “Riemannian procrustes analysis:Transfer learning for brain–computer interfaces,”

IEEE Transactions onBiomedical Engineering , vol. 66, no. 8, pp. 2390–2401, 2019.[30] F. Yger, M. Berar et al. , “Riemannian approaches in brain-computerinterfaces: a review,”

IEEE Transactions on Neural Systems and Reha-bilitation Engineering , vol. 25, no. 10, pp. 1753–1762, 2016.[31] J. O. Smith,

Introduction to Digital Filters with Audio Applications .http://ccrma.stanford.edu/ jos/ﬁlters/, 2020, online book.[32] R. Burden and J. Faires,

Numerical analysis . Cengage Learning, 2004.[33] J. H. Wilkinson,

The algebraic eigenvalue problem . Oxford Clarendon,1965, vol. 662.[34] M. Gautschi, P. D. Schiavone et al. , “Near-threshold RISC-V core withDSP extensions for scalable IoT endpoint devices,”

IEEE Transactionson VLSI Systems , vol. 25, no. 10, pp. 2700–2713, 2017.[35] X. Wang, “DSP library for PULP,” https://github.com/pulp-platform/pulp-dsp, 2019.[36] W. Givens, “Numerical computation of the characteristic values of a realsymmetric matrix,” Oak Ridge National Lab., Tech. Rep., 1954.[37] D. Bindel, J. Demmel et al. , “On computing givens rotations reliablyand efﬁciently,”