[PDF] MaskedNet: The First Hardware Inference Engine Aiming Power Side-Channel Protection

Abstract

Differential Power Analysis (DPA) has been an active area of research for the past two decades to study the attacks for extracting secret information from cryptographic implementations through power measurements and their defenses. Unfortunately, the research on power side-channels have so far predominantly focused on analyzing implementations of ciphers such as AES, DES, RSA, and recently post-quantum cryptography primitives (e.g., lattices). Meanwhile, machine-learning, and in particular deep-learning applications are becoming ubiquitous with several scenarios where the Machine Learning Models are Intellectual Properties requiring confidentiality. Expanding side-channel analysis to Machine Learning Model extraction, however, is largely unexplored. This paper expands the DPA framework to neural-network classifiers. First, it shows DPA attacks during inference to extract the secret model parameters such as weights and biases of a neural network. Second, it proposes the first countermeasures against these attacks by augmenting masking . The resulting design uses novel masked components such as masked adder trees for fully-connected layers and masked Rectifier Linear Units for activation functions. On a SAKURA-X FPGA board, experiments show that the first-order DPA attacks on the unprotected implementation can succeed with only 200 traces and our protection respectively increases the latency and area-cost by 2.8x and 2.3x.

Full PDF

IIEEE C

OPYRIGHT N OTICE © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in anycurrent or future media, including reprinting/republishing this material for advertising or promotional purposes, creating newcollective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in otherworks.

Accepted to be published in: Proceedings of the 2020 IEEE International Symposium on Hardware Oriented Securityand Trust (HOST), May 4-7, 2020, San Jose, CA, USA. a r X i v : . [ c s . CR ] D ec askedNet: The First Hardware Inference EngineAiming Power Side-Channel Protection Anuj Dubey ∗ , Rosario Cammarota † , Aydin Aysu ∗∗ Department of Electrical and Computer Engineering, North Carolina State University { aanujdu, aaysu } @ncsu.edu † Intel AI, Privacy and Security [email protected]

Abstract —Differential Power Analysis (DPA) has been anactive area of research for the past two decades to study theattacks for extracting secret information from cryptographic im-plementations through power measurements and their defenses.The research on power side-channels have so far predominantlyfocused on analyzing implementations of ciphers such as AES,DES, RSA, and recently post-quantum cryptography primitives(e.g., lattices). Meanwhile, machine-learning applications arebecoming ubiquitous with several scenarios where the MachineLearning Models are Intellectual Properties requiring conﬁden-tiality. Expanding side-channel analysis to Machine LearningModel extraction, however, is largely unexplored.This paper expands the DPA framework to neural-networkclassiﬁers. First, it shows DPA attacks during inference to extractthe secret model parameters such as weights and biases of a neu-ral network. Second, it proposes the ﬁrst countermeasures againstthese attacks by augmenting masking . The resulting design usesnovel masked components such as masked adder trees for fully-connected layers and masked Rectiﬁer Linear Units for activationfunctions. On a SAKURA-X FPGA board, experiments show thatthe ﬁrst-order DPA attacks on the unprotected implementationcan succeed with only 200 traces and our protection respectivelyincreases the latency and area-cost by 2.8 × and 2.3 × . Index Terms —Machine Learning, Neural Networks, Side-Channel Analysis, Masking.

I. I

NTRODUCTION

Since the seminal work on Differential Power Analysis(DPA) [1], there has been an extensive amount of researchon power side-channel analysis of cryptographic systems.Such research effort typically focus on new ways to breakinto various implementations of cryptographic algorithms andcountermeasures to mitigate attacks. While cryptography isobviously an important target driving this research, it is notthe only scenario where asset conﬁdentiality is needed—secretkeys in the case of cryptographic implementations.In fact, Machine Learning (ML) is a critical new target withseveral motivating scenarios to keep the internal ML modelsecret. The ML models are considered trademark secrets,e.g., in Machine-Learning-as-a-Service applications, due to thedifﬁculty of training ML models and privacy concerns aboutthe information embedded in the model coefﬁcients such asweights and biases in the case of neural networks. If leaked,the model, including weights, biases and hyper-parameters canviolate data privacy and intellectual property rights. Moreover,knowing the ML classiﬁer details makes it more susceptible toadversarial ML attacks [2], and especially to test-time evasion

Fig. 1. Motivation of this work: DPA of the BNN hardware with 100kmeasurements. Green plot is the correlation trace for the correct 4-bit weightguess, which crosses the 99.99% conﬁdence threshold revealing the signiﬁcantleak. Blue plot is for the 2’s complement of the correct guess, which is anexpected false positive of the target signed multiplication. Other 14 guessesdo not show correlation. attacks[3], [4]. Finally, ML has also been touted to replacecryptographic primitives [5]—under this scenario, learning theML classiﬁer details would be equivalent to extracting secretsfrom cryptographic implementations.In this work, we extend the side-channel analysis frameworkto ML models. Speciﬁcally, we apply power-based side-channel attacks on a hardware implementation of a neuralnetwork and propose the ﬁrst side-channel countermeasure.Fig. 1 shows the vulnerability of a Binarized Neural Network(BNN)—an efﬁcient network for IoT/edge devices with bi-nary weights and activation values [6]. Following the DPAmethodology [7], the adversary makes hypothesis on 4 bits ofthe secret weight. For all these 16 possible weight values, theadversary computes the corresponding power activity on anintermediate computation, which depends on the known inputand the secret weight. This process is repeated multiple timesusing random, known inputs. The correlation plots between thecalculated power activities for the 16 guesses and the obtainedpower measurements reveal the value of the secret weight.Fig. 1 shows that at the exact time instant where thetargeted computation occurs, a signiﬁcant information leakageexists between the power measurements and the correct keyguess. The process can be repeated reusing the same powermeasurements to extract other weights. Hence, it shows thatimplementations of ML Intellectual Properties are also sus-ceptible to side-channel attacks like ciphers.Given this vulnerability, the objective of this paper is toexamine it and propose the ﬁrst countermeasure attempt foreural network inference against power-based side-channelattacks. A neural network inference is a sequence of repeatedlinear and non-linear operations, similar in essence to crypto-graphic algorithms, but has unique computations such as row-reduction (i.e., weighted summation) operations and activationfunctions. Unlike the attack scenario, the defense exhibitschallenges due to the presence of these operations in neuralnetworks, which introduce an additional and subtle type ofleak. To address the vulnerability, we propose a countermea-sure using the concepts of message blinding and secret sharing.This countermeasure style is called masking [8], which isan algorithm-level defense that can produce resilient designsindependent of the implementation technology [9]. We tunedthis countermeasure for the neural networks in a cost effectiveway and complement it with other techniques.The main contributions of the paper include the following: • We demonstrate attacks that can extract the secret weightsof a BNN in a highly-parallelized hardware implementa-tion. • We formulate and implement the ﬁrst power-based side-channel countermeasures for neural networks by adaptingmasking to the case of neural networks. This processreveals new challenges and solutions to mask uniqueneural network computations that do not occur in thecryptographic domain. • We validate both the insecurity of the baseline designand the security of the masked design using powermeasurement of an actual FPGA hardware and quantifythe overheads of the proposed countermeasure.We note that while there is prior work on theoretical at-tacks [10], [11], [12], [13], [14] and digital side-channels [15],[16], [17], [18] of neural networks, their physical side-channelsare largely unexplored. Such research is needed because phys-ical side-channels are orthogonal to these threats, fundamentalto the Complementary Metal Oxide Semiconductor (CMOS)technology, and require extensive countermeasures as we havelearned from the research on cryptographic implementations.A white paper is recently published on model extractionvia physical side-channels [19] . This work does not studymitigation techniques and focuses on 8-bit/32-bit microcon-trollers. We further analyze attacks on parallelized hardwareaccelerators and investigate the ﬁrst countermeasures.II. T HREAT M ODEL AND R ELATION TO P RIOR W ORK

This work follows the typical DPA threat model [21]. Theadversary has physical access to the target device or has a re-mote monitor [22], [23], [24] and obtains power measurementswhile the device processes secret information. We assumethat the security of the system is not based on the secrecyof the software or hardware design. This includes the detailsof neural network algorithm and its hardware implementationsuch as the data ﬂow, parallelization and pipelining—in prac-tice, those details are typically public but what remains a After we submitted our work to HOST’20, that paper has been publishedat USENIX’19 [20]. Fig. 2. The adversary’s goal is to extract the secret model parameters on theIoT/edge device during inference using side-channels. secret is the model parameters obtained via training.

Forunknown implementations, adversary can use prior techniquesto locate certain operations, which work in the context ofphysical [25], [26] and digital side-channels [27], [28] cov-ering both hardware and software realizations. This aspect ofreverse engineering logic functions from bitstream [29], [30]is independent of our case and is generic to any given system.Fig. 2 outlines the system we consider. The adversary inour model targets the ML inference with the objective oflearning the secret model parameters.

This is different thanattacks on training set [31] or the data privacy problem duringinference [32]. We assume the training phase is trusted butthe obtained model is then deployed to operate in an untrustedenvironment. Our attack is similar to a known-plaintext (input)attack and does not require knowing the inference output orconﬁdence score, making it more powerful than theoreticalML extraction attacks [10], [11], [12], [13], [14].Since edge/IoT devices are the primary target of DPAattacks (due to easy physical access), we focus on BNNsthat are suitable on such constrained devices [6]. A BNNalso allows realizing the entire neural network on the FPGAwithout having external memory access. Therefore, memoryaccess pattern side-channel attacks on the network [15] can-not be mounted. We furthermore consider digitally-hardenedaccelerator designs that execute in constant time and constantﬂow with no shared resources, disabling timing-based or othercontrol-ﬂow identiﬁcation attacks [16], [17], [18]. This makesthe attacks we consider more potent than prior work.III. BNN

AND THE T ARGET I MPLEMENTATION

The following subsections give a brief introduction to BNNand discuss the details of the target hardware implementation.

A. Neural Network Classiﬁers

Neural networks consist of layers of neurons that take inan input vector and ultimately make a decision, e.g., for aclassiﬁcation problem. Neurons at each layer may have a dif-ferent use, implementing linear or non-linear functions to makedecisions, applying ﬁlters to process the data, or selectingspeciﬁc patterns. Inspired from human nervous system, the ig. 3. A simple neural network and a single neuron’s function. neurons in a neural network transmit information from onelayer to the other typically in a feed-forward manner.Fig. 3 shows a simple network with two fully-connectedhidden layers and depicts the function of a neuron. In a feed-forward neural network, each neuron takes the results from itsprevious layer connections, computes a weighted summation(row-reduction), adds a certain bias, and ﬁnally applies a non-linear transformation to compute its activation output. Theresulting activation value is used by the next layer’s connectedneurons in sequence. The connections between neurons can bestrong, weak, or non-existent—the strength of these connec-tions is called weight , which is a critical parameter for thenetwork. The entire neural network model can be representedwith these parameters, and with hyperparameters that are high-level parameters such as the number or type of the layers.Neural networks have two phases: training and inference .During training, the network self-tunes its parameters for thespeciﬁc classiﬁcation problem at hand. This is achieved byfeeding pre-classiﬁed inputs to the network together with theirclassiﬁcation results, and by allowing the network to convergeinto acceptable parameters (based on some conditions) that cancompute correct output values. During inference, the networkuses those parameters to classify new (unlabeled) inputs.

B. BNNs

A BNN works with binary weights and activation values.This is our starting point as the implementations of suchnetworks have similarities with the implementation of blockciphers. BNN reduces the memory size and converts a ﬂoatingpoint multiplication to a single-bit XNOR operation in theinference [33]. Therefore, such networks are suitable forconstrained IoT nodes where some of the detection accuracycan be traded for efﬁciency. Several variants of this low-costapproach exist to build neural networks with reasonably highaccuracy [33], [34], [6].The following is the mathematical equation (1) for a typicalneuron: a = f ( (cid:88) w i x i + b ) (1)where a is the activation value, w i is the weight, x i is theactivation of the previous layer and b is the bias value for thenode. w i , x i , and a have binary values of 0 and 1 respectivelyrepresenting the actual values of -1 and +1. The function f ( x ) is the non-linear activation function (2), which in our case isdeﬁned as follows: Fig. 4. Overview of the unprotected neural network inference. The adder treeﬁrst computes the weighted sum of input pixels. The activation function thenbinarizes the sum used by the next layer. Finally, the output layer returns theclassiﬁcation result by computing a maximum of last layer activations. f ( x ) = (cid:26) , for x ≤ , for x > (cid:27) (2)Equations (1) and (2) show that the computation involves asummation of weighted products with binary weights with abias offset and an eventual binarization.We build the BNN inference hardware which performsModiﬁed National Institute of Standards and Technology(MNIST) classiﬁcation (hand written digit recognition from28-by-28 pixel images) using 3 feed-forward, fully-connectedhidden layers of 1024 neurons. The implementation computesup to 1024 additions (i.e., the entire activation value of aneuron) in parallel. C. Unprotected Hardware Design

Fig. 4 illustrates the sequence of operations in the neuralnetwork. One fully-connected layer has two main steps: (1)calculating the weighted sums using an adder tree and (2)applying the non-linear activation function f . To classifythe image, the hardware sends out the node number withmaximum sum in the output layer.The input image pixels are ﬁrst buffered or negated basedon whether the weight is 1 or 0 respectively, and then fedto the adder tree shown in Fig. 5. We implemented a fully-pipelined adder tree of depth 10 as the hardware needs upto 1024 parallel additions to compute the sum. The activationfunction binarizes the ﬁnal sum. After storing all the ﬁrst layeractivations, the hardware computes the second layer activationsusing the ﬁrst layer layer activations. The output layer consistsof 10 nodes, each representing a classiﬁcation class (0-9). Thenode index with the maximum conﬁdence score becomes theoutput of the neural network.The hardware reuses the same adder tree for each layer’scomputation, similar to a prior architecture [35]. Hence, thehardware has a throughput of approximately 3000 cycles perimage classiﬁcation. The reuse is not directly feasible as theadder tree (Fig. 5) can only support 784 inputs (of 8-bits)but it receives 1024 outputs from each of the hidden layers.Therefore, the hardware converts the 1024 1-bit outputs fromeach hidden layer to 512 2-bit outputs using LUTs. TheseLUTs take the weights and activation values as input, andproduces the corresponding 512 2-bit sums, which is within ig. 5. Adder Tree used in HW Implementation. The ﬁgure shows the scenariowhere the 2nd stage registers(red) are targeted for DPA. This results in 16possible key guesses corresponding to the 4 input pixels involved in thecomputation of each second stage register, grouped by the dotted blue line. the limits of the adder tree. Adding bias and applying the batchnormalization is integrated to the adder tree computations.We adopt the batch-normalization free approach [36], the biasvalues are thus integer unlike the binary weights.IV. A N E XAMPLE OF

DPA ON BNN H

ARDWARE

This section describes the attack we performed on the BNNhardware implementation that is able to extract secret weights.To carry out a power based side-channel attack, the ad-versary has to primarily focus on the switching activity ofthe registers, as they have a signiﬁcant power consumptioncompared to combinational logic (especially in FPGAs). Thepipeline registers of the adder tree store the intermediate sum-mations of the product of weights and input pixels. Therefore,the value in these registers is directly correlated to the secret—model weights in our case.Fig. 5 shows an example attack. 4 possible values can beloaded in the output register [0] of stage-1: − [0] − [1] , − [0]+[1] , [0] − [1] and [0] + [1] corresponding to the weights of (0,0),(0,1), (1,0) and (1,1), respectively. These values will directlyaffect the computation and the corresponding result stored instage-2 registers. Therefore, a DPA attack with known inputs( x i ) on stage-2 registers (storing w i x i accumulations) canreveal 4-bits of the secret weights ( w i ). The attack can targetany stage of the adder tree but the number of possible weightcombinations grows exponentially with depth.Since the adder tree is pipelined, we need to create a modelbased on hamming distance of previous cycle and current cyclesummations in each register. To aid the attack, we developeda cycle-accurate hamming-distance simulator for the adderpipeline. It ﬁrst computes the value in each register every cyclefor all possible weight combinations given a ﬁxed input. Next,it calculates the hamming distance of individual registers foreach cycle using the previous and current cycle values. Finally,it adds the hamming distances of all the registers for a cycleto model the overall power dissipation for that cycle.Fig. 6 illustrates the result of the attack on stage-2 registers.There is a strong correlation between the correct key guessand the power measurements crossing the 99.99% conﬁdencethreshold after 45k measurements. The false positive leak isdue to signed multiplication and is caused by the additive Fig. 6. Pearson Correlation Coefﬁcient versus time and number of traces forDPA on weights. Lower plot shows a high correlation peak at the time oftarget computation, for the correct weight guess denoted in green. The upperplot shows that approximately 40k traces are needed to get a correlation of99.99% for the correct guess. The conﬁdence intervals are shown in dottedlines. The blue plot denotes the 2’s complement of the correct weight guess inverse of the correct key, which is expected and thus doesnot affect the attack. Using this approach, the attacker cansuccessively extract the value of weights and biases for all thenodes in all the layers, starting from the ﬁrst node and layer.The bias, in our design, is added after computing the ﬁnal sumin the 10 th stage, before sending the result to the activationfunction. Therefore the adversary can attack this additionoperation by creating a hypothesis for the bias. Another wayextract bias is by attacking the activation function output sincethe sign of the output correlates to the bias.V. S IDE -C HANNEL C OUNTERMEASURES

This section presents our novel countermeasure againstside-channel attacks. The development of the countermeasurehighlights unique challenges that arise for masking neuralnetworks and describes the implementation of the entire neuralnetwork inference.Masking works by making all intermediate computationsindependent of the secret key—i.e., rather than preventingthe leak, by encumbering adversary’s capability to correlateit with the secret value. The advantage of masking is beingimplementation agnostic. Thus, they can be applied on anygiven circuit style (FPGA or ASIC) without manipulatingback-end tools but they require algorithmic tuning, especiallyto mask unique non-linear computations.The main idea of masking is to split inputs of all key-dependent computations into two randomized shares: a one-time random mask and a one-time randomized masked value.These shares are then independently processed and are re-constituted at the ﬁnal step when the ﬁnal output is gener-ated. This would effectively thwart ﬁrst-order side-channelttacks probing a single intermediate computation. Higher-order attacks probing multiple computations [37]—masks andmasked computations—can be further mitigated by splittinginputs of key-dependent operations into more shares [38]. Ourimplementation is designed to be ﬁrst-order secure but canlikewise be extended for higher order attacks.Fig. 4 highlights that a typical neural network inference canbe split into 3 distinct types of operations: (1) the adder treecomputations, (2) the activation function and the (3) outputlayer max function. Our goal is to mask these functions toincrease resilience against power side-channel attacks. Thesespeciﬁc functions are unique to a neural network inference.Hence, we aim to construct novel masked architectures forthem using the lessons learned from cryptographic side-channel research. We will explain our approach in a bottom-upfashion by describing masking of individual components ﬁrst,and the entire hardware architecture next.

A. Masking the Adder Tree

Using the approach in Fig. 5, the adversary can attack anystage of the adder tree to extract the secret weights. Therefore,the countermeasure has to break the correlation between thesummations generated at each stage and the secret weights.We use the technique of message blinding to mask the inputof the adder tree.Blinding is a technique where the inputs are randomizedbefore sending to the target unit for computation. This preventsthe adversary from knowing the actual inputs being processed,which is usually the basis for known-plaintext power-basedside channel attacks. Fig. 7 shows our approach, that uses thisconcept by splitting each of the 784 input pixels a i into twoarithmetic shares r i and a i − r i , where each r i is a unique 8-bitrandom number. These two shares are therefore independentof the input pixel value, as r i is a fresh random number neverreused again for the same node. The adder tree can operateon each share individually due to additive homomorphism—itgenerates the two ﬁnal summations for each branch such thattheir combination (i.e., addition) will give the original sum.Since the adder tree is reused for all layers, hardware simplyrepeats the masking process for subsequent hidden layers usingfresh randomness in each layer.

1) A Unique and Fundamental Challenge for ArithmeticMasking of Neural Networks:

The arithmetic masking ex-tension to the adder tree is unfortunately non-trivial dueto differences in the fundamental assumptions. Arithmeticmasking aims at decorrelating a variable x by splitting itinto two statistically independent shares: r and ( x − r ) modk . The modulo operation in ( x − r ) mod k exists in mostcryptographic implementations because most of them arebased on ﬁnite ﬁelds. In a neural network, however, subtractionmeans computing the actual difference without any modulus.This introduces the notion of sign in numbers, which is absentin modulo arithmetic, and is the source of the problem.Consider two uniformly distributed 8-bit unsigned num-bers a and r . In a modulo subtraction, the result will be ( a − r ) mod , which is again an 8-bit unsigned number Fig. 7. Masking of adder tree. Each input pixel depicted in orange is splitinto two arithmetic shares depicted by the green and blue nodes with uniquerandom numbers ( r i s). The masked adder tree computes branches sequentially. lying between 0 and 255. In an actual subtraction, however,the result will be ( a − r ) , which is a 9-bit number with MSBbeing the sign bit. TABLE IP

ROBABILITY OF a − r BEING POSITIVE OR NEGATIVE

Scenario Positive Negative a >

128 & r >

50% 50% a >

128 & r < a <

128 & r >

0% 100% a <

128 & r <

50% 50%

Table I lists four possible scenarios of arithmetic maskingbased on the magnitude of the two unsigned 8-bit shares.In a perfect masking scheme, probability of a − r beingeither positive or negative should be 50%, irrespective ofthe magnitude of the input a . Let’s consider the case when a > , which has a probability of 50%. If r < , whichalso has a 50% probability, the resulting sum a − r is alwayspositive. Else if r > , the value a − r can both be positive ornegative with equal probabilities due to uniform distribution.Therefore, given a > , the probability of the arithmeticmask being positive is (50+25)% = 75% and being negative is . Table I lists the other case when a < , which resultsa similar correlation between a and a − r . This is showing aclear information leak through the sign bit of arithmetic masks.The discussed vulnerability does not happen in moduloarithmetic as there is no sign bit; the modulo operation wrapsaround the result if it is out of bounds, to obey the closureproperty . Evaluating the correlation of ( a + r ) instead of ( a − r ) yields similar results. Likewise, shifting the range of r basedon a , to uniformly distribute ( a − r ) between -128 to 127,would not resolve the problem and further introduces a biasin both shares.

2) Addressing the Vulnerability with Hiding:

The arithmeticmasking scheme can be augmented to decorrelate the signbit from the input. We used hiding to address this problem.We used hiding just for the sign bit computation . Hidingtechniques target constant power consumption, irrespectiveof the inputs, which makes it harder for an attacker tocorrelate the intermediate variables. Power equalized build-ing blocks using techniques like Wave Differential Dynamicogic (WDDL) [39] can achieve close to a constant powerconsumption to mitigate the vulnerability.The differential part of WDDL circuits aims to make thepower consumption constant throughout the operation, bygenerating the complementary outputs of each gate along withthe original outputs. Differential logic makes it difﬁcult for anattacker to distinguish between a → and a → transition,however, an attacker can still distinguish between a → and a → transition or a → and a → transition.Therefore, the differential logic alone is still susceptible to sidechannel leakages, as the power activity is easily correlated tothe input switching pattern. This vulnerability is reduced bydynamic logic, where all the gates are pre-charged to 0, beforethe actual computation.We use the WDDL gates to solve our problem of sign bitleakages, by modifying the adders to compute the sign bit inWDDL style. Following is the equation of the addition, whentwo 8-bit signed numbers a and b , represented as ( a a · · · a ) and ( b b · · · b ) are added to give a 9-bit signed sum s represented by ( s s · · · s ) : s = a + b (3)After sign-extending a and b , { s s s · · · s } = { a a a · · · a } + { b b b · · · b } (4)Performing regular addition on the leftmost 8 bits of a and b ,and generating a carry c , the equation of s becomes s = a ⊕ b ⊕ c (5)Expanding the above expression in terms of AND, OR andNOT operators results: s = ( a · b · c ) | ( a · b · c ) | ( a · b · c ) | ( a · b · c ) (6)Representing the expression only in terms of NAND, so thatwe can replace all the NANDs by WDDL NAND gates reveals: s = ( a · b · c ) · ( a · b · c · ( a · b · c ) · ( a · b · c ) (7)Fig. 8 depicts the circuit diagram for the above imple-mentation. The WDDL technique is applied to the MSBcomputation by replacing each NAND function in Eq (7) withWDDL NAND gates. The pipeline registers of the adder treeare replaced by Simple Dynamic Differential Logic (SDDL)registers [39]. Each WDDL adder outputs the actual sum s and the complement of its MSB s (cid:48) , which go as input to theWDDL adder in the next stage of the pipelined tree. Therefore,we construct a resilient adder tree mitigating the leakage inthe sign-bit. B. Masking the Activation Function

The binary sign function (Eq. 2) is the activation functionof BNN. This function generates +1 if the weighted sumis positive, else -1 if the sum is negative. In the unmaskedimplementation, the sign function receives the weighted sumof the 784 original input pixels, whereas in the masked imple-mentation, it receives the two weighted sums corresponding toeach masked share. So, the masked function has to computethe sign of the sum of two shares without actually addingthem. Using the fact that the sign only depends on the MSB

Fig. 8. Circuit diagram of the proposed adder with MSB computed in WDDLstyle as described in Eq.(3)-(7). Each of the 784 arithmetic shares ( a, b, ... ) arefed to these adders. All the bits except the MSB undergo a regular addition.The MSBs of the two operands along with the generated carry are fed to theDifferential MSB Logic block, which computes the MSB and its complementby replacing the NAND gates in Eq (7) by WDDL gates. The pipeline registersin the tree are replaced by SDDL registers. The NOR gates generate the pre-charge wave at the start of the logic cones. The regular and the WDDL speciﬁcblocks are depicted in blue and green respectively of the ﬁnal sum, we propose a novel masked sign functionthat sequentially computes and propagates the masked carrybits in a ripple carry fashion.Fig. 9 shows the details of our proposed masked signfunction hardware. This circuit generates the ﬁrst masked carry using a Look-up-Table (LUT) that takes in the LSB of bothshares and a fresh random bit ( r i ) to ensure the randomizationof the intermediate state, similar in style to prior works onmasked LUT designs [40]. LUT function computes the maskedfunction with the random input and generates two outputs: oneis the bypass of the random value ( r i ) and the other is themasked output ( r i ⊕ f ( x ) ) where f ( x ) is the carry output.The entire LUT function for each output can ﬁt into a singleLUT to reduce the effects of glitches [9]. To further reducethe impact of glitches, the hardware stores each LUT outputin a ﬂip-ﬂop. These are validated empirically in Section VI.The outputs of an LUT are sent to the next LUT in chainand the next masked carry is computed accordingly. Fromthe second LUT and onward, each LUT has to also takethe masked carry and mask value generated from the priorstep. The output r o is simply the input r i like a forwardbypass, because the mask value is also needed for the nextcomputation. This way the circuit processes all the bits of theshares and ﬁnally outputs the last carry bit which decides thesign of the sum. Each LUT computation is masked by a freshrandom number. More efﬁcient designs may also be possibleusing a masked Kogge-Stone adder (e.g., by modifying theones described in [41]).Fig. 9 illustrates that the ﬁrst LUT is a 3-bit input 2-bitoutput LUT because there is no carry-in for LSB, and all thesubsequent LUTs have 5-bit inputs and 2-bit outputs since theyalso need previous stage outputs as their inputs. After the ﬁnalcarry is generated, which is also the sign bit of the sum, thehardware can binarize the sum to 1 or 0 based on whetherthe sign bit is 0 or 1 respectively. This is the functionality ofthe ﬁnal LUT, which is different from the usual masked carrygenerators in the earlier stages. ig. 9. Hardware design of the masked binarizer. It comprises of a chainof LUTs (lut0-lut18) denoted in blue, computing the carry in ripple carryfashion. Each LUT is masked by a fresh random number ( r i ). The wholedesign is fully pipelined to maintain the original throughput by adding ﬂipﬂops (in green) at each stage. The circuit has 19 LUTs in serial; each LUT outputis registered for timing and side-channel resilience againstglitches. This design, however, adds a latency of 19 cycles tocompute each activation value, increasing the original latency.Therefore, instead of streaming each of the 19 bits on the toprow of LUTs sequentially in Fig. 9, the entire 19 bit sum isregistered in the ﬁrst stage, and each bit is used sequentiallythroughout the 19 cycles. This avoids the 19 cycle wait timefor consecutive sums and brings back the throughput to 1activation per cycle.

C. Boolean to Arithmetic Share Conversion

Each layer generates 1024 pairs of Boolean shares, whichrequires two changes in the hardware. First, the adder treesupports 784 inputs which cannot directly process 1024 shares.Second, the activation values are in the form of two Booleanshares while the masking of adder tree requires arithmeticshares as discussed in Section V-A. Using the same strategyof the unmasked design, the hardware adds 1024 1-bit sharespairwise to produce 512 2-bit shares before sending themto the adder tree. To resolve the conversion of Boolean toarithmetic conversion, the hardware can generate R such that x = x ⊕ x (8) x = R + x (9)Using masked LUTs, the hardware performs signed additionof 1024 shares to 512 shares, and it also produces thearithmetic shares. The LUTs take in two consecutive activationvalues already multiplied by the corresponding weights, and a2-bit signed random number to generate the arithmetic shares.Since multiplication in binary neural network translates toan XNOR operation [33], the hardware performs an XNORoperation using the activation value with its correspondingweight before sending it to the LUT. Since the activation valueis in the form of 2 Boolean shares, the hardware performsXNOR only on one of the shares as formulated below: a ⊕ w = b (10) a = a ⊕ a (11) ( a ⊕ a ) ⊕ w = ( a ⊕ w ) ⊕ a (12)The LUTs have ﬁve inputs: two shares that are notXNORed, two shares that are XNORed and a 2-bit signedrandom number. If the actual sum of the two consecutive nodesis a i , then the LUT outputs r i and a i -r i range from -2 to +1since it is a 2-bit signed number and weighted sum of twonodes will range from -2 to +2. Therefore, a i -r i can rangefrom -3 to +4 and should be 4-bit wide. The hardware has512 LUTs that convert the 1024 pairs of Boolean shares to 512pairs of arithmetic shares. After the conversion, the hardwarereuses the same adder tree masking that was described inSection V-A. The arithmetic shares have a leakage in MSB asdiscussed in Subsection V-A1, but because the same WDDLstyle adder tree is reused, this is addressed for all layers. D. Output Layer

In the output layer, for an unmasked design, the index ofthe node with maximum conﬁdence score is generated as theoutput of the neural network. In the masked case, however, theconﬁdence values are split in two arithmetic shares, which bydeﬁnition cannot be combined. Equations (14–16) formulatethe masking operations of the output layer. Basically, we checkif the sum of two numbers is greater than the sum of anothertwo numbers, without looking at the terms of each sum atthe same time. Therefore, instead of adding the two shares ofthe conﬁdence values and comparing them, we subtract oneshare of a conﬁdence value from another share of the otherconﬁdence value. This still solves the inequality, but looks atthe shares of two different conﬁdence scores. a + a ≥ b + b (13) a − b ≥ b − a (14) ( a − b ) + ( a − b ) ≥ (15)This simpliﬁes the original problem to the previous problemof ﬁnding the sign of the sum of two numbers withoutcombining them. Hence, in the ﬁnal layer computation, thehardware reuses the masked carry generator explained inSection V-B. E. The Entire Inference Engine

Fig. 10 illustrates all the components of the masked neuralnetwork. First, the network arithmetically masks the inputpixel a i using fresh r i generated by the PRNG. Next, theWDDL style adder tree processes each of the masks ( r i )and the masked values ( a i − r i ) in two sequential phases.The demultiplexer at the adder tree output helps to bufferthe ﬁrst phase ﬁnal summations, and pass the second phasesummations to the masked activation function directly. Themasked activation function produces the two Boolean sharesof the actual activation using fresh randomness from thePRNG. The network XNORs one share with the weights andsends the second share directly to the Boolean to arithmeticconverters. The converters produce 512 arithmetic shares fromthe 1024 Boolean shares using random numbers generated bythe PRNG. The hardware feeds the arithmetic shares a i − r i ig. 10. Components of the masked BNN. Blocks in green represent the proposed masking blocks. and r i to the adder tree and repeats the whole process for eachlayer. Finally, the output layer reuses the masked activationfunction to ﬁnd the node with maximum conﬁdence from thearithmetic shares of the conﬁdence values and computes theclassiﬁcation result output as described in Subsection V-D.VI. L EAKAGE AND O VERHEAD E VALUATION

This section describes the measurement setup for our ex-periments, the evaluation methodology used to validate thesecurity of the unprotected and protected implementations, andthe corresponding results with overhead quantiﬁcation.

A. Hardware Setup

Our evaluation setup used the SAKURA-X board [42],which includes a Xilinx Kintex-7 (XC7K160T-1FBGC) FPGAfor processing and enables measuring the voltage drop ona 10m Ω shunt-resistance while making use of the on-boardampliﬁers to measure FPGA power consumption. The clockfrequency of the design was 24MHz. We used the Picoscope3206D oscilloscope to take measurements with the samplingfrequency set to 250MHz. To amplify the output of the Kintex-7 FPGA, we used a low-noise ampliﬁer provided by Riscure(HD24248) along with the current probe setup. The experimentcan also be conducted at a lower sampling rate by increasingthe number of measurements [43]. B. Side-Channel Test Methodology

Our leakage evaluation methodology is built on the priortest efforts on cryptographic implementations [44], [25], [40].We performed DPA on the 4 main operations of an inferenceengine as stated before, viz. adder tree, activation function,Boolean to arithmetic share conversion, and output layermax function. Pseudo Random Number Generators (PRNG)produced the random numbers required for masking—anycryptographically-secure PRNG can be employed to this end.We ﬁrst show the ﬁrst-order DPA weight recovery attacks onthe masked implementation with PRNG disabled. With PRNGoff, the design’s security is equivalent to that of an unmaskeddesign. We illustrate that such a design leaked information forall the three operations, which ensured that our measurementsetup and recovery code was sound. Next, we turned on thePRNG and performed the same attack which failed for allthe operations. We also performed a second-order attack tovalidate the number of traces used in the ﬁrst-order analysis.The power model was based on hamming distance of registersthat was generated using our HD simulator for the neural network and the tests used the Pearson correlation coefﬁcientto compare the measurement data with the hypothesis. Inpractice, we leave it as an open problem to use more advancedtools like proﬁled attacks (model-less), MCP-DPA attacks [45]for the masking parts, information theoretic tools (MI/PI/HI)[46], and more advanced high-dimensional attacks/ﬁltering orpost-processing.

C. Attacks with PRNG off

The PRNG generates the arithmetic shares for the addertree, feeds the masked LUTs of the activation function andthe Boolean to arithmetic converters. Turning off the PRNGunmasks all these operations making a ﬁrst-order attack suc-cessful at all these points. Fig. 11 shows the mean power ploton the top for orientation, which is followed below by the 3attack plots with PRNG disabled. We attack the second stageof the adder tree, the ﬁrst node’s activation result, and the ﬁrstnode of the output layer, respectively, shown in the next threeplots. In all the plots, we observe a distinct correlation peakfor the targeted variable corresponding to the correct weightand bias values. Fig. 12 shows the evolution of the correlationcoefﬁcient as the number of traces increase. The attack issuccessful at 200 traces with PRNG off. This validates ourclaim on the vulnerability of the baseline, unprotected design.

D. First-order Attacks with PRNG on

We used the same attack vectors from the case of PRNGoff, but with the PRNG turned on this time. This armed allthe countermeasures implemented for each operation. Fig. 11shows that the distinct peaks seen in the PRNG OFF plotsdo not appear in the PRNG ON plots for ﬁrst-order attacks.Fig. 12 shows that the ﬁrst-order attack is unsuccessful with100k traces. This validates our claim on the resiliency of themasked, protected design.

E. Second Order Attacks with PRNG on

We also performed a second-order DPA on the activationfunction to demonstrate that we used sufﬁcient number oftraces in the ﬁrst-order attack. Again, we used the same attackvectors used in the ﬁrst-order analysis experiments, but appliedthe attack on centered-squared traces. Fig. 11 shows that weobserved a distinct correlation peak at the correct point intime. Fig. 12 shows the evolution of the correlation coefﬁcientfor the second-order attack. We can see that the attack issuccessful around 3.7k traces which conﬁrms that 100k tracesare sufﬁcient for a ﬁrst-order attack. ig. 11. Side-channel evaluation tests. First-order attacks on the unmasked design (PRNG off) show that it leaks information while the masked design (PRNGon) is secure. The second-order attack on the masked design succeeds and validates the number of measurements used in the ﬁrst-order attacks.Fig. 12. Evolution of the Pearson coefﬁcient at the point of leak with the number of traces for ﬁrst-order attacks when the PRNGs are off (left), PRNGs areon (middle), and for second order attacks with PRNGs on (right). The ﬁrst-order attack with PRNGs off succeeds around 200 traces but fails when PRNGsare on even with 100k traces, which shows that the design is masked successfully. The second-order attack becomes successful around 3.7k traces, whichvalidates that we used sufﬁcient number of traces in the ﬁrst-order attacks.

F. Attacks on Hiding

We applied a difference of means test with 100k traces totest the vulnerability of hiding used for the MSB of arithmeticshares. The partition is thus based on the binary value ofMSB. Fig. 13 shows the attack on the targeted clock cycleand quantiﬁes that after 40k traces the adversary is ableto distinguish the two cases with 99.99% conﬁdence. Notethat this number is signiﬁcantly higher than the number ofmeasurements required to succeed for the second order attack.The number of traces for all the attacks are relatively low dueto the very friendly scenario created for the adversary; theplatform is low noise. In a real-life setting, the noise wouldbe much higher and consequently all attacks would requiremore traces.

G. Masking Overheads

Table II summarizes the area and latency overheads ofmasking in our case. As expected, due to the sequential pro- cessing with the two shares in the masked implementation, theinference latency is approximately doubled, from cyclesfor the baseline to cycles. Table II also compares the areautilization of the unmasked vs. masked implementations interms of the various blocks present in the FPGA. The increasein the number of the LUTs, ﬂip ﬂops and BRAMs in the designis approximately 2.7x, 1.7x and 1.3x. The signiﬁcant increasein the number of LUTs is mainly due to the masked LUTs usedto mask the activation function and convert the Boolean sharesof each layer to arithmetic shares. The increase in the numberof ﬂip ﬂops and BRAM utilization is caused by additionalstorage structures of the masked implementation such as therandomness buffered at the start to mask the various partsof the inference engine. Furthermore, the arithmetic masksare also stored in the ﬁrst phase, to be sent together to themasked activation function later. Each layer also stores twicethe number of activations in the form of two Boolean sharesincreasing the memory overhead. ig. 13. The difference-of-means test on the WDDL based signed-bitcomputation. The ﬁgure shows that the difference of means between thepower traces corresponding to MSB=0 and MSB=1 cases cross the 99.99%conﬁdence interval (represented by the dotted lines) around 40k traces.TABLE IIA

REA AND L ATENCY C OMPARISON OF UNMASKED VS . MASKEDIMPLEMENTATIONS

Design Type LUT/FF BRAM/DSP CyclesUnmasked 20296/18733 81/0 3192Masked 55508/33290 111/0 7248

VII. D

ISCUSSIONS

This section discusses the orthogonal aspects together withthe limitations of our approach and comments on how theycan be complemented to improve our proposed effort.

A. Limitations of The Proposed Defense

Masking is difﬁcult—after 20 years of AES masking, thereis still an increasing number of publications (e.g. CHES’19papers [47], [48], [49]) on better/more efﬁcient masking. This,in part, is due to ever-evolving attacks [50]. The paper’s focusis on empirical evaluation of security. We provide a proof-of-concept which can be extended towards building morerobust and efﬁcient solutions. We emphasize the importance oftheoretical proofs [51] and the need to conduct further researchon adapting them to the machine learning framework.We have addressed the leakage in the sign bit of arithmeticshare generation of the adder tree through hiding for cost-effectiveness. This is the only part in our hardware designthat is not masked and hence may be vulnerable due tocomposability issues or implementation imbalances (especiallyfor sophisticated EM attacks [52]). We highlight this issue asan open problem, which may be addressed through extensionsof gate level masking. But such an implementation will incursigniﬁcant overheads in addition to what we already show.Our evaluations build on top of model-based approaches,which can be corroborated with more sophisticated attackssuch as template based [53], moments-correlating based [45],deep-learning based [54], or horizontal methods [55]. Moreresearch is needed to design efﬁcient masking componentsfor neural network speciﬁc computations, extending ﬁrst-ordermasks to higher-order, and on investigating the security againstsuch side-channel attacks.

B. Comparison of Theoretical Attacks, Digital Side-Channels,and Physical Side-Channels

We argue that a direct comparison of the physical side-channels to digital and theoretical attack’s effectiveness (in terms of number of queries) is currently unfair due to im-maturity of the model extraction ﬁeld and due to differenttarget algorithms. Analyzing and countering theoretical attacksimprove drastically over time. This has already occurred incryptography: algorithms have won [56]. Indeed, there hasbeen no major breakthrough on the cryptanalysis of encryptionstandards widely-used today. But side-channel attacks are stillcommonplace and are even of growing importance. Whiledigital side-channels are imminent, they are relatively easier toaddress in application-speciﬁc hardware accelerators/IPs thatenforce constant time and constant ﬂow behavior (as opposedto general purpose architectures that execute software). Forexample, the hardware designs we provide in this work has nodigital side-channel leaks. Physical side-channels, by contrast,are still observed in such hardware due to their data-dependent,low-level nature; and therefore require more involved mitiga-tion techniques.

C. Scaling to other Neural Networks

The primary objective of this paper is to provide theﬁrst proof-of-concept of both power side-channel attacks anddefenses of NNs in hardware. To this end, we have designeda neural network that encompasses all the basic features of abinarized neural network, like binarized weights and activa-tions, and the commonly used sign function for non-linearity.When extended to other neural networks/datasets, like CIFAR-10, the proposed defences will roughly scale linearly withthe node, layer count and bit-precision (size) of neurons. Todeploy the countermeasures on constrained devices, the areaoverheads can be traded off for throughput, or vice versa. Anyalgorithm, independent of its complexity, can be attacked withphysical side-channels. But the attack success will depend onthe parallelization level in hardware. In a sequential design,increasing the weight size (e.g. moving from one bit to 8-bitsor ﬂoating point) may even improve the attack because thereis more signal to correlate.VIII. C

ONCLUSION

Physical side-channel leaks in neural networks call for anew line of side-channel analysis research because it opensup a new avenue of designing countermeasures tailored forthe deep learning inference engines. In this paper, we providedthe ﬁrst effort in mitigating the side-channel leaks in neuralnetworks. We primarily apply masking style techniques anddemonstrate new challenges together with opportunities thatoriginate due to the unique topological and arithmetic needsof neural networks. Given the variety in neural networks withno existing standard and the apparent, enduring struggle formasking, there is a critical need to heavily invest in securingdeep learning frameworks.IX. A

CKNOWLEDGEMENTS

We thank the anonymous reviewers of HOST for theirvaluable feedback and to Itamar Levi for helpful discussions.This project is supported in part by NSF under Grants No.1850373 and SRC GRC Task 2908.001.

EFERENCES[1] P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in

Advancesin cryptology–CRYPTO’99 . Springer, 1999, pp. 789–789.[2] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma,“Adversarial classiﬁcation,” in

Proceedings of the Tenth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining ,ser. KDD ’04. New York, NY, USA: ACM, 2004, pp. 99–108.[Online]. Available: http://doi.acm.org/10.1145/1014052.1014066[3] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, andA. Swami, “The limitations of deep learning in adversarial settings,”in ,March 2016, pp. 372–387.[4] G. Wang, T. Wang, H. Zheng, and B. Y. Zhao, “Man vs.machine: Practical adversarial detection of malicious crowdsourcingworkers,” in

Europhysics Letters (EPL) ,vol. 57, no. 1, pp. 141–147, jan 2002. [Online]. Available:https://doi.org/10.1209%2Fepl%2Fi2002-00552-9[6] M. Courbariaux and Y. Bengio, “BinaryNet: training deep neuralnetworks with weights and activations constrained to +1 or -1,”

CoRR , vol. abs/1602.02830, 2016. [Online]. Available: http://arxiv.org/abs/1602.02830[7] E. Brier, C. Clavier, and F. Olivier, “Correlation power analysis with aleakage model,” in

International Workshop on Cryptographic Hardwareand Embedded Systems . Springer, 2004, pp. 16–29.[8] J.-S. Coron and L. Goubin, “On boolean and arithmetic masking againstdifferential power analysis,” in

Cryptographic Hardware and EmbeddedSystems — CHES 2000 , C¸ . K. Koc¸ and C. Paar, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2000, pp. 231–237.[9] S. Nikova, C. Rechberger, and V. Rijmen, “Threshold implementationsagainst side-channel attacks and glitches,” in

Information and Communi-cations Security , P. Ning, S. Qing, and N. Li, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2006, pp. 529–545.[10] D. Lowd and C. Meek, “Adversarial learning,” in

Proceedings ofthe Eleventh ACM SIGKDD International Conference on KnowledgeDiscovery in Data Mining , ser. KDD ’05. New York, NY, USA:ACM, 2005, pp. 641–647. [Online]. Available: http://doi.acm.org/10.1145/1081870.1081950[11] F. Tram`er, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealingmachine learning models via prediction APIs,” in

Proceedings ofthe 25th USENIX Conference on Security Symposium , ser. SEC’16.Berkeley, CA, USA: USENIX Association, 2016, pp. 601–618.[Online]. Available: http://dl.acm.org/citation.cfm?id=3241094.3241142[12] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, andA. Swami, “Practical black-box attacks against machine learning,”in

Proceedings of the 2017 ACM on Asia Conference on Computerand Communications Security , ser. ASIA CCS ’17. New York,NY, USA: ACM, 2017, pp. 506–519. [Online]. Available: http://doi.acm.org/10.1145/3052973.3053009[13] B. Wang and N. Z. Gong, “Stealing hyperparameters in machinelearning,” in , May2018, pp. 36–52.[14] M. Juuti, S. Szyller, A. Dmitrenko, S. Marchal, and N. Asokan,“PRADA: protecting against DNN model stealing attacks,”

CoRR , vol.abs/1805.02628, 2018. [Online]. Available: http://arxiv.org/abs/1805.02628[15] W. Hua, Z. Zhang, and G. E. Suh, “Reverse engineering convolutionalneural networks through side-channel information leaks,” in , June 2018, pp.1–6.[16] S. Tople, K. Grover, S. Shinde, R. Bhagwan, and R. Ramjee, “Privado:Practical and secure DNN inference,”

CoRR , vol. abs/1810.00602,2018. [Online]. Available: http://arxiv.org/abs/1810.00602[17] M. Yan, C. W. Fletcher, and J. Torrellas, “Cache telepathy: Leveragingshared resource attacks to learn DNN architectures,”

CoRR , vol.abs/1808.04761, 2018. [Online]. Available: http://arxiv.org/abs/1808.04761[18] H. Naghibijouybari, A. Neupane, Z. Qian, and N. Abu-Ghazaleh,“Rendered insecure: GPU side channel attacks are practical,” in

Proceedings of the 2018 ACM SIGSAC Conference on Computerand Communications Security , ser. CCS ’18. New York, NY,USA: ACM, 2018, pp. 2139–2153. [Online]. Available: http://doi.acm.org/10.1145/3243734.3243831[19] L. Batina, S. Bhasin, D. Jap, and S. Picek, “CSI neural network: Usingside-channels to recover your artiﬁcial neural network information,”

CoRR , vol. abs/1810.09076, 2018. [Online]. Available: http://arxiv.org/abs/1810.09076[20] ——, “CSI NN: Reverse engineering of neural network architecturesthrough electromagnetic side channel,” in

Journal of Cryptographic Engineering ,vol. 1, no. 1, pp. 5–27, Apr 2011. [Online]. Available: https://doi.org/10.1007/s13389-011-0006-y[22] F. Schellenberg, D. R. E. Gnad, A. Moradi, and M. B. Tahoori, “Aninside job: Remote power analysis attacks on FPGAs,” in , March 2018,pp. 1111–1116.[23] M. Zhao and G. E. Suh, “FPGA-based remote power side-channelattacks,” in , 2018, pp. 839–854.[24] C. Ramesh, S. B. Patil, S. N. Dhanuskodi, G. Provelengios,S. Pillement, D. Holcomb, and R. Tessier, “FPGA side channelattacks without physical access,” in

International Symposium onField-Programmable Custom Computing Machines , Boulder, UnitedStates, Apr. 2018, p. paper

DPA, Bit-slicing and Masking at 1 GHz . Berlin, Heidelberg: Springer BerlinHeidelberg, 2015, pp. 599–619.[26] T. Eisenbarth, C. Paar, and B. Weghenkel,

Building a SideChannel Based Disassembler . Berlin, Heidelberg: Springer BerlinHeidelberg, 2010, pp. 78–99. [Online]. Available: https://doi.org/10.1007/978-3-642-17499-5 4[27] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-VM sidechannels and their use to extract private keys,” in

Proceedings of the2012 ACM Conference on Computer and Communications Security , ser.CCS ’12. New York, NY, USA: ACM, 2012, pp. 305–316.[28] M. S. ˙Inci, B. Gulmezoglu, G. Irazoqui, T. Eisenbarth, and B. Sunar,

Cache Attacks Enable Bulk Key Recovery on the Cloud . Berlin,Heidelberg: Springer Berlin Heidelberg, 2016, pp. 368–388.[29] J.-B. Note and E. Rannaud, “From the bitstream to the netlist,”in

Proceedings of the 16th International ACM/SIGDA Symposiumon Field Programmable Gate Arrays , ser. FPGA ’08. NewYork, NY, USA: ACM, 2008, pp. 264–264. [Online]. Available:http://doi.acm.org/10.1145/1344671.1344729[30] F. Benz, A. Seffrin, and S. A. Huss, “Bil: A tool-chain for bitstreamreverse-engineering,” in , Aug 2012, pp. 735–738.[31] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membershipinference attacks against machine learning models,” in , May 2017, pp. 3–18.[32] L. Wei, B. Luo, Y. Li, Y. Liu, and Q. Xu, “I know what you see: Powerside-channel attack on convolutional neural network accelerators,”in

Proceedings of the 34th Annual Computer Security ApplicationsConference , ser. ACSAC ’18. New York, NY, USA: ACM, 2018,pp. 393–406. [Online]. Available: http://doi.acm.org/10.1145/3274694.3274696[33] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:ImageNet classiﬁcation using binary convolutional neural networks,”in

Computer Vision – ECCV 2016 , B. Leibe, J. Matas, N. Sebe, andM. Welling, Eds. Cham: Springer International Publishing, 2016, pp.525–542.[34] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong,M. Jahre, and K. Vissers, “FINN: A framework for fast, scalablebinarized neural network inference,” in

Proceedings of the 2017ACM/SIGDA International Symposium on Field-Programmable GateArrays , ser. FPGA ’17. New York, NY, USA: ACM, 2017, pp. 65–74.[Online]. Available: http://doi.acm.org/10.1145/3020078.3021744[35] E. Nurvitadhi, D. Shefﬁeld, Jaewoong Sim, A. Mishra, G. Venkatesh,and D. Marr, “Accelerating binarized neural networks: Comparison ofPGA, CPU, GPU, and ASIC,” in , Dec 2016, pp. 77–84.[36] H. Yonekawa and H. Nakahara, “On-chip memory based binarizedconvolutional deep neural network applying batch normalization freetechnique on an FPGA,” in , May 2017,pp. 98–105.[37] T. S. Messerges, “Using second-order power analysis to attack DPAresistant software,” in

Cryptographic Hardware and Embedded Systems— CHES 2000 , C¸ . K. Koc¸ and C. Paar, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2000, pp. 238–251.[38] M.-L. Akkar and L. Goubin, “A generic protection against high-orderdifferential power analysis,” in

Fast Software Encryption , T. Johansson,Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 192–205.[39] K. Tiri and I. Verbauwhede, “A logic level design methodology for asecure DPA resistant ASIC or FPGA implementation,” in

ProceedingsDesign, Automation and Test in Europe Conference and Exhibition ,vol. 1, Feb 2004, pp. 246–251 Vol.1.[40] O. Reparaz, S. Sinha Roy, F. Vercauteren, and I. Verbauwhede,

Amasked Ring-LWE implementation . Berlin, Heidelberg: Springer BerlinHeidelberg, 2015, pp. 683–702.[41] T. Schneider and A. Moradi, “Arithmetic addition over boolean maskingtowards ﬁrst- and second-order resistance in hardware.”[42] Y. Hori, T. Katashita, A. Sasaki, and A. Satoh, “SASEBO-GIII: Ahardware security evaluation board equipped with a 28-nm fpga,” in

The 1st IEEE Global Conference on Consumer Electronics 2012 , Oct2012, pp. 657–660.[43] D. R. E. Gnad, J. Krautter, and M. B. Tahoori, “Leaky noise:New side-channel attack vectors in mixed-signal IoT devices,”

IACRTransactions on Cryptographic Hardware and Embedded Systems ,vol. 2019, no. 3, pp. 305–339, May 2019. [Online]. Available:https://tches.iacr.org/index.php/TCHES/article/view/8297[44] O. Reparaz, R. de Clercq, S. S. Roy, F. Vercauteren, and I. Verbauwhede,“Additively homomorphic ring-LWE masking,” in

Post-Quantum Cryp-tography , T. Takagi, Ed. Cham: Springer International Publishing, 2016,pp. 233–244.[45] A. Moradi and F.-X. Standaert, “Moments-correlating DPA,” in

Proceedings of the 2016 ACM Workshop on Theory of ImplementationSecurity , ser. TIS ’16. New York, NY, USA: ACM, 2016, pp. 5–15.[Online]. Available: http://doi.acm.org/10.1145/2996366.2996369[46] B. Gierlichs, L. Batina, P. Tuyls, and B. Preneel, “Mutual informationanalysis,” in

International Workshop on Cryptographic Hardware andEmbedded Systems . Springer, 2008, pp. 426–442.[47] G. Cassiers and F.-X. Standaert, “Towards globally optimized masking:From low randomness to low noise rate,”

IACR Transactions onCryptographic Hardware and Embedded Systems , vol. 2019, no. 2, pp.162–198, Feb. 2019. [Online]. Available: https://tches.iacr.org/index.php/TCHES/article/view/7389[48] T. Sugawara, “3-share threshold implementation of AES s-box withoutfresh randomness,”

IACR Transactions on Cryptographic Hardware andEmbedded Systems , vol. 2019, no. 1, pp. 123–145, Nov. 2018. [Online].Available: https://tches.iacr.org/index.php/TCHES/article/view/7336[49] L. De Meyer, V. Arribas, S. Nikova, V. Nikov, and V. Rijmen, “MM:Masks and macs against physical attacks,”

IACR Transactions onCryptographic Hardware and Embedded Systems , vol. 2019, no. 1, pp.25–50, Nov. 2018. [Online]. Available: https://tches.iacr.org/index.php/TCHES/article/view/7333[50] I. Levi, D. Bellizia, and F.-X. Standaert, “Reducing a maskedimplementation’s effective security order with setup manipulations,”

IACR Transactions on Cryptographic Hardware and EmbeddedSystems , vol. 2019, no. 2, pp. 293–317, Feb. 2019. [Online]. Available:https://tches.iacr.org/index.php/TCHES/article/view/7393[51] T. Moos, A. Moradi, T. Schneider, and F.-X. Standaert, “Glitch-resistantmasking revisited,”

IACR Transactions on Cryptographic Hardware andEmbedded Systems , pp. 256–292, 2019.[52] V. Immler, R. Specht, and F. Unterstein, “Your rails cannot hide fromlocalized EM: how dual-rail logic fails on FPGAs,” in

International Con-ference on Cryptographic Hardware and Embedded Systems . Springer,2017, pp. 403–424.[53] S. Chari, J. R. Rao, and P. Rohatgi, “Template attacks,” in

Interna-tional Workshop on Cryptographic Hardware and Embedded Systems .Springer, 2002, pp. 13–28.[54] H. Maghrebi, T. Portigliatti, and E. Prouff, “Breaking cryptographicimplementations using deep learning techniques,” in

International Con- ference on Security, Privacy, and Applied Cryptography Engineering .Springer, 2016, pp. 3–26.[55] C. Clavier, B. Feix, G. Gagnerot, M. Roussellet, and V. Verneuil,“Horizontal correlation analysis on exponentiation,” in