[PDF] SEALing Neural Network Models in Secure Deep Learning Accelerators

Abstract

Deep learning (DL) accelerators are increasingly deployed on edge devices to support fast local inferences. However, they suffer from a new security problem, i.e., being vulnerable to physical access based attacks. An adversary can easily obtain the entire neural network (NN) model by physically snooping the GDDR memory bus that connects the accelerator chip with DRAM memory. Therefore, memory encryption becomes important for DL accelerators on edge devices to improve the security of NN models. Nevertheless, we observe that traditional memory encryption solutions that have been efficiently used in CPU systems cause significant performance degradation when directly used in DL accelerators. The main reason comes from the big bandwidth gap between the GDDR memory bus and the encryption engine. To address this problem, our paper proposes SEAL, a Secure and Efficient Accelerator scheme for deep Learning. SEAL enhances the performance of the encrypted DL accelerator from two aspects, i.e., improving the data access bandwidth and the efficiency of memory encryption. Specifically, to improve the data access bandwidth, SEAL leverages a criticality-aware smart encryption scheme which identifies partial data that have no impact on the security of NN models and allows them to bypass the encryption engine, thus reducing the amount of data to be encrypted. To improve the efficiency of memory encryption, SEAL leverages a colocation mode encryption scheme to eliminate memory accesses from counters used for encryption by co-locating data and their counters. Our experimental results demonstrate that, compared with traditional memory encryption solutions, SEAL achieves 1.4 ~ 1.6 times IPC improvement and reduces the inference latency by 39% ~ 60%. Compared with a baseline accelerator without memory encryption, SEAL compromises only 5% ~ 7% IPC for significant security improvement.

Full PDF

SSEALing Neural Network Models in Secure Deep LearningAccelerators

Pengfei Zuo * † , Yu Hua * , Ling Liang † , Xinfeng Xie † , Xing Hu † , Yuan Xie † * Huazhong University of Science and Technology † Scalable Energy-efﬁcient Architecture Lab (SEAL), University of California, Santa Barbara

ABSTRACT

Deep learning (DL) accelerators are increasingly deployed onedge devices to support fast local inferences. However, theysuffer from a new security problem, i.e., being vulnerable tophysical access based attacks. An adversary can easily obtainthe entire neural network (NN) model by physically snoopingthe GDDR (graphics double data rate) memory bus that con-nects the accelerator chip with DRAM memory. Therefore,memory encryption becomes important for DL acceleratorson edge devices to improve the security of NN models. Nev-ertheless, we observe that traditional memory encryption so-lutions that have been efﬁciently used in CPU systems causesigniﬁcant performance degradation when directly used in DLaccelerators. The main reason comes from the big bandwidthgap between the GDDR memory bus and the encryption en-gine. To address this problem, our paper proposes SEAL,a Secure and Efﬁcient Accelerator scheme for deep Learn-ing. SEAL enhances the performance of the encrypted DLaccelerator from two aspects, i.e., improving the data accessbandwidth and the efﬁciency of memory encryption. Speciﬁ-cally, to improve the data access bandwidth, SEAL leveragesa criticality-aware smart encryption scheme which identiﬁespartial data that have no impact on the security of NN modelsand allows them to bypass the encryption engine, thus reduc-ing the amount of data to be encrypted. To improve the efﬁ-ciency of memory encryption, SEAL leverages a colocationmode encryption scheme to eliminate memory accesses fromcounters used for encryption by co-locating data and theircounters. Our experimental results demonstrate that, com-pared with traditional memory encryption solutions, SEALachieves 1 . × − . × IPC improvement and reduces the in-ference latency by 39% − −

7% IPC for signiﬁcant security improvement.

1. INTRODUCTION

Machine learning techniques, especially deep learning(DL), have made signiﬁcant progress in the past few years,whose performances have surpassed those of humans in someapplication domains, such as image classiﬁcation [25, 36, 67],speech recognition [12, 20, 76], and games [66]. With theincrease of computing performance and storage capacity ofedge devices, DL systems are increasingly expanded andused from cloud to edge devices [19, 75], such as self-drivingcars [31] and Internet-of-things devices [40]. By employingDL accelerators, e.g., GPU and NPU, edge devices are ableto carry out real-time local inferences based on current envi-ronments without a connection with a remote control center with high latency. For example, over 99% smartphones areequipped with a GPU by 2019 [63, 70]. The self-drivingcomputer within Tesla cars [7] and Google edge TPU [64]also include a GPU.In DL accelerators, neural network (NN) models are conﬁ-dential information that needs to be protected. Because NNmodels represent the Intellectual Property (IP) of model own-ers, which should be conﬁdentially protected to preserve theircompetitive advantages. More importantly, the knowledgeof NN models can facilitate an adversary to carry out morepowerful adversarial attacks [18, 69]. In adversarial attacks,an adversary is able to intentionally affect the outcome ofthe DL inference by modifying the input data with a slightperturbation that is imperceptible to humans. For example, byperforming adversarial attacks, an adversary is able to manip-ulate self-driving cars [16] and trick the speaker recognitionsystem in smartphones [6]. In general, if the adversary doesnot know the NN model, the success rate of the adversarialattack is low. With the knowledge of the NN model, thesuccess rate is signiﬁcantly improved [44, 53].However, DL accelerators deployed on edge devices suf-fer from a new security issue compared with those deployedon the cloud. The reason is that DL accelerators on edgedevices are easier to be physically accessed, thus being vul-nerable to physical access based attacks. The accelerator chipand DRAM themselves are usually well packaged and hencesecure to physical access, but the memory bus collecting ac-celerator and DRAM is not secure, due to being vulnerable tobus snooping attacks [27,29,30,77]. Since the DL acceleratorhas to access the NN model stored in the DRAM memorythrough the GDDR memory bus during the inference, an ad-versary can easily obtain the entire NN model by insertinga bus snooper on the GDDR bus to intercept the data com-municated between the DL accelerator chip and the DRAMmemory. Therefore, memory encryption for encrypting thedata transmitted between the DL accelerator chip and theDRAM memory is important.There are two existing memory encryption models in se-cure CPU systems including direct encryption and countermode encryption. Direct encryption encrypts all memorylines by using the same global key. It has a low securitylevel since the same data are always encrypted to the sameciphertext, leaving the direct encryption vulnerable to dic-tionary and retry attacks [3]. Counter mode encryption [41]encrypts a memory line by using a globe key in conjunctionwith its line address and a per-line counter. Thus the sameplaintexts are encrypted to different ciphertexts, achievinga high security level. Counter mode encryption needs to1 a r X i v : . [ c s . A R ] A ug aintain a counter cache on the CPU chip. When accessinga memory line, if its corresponding counter is found in thecounter cache, its decryption latency is hidden in the memoryread latency to improve the system performance. The reasonis that counter mode encryption generates a one-time pad(OTP) using the counter in parallel with the memory read anddecrypts the memory line by XORing the OTP with the data.Due to the beneﬁt of hiding decryption latency, counter modeencryption only incurs about 5% performance overhead inCPU systems [77].However, we observe that employing these memory en-cryption techniques in DL accelerators signiﬁcantly decreasestheir performance. The IPC (instruction per cycle) of theDL accelerator is reduced by over 50% after using memoryencryption, as evaluated in Section 2.4. Such a signiﬁcant per-formance decrease is unacceptable for the latency-sensitiveDL accelerators on edge devices that must carry out real-timeinferences based on current environments, e.g., self-drivingcars. The main reason comes from the big bandwidth gapbetween the GDDR memory bus and the encryption engine.For DL accelerators, e.g., GPUs, their performance is highlybandwidth-bounded and hence the GDDR memory is de-signed for GPUs to achieve high memory access bandwidth.The bandwidth of the GDDR memory bus is generally higherthan 160GB/s [49, 50, 51, 52]. However, the state-of-the-artencryption engine with hardware implementation achievesonly about 8GB/s of bandwidth on average [15,42, 45,46,62].Even though we deploy one encryption engine in every mem-ory controller, the big bandwidth gap remains. As a result, thehigh bandwidth of the GDDR memory bus is under-utilizedand the encryption engine becomes the bandwidth bottleneckin secure DL accelerators. Moreover, since the data accessbandwidth is the performance bottleneck, counter mode en-cryption causing extra memory accesses from counters ex-acerbates the performance on DL accelerators, which evendelivers worse performance than direct encryption.To address these problems, our paper proposes SEAL , aSecure and Efﬁcient Accelerator scheme for deep Learningto enhance the security of DL accelerators on edge deviceswhile delivering a high performance. SEAL reduces the per-formance overhead of encryption by using a criticality-awaresmart encryption (SE) scheme and a colocation mode en-cryption (ColoE) scheme. Speciﬁcally, to improve the dataaccess bandwidth of DL accelerators, SEAL leverages theSE scheme to identify partial data having no impact on thesecurity of NN models and allow them to bypass the encryp-tion engine, lowering the amount of data to be encryptedwithout affecting security. To improve the efﬁciency of mem-ory encryption, SEAL leverages the ColoE scheme that co-locates the storage of each data and its counter. ColoE has thesame security level as the traditional counter mode encryp-tion while achieving higher performance in DL acceleratorsdue to removing extra memory accesses from counters. Insummary, this paper makes the following contributions. • Observations and Insights on Securing DL Accelera-tors.

We present the new security problem of DL acceleratorson edge devices, i.e., being vulnerable to physical accessbased attacks. We observe that memory encryption that has SEAL means we seal NN models in secure DL accelerators andthus no one can snoop them. been efﬁciently used in CPU systems causes signiﬁcant (upto 50%) performance degradation when being directly usedin DL accelerators. By analyzing experimental results, wepresent the insights that the big bandwidth gap between theGDDR bus and the encryption engine is the main reason ofcausing performance degradation. • Criticality-aware Smart Encryption for NN Models.

Wepropose a criticality-aware smart encryption (SE) scheme toallow partial data to bypass the encryption engine for improv-ing the data access bandwidth in DL accelerators without anyloss of security. The idea of the SE scheme is to measure therelative importance of weight parameters in the NN model.Based on the relative importance, the SE scheme does notencrypt these weight parameters with the lowest importance,and thus it is unnecessary to encrypt their correspondingchannels in the input or output feature maps. Based on thequantitative security evaluation in terms of both IP protectionand adversarial attacks [17, 18, 37, 55], we determine the per-centage of encrypted data with which the SE scheme achievesthe same security level as the full encryption scheme. • Colocation Mode Encryption for DL Accelerators.

Inorder to improve the efﬁciency of memory encryption, wepropose a colocation mode encryption (ColoE) scheme tostore the data and its counter in the same memory line, un-like the traditional counter mode encryption storing themseparately. Thus the ColoE scheme removes extra memoryaccesses from counters to improve the system performanceand does not need a large on-chip counter cache comparedwith the traditional counter mode encryption. Due to theusage of counters for encryption, the ColoE scheme also hashigher security level than traditional direct encryption. • Implementation and Evaluation.

We have implementedSEAL in GPGPU-Sim [5] and evaluated it using three clas-sical CNN models including VGG-16 [67], ResNet-18 [25],and ResNet-34 [25]. Experimental results show that, com-pared with traditional direct and counter mode encryption,SEAL achieves 1 . × − . × IPC improvement and 39% −

60% of latency reduction. Compared with a baseline acceler-ator without memory encryption, SEAL is able to improvethe security with a slight overhead ( 5% −

7% IPC).

2. BACKGROUND AND MOTIVATION2.1 Deep Learning Accelerators

Deep learning (DL) [38] is widely used in current artiﬁcialintelligence applications, such as natural language process-ing, speech recognition, and computer vision. Achievinghigh accuracy and low processing latency in these applica-tions requires complicated deep learning computation [25,67].Therefore, various DL hardware accelerators [9,10,11,33,51]are used to deliver high performance.GPU is the most widely used DL accelerator due to com-patibility with different algorithms and high parallelism. Thepowerful parallel processing ability of GPU is efﬁcient andsuitable for DL with a large amount of parallel ﬂoating-pointand matrix/vector multiplication computation. FPGA is analternative for implementing DL accelerators with energyefﬁciency. Furthermore, various ASIC DL accelerators areproposed to speed up special machine learning algorithms,such as TPU [33], DianNao family [9, 11], and Eyeriss [10].2 E PE PE DRA M M e m o r y PE PE PE PE PE PE PE PE PE GDDR

Accelerator Chip M e m o r y C on t r o ll e r PE PEPE O n - c h i p C a c he Figure 1:

The generic DL accelerator architecture.

A generic hardware architecture for these GPU, FPGA,and ASIC DL accelerators is shown in Figure 1. The acceler-ator architecture consists of an array of processing elements(PEs, or called cores in GPUs) and a data cache (or calledglobal buffer) on chip. Each PE has its own control logic andscratchpad, and communicates with the data cache throughnetwork-on-chips (NoCs). As the size of the on-chip datacache is limited, the entire NN model and the intermediatedata produced during DL inference are stored in the off-chipDRAM memory with large capacity. The accelerator accessesthe DRAM through the high-bandwidth GDDR bus.

For DL applications, neural network (NN) models arecritical data maintained in DL accelerators [29, 30]. However,DL accelerators deployed on edge devices have the risk ofleaking their NN models due to being vulnerable to physicalaccess based attacks. Compared with devices deployed inthe cloud, edge devices are easier to be physically accessed.For example, a user can dismantle his/her own self-drivingcar to look into the internal computer system. Therefore, DLaccelerators on edge devices have the security vulnerabilityto physical access based attacks, i.e., bus snooping [65, 77].

Threat Model:

Like existing threat models for hardwareattacks on CPUs [65, 77] and accelerators [29, 30], we con-sider on-chip components of accelerators and DRAM aresecure. However, an adversary can insert a bus snooper or amemory scanner on the GDDR memory bus to obtain thedata communicated between the accelerator chip and off-chipDRAM [29, 30], and further steals the entire NN model. Threat Purposes:

We consider two threat purposes thatan adversary obtains NN models via bus snooping.

1) IP Strealing.

NN models are considered as the Intel-lectual Property (IP) of model owners [30, 60, 72]. Modelowners may consume a large amount of ﬁnancial and materialresources to train a sophisticated NN model. The adversarymay be a business competitor of model owners. The leakageof NN models incurs the property loss of model owners andreduces their competitive advantages.

2) Adversarial Attacks.

The exposion of an NN model cansigniﬁcantly increase the risk that the NN model is attackedby adversarial attacks. In adversarial attacks, an adversaryaims to apply an imperceptible non-random perturbation onthe input data to change the prediction results of NN mod-els [18,69]. The perturbed input data are termed as adversarialexamples. If the adversary does not know the NN model, theadversarial attack is called black-box attack. If the adversaryknows the entire NN model, the adversarial attack is called white-box attack. In the black-box attacks, the attack suc- As we aim to protect the conﬁdentiality of NN models, bus tam-pering attacks are not considered in our threat model that can bedefended via Merkle Trees based authentication techniques [68],which are orthogonal to our work.

AES

KeyPlaintext(Ciphertext) Ciphertext(Plaintext)

Encryption:(Decryption:) (a) Direct Encryption

AES

AddrCounter

Key + Plaintext Plaintext + Ciphertext Ciphertext

Encryption: Decryption:

OTP (b) Counter mode encryption

Figure 2:

Encryption and decryption operations in thedirect encryption and counter mode encryption. cess rate is low. In the white-box attacks, the attack successrate signiﬁcantly increases since the adversary can generatehigh-quality adversarial examples by using the known modelinformation [44, 53].In order to protect the NN models in DL accelerators frombus snooping attacks, encrypting the data transmitted throughthe GDDR bus is important. Existing memory encryptiontechniques [65, 77] are widely used in secure CPU systemsto enable secure data transmission through the DDR bus ofCPU memory. However, data security on the GDDR memorybus for DL accelerators are rarely touched by existing work.In this following, we ﬁrst present memory encryption tech-niques for secure CPUs (§2.3) and then investigate whetherthe straightforward solutions that perform CPU memory en-cryption directly on DL accelerators are efﬁcient (§2.4).

In secure CPUs, the encryption engine of a block cipheralgorithm (e.g., AES [13]) is added in the memory controllerfor encrypting and decrypting data. In general, there are twomemory encryption models used for secure CPUs, includingdirect encryption and counter mode encryption.As shown in Figure 2a, in the direct encryption, each cacheline is encrypted by the AES encryption engine before beingwritten back to the DRAM memory. After being read fromthe DRAM memory, each line is decrypted and then put intothe last level cache. However, direct encryption causes highdecryption latency in the critical path of memory accesses inCPUs. Additionally, direct encryption encrypts all memorylines by using the same global key, which has a low securitylevel. Since the same data are always encrypted to the sameciphertext, it leaves direct encryption vulnerable to dictionaryand retry based attacks [3, 79].As shown in Figure 2b, in counter mode encryption [41],a global key, the line address and the per-line counter passthrough the AES encryption engine to generate a one-timepad (OTP). The plaintext or ciphertext is then encrypted ordecrypted by simply XORing its OTP. Each memory line inthe off-chip DRAM has a counter. All counters are storedin the DRAM. Recently used counters are buffered in an on-chip counter cache managed by the memory controller. If thecounter of a memory line to be read is in the counter cache,its decryption latency is hidden in the memory read latency,since the OTP is generated in parallel with the memory read.Only the XOR latency is added to the critical path, thus3 B a s e li ne D i r e c t C t r- C t r- C t r- C t r- I P C (a) Instruction per cycle (IPC)

24 96 384 1536 C a c he H i t R a t e Counter Cache Size (KB) (b) Counter cache hit rate

Figure 3:

The IPC of GPUs with two straightforwardmemory encryption solutions. ("Baseline" means a base-line GPU without using memory encryption. "Direct" meansthe direct encryption. "Ctr-96" means the counter modeencryption with the 96KB counter cache and each memorycontroller has a 16KB (=96KB/6) counter cache.) reducing the decryption latency.Moreover, counter mode encryption provides a higher secu-rity level than direct encryption, since OTPs are never reusedfor data encryption which keep counter mode encryption se-cure from dictionary and retry based attacks. First, since theline address is used to generate the OTP, the data stored atdifferent addresses are encrypted by different OTPs. Second,a per-line counter is used to generate the OTP and the counterincreases one on each write. Data rewritten in the same ad-dress are encrypted by different OTPs. In general, countersare stored in the plaintext since data cannot be decrypted ifan adversary has the knowledge of the counter value but doesnot know the key [41, 79].

We consider two straightforward solutions, i.e., simply em-ploying the direct encryption and counter mode encryptionin DL accelerators, to improve the security of NN models.Without loss of generality, in the rest of this paper, we analyzeGPU as a representative example of DL accelerators. How-ever, the problems, insights, and solutions that we developare also applicable to other DL accelerators.We implement the two straightforward solutions in GPGPU-Sim [5]. Since the encryption engine increases the chiparea and energy overhead that also affects the chip cool-ing [4,45,54], each memory controller generally includes oneencryption engine [3, 43, 77, 79]. Thus the six memory con-trollers in the modeled GPU include six encryption engines.For the counter mode encryption, we add an on-chip countercache to buffer recently used counters. The detailed GPUconﬁgurations are shown in Section 4.1. We use the modeledGPU to execute matrix multiplication computation that is themost common operation in DL algorithms. We evaluate theIPC (instruction per cycle) of the GPU with different encryp-tion schemes and compare them with a baseline GPU withoutusing memory encryption, as shown in Figure 3a.First, we observe the GPU with memory encryption issigniﬁcantly less efﬁcient than the GPU without memoryencryption. Memory encryption decreases the GPU IPCby 45% −

54% for the matrix multiplication computation.Second, using the counter mode encryption scheme doesnot deliver higher performance compared to using the directencryption scheme on GPU. With the small counter cachesizes, i.e., 24KB, 96KB, and 384KB, the performance of Table 1:

Bandwidth comparisons of AES encryption en-gine and different buses [26, 45, 52].

DDR bus DDR3 (No.800 − . ∼ . − . ∼ . × ×

16 links) 16 GB/sAES engine 128bit block 1 . ∼

19 GB/sGDDR bus GDDR5 160 ∼

336 GB/sGDDR5X 320 ∼

484 GB/s the counter mode encryption scheme is even lower than thedirect encryption scheme. By using a large counter cache, i.e.,1536KB, the IPC of the GPU is improved by 15%. However,the counter cache size is double of the L2 cache size in themodeled GPU as shown in the conﬁgurations (Section 4.1),which is too large to be deployed on the GPU die.The reason that memory encryption signiﬁcantly reducesthe GPU performance is the big bandwidth gap between theGDDR memory bus and the encryption engine, as shownin Table 1. In CPU systems, memory encryption workswell [65, 77] since the AES encryption engine has a simi-lar bandwidth to the DDR memory bus and the PCIe bus ofCPU. However, in GPU systems, the GPU performance ishighly bandwidth-bounded and hence the GDDR memory isdesigned for GPUs to achieve high memory access bandwidth.The bandwidth of the GDDR memory bus is generally morethan 160GB/s [49, 50, 51, 52]. However, the state-of-the-artpipelined AES encryption engine with hardware implementa-tion achieves only about 8GB/s of bandwidth on average [45].Even though we deploy one encryption engine in every mem-ory controller, the total encryption bandwidth is 48 GB/s. Asa result, the high bandwidth of the GDDR memory bus isunder-utilized and the AES encryption engine becomes thebandwidth bottleneck in secure GPUs. A single AES engineusually occupies over 1 mm on-die area and has hundreds orthousands of mW power, as shown in Table 2. As resourceson the microprocessor die are very scarce, it is ruinouslycostly to integrate more encryption engines into memory con-trollers on the GPU die [21]. Even though a GPU/CPU dieusually has an area of 90 − mm , most area is occupiedby cores and on-die memory and only less than 10% area isleft to memory controllers [34, 59]. This is also the reasonwhy Intel carefully designs the AES hardware implementa-tion to reduce area and energy overheads for Software GuardExtensions (SGX) [21]. Like the design principle of Intel’sSGX [21] and many previous works [3, 43, 77, 79], the goalof this paper is also to improve the hardware security whilehaving low on-die overheads. Moreover, since the data accessbandwidth is the performance bottleneck, the counter modeencryption incurs extra memory access requests for readingand writing counters compared with the direct encryption,thus delivering low performance with small cache sizes.

3. THE SEAL DESIGN

We propose SEAL, a secure and efﬁcient DL accelera-tor scheme to enhance the security of NN models. SEALimproves the performance of secure DL accelerators by ex-ploring and exploiting software and hardware co-design. Inthe software layer, to improve the data access bandwidth ofDL accelerators, a criticality-aware smart encryption (SE)4 M : x x F M : x x F M : x x F M : x x F M : x x F M : x x F M : x x F M : x x I N P U T : ... F M : x x F M : x x F M : x x O U T P U T : Hidden Layers (in which CONV and FC layers have weight parameters)

Input Layer Output Layer F M : x x C O N V : x , C O N V : x , P OO L : / C O N V : x , C O N V : x , P OO L : / C O N V : x , C O N V : x , F C : F C : F C : C O N V : x , P OO L : / Figure 4:

The CNN architecture of VGG-16 as an example. (The CNN uses an image with × pixels and 3 channelsas the input layer and outputs a one-dimensional vector. The input and output of each hidden layer are called feature maps(FMs) and output FMs of the previous layer are input FMs of the latter layer. (CONV: × , 64) indicates a convolution layerwith the × CONV kernel and 64 output channels. (FM: × × ) means the FMs with the size of × × .POOL indicates a pooling later and FC indicates a full connected layer. Table 2:

Performance comparisons of different AES en-cryption engine implementations (counter mode) . Area( mm ) Power(mW) Latency(cycle) Throughput(GB/s)Morioka et al. [46] N/A 1920 10 1.5Mathew et al. [45] 1.1 125 20 6.6Ensilica [15] 1.4 N/A 11 8Sayilar et al. [62] 6.3 6207 20 16Liu et al. [42] 6.6 1580 152 19 scheme (§3.1) is used to measure the relative importance ofweight parameters in the NN model. Only the relatively im-portant weight parameters are processed by the AES encryp-tion engine and the remaining parameters bypass the AESencryption engine, which reduces the amount of data to be en-crypted without compromising the security. We quantitativelyanalyze and evaluate the security of the SE scheme in termsof both IP protection and adversarial attacks, and leveragethe evaluation results to guide the parameter conﬁguration ofthe SE scheme to obtain the maximum performance beneﬁtand the highest security level (§3.4). In the hardware layer,to improve the efﬁciency of memory encryption, SEAL lever-ages a colocation mode encryption (ColoE) scheme (§3.2) toachieve the same security level as the counter mode encryp-tion while eliminating extra memory accesses from counters.Moreover, we present the overall hardware architecture de-sign to support SE and ColoE (§3.3). In this subsection, we ﬁrst use the convolutional neuralnetwork (CNN) that is a widely used neural network for DLas an example to present the challenges of performing partialencryption on DL accelerators. We then present the proposedcriticality-aware smart encryption scheme.

During the process of the CNN inference, there are fourkinds of data, i.e., data in the input layer, data in the outputlayer, weight parameters in hidden layers (i.e., the NN modeldata), and intermediate data (i.e., feature maps) produced byhidden layers, as shown in Figure 4. If we encrypt all the dataduring the CNN inference, the inference performance signiﬁ-cantly decreases, as presented in Section 2.4. This is mainlybecause the bandwidth of the AES encryption engine is farlower than that of the GDDR memory bus, limiting the totaldata access bandwidth. If we can only encrypt partial datato reduce the amount of data to be encrypted, the total dataaccess bandwidth improves. Nevertheless, performing par-tial encryption is not easy due to the following fundamental challenges.

Challenge 1: How to select appropriate data to be en-crypted?

Among the four kinds of data, the data in the inputand output layers are usually known by the adversary. Forexample, for the DL accelerator in a self-driving car, the in-put data are the pictures of the current visual ﬁeld taken bycameras, which can be obtained by the adversary. The outputdata are the current actions of the car, e.g., stop, turning left,or turning right, also known by the adversary. A simple wayof the partial encryption is that we do not encrypt the data inthe input and output layers and encrypt the remaining dataincluding weight parameters in the NN model and intermedi-ate data produced by hidden layers. However, the sizes of thedata in the input and output layers are far smaller than that ofthe intermediate data, as shown in Figure 4. For example, thedata in the input layer with the size of 224 × × × ×

64. Therefore, this simpleway is inefﬁcient to improve inference performance.Moreover, among these data in the CNN inference, weightparameters of the NN model have to be protected. Intuitively,we can encrypt only the weight parameters of the NN modeland do not encrypt the remainder of the data to reduce theencryption overhead. However, an adversary can calculate orspeculate the weight parameters of the NN model via unen-crypted feature maps. For example, a CONV layer computesthe input feature maps X with a kernel matrix ω to producethe output feature maps Y , i.e., Y = X ω . If X and Y arenot encrypted, an adversary can easily compute the kernelmatrix ω via the equation ω = X − Y in which X − is theinverse matrix of X . Therefore, it is important to protect theNN model data from being calculated or speculated from theunencrypted data. Challenge 2: How to evaluate the impact of partial en-cryption on security?

Intuitively, encrypting all data inputtedand produced during the NN inference has a high securitylevel but causes signiﬁcant performance degradation. Selec-tively un-encrypting partial data can improve the performancewhich however may exacerbate security. An adversary candirectly compute encrypted weights via unencrypted featuremaps as discussed above. Moreover, existing ﬁne-tuning tech-niques [39, 58] for NN models can also be used to speculatea complete NN model based on known partial weight param-eters and the data in the input and output layers. Speciﬁcally,the adversary can ﬁll the known partial weight parameters inthe NN model and then use the data in the input and outputlayers to retain a complete NN model. Hence, how to evalu-ate and quantify the impact of partial encryption on security5s non-trivial for designing an efﬁcient encryption scheme.

To address these challenges, we propose a criticality-awaresmart encryption (SE) scheme in SEAL, which aims to reducethe amount of encrypted data while improving the NN modelsecurity. The SE scheme quantitatively measures the relativeimportance of weight parameters in each layer by calculatingthe sum of their absolute weights, i.e., (cid:96) -norm. The weightparameters with the smallest absolute values in each layer areconsidered to be least important and hence are not encrypted.Thus it is unnecessary to encrypt the corresponding channelsin the input or output feature maps of unencrypted weightparameters. As a result, the amount of data to be encrypted issigniﬁcantly reduced. The percentage of un-encrypted weightparameters is determined based on the quantitative securityevaluation in Section 3.4 to obtain maximum performancebeneﬁt and highest security level.In deep neural networks, we consider use the SE schemein the CONV layers since most layers in a CNN model areCONV layers, e.g., 13/16 for VGG-16, 17/18 for ResNet-18, and 33/34 for ResNet-34. The computation process ofa CONV layer is shown in Figure 5. Weight parametersin a CONV layer are organized as a convolutional kernelmatrix, and each convolutional kernel is a weight matrix,e.g., 3 ×

3. The computation of a CONV layer transforms theinput feature maps with the convolutional kernel matrix to theoutput feature maps. The convolutional kernel matrix has n x kernel rows and n y kernel columns. n x is equal to the numberof channels in the input feature maps. Each kernel row inthe kernel matrix corresponds to a single input channel in theinput feature maps and this input channel does not involve theconvolution computation with other kernel rows, as shown inFigure 5. Similarly, n y is equal to the number of channels inthe output feature maps. Each kernel column in the kernelmatrix corresponds to a single output channel in the outputfeature maps, as shown in Figure 5. Relative Importance Measurement.

We ﬁrst present ourapproach for relative importance measurement as shown inFigure 5. We measure the relative importance of a kernel rowin each layer by calculating the sum of its absolute weights,i.e., (cid:96) -norm. The sum of absolute weights in a row also repre-sents the average magnitude of the kernel weights which givesan expectation of the magnitude of the output feature map.Thus kernel rows with smaller sums of absolute weights tendto produce feature maps with weak activations, comparedwith the other kernel rows in the same layer [39]. Hence,these rows with small absolute-value sums have a lower im-pact on the output of the entire NN model compared with therows with large absolute-value sums. Existing work [23, 39]on pruning NN models demonstrate that, even after com-pletely eliminating the convolution computation that usesthese weight parameters with small absolute values, the origi-nal accuracy of the NN model can be regained by retrainingthe networks. This observation indicates that these weight pa-rameters with small absolute values are less important to theNN model and thus rarely affect the security of the NN model.We have conﬁrmed this conjecture by performing IP protec-tion and adversarial attack tests as presented in Section 3.4,whose results motivate us to propose the smart encryption(SE) scheme to reduce the encryption overhead in DL accel- Kernel Matrix r Kernel Matrix

X W Y W ’ h x w x h y w y r Figure 5:

An example for the smart encryption scheme. (Green areas: encrypted data. Each grid in the kernel matrixis a kernel, e.g., 3*3.) erators by only encrypting the weight parameters with largeabsolute values.

Smart Encryption (SE).

After computing the sum of ab-solute weights in each row, the SE scheme sorts the kernelrows based on their sums. The SE scheme then encryptspartial kernel rows with the largest sums. The percentage ofthe encrypted kernel rows is determined by our quantitativesecurity analysis as shown in Section 3.4. However, the en-crypted weight parameters in the SE scheme can be ﬁguredout if the input and output feature maps of this CONV layerare unencrypted as discussed in Section 3.1.1. Therefore,for each encrypted row, the SE scheme also encrypts oneinput channel in the input feature maps corresponding to theencrypted row, since each kernel row corresponds to a singleinput channel and does not involve the convolution computa-tion with other input channels, as shown in Figure 5. In thisway, the encrypted weight parameters cannot be ﬁgured out.For example, for the matrix multiplication Y = X ω , the inputchannel X and the weights ω are encrypted. ω cannot beﬁgured out even though the adversary knows Y . The data inthe input channel X is encrypted once being produced by theprevious CONV layer. Hence, the plaintext in the encryptedchannel X is never exposed to the memory bus.Moreover, when considering unencrypted data among mul-tiple layers, the encrypted channels and weights cannot beﬁgured out and hence also secure. To prove this, we use a sim-ple example with two sequential CONV layers, i.e., Y = X ω and Z = Y ω (cid:48) , as follows. X = (cid:2) XXX X (cid:3) , ω = (cid:20) ωωω rrr ω r (cid:21) = (cid:20) ωωω ωωω ω ω (cid:21) , Y = (cid:2) Y YYY (cid:3) , ω (cid:48) = (cid:20) ω (cid:48) r ωωω (cid:48) rrr (cid:21) = (cid:20) ω (cid:48) ω (cid:48) ωωω (cid:48) ωωω (cid:48) (cid:21) , Z = (cid:2) ZZZ Z (cid:3) (1) The feature maps X, Y, and Z have 2 channels. Since thereare 2 input and output channels, the kernel matrixes ω and ω (cid:48) have 2 rows and 2 columns. With a 50% encryptionratio, we assume the ﬁrst row ω r in ω is encrypted, and thesecond row ω (cid:48) r in ω (cid:48) is encrypted. Based on the SE scheme,we should encrypt the ﬁrst channel X in X and the secondchannel Y in Y . Moreover, we assume Z is encrypted in Z .In Equation 1, the bold fonts mean encrypted data. Thus forthe two sequential CONV layers, we can have the followingequations (the encrypted data are in bold): (cid:26) XXX ∗ ωωω + X ∗ ω = Y XXX ∗ ωωω + X ∗ ω = YYY (2) (cid:26) Y ∗ ω (cid:48) + YYY ∗ ωωω (cid:48) = ZZZ Y ∗ ω (cid:48) + YYY ∗ ωωω (cid:48) = Z (3) As shown in Equations 2 and 3, encrypted input channelsare never multiplied with unencrypted weight rows, and un-encrypted input channels are never multiplied with encrypted6

ES Data Counter

DRAM

L2 Cache Counter Cache + (a) Counter mode encryption AES

L2 CacheDRAM

Counter (8B)Data (128B) (b) Colocation mode encryption

Figure 6:

The comparisons between counter and coloca-tion mode encryption schemes. weight rows. Thus we can only obtain the product of twoencrypted matrixes, e.g., X ∗ ω , but cannot ﬁgure out anysingle encrypted matrix from Equations 2 and 3. Therefore,the data in encrypted channels and weights are secure evenconsidering data among multiple layers.In fact, the SE scheme can also be applied to FC lay-ers since each FC layer includes a kernel matrix like theCONV layer. Therefore, the proposed SE scheme can beapplied to other deep neural networks, e.g., recurrent neuralnetworks [12, 28], that are composed of many FC layers. There are two existing memory encryption models, i.e.,direct encryption and counter mode encryption, as discussedin Section 2.3. Direct encryption has a lower security leveldue to being vulnerable to the directory and retry based at-tacks. Counter mode encryption enhances security by usingcounters for encryption but requires a large counter cacheon chip to achieve a high cache hit rate. Based on previ-ous works [3, 43, 77] on counter mode encryption, the sizeof the used counter cache is usually up to 1MB − a colocation mode encryption (ColoE)scheme for DL accelerators without using an on-chip countercache. The ColoE scheme achieves the same security levelwhile having higher performance on DL accelerators, com-pared with the traditional counter mode encryption. Unlikethe traditional counter mode encryption that stores the dataand their counters separately as shown in Figure 6a, the ColoEscheme stores the data and its counter together, i.e., coloca-tion. Like Intel’s SGX [21,22], we use the monolithic counterscheme rather than the split counter scheme [77] to avoid theoverheads of intricate page re-encryption. The counter area is8B for each memory line. Thus a memory line for storing theencrypted data is 136B including 128B data and 8B counterarea as shown in Figure 6b. When the data is evicted fromthe L2 cache, the ColoE scheme encrypts the data using itsco-located counter, its memory address, and a key and then AES L2 CacheDRAM DIMM A cce l e r a t o r C h i p B y pa ss AES

HW SW

DL Program emalloc()malloc()emalloc() malloc()

Compiler C on t r o ll e r Flag

Data Data Data Data Data Data Data DataData Data Data Data Data Data Data Data

Ctr

Chip ...

Figure 7:

A high-level overview of the SEAL architec-ture. (This ﬁgure shows one memory controller and the othercontrollers are the same as this one.) writes it into the DRAM memory. When the data is readfrom the DRAM memory, the ColoE scheme decrypts thedata and then sends it to the L2 cache. Unlike the traditionalcounter mode encryption that needs extra memory accesses toread/write counters, the ColoE scheme avoids these memoryrequests from counters by co-locating the data and their coun-ters, thus improving the encryption performance in GPUs.

To support the proposed SE and ColoE schemes, SEAL isimplemented via exploring and exploiting software and hard-ware co-design. The implementation and overall architectureof SEAL are shown in Figure 7.To support the SE scheme, in the software layer, we exposea new programming primitive, emalloc() , to the high-levelprogram in order to allow programmers to leverage the ben-eﬁts of the smart encryption. The memory space allocatedby emalloc() needs to be encrypted. The memory spaceallocated by existing malloc() in current programming lan-guages does not need to be encrypted. In the hardware layer,the counter area is 64 bits while the counter used in thecounter mode encryption only needs 56 bits, like the imple-mentation in Intel’s SGX [21, 22]. Thus 8 bits in the counterarea are not used. We use one bit in the counter area of eachmemory line as a ﬂag to indicate whether the memory lineis allocated by emalloc() or malloc() . Hence, memorycontrollers can distinguish the memory lines allocated by emalloc() or malloc() based on the ﬂags. Memory linesallocated by malloc() bypass the AES engine.To support the ColoE scheme, referring to the design oferror-correcting code (ECC) DRAM [8, 14], we design theDRAM DIMM to include an extra chip without changing theDRAM burst mechanism. As shown in Figure 7, in a DRAMrank, there are 16 data chips and 1 counter chip (in the ECCDRAM, the chip is used for storing ECC bits). For a memoryline, 128B data is stored in the 16 data chips (8B per chip)and 8B counter area is stored in the counter chip. For the security analysis, we ﬁrst discuss the case wherean adversary does not know what NN architecture is usedin the target DL accelerator. In this case, even though someNN model data are obtained by the bus snooping attack, theadversary is difﬁcult to distinguish which data are used fora particular layer. In our proposed SE scheme, some dataare encrypted and hence it is more difﬁcult for the adversaryto recover the NN model. Therefore, we consider a strong7 A cc u r a cy Selective Encryption RatioAverage VGG16ResNet18 ResNet34

Figure 8:

The inference accuracy of substitute models. attack model in which an adversary is able to ﬁgure outthe NN architecture in the DL accelerator via side channelinformation [29,30,78],e.g., memory access patterns obtainedfrom the memory bus, or device speciﬁcations [32]. In thiscase, the adversary can distinguish the data from differentlayers and know the locations in the NN model where theencrypted and unencrypted data correspond to. Under thestrong attack model, we below present the security analysis.The security of NN models involves two aspects including IPstealing and adversarial attacks, as presented in Section 2.2.

In the security evaluation tests, we use three classicalCNN models including VGG-16 [67], ResNet-18 [25], andResNet-34 [25] and train them on the widely used CIFAR-10dataset [35]. The NN model stored in the target DL accel-erator is called victim model , and the NN model that the ad-versary extracts from the accelerator by using bus-snoopingattacks is called substitute model . Based on the fact that theadversary does not know the training dataset of the victimmodel, we isolate 90% of training samples (45,000 images)in CIFAR10 as the training dataset of the victim model [56].The remaining 10% of training samples (5,000 images) areused by the adversary. Based on the 5,000 images, the ad-versary uses Jacobian-based dataset augmentation [56] togenerate additional 40,000 images and then query them in thetarget accelerator to obtain their corresponding labels. Thegenerated image-label pairs are used as the training datasetof the adversary’s substitute models. There are three kinds ofsubstitute models that the adversary may obtain as follows. • White-box model.

If a DL accelerator does not equipmemory encryption, the adversary can know the entire victimmodel including all weight parameters and the NN architec-ture. Thus we consider an NN model that is the same as thevictim model as the white-box substitute model. • Black-box model.

If we encrypt all the victim model dataand intermediate data, the adversary knows the NN archi-tecture but does not know any weight parameters. However,the adversary can feed his/her own images into the target DLaccelerator and obtain the output label. By using the image-label pairs, the adversary is able to retrain an NN model withthe same architecture as the victim model. We consider theretrained NN model as the black-box substitute model. • Smart encryption (SE) models.

In the SEAL, we se-lectively encrypt partial data that are critical and thus theadversary knows the NN architecture and partial weight pa-rameters that are unencrypted. We perform full encryption onthe ﬁrst two CONV layers, the last one CONV layer, and thelast FC layers of a CNN model to prevent the adversary fromcalculating the weight parameters via input and output layers,and perform the SE scheme on the remaining weight layers. (cid:12) (cid:20) (cid:14) (cid:15)(cid:19) (cid:11)(cid:2) (cid:1) (cid:10)(cid:2) (cid:1) (cid:9)(cid:2) (cid:1) (cid:8)(cid:2) (cid:1) (cid:7)(cid:2) (cid:1) (cid:6)(cid:2) (cid:1) (cid:5)(cid:2) (cid:1) (cid:4)(cid:2) (cid:1) (cid:3)(cid:2) (cid:1) (cid:13) (cid:17) (cid:18) (cid:21) (cid:16) (cid:12) (cid:20) (cid:14) (cid:15)(cid:19) (cid:11)(cid:2) (cid:1) (cid:10)(cid:2) (cid:1) (cid:9)(cid:2) (cid:1) (cid:8)(cid:2) (cid:1) (cid:7)(cid:2) (cid:1) (cid:6)(cid:2) (cid:1) (cid:5)(cid:2) (cid:1) (cid:4)(cid:2) (cid:1) (cid:3)(cid:2) (cid:1) (cid:13) (cid:17) (cid:18) (cid:21) (cid:16) (cid:12) (cid:20) (cid:14) (cid:15)(cid:19) (cid:11)(cid:2) (cid:1) (cid:10)(cid:2) (cid:1) (cid:9)(cid:2) (cid:1) (cid:8)(cid:2) (cid:1) (cid:7)(cid:2) (cid:1) (cid:6)(cid:2) (cid:1) (cid:5)(cid:2) (cid:1) (cid:4)(cid:2) (cid:1) (cid:3)(cid:2) (cid:1) (cid:13) (cid:17) (cid:18) (cid:21) (cid:16) (cid:4) (cid:3)(cid:4)(cid:4) (cid:3)(cid:5)(cid:4) (cid:3)(cid:6)(cid:4) (cid:3)(cid:7)(cid:4) (cid:3)(cid:8)(cid:4) (cid:3)(cid:9)(cid:4) (cid:3)(cid:10)(cid:4) (cid:3)(cid:11)(cid:4) (cid:3)(cid:12)(cid:4) (cid:3)(cid:13)(cid:5) (cid:3)(cid:4) (cid:17) (cid:22) (cid:29) (cid:16) (cid:22) (cid:30)(cid:2) (cid:7) (cid:8)(cid:17) (cid:22) (cid:29) (cid:16) (cid:22) (cid:30)(cid:2) (cid:5) (cid:12) (cid:1) (cid:9) (cid:4)(cid:8) (cid:10) (cid:5) (cid:4) (cid:9) (cid:2)(cid:3) (cid:6)(cid:7)(cid:6) (cid:11) (cid:12) (cid:18) (cid:23) (cid:22) (cid:1)(cid:14) (cid:25) (cid:21) (cid:28) (cid:31) (cid:27) (cid:30)(cid:24)(cid:26) (cid:25) (cid:1)(cid:17) (cid:20) (cid:30)(cid:24)(cid:26)(cid:19) (cid:15) (cid:15) (cid:2) (cid:5) (cid:10)

Figure 9:

The transferability of adversarial attacks fordifferent substitute models.

However, by using inputs and outputs of the target DL accel-erator, the adversary is able to supplement the unknown partof weight parameters via retraining the NN. Speciﬁcally, theadversary initializes an NN model with known weight param-eters and ﬁlls random numbers following a standard normaldistribution for unknown weight parameters [24]. The ad-versary then keeps the known weight parameters unchangedand ﬁne-tunes unknown weight parameters by retraining theNN using inputs and outputs of the target DL accelerator.Note that the attacker can know the information that thesums of unknown weight rows must be larger than those ofknown weight rows and then leverage this information duringﬁne-tuning. However, in our experiments, we observe thegenerated substitute models leveraging the information do notperform better, since limiting the sums of unknown weightrows may destroy efﬁcient parameter ﬁne-tuning.

One of the attack purposes is to steal the IP of NN models.The adversary that may be a business competitor aims toreduce the competitive advantages of model owners. The efﬁ-ciency of the stolen attacks depends on the inference accuracyof the extracted substitute models. In the stolen attack tests,we ﬁrst generate the three kinds of substitute models includ-ing white-box, black-box, and SE models that the adversarymay obtain as mentioned above. For the SE models, we varythe encryption ratio from 90% to 10%. The encryption ratiois deﬁned as the ratio of encrypted weight parameters to allweight parameters in each layer. The encrypted weights havethe largest absolute weight values in each layer as presentedin Section 3.1.2. We evaluate the inference accuracy of thesesubstitute models using test samples of the victim model.Figure 8 shows their inference accuracy. We observe thatthe while-box model has a very high accuracy, i.e., about 94%,due to being the same as the victim model. The black-boxmodel signiﬁcantly reduces the accuracy from 94% to 75%.This is because the adversary does not know any weightsand training samples in the victim model, and the black-boxmodel can only be trained from a blank model by usingthe adversary’s training dataset. For the SE models, whenthe encryption ratio is only 20%, the accuracy signiﬁcantlydecreases by 14% on average (from 94% to 80%), since theweight parameters with the largest absolutes are encrypted inSE models. When the encryption ratios ≥ ≥

40% encryption ratio achieves the samesecurity level as the black-box model for IP protection.8 .4.3 Security on Adversarial Attacks

If the purpose is to attack the victim model, the adver-sary can use the extracted NN models to generate adversarialexamples and then use the adversarial examples to performadversarial attacks. In the adversarial attacks, the adversaryaims to add the minimum perturbation on the input to mis-lead the victim model to produce a pre-assigned incorrectoutput [1, 37, 69]. In the adversarial attack tests, we use thethree kinds of substitute models including white-box, black-box, and SE models to respectively generate 1,000 adversarialexamples via the I-FGSM method [37]. Each batch of 1,000adversarial examples have a 100% attack success rate to at-tack their corresponding substitute models. We then use theseadversarial examples to attack the victim model and evaluatethe transferability of adversarial examples. The transferabil-ity is deﬁned as the ratio of the adversarial examples thatsuccessfully attack the victim model to all adversarial exam-ples, which is a widely used metric to evaluate the efﬁciencyof substitute models for adversarial attacks [17, 44, 71, 80].Figure 9 shows the transferability of the adversarial examplesgenerated by different substitute models.We observe that black-box models have much low transfer-ability (about 20%) for the three CNN models compared withwhile-box models, since the adversary with black-box modelsdoes not know any weight parameters and training samplesof the victim model. For the SE models, when the encryptionratios ≥

50% for the three CNN models, the transferabilityis close to, and even smaller than those of black-box mod-els. The reason is that the unencrypted weight parametersin the SE scheme are relatively un-important because theyhave the smallest absolute weights in each layer. If the adver-sary keeps the unencrypted weight parameters unchanged andﬁne-tunes the remaining weight parameters, the unchanged,unimportant weight parameters may disturb the retrainedmodel, thus producing smaller attack success rates than theblack-box model. When the encryption ratios <

4. PERFORMANCE EVALUATION4.1 Methodology

We evaluate the performance of SEAL using the GPGPU-Sim v3.2.2 [5], a cycle-level simulator for contemporaryGPUs. We model the microarchitecture for NVIDIA GeForceGTX480 GPU [49] with 15 streaming multiprocessors (SMs),one of the default GPUs in GPGPU-Sim. The details of ourused GPU conﬁguration are shown in Table 3. Althoughwe perform our simulations on an Nvidia Fermi GPU, oursolution focusing on the accelerator memory system is alsoapplicable and generalizable to newer GPU architecturesincluding Maxwell, Pascal, and Volta, as well as other kindsof DL accelerators as presented in Section 2.4. To implementSEAL, we add an AES encryption engine in every memorycontroller of the simulated GPU. We model a pipeline AESencryption engine with 128-bit block [4, 45, 54], in which the Table 3:

Conﬁgurations of the simulated system.

GPU Core

Number of SMs 15Core clock 700 MHzNumber of warps per SM 48Register ﬁle size per SM 128KB (32768 registers)Register ﬁle cache size per SM 16KB (4096 registers)Shared memory size per SM 48KB

Cache and Memory

Private L1 cache 16KB, 4-way, 128B line, 1-cycle latencyShared L2 cache 768KB, 8-way, 128B line, 10-cycle latencyMemory model GDDR5, 1848 MHz (3696 data rate),384-bit bus width, 6 channels, FR-FCFSGDDR5 timing (ns) t CL = 12, t RP = 12, t RC = 40, t RAS = 28, t RCD = 12, t RRD = 6 overall AES encryption latency for a cache line is 20 cyclesand the bandwidth of an AES engine is 8GB/s. Accordingto bandwidth values summarized in Tables 1 and 2, we set amean bandwidth value for AES and GPU.

Benchmarks.

We use three classical CNN models includ-ing VGG-16 [67], ResNet-18 [25], and ResNet-34 [25]. Inorder to run these CNN models on GPGPU-Sim, we installthe PyTorch for GPGPU-Sim [57], an open-source modi-ﬁed version of PyTorch that enables GPGPU-Sim to use thecuDNN library [48].

Comparisons.

We compare SEAL with the ﬁve schemes. • Baseline : An insecure GPU without memory encryp-tion as the baseline. • Direct : A straightforward solution using the directencryption scheme as presented in Section 2.4. • Counter : A straightforward solution using the countermode encryption scheme as presented in Section 2.4.For the counter mode encryption scheme, we add anon-chip counter cache whose size is 1/16 (equal to thecounter/data size ratio, i.e., 8B/128B) of the L2 cache. • Direct+SE : The direct encryption scheme with our pro-posed criticality-aware smart encryption (SE) scheme.We compare the performance of

Direct and

Direct+SE to show the beneﬁt of the SE scheme. • Counter+SE : The counter mode encryption schemewith our proposed SE scheme. We compare the perfor-mance of

Counter and

Counter+SE to show the ben-eﬁt of the SE scheme. We also compare

Counter+SE with

SEAL to show the beneﬁt of our proposed coloca-tion mode encryption (ColoE) scheme.

We perform the SE scheme on CONV layers whose inputand output feature maps are also the input and output ofPOOL layers. Different encryption schemes have differentimpacts on the performance of CONV and POOL layers.The default encryption ratio is 50% for the SE scheme aspresented in Section 3.4. To investigate the impact of differentencryption schemes on the performance of different layers,we evaluate four typical CONV layers in VGG, in which thenumber of input and output channels is 64, 128, 256, and 512,respectively. We also evaluate the ﬁve different POOL layers.Figure 10 shows the relative IPCs of different encryp-tion schemes when computing these CONV layers. We9 (cid:2)(cid:3)(cid:3) (cid:2)(cid:5)(cid:3) (cid:2)(cid:7)(cid:3) (cid:2)(cid:8)(cid:3) (cid:2)(cid:9)(cid:4) (cid:2)(cid:3) (cid:10) (cid:12) (cid:11) (cid:13) (cid:1)(cid:7)(cid:10) (cid:12) (cid:11) (cid:13) (cid:1)(cid:6)(cid:10) (cid:12) (cid:11) (cid:13) (cid:1)(cid:5) (cid:1)(cid:4) (cid:10) (cid:18) (cid:12) (cid:14)(cid:13)(cid:15) (cid:12) (cid:1) (cid:1)(cid:6) (cid:13)(cid:17)(cid:12) (cid:11) (cid:19)(cid:1) (cid:1)(cid:5) (cid:16) (cid:20) (cid:15) (cid:19)(cid:12) (cid:17)(cid:1) (cid:1)(cid:6) (cid:13)(cid:17)(cid:12) (cid:11) (cid:19)(cid:2) (cid:9) (cid:7) (cid:1) (cid:1)(cid:5) (cid:16) (cid:20) (cid:15) (cid:19)(cid:12) (cid:17)(cid:2) (cid:9) (cid:7) (cid:1) (cid:1)(cid:9) (cid:7) (cid:3) (cid:8) (cid:1) (cid:10) (cid:12) (cid:11) (cid:13) (cid:1)(cid:4) (cid:4) (cid:12) (cid:13) (cid:11) (cid:6) (cid:10)(cid:9) (cid:14) (cid:8)(cid:7) (cid:1)(cid:3) (cid:5) (cid:2)

Figure 10:

The IPC of different encryption schemes nor-malized to that of a baseline GPU for CONV layers. (cid:3) (cid:2)(cid:3)(cid:3) (cid:2)(cid:5)(cid:3) (cid:2)(cid:7)(cid:3) (cid:2)(cid:9)(cid:3) (cid:2)(cid:10)(cid:4) (cid:2)(cid:3) (cid:13) (cid:12) (cid:12) (cid:11) (cid:1)(cid:8)(cid:13) (cid:12) (cid:12) (cid:11) (cid:1)(cid:7)(cid:13) (cid:12) (cid:12) (cid:11) (cid:1)(cid:6)(cid:13) (cid:12) (cid:12) (cid:11) (cid:1)(cid:5) (cid:1)(cid:4) (cid:10) (cid:18) (cid:12) (cid:14)(cid:13)(cid:15) (cid:12) (cid:1) (cid:1)(cid:6) (cid:13)(cid:17)(cid:12) (cid:11) (cid:19)(cid:1) (cid:1)(cid:5) (cid:16) (cid:20) (cid:15) (cid:19)(cid:12) (cid:17)(cid:1) (cid:1)(cid:6) (cid:13)(cid:17)(cid:12) (cid:11) (cid:19)(cid:2) (cid:9) (cid:7) (cid:1) (cid:1)(cid:5) (cid:16) (cid:20) (cid:15) (cid:19)(cid:12) (cid:17)(cid:2) (cid:9) (cid:7) (cid:1) (cid:1)(cid:9) (cid:7) (cid:3) (cid:8) (cid:1) (cid:13) (cid:12) (cid:12) (cid:11) (cid:1)(cid:4) (cid:4) (cid:12) (cid:13) (cid:11) (cid:6) (cid:10)(cid:9) (cid:14) (cid:8)(cid:7) (cid:1)(cid:3) (cid:5) (cid:2)

Figure 11:

The IPC of different encryption schemes nor-malized to that of a baseline GPU for POOL layers. observe that the Direct scheme and the Counter scheme re-duce the GPU IPC by up to 40% compared with the base-line GPU without memory encryption. The reason is thatmemory encryption signiﬁcantly reduces the data accessbandwidth in GPUs as discussed in Section 2.4. By com-paring the performance between the Direct/Counter and theDirect+SE/Counter+SE schemes, our proposed SE schemesigniﬁcantly improves the memory encryption performanceon GPUs by reducing the amount of the encrypted data toimprove the data access bandwidth without compromisingsecurity. The Direct+SE scheme has higher IPC performancethan the Counter+SE scheme, since the counter mode en-cryption causes extra memory accesses from counters. How-ever, the direct encryption has a lower security level than thecounter mode encryption. SEAL leverages a ColoE schemeto achieve the same security level as counter mode encryptionwhile delivering higher performance. Compared with theCounter+SE scheme, we observe that SEAL improves theIPC by up to 12% by using the ColoE scheme.Figure 11 shows the relative IPCs of different encryptionschemes when computing POOL layers. We observe theDirect and Counter schemes reduce the IPC by up to 50%,and perform worse in comparison to computing CONV layerssince the computation of POOL layers is more bandwidth-bounded than that of CONV layers. Due to the same reason,the Direct+SE, Counter+SE, and SEAL perform worse whencompared to computing CONV layers. Nevertheless, for theentire neural network, the amount of computation overheadin CONV layers is much larger than that in POOL layers.

We investigate the impact of different encryption ratios onthe performance of SEAL when computing a CONV/POOLlayer. We vary the encryption ratio from 100% to 0% withthe 10% interval. A 100% encryption ratio means a full-encryption GPU. When the encryption ratio is 0%, the perfor-mance is the same as that of a baseline GPU without memoryencryption. The experimental results are shown in Figure 12.When slightly reducing the encryption ratio by 20% − (cid:5) (cid:4) (cid:4) (cid:2) (cid:13) (cid:4) (cid:2) (cid:12) (cid:4) (cid:2) (cid:11) (cid:4) (cid:2) (cid:10) (cid:4) (cid:2) (cid:9) (cid:4) (cid:2) (cid:8) (cid:4) (cid:2) (cid:7) (cid:4) (cid:2) (cid:6) (cid:4) (cid:2) (cid:5) (cid:4) (cid:2) (cid:4) (cid:2)(cid:4) (cid:3)(cid:8)(cid:4) (cid:3)(cid:9)(cid:4) (cid:3)(cid:10)(cid:4) (cid:3)(cid:11)(cid:4) (cid:3)(cid:12)(cid:4) (cid:3)(cid:13)(cid:5) (cid:3)(cid:4)(cid:5) (cid:3)(cid:5) (cid:4) (cid:12) (cid:13) (cid:11) (cid:6) (cid:10)(cid:9) (cid:14) (cid:8)(cid:7) (cid:1)(cid:3) (cid:5) (cid:2) (cid:1)(cid:14) (cid:18) (cid:17) (cid:22)(cid:1)(cid:19) (cid:18) (cid:18) (cid:16)(cid:21) (cid:25) (cid:27)(cid:25) (cid:25) (cid:24) (cid:26)! (cid:25) (cid:1)(cid:15) (cid:28) (cid:24) (cid:31) " (cid:30) (cid:26)(cid:29) (cid:28) (cid:1)(cid:20) (cid:23) (cid:26)(cid:29) Figure 12:

The IPC of SEAL with different encryptionratios normalized to that of a baseline GPU. the AES encryption engine not only improves the data accessbandwidth of the GPU but also reduces the competition forthe use of the encryption engine. When the encryption ratiois reduced to 50%, the IPC is improved from 65% to 95%and from 54% to 87% for computing the CONV and POOLlayers respectively, compared with a 100% encryption ratio.

We evaluate the IPC of the GPU with different encryptionschemes when executing the NN inference using VGG-16,ResNet-18, and ResNet-34, as shown in Figure 13. Tradi-tional memory encryption solutions including the Direct andCounter schemes reduce the GPU IPC for executing NNinference by 30% − . × − . × IPCimprovement. Moreover, SEAL achieves the 93% −

95% IPCof a baseline GPU without memory encryption, i.e., compro-mising only 5% −

7% performance for security improvement. (cid:13) (cid:10) (cid:10) (cid:1)(cid:4) (cid:8) (cid:12) (cid:14) (cid:15) (cid:11) (cid:14) (cid:16)(cid:1)(cid:4) (cid:9) (cid:12) (cid:14) (cid:15) (cid:11) (cid:14) (cid:16)(cid:1)(cid:6) (cid:7)(cid:3) (cid:2)(cid:3)(cid:3) (cid:2)(cid:5)(cid:3) (cid:2)(cid:7)(cid:3) (cid:2)(cid:8)(cid:3) (cid:2)(cid:9)(cid:4) (cid:2)(cid:3) (cid:1)(cid:4) (cid:10) (cid:18) (cid:12) (cid:14)(cid:13)(cid:15) (cid:12) (cid:1) (cid:1)(cid:6) (cid:13)(cid:17)(cid:12) (cid:11) (cid:19)(cid:1) (cid:1)(cid:5) (cid:16) (cid:20) (cid:15) (cid:19)(cid:12) (cid:17)(cid:1) (cid:1)(cid:6) (cid:13)(cid:17)(cid:12) (cid:11) (cid:19)(cid:2) (cid:9) (cid:7) (cid:1) (cid:1)(cid:5) (cid:16) (cid:20) (cid:15) (cid:19)(cid:12) (cid:17)(cid:2) (cid:9) (cid:7) (cid:1) (cid:1)(cid:9) (cid:7) (cid:3) (cid:8) (cid:1) (cid:4) (cid:12) (cid:13) (cid:11) (cid:6) (cid:10)(cid:9) (cid:14) (cid:8)(cid:7) (cid:1)(cid:3) (cid:5) (cid:2)

Figure 13:

The IPC normalized to that of a baseline GPU.

We evaluate the number of different kinds of memory ac-cesses when using different encryption schemes, as shownin Figure 14. For the baseline GPU, all memory accessesincluding reads and writes come from unencrypted data. For10 a s e li ne D i r e c t C oun t e r D i r e c t + SE C oun t e r + SE SEA L B a s e li ne D i r e c t C oun t e r D i r e c t + SE C oun t e r + SE SEA L B a s e li ne D i r e c t C oun t e r D i r e c t + SE C oun t e r + SE SEA L o f M e m o r y A cc e ss e s Un-encrypted Data Encrypted Data CounterVGG-16

Figure 14:

The number of memory accesses normalizedto that of a baseline GPU. the Direct scheme, all memory accesses are from encrypteddata and thus need to pass through the low-bandwidth AESencryption engine. Therefore, the Direct scheme signiﬁcantlyreduces the GPU performance compared with the Baseline asshown in Figure 13. For the Counter scheme, all memory ac-cesses from data are also from encrypted data. Moreover, theCounter scheme incurs 31% −

35% more memory accessesfrom counters and thus has lower performance than the Directscheme. Nevertheless, in the Direct and Counter schemes, themain performance bottleneck is the AES encryption enginerather than the DRAM. Hence, extra memory accesses fromcounters in Counter scheme do not incur much performancedecrease, compared with the Direct scheme.By using the SE scheme, the number of memory accessesfrom encrypted data is reduced by 39% − We investigate the impact of different encryption schemeson the inference latency, as shown in Figure 15. Tradi-tional memory encryption solutions including the Direct andCounter schemes increase the inference latency by 39% − − −

7% higherinference latency than the baseline GPU.

5. RELATED WORK

Model Extraction Attacks.

1) Algorithm layer.

Thereexist algorithm-layer approaches to extract the NN model re-lated information by exploiting the inputs and outputs of the (cid:13) (cid:10) (cid:10) (cid:1)(cid:4) (cid:8) (cid:12) (cid:14) (cid:15) (cid:11) (cid:14) (cid:16)(cid:1)(cid:4) (cid:9) (cid:12) (cid:14) (cid:15) (cid:11) (cid:14) (cid:16)(cid:1)(cid:6) (cid:7)(cid:3) (cid:2)(cid:3)(cid:3) (cid:2)(cid:5)(cid:3) (cid:2)(cid:7)(cid:3) (cid:2)(cid:8)(cid:3) (cid:2)(cid:9)(cid:4) (cid:2)(cid:3)(cid:4) (cid:2)(cid:5)(cid:4) (cid:2)(cid:7)(cid:4) (cid:2)(cid:8) (cid:1)(cid:4) (cid:10) (cid:18) (cid:12) (cid:14)(cid:13)(cid:15) (cid:12) (cid:1) (cid:1)(cid:6) (cid:13)(cid:17)(cid:12) (cid:11) (cid:19)(cid:1) (cid:1)(cid:5) (cid:16) (cid:20) (cid:15) (cid:19)(cid:12) (cid:17)(cid:1) (cid:1)(cid:6) (cid:13)(cid:17)(cid:12) (cid:11) (cid:19)(cid:2) (cid:9) (cid:7) (cid:1) (cid:1)(cid:5) (cid:16) (cid:20) (cid:15) (cid:19)(cid:12) (cid:17)(cid:2) (cid:9) (cid:7) (cid:1) (cid:1)(cid:9) (cid:7) (cid:3) (cid:8) (cid:1) (cid:3) (cid:12) (cid:13) (cid:10) (cid:4) (cid:9)(cid:8) (cid:16) (cid:7)(cid:6) (cid:1) (cid:2)(cid:4) (cid:14) (cid:7)(cid:11) (cid:5)(cid:15)

Figure 15:

The inference latency normalized to that of abaseline GPU.

DL inference. Tramèr el al. [72] assume that the conﬁdencescores of the classiﬁcation labels produced by DL systemsand the NN model architecture are known and demonstratethat some model parameters can be speculated. Oh et al. [53]assume the model architecture is unknown and propose ametamodel approach to extract the information of the NNmodel architecture. Wang et al. [74] propose an approach toextract the hyperparameters of the NN model. The hyperpa-rameters are usually used to balance between the regulariza-tion terms in the objective function and the loss function.

Existing works exploit theinformation of the operating system and architecture layersto speculate the NN model related information. Naghibijouy-bari et al. [47] exploit the side channel information in theoperating system, such as memory allocation APIs, GPU per-formance counters, and timing measurement, to speculate theNN model related information, e.g., the number of neurons.Hua et al. [30] exploit the side channel information in the DLaccelerator architecture, e.g., the memory access pattern, tospeculate the NN architecture related information.The model extraction attacks mentioned above can obtainonly a small part of the NN model related information. Com-pared with these model extraction attacks, the bus snoopingattacks for DL accelerators that our paper focuses on aremuch more dangerous. This is because an adversary canobtain all data of the entire NN model including weight pa-rameters in each layer by the bus snooping attacks. Our paperproposes a secure and efﬁcient solution, called SEAL, todefend against the bus snooping attacks for DL accelerators.

Memory Encryption.

Obviously, software memory en-cryption, such as Graviton [73], cannot adequately defendagainst physical access based attacks [77], since the programsof encryption software themselves can be stored in the mem-ory. Hardware memory encryption has been widely used insecure CPU systems [3, 27, 41, 61, 77, 79] to defend againstphysical access based attacks by adding the hardware encryp-tion engine on the CPU chip. Memory encryption does notcause signiﬁcant performance degradation in CPU systems,since the DDR memory bus for CPUs has a similar bandwidthto the encryption engine. However, memory encryption sig-niﬁcantly decreases the performance of DL accelerators, e.g.,GPUs, due to the big bandwidth gap between the GDDRmemory bus and encryption engine. Our proposed SEAL isable to efﬁciently address this problem via criticality-awaresmart encryption and colocation mode encryption.

6. CONCLUSION

Our paper proposes SEAL to enhance the security of NNmodels in DL accelerators on edge devices. To reduce perfor-mance overheads from memory encryption, SEAL leverages11 criticality-aware smart encryption (SE) scheme and a colo-cation mode encryption (ColoE) scheme. The SE schemeis used to improve the data access bandwidth of DL accel-erators by identifying partial data that have no impact onthe security of NN models and allowing them to bypass theencryption engine without affecting the security of the NNmodel. The ColoE scheme is used to improve the efﬁciencyof memory encryption by co-locating data and their countersto reduce the memory accesses from counters. Our experi-mental results show that, compared with traditional memoryencryption solutions, SEAL achieves 1 . × − . × IPC im-provement. Compared with a baseline accelerator withoutusing memory encryption, SEAL improves the security witha slight overhead ( 5% −

7% IPC).

REFERENCES [1] S. Alfeld, X. Zhu, and P. Barford, “Data poisoning attacks againstautoregressive models,” in

Proceedings of the Thirtieth AAAIConference on Artiﬁcial Intelligence (AAAI)

Proceedings of the 21st International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) , 2016.[4] A. Awad, Y. Wang, D. Shands, and Y. Solihin, “Obfusmem: Alow-overhead access obfuscation for trusted memories,” in

Proceedings of the 44th Annual International Symposium onComputer Architecture (ISCA) , 2017.[5] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt,“Analyzing CUDA workloads using a detailed GPU simulator,” in

Proceedings of the 2009 IEEE International Symposium onPerformance Analysis of Systems and Software (ISPASS) , 2009.[6] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields,D. Wagner, and W. Zhou, “Hidden voice commands,” in

Proceedingsof the 25th USENIX Security Symposium (USENIX Security) , 2016.[7] Chanan Bos, “Teslaâ ˘A ´Zs new hw3 self-driving computer – itâ ˘A ´Zs abeast,” June 2019, https://cleantechnica.com/2019/06/15/teslas-new-hw3-self-driving-computer-its-a-beast-cleantechnica-deep-dive/.[8] C.-L. Chen and M. Hsiao, “Error-correcting codes for semiconductormemory applications: A state-of-the-art review,”

IBM Journal ofResearch and development , vol. 28, no. 2, pp. 124–134, 1984.[9] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,“Diannao: A small-footprint high-throughput accelerator forubiquitous machine-learning,” in

Proceedings of the 19th internationalconference on Architectural support for programming languages andoperating systems (ASPLOS) , 2014.[10] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: Anenergy-efﬁcient reconﬁgurable accelerator for deep convolutionalneural networks,”

IEEE Journal of Solid-State Circuits , vol. 52, no. 1,pp. 127–138, 2017.[11] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,Z. Xu, N. Sun et al. , “Dadiannao: A machine-learning supercomputer,”in

Proceedings of the 47th Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO) , 2014.[12] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingrnn encoder-decoder for statistical machine translation,” arXiv preprintarXiv:1406.1078 , 2014.[13] J. Daemen and V. Rijmen,

The design of Rijndael: AES-the advancedencryption standard . Springer Science & Business Media, 2013.[14] T. J. Dell, “A white paper on the beneﬁts of chipkill-correct ecc for pcserver main memory,”

IBM Microelectronics Division

Proceedings of the Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2018.[17] I. Goodfellow, P. McDaniel, and N. Papernot, “Making machinelearning robust against adversarial inputs,”

Communications of theACM , vol. 61, no. 7, pp. 56–66, 2018.[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in neural information processing systems (NeurIPS) , 2014.[19] Google Corporation, “Edge TPU: Google’s purpose-built ASICdesigned to run inference at the edge,”https://cloud.google.com/edge-tpu/, 2018.[20] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” in

Proceedings of the 2013 IEEEinternational conference on acoustics, speech and signal processing(ICASSP) , 2013.[21] S. Gueron, “A memory encryption engine suitable for general purposeprocessors,” Cryptology ePrint Archive, Report 2016/204, 2016,https://eprint.iacr.org/2016/204.[22] S. Gueron, “Memory encryption for general-purpose processors,”

IEEE Security & Privacy , vol. 14, no. 6, pp. 54–62, 2016.[23] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” in

Proceedings of the International Conference on LearningRepresentations (ICLR) , 2015.[24] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,” in

Proceedings of the IEEE international conference on computer vision(ICCV) , 2015.[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition (CVPR) , 2016.[26] J. L. Hennessy and D. A. Patterson,

Computer architecture: aquantitative approach . Elsevier, 2011.[27] M. Henson and S. Taylor, “Memory encryption: A survey of existingtechniques,”

ACM Computing Surveys (CSUR) , vol. 46, no. 4, pp.53–79, 2014.[28] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[29] X. Hu, L. Liang, S. Li, L. Deng, P. Zuo, Y. Ji, X. Xie, Y. Ding, C. Liu,T. Sherwood, and Y. Xie, “Deepsniffer: a dnn model extractionframework based on learning architectural hints,” in in Proceedings ofthe 25th International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) , 2020.[30] W. Hua, Z. Zhang, and G. E. Suh, “Reverse engineering convolutionalneural networks through side-channel information leaks,” in

Proceedings of the 2018 55th ACM/ESDA/IEEE Design AutomationConference (DAC) , 2018.[31] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil,M. Andriluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue et al. , “Anempirical evaluation of deep learning on highway driving,” arXivpreprint arXiv:1504.01716 , 2015.[32] Intel Corporation, “Intel c (cid:13) et al. , “In-datacenterperformance analysis of a tensor processing unit,” in , 2017.[34] Khalid Moammer, “Nvidia gtx 1070 undressed, gp104 gpu gets ﬁrstever die shots â ˘A¸S dissecting the heart of geforce,” 2016,https://wccftech.com/nvidia-gtx-1080-gp104-die-shot/.[35] A. Krizhevsky and G. Hinton, “Learning multiple layers of featuresfrom tiny images,” Citeseer, Tech. Rep., 2009.

36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neuralinformation processing systems (NeurIPS) , 2012.[37] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples inthe physical world,” arXiv preprint arXiv:1607.02533 , 2016.[38] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol.521, no. 7553, pp. 436–444, 2015.[39] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruningﬁlters for efﬁcient convnets,” in

Proceedings of the InternationalConference on Learning Representations (ICLR) , 2017.[40] H. Li, K. Ota, and M. Dong, “Learning iot in edge: deep learning forthe internet of things with edge computing,”

IEEE Network , vol. 32,no. 1, pp. 96–101, 2018.[41] H. Lipmaa, P. Rogaway, and D. Wagner, “CTR-mode encryption,” in

Proceedings of the First NIST Workshop on Modes of Operation , 2000.[42] B. Liu and B. M. Baas, “Parallel aes encryption engines for many-coreprocessor arrays,”

IEEE transactions on computers , vol. 62, no. 3, pp.536–547, 2011.[43] S. Liu, A. Kolli, J. Ren, and S. Khan, “Crash consistency in encryptednon-volatile main memory systems,” in

Proceedings of the 2018 IEEEInternational Symposium on High Performance Computer Architecture(HPCA) , 2018.[44] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferableadversarial examples and black-box attacks,” in

Proceedings of theInternational Conference on Learning Representations (ICLR) , 2017.[45] S. Mathew, F. Sheikh, A. Agarwal, M. Kounavis, S. Hsu, H. Kaul,M. Anders, and R. Krishnamurthy, “53Gbps native GF ( ) composite-ﬁeld AES-encrypt/decrypt accelerator forcontent-protection in 45nm high-performance microprocessors,” in Proceedings of the 2010 IEEE Symposium on VLSI Circuits (VLSIC) ,2010.[46] S. Morioka and A. Satoh, “A 10-gbps full-aes crypto design with atwisted bdd s-box architecture,”

IEEE Transactions on Very LargeScale Integration (VLSI) Systems , vol. 12, no. 7, pp. 686–691, 2004.[47] H. Naghibijouybari, A. Neupane, Z. Qian, and N. Abu-Ghazaleh,“Rendered insecure: Gpu side channel attacks are practical,” in

Proceedings of the 2018 ACM SIGSAC Conference on Computer andCommunications Security (CCS)

Proceedings of theInternational Conference on Learning Representations (ICLR) , 2018.[54] OpenCores, “Tiny AES,” http://opencores.org/project/, 2012.[55] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, andA. Swami, “Practical black-box attacks against machine learning,” in

Proceedings of the 2017 ACM on Asia Conference on Computer andCommunications Security (ASIA CCS) , 2017.[56] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, andA. Swami, “Practical black-box attacks against machine learning,” in

Proceedings of the 2017 ACM on Asia Conference on Computer andCommunications Security (ASIA CCS) , 2017.[57] PyTorch for GPGPU-Sim, “Modiﬁed version of PyTorch able to work with changes to GPGPU-Sim,”https://github.com/gpgpu-sim/pytorch-gpgpu-sim, 2018.[58] F. Radenovi´c, G. Tolias, and O. Chum, “Cnn image retrieval learnsfrom bow: Unsupervised ﬁne-tuning with hard examples,” in

Proceedings of the European conference on computer vision (ECCV)

Proceedings of the International Conference onComputer-Aided Design (ICCAD) , 2018.[61] G. Saileshwar, P. Nair, P. Ramrakhyani, W. Elsasser, J. Joao, andM. Qureshi, “Morphable counters: Enabling compact integrity treesfor low-overhead secure memories,” in , 2018.[62] G. Sayilar and D. Chiou, “Cryptoraptor: High throughputreconﬁgurable cryptographic processor,” in

Proceedings of the 13th InternationalConference on Parallel Architectures and Compilation Techniques(PACT) , 2004.[66] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, andD. Hassabis, “Mastering the game of go with deep neural networksand tree search,” nature , vol. 529, no. 7587, pp. 484–489, 2016.[67] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” in

Proceedings of the InternationalConference on Learning Representations (ICLR) , 2014.[68] G. E. Suh, D. Clarke, B. Gassend, M. v. Dijk, and S. Devadas,“Efﬁcient memory integrity veriﬁcation and encryption for secureprocessors,” in

Proceedings of the 36th annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO) , 2003, p.339.[69] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus, “Intriguing properties of neuralnetworks,” in

Proceedings of the International Conference onLearning Representations (ICLR)

Proceedings of the 2018 International Conference on LearningRepresentations (ICLR) , 2018.[72] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart,“Stealing machine learning models via prediction apis,” in

Proceedingsof the 25th USENIX Security Symposium (USENIX Security) , 2016.[73] S. Volos, K. Vaswani, and R. Bruno, “Graviton: Trusted executionenvironments on GPUs,” in

Proceedings of the 13th USENIXSymposium on Operating Systems Design and Implementation (OSDI) ,2018.[74] B. Wang and N. Z. Gong, “Stealing hyperparameters in machinelearning,” in

Proceedings of the 2018 IEEE Symposium on Securityand Privacy (SP) , 2018.[75] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan,K. Hazelwood, E. Isaac, Y. Jia, B. Jia et al. , “Machine learning atfacebook: Understanding inference at the edge,” in , 2019.[76] W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. Stolcke,D. Yu, and G. Zweig, “Toward human parity in conversational speechrecognition,”

IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 25, no. 12, pp. 2410–2423, 2017.

77] C. Yan, D. Englender, M. Prvulovic, B. Rogers, and Y. Solihin,“Improving cost, performance, and security of memory encryption andauthentication,” in

Proceedings of the 33rd Annual InternationalSymposium on Computer Architecture (ISCA) , 2006.[78] M. Yan, C. Fletcher, and J. Torrellas, “Cache telepathy: Leveragingshared resource attacks to learn dnn architectures,” arXiv preprintarXiv:1808.04761 , 2018.[79] V. Young, P. J. Nair, and M. K. Qureshi, “DEUCE: write-efﬁcientencryption for non-volatile memories,” in

Proceedings of theTwentieth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) , 2015.[80] Y. Zhao, I. Shumailov, R. Mullins, and R. Anderson, “To compress ornot to compress: Understanding the interactions between adversarialattacks and neural network compression,” in

Proceedings of theConference on Systems and Machine Learning (SysML) , 2019., 2019.