[PDF] Toward Taming the Overhead Monster for Data-Flow Integrity

Abstract

Data-Flow Integrity (DFI) is a well-known approach to effectively detecting a wide range of software attacks. However, its real-world application has been quite limited so far because of the prohibitive performance overhead it incurs. Moreover, the overhead is enormously difficult to overcome without substantially lowering the DFI criterion. In this work, an analysis is performed to understand the main factors contributing to the overhead. Accordingly, a hardware-assisted parallel approach is proposed to tackle the overhead challenge. Simulations on SPEC CPU 2006 benchmark show that the proposed approach can completely verify the DFI defined in the original seminal work while reducing performance overhead by 4x on average.

Full PDF

aa r X i v : . [ c s . A R ] F e b T OWAR D T AMING THE O VER HEAD M ONSTER FOR D ATA -F LOW I NTEGRITY

Lang Feng ∗ § , Jiayi Huang † , Jeff Huang ¶ , Jiang Hu § , ∗ School of Electronic Science and Engineering, Nanjing University † Department of Electrical and Computer Engineering, University of California, Santa Barbara § Department of Electrical & Computer Engineering, Texas A&M University ¶ Department of Computer Science & Engineering, Texas A&M University ﬂ[email protected]; [email protected]; [email protected]; [email protected];February 22, 2021 A BSTRACT

Data-Flow Integrity (DFI) is a well-known approach to effectively detecting a wide range of softwareattacks. However, its real-world application has been quite limited so far because of the prohibitiveperformance overhead it incurs. Moreover, the overhead is enormously difﬁcult to overcome withoutsubstantially lowering the DFI criterion. In this work, an analysis is performed to understand themain factors contributing to the overhead. Accordingly, a hardware-assisted parallel approach isproposed to tackle the overhead challenge. Simulations on SPEC CPU 2006 benchmark show thatthe proposed approach can completely verify the DFI deﬁned in the original seminal work whilereducing performance overhead by 4 × on average. K eywords Data-Flow Integrity · Architecture · Security

Data-Flow Integrity (DFI) is a regulation to ensure that data to be accessed are written by legitimate instructions [1]. Assuch, DFI veriﬁcation can identify unwanted data modiﬁcations that are not consistent with programmer’s intention.It can detect a wide variety of security attacks including control data attacks such as Jump-Oriented Programming(JOP) [2] and Return-Oriented Programming (ROP) [3], and non-control data attacks such as Heartbleed [4] and theheap overﬂow attack to Nullhttpd [5]. As a large number of software attacks rely on data modiﬁcations, DFI is asingle principle that is effective for many different attack scenarios including future potential ones. In fact, its defensescope is a much bigger superset of Control-Flow Integrity (CFI) [6], which is another well-known software securityapproach.The concept of DFI was introduced in 2006 by the seminal work [1], and has received a lot of attention thereafter dueto its potential of being a powerful security measure. However, a complete DFI enforcement as in [1] incurs morethan 100% performance overhead even though several optimization techniques have been applied. Indeed, the hugeoverhead seems inevitable as every data access needs to be examined. Due to this intrinsic difﬁculty, there have beenfew follow-up works on DFI despite its widely recognized importance. This is in sharp contrast to CFI [6], which hasmuch more published studies [7, 8, 9, 10, 11, 12].The few later works on DFI [13, 14, 15, 16] reduce the overhead by exploiting partial DFI, whose criteria aresubstantially lower than the original DFI deﬁnition [1]. The Hardware-Assisted Data-Flow Isolation (HDFI) [13]is one example. It partitions data into two regions, and only requires that data to be read and written must be consistentin the same region. In other words, it reports a violation only when data intends to be in one region but is actuallywritten by an instruction for another region. Although its overhead is very small, the veriﬁcation granularity is verycoarse and may miss attacks that mingle different data within the same region. Consider the example in Figure 1,where input data are ﬁrst written into u0 and u1 in lines 10 and 11. Later, the data are copied to buffers in lines13-15. If there is buffer overﬂow when executing line 10, i.e., the input data size exceeds 256, then offset u0- > off is PREPRINT - F

EBRUARY

22, 2021modiﬁed unintentionally. Then, line 13 may copy user0’s data to other users’ buffers through the modiﬁed u0- > off .Meanwhile, user1 can write to user2’s buffer in line 14 in the same way. As HDFI partitions data into only two regions,one of the user pairs - (user0, user1), (user0, user2) or (user1, user2) must share the same region. Consequently, theformer user in a pair can attack the latter in the pair without being detected by HDFI. By contrast, a complete DFI [1]can isolate data among dozens of thousands of regions, i.e., a resolution > × higher than HDFI. Therefore, thesecurity price that HDFI paid for its overhead reduction can be very high. struct vuln{ char data [256]; int off=0; int size=0; }*u0 , *u1 , *u2; /* ============== */ char user0_buffer[256]; char user1_buffer[256]; char user2_buffer[256]; read_user_input(u0 , user0_input); read_user_input(u1 , user1_input); ... memcpy (user0_buffer+u0 ->off , u0 ->data , u0 ->size); memcpy (user1_buffer+u1 ->off , u1 ->data , u1 ->size); memcpy (user2_buffer+u2 ->off , u2 ->data , u2 ->size); Figure 1: An example of vulnerability that HDFI cannot detect.Verifying the complete DFI [1] with practically acceptable overhead is a huge challenge. Different from most ofexisting overhead reduction techniques [13, 14, 15, 16], which rely on lowering the DFI criterion, we pursue a newapproach that exploits additional hardware while the original DFI [1] can still be completely veriﬁed. As hardwarecost becomes increasingly affordable along with the progress of semiconductor technology, reducing performanceoverhead at the expense of extra hardware is a promising direction.We ﬁrst conduct an extensive performance analysis of DFI and, surprisingly, we discover that the frequent DFIdata access does not lead to frequent memory access and thus, memory access is not a bottleneck, but the DFIkernel computations usually contribute the most to the overhead. We propose a parallel approach, where kernelcomputations are performed in another processor core. However, a straightforward software-based parallel computingstill experiences huge overhead resulted from runtime information collection and communications with the otherprocessor core. Therefore, we develop a new hardware technique to further trim down the overhead. This hardware-assisted parallel approach also includes new software instrumentation techniques, lossless data compression andruntime optimization techniques. For the ease of deployment, we intend to minimize the dependence on computinginfrastructure changes. Except the necessary circuits and software instrumentation, our approach does not rely onusing new instructions or OS/compiler modiﬁcations.Overall, the proposed approach reduces performance overhead from 161% of [1] to an average of 36% on the sameSPEC CPU2006 benchmarks. As it is a complete DFI veriﬁcation, it can detect a wide range of security attacks andcover cases that cannot be handled by the previous low-overhead methods [13, 14, 15, 16]. Our approach provides asolution with a security-overhead tradeoff in complement to existing methods [13, 14, 15, 16]. A brief comparisonwith existing methods is summarized in Table 1. The contributions of this work are as follows.• An overhead breakdown analysis is performed to understand the main performance bottlenecks in software DFI.• This is the ﬁrst hardware approach to complete DFI veriﬁcation, to the best of our knowledge.• Two variants of the proposed approach are investigated, one for Processing-In-Memory (PIM) and the other forChip Multiprocessor (CMP).• The tradeoff between DFI violation detection latency and performance overhead is studied.• Our approach achieves about 4 × overhead reduction, which is a major progress for complete DFI since 2006. Data-ﬂow integrity requires that data to be loaded from memory can only be stored by legitimate instructions thatare consistent with the programmer’s original intention [1]. Every instruction in a program is assigned a numerical identiﬁer through automatic code instrumentation. The reaching deﬁnition of an instruction A is the latest instruction B that stores the data loaded by A and is represented by the identiﬁer of B . Each instruction that can load data frommemory has its own Reaching Deﬁnition Set (RDS) , which consists of all the allowed reaching deﬁnitions of thisinstruction. A static software analysis can be performed for a program to obtain the RDSs for all relevant instructions.2

PREPRINT - F

EBRUARY

22, 2021Table 1: Comparison between our work and others.

Method PerformanceOverhead DFIVeriﬁcationCompleteness Approach NewInstruction OSChange CompilerChange InstrumentSW DFI [1] 161% Complete SW × × × √

KENALI [14] 7-15% Partial SW × √ × √

WIT [15] 7% Partial SW × × × √

CHERI [17] 5-20% Partial HW √ √ √ ×

TMDFI [16] 39% Partial HW √ × × ×

HDFI [13] <

2% Partial HW √ √ √ ×

Our work 39% Complete HW × × × √

In the example of Figure 2, “ store x y ” means storing variable x at address y , “ load x y ” is to load the data ataddress y to variable x , and “ jump label ” implies an unconditional branch to the location marked by label . If theidentiﬁer of each instruction is the same as its line number, the RDS of line 7’s instruction is {1}. DFI requires that allthe instructions that can load data from memory are consistent with their RDSs, i.e., when executing an instruction A that loads data from memory, the data should be indeed most recently stored by one of the instructions in the RDS of A . Hence, the identiﬁer of the latest instruction that stores a data needs to be tracked for the data. Such identiﬁers forall data form a Reaching Deﬁnition Table (RDT) . store x1 addr1 store x1 addr2 jump label store x2 addr1 load x3 addr1 label: load x4 addr1 Figure 2: A code example for illustrating DFI.DFI is a superset of Control-Flow Integrity (CFI) [6], which only regulates instruction ﬂow transitions toward targetaddresses conforming to the original design intention. Attackers have to modify control data, such as the target addressfor an indirect branch, to change a control ﬂow. By protecting all the data, DFI can also prevent all control-ﬂow attacks.Additionally, DFI can protect non-control data that cannot be covered by CFI.A general threat model for DFI is that computer hardware and OS are secure, while attackers can manage to view thebinary code and opportunistically modify some program data, e.g., through buffer overﬂow.

The concept of Data-Flow Integrity (DFI) was proposed in the seminal work [1] in 2006. This work also provides asoftware implementation technique and optimization techniques for overhead reduction. Although the DFI veriﬁcationprocedure is simple, its performance overhead is intrinsically huge as the veriﬁcation needs to be conducted fortremendous data.The few later previous works [14, 15, 13, 17, 16] achieved much lower overhead by focusing on partial DFI. The workof [14] is restricted to only certain selected data for kernel software. One of its main contributions is the techniques onhow to select data to be protected. Although its performance overhead is only 7 − load and store instructions, the scope of Write Integrity Testing (WIT) [15] is restricted to store . It requires that each store instruction can only write to certain data objects, and each indirect call can onlycall certain functions. Although its overhead is at most 25%, it does not cover load instructions. Thus, an unsafe load instruction may read more bytes than the programmer’s intention, and consequently information leak may occur,e.g., Heartbleed [4] is an attack that WIT would fail to detect.Data isolation is another approach to protecting data with relatively low overhead. A hardware solution for data-ﬂowisolation, called HDFI, is proposed in [13]. It designates two data regions, a sensitive one and a non-sensitive one.A 1-bit tag is employed to tell the region that a data belongs to. Instruction set is modiﬁed such that the tags can beread and set. Moreover, processor hardware, operating system and compiler also need changes. If data belongs to oneregion, it cannot be written by an instruction for the other region. Although the isolation helps security, it cannot handlethe case where load/store instructions for different data of the same region are mingled. Thus, its low overheadof <

2% comes with the price of very coarse grained security resolution. To a certain degree, the original DFI [1]3

PREPRINT - F

EBRUARY

22, 2021can be regarded as data isolation among individual instructions. If 16 bits are used for each instruction identiﬁer, it isequivalent to isolation among up to 2 regions. Compared to the only 2 regions of HDFI [13], the resolution of theoriginal DFI is 2 = =

256 regions,which are much coarser grained than the resolution of our approach. For a typical program, such as each benchmark inSPEC CPU 2006, it needs at least > > We analyze the source of performance overhead of software DFI [1]. We call the program to be checked by DFIveriﬁcation the user program . For a user program, when each store or load is executed, RDT needs to be accessedand consequently data transfer with memory may be greatly increased. A memory access typically takes hundreds ofclock cycles and can cause huge overhead. Thus, we ﬁrst tested the cache hit rate to understand the DFI’s impact onmemory accesses. L1d Hits (No DFI)

L2 Hits (No DFI) L1i Hits (Software DFI)

L1d Hits (Software DFI)

L2 Hits (Software DFI) C ac h e H i t R a t e ( % ) Benchmarks

Figure 3: Cache hit rates of user programs with and without software DFI.The cache hit rates of user programs without DFI veriﬁcation and with software DFI are shown in Figure 3. One cansee that the cache hit rates are usually greater than 95% regardless with DFI veriﬁcation or not. This indicates thatmemory access is probably not a bottleneck. . b z i p . m c f . m il c445 . gob m k456 . h mm e r . s j e ng .li bqu a n t u m . h r e f . as t a r . s ph i n x3 A v g . Library Loop Bounds Check DFI Check P e r ce n t a g e Benchmarks RDT Store RDT Load RDT Search

Figure 4: Overhead breakdown of software DFI.We further investigated the overhead breakdown of software DFI, of which the results are shown in Figure 4, where“RDT Search” represents the instrumentation execution to ﬁnd the RDT entry of the corresponding user load or store .“Bounds Check” means the check for preventing RDT from illegal modiﬁcation. “Library Loop” is the additional loop4 PREPRINT - F

EBRUARY

22, 2021in the instrumented wrapper of each library function. “DFI Check” indicates the comparisons verifying the identiﬁerfound in RDT Search is in the RDS of the corresponding user load .According to Figure 4, most of the overhead is from DFI check. It also shows that RDT access itself (excluding theRDT search part) contributes little to the overhead. This conﬁrms that the bottleneck is not memory access but DFIcheck instructions. Speciﬁcally, many comparison and branch instructions are executed for each DFI check, whichcompares the identiﬁer found in RDT with each identiﬁer in RDS of the corresponding user load . Although this checkcomputation is fairly simple, it is performed for a huge volume of data.

Our approach is to delegate DFI veriﬁcation to another computing resource external to the main processor wherethe user program is executed. The delegated resource can be a processor core in a Chip Multiprocessor (CMP) ora Processing-In-Memory (PIM) processor [21]. The two options are similar in terms of the overhead reduction. Anon-essential yet non-trivial difference is that the PIM approach entails less data movement as RDSs and RDT residein memory. Thus, the PIM approach is more power-efﬁcient. Moreover, the DFI kernel computation is simple and aPIM processor would sufﬁce, whereas a CMP core is usually an overkill. An approximate estimate [22, 23] tells thatthe circuit area of a PIM processor is often <

10% of the main processor under the same technology. We will use PIMas a platform to describe our approach while the same idea is applicable to the CMP core option.We ﬁrst summarize the information required for DFI veriﬁcation by PIM and their locations as follows.1. RDS (Reaching Deﬁnition Set) for all user load instructions in the program. This information does not changethroughout the program execution and can be loaded into PIM once at the beginning.2. RDT (Reaching Deﬁnition Table). This information changes dynamically during a program execution. It ismaintained by the PIM processor, and therefore is local to DFI veriﬁcation at PIM .3. Target instruction information. A target instruction is an instruction in a user program to be veriﬁed for DFI.Mainly two types of instructions are involved: load instructions for which DFI veriﬁcation is performed, and store instructions that affect RDT. These two pieces of information change at runtime and need to be transferredfrom the main processor to memory. It consists of the following components:• Instruction identiﬁer.• Instruction type: either load or store .• Target address of load or store .The PIM processor undertakes most of the DFI veriﬁcation components analyzed in Section 4, and can quickly accessRDSs and RDT in its vicinity. As such, what remains for the main processor to do is to collect target instructioninformation and send it to PIM. Although the information collection and transmission can be implemented withsoftware in a way same as multithreading, our study shows that such a software approach still experiences hugeor even worse performance overhead. Thus, we propose a hardware approach to minimize extra software executionsat the main processor. Moreover, the hardware approach facilitates runtime application of optimizations described inSection 7.4.The overall ﬂow of the proposed DFI veriﬁcation is depicted in Figure 5, where green numbers indicate step ID:1. Static analysis is performed for a user program.2. RDSs are obtained from the static analysis.3. The codes are instrumented automatically. The main instrumentation is to add store instructions, which are called DFI store s and in red font in Figure 5, after each target instruction so as to help collect its information.4. The DFI checking program and RDS are loaded onto the PIM processor before the user program execution startson the main processor.5. During program execution, a dedicated hardware, called info-collector in Figure 5, parses each DFI store ,collects target instruction information accordingly, forms a

DFI packet , and sends it to the PIM processor, whereveriﬁcation computations are performed or RDT is updated.

Instrumentation is to add additional code into a user program in order to facilitate the DFI veriﬁcation. Thesoftware instrumentation in our approach helps not only extract the necessary information but also avoid changing5

PREPRINT - F

EBRUARY

22, 2021 (cid:53)(cid:39)(cid:54) (cid:20)(cid:3)(cid:79)(cid:82)(cid:68)(cid:71)(cid:3)(cid:68)(cid:3)(cid:68)(cid:71)(cid:71)(cid:85)(cid:20)(cid:3)(cid:3)(cid:11)(cid:20)(cid:21)(cid:12)(cid:21)(cid:3)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:3247)(cid:79)(cid:82)(cid:68)(cid:71)(cid:15)(cid:3)(cid:76)(cid:71)(cid:32)(cid:20)(cid:21)(cid:3248)(cid:17)(cid:17)(cid:17)(cid:22)(cid:3)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:69)(cid:3)(cid:68)(cid:71)(cid:71)(cid:85)(cid:21)(cid:3)(cid:3)(cid:11)(cid:20)(cid:23)(cid:12)(cid:23)(cid:3)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:3247)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:15)(cid:3)(cid:76)(cid:71)(cid:32)(cid:20)(cid:23)(cid:3248)(cid:17)(cid:17)(cid:17)(cid:24)(cid:3)(cid:79)(cid:82)(cid:68)(cid:71)(cid:3)(cid:70)(cid:3)(cid:68)(cid:71)(cid:71)(cid:85)(cid:21)(cid:3)(cid:3)(cid:11)(cid:21)(cid:24)(cid:12)(cid:25)(cid:3)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:3247)(cid:79)(cid:82)(cid:68)(cid:71)(cid:15)(cid:3)(cid:76)(cid:71)(cid:32)(cid:21)(cid:24)(cid:5)(cid:17)(cid:17)(cid:17) (cid:56)(cid:86)(cid:72)(cid:85)(cid:3)(cid:51)(cid:85)(cid:82)(cid:74)(cid:85)(cid:68)(cid:80) (cid:48)(cid:68)(cid:76)(cid:81)(cid:3)(cid:51)(cid:85)(cid:82)(cid:70)(cid:72)(cid:86)(cid:86)(cid:82)(cid:85) (cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92)(cid:38)(cid:82)(cid:81)(cid:87)(cid:85)(cid:82)(cid:79)(cid:79)(cid:72)(cid:85)(cid:68)(cid:71)(cid:71)(cid:85)(cid:71)(cid:68)(cid:87)(cid:68)(cid:82)(cid:87)(cid:75)(cid:72)(cid:85) (cid:71)(cid:68)(cid:87)(cid:68)(cid:3244) (cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17) (cid:44)(cid:81)(cid:73)(cid:82)(cid:16)(cid:38)(cid:82)(cid:79)(cid:79)(cid:72)(cid:70)(cid:87)(cid:82)(cid:85) (cid:68)(cid:71)(cid:71)(cid:85)(cid:3244)(cid:71)(cid:68)(cid:87)(cid:68)(cid:3244)(cid:82)(cid:87)(cid:75)(cid:72)(cid:85)(cid:53)(cid:72)(cid:74)(cid:86) (cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92) (cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92)(cid:3)(cid:37)(cid:88)(cid:86)(cid:51)(cid:44)(cid:48)(cid:3)(cid:51)(cid:85)(cid:82)(cid:70)(cid:72)(cid:86)(cid:86)(cid:82)(cid:85)(cid:54)(cid:87)(cid:68)(cid:87)(cid:76)(cid:70)(cid:3)(cid:36)(cid:81)(cid:68)(cid:79)(cid:92)(cid:86)(cid:76)(cid:86)(cid:3)(cid:51)(cid:85)(cid:82)(cid:74)(cid:85)(cid:68)(cid:80)(cid:44)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:80)(cid:72)(cid:81)(cid:87)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:51)(cid:85)(cid:82)(cid:74)(cid:85)(cid:68)(cid:80) (cid:53)(cid:72)(cid:68)(cid:70)(cid:75)(cid:76)(cid:81)(cid:74)(cid:3)(cid:39)(cid:72)(cid:73)(cid:76)(cid:81)(cid:76)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:54)(cid:72)(cid:87)(cid:86) (cid:20)(cid:21)(cid:22) (cid:23) (cid:23)(cid:24) (cid:24) (cid:53)(cid:39)(cid:55)(cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92)(cid:3)(cid:37)(cid:79)(cid:82)(cid:70)(cid:78)(cid:86) (cid:11)(cid:41)(cid:44)(cid:41)(cid:50)(cid:3)(cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92)(cid:12) (cid:51)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:86)(cid:50)(cid:73)(cid:73)(cid:79)(cid:76)(cid:81)(cid:72) (cid:53)(cid:88)(cid:81)(cid:87)(cid:76)(cid:80)(cid:72)(cid:3)(cid:38)(cid:75)(cid:72)(cid:70)(cid:78)(cid:76)(cid:81)(cid:74) (cid:39)(cid:41)(cid:44)(cid:38)(cid:75)(cid:72)(cid:70)(cid:78)(cid:76)(cid:81)(cid:74)(cid:51)(cid:85)(cid:82)(cid:74)(cid:85)(cid:68)(cid:80)

Figure 5: The ﬂow of PIM DFI veriﬁcation.the instruction set. The description here is based on C/C++ programs compiled by LLVM [24] and the static analysisis performed by SVF [25]. However, our techniques are general and directly applicable to other software languages,compilers and static analysis tools.Given a program’s LLVM Intermediate Representation (IR), static analysis is performed to obtain its reachingdeﬁnition sets (RDSs), which will be sent to PIM at the beginning of code execution. The instrumentation isautomatically performed on an IR by a software that we developed. Then, the instrumented IR is further compiledinto binary code. Although the instrumentation is inserted in the middle of a compiling ﬂow, it does not require anychanges to compiler code. In absence of source code such as proprietary software, our method can still be applied byemploying a binary code analysis tool and instrumentation for the binary code.

The instrumentation is mainly to extract the runtime information of target instructions, which are the load / store instructions in a user program related to DFI checking, and sent to the PIM processor. The information includesinstruction identiﬁer, instruction type and target address of load / store . Instruction identiﬁers are automaticallyassigned by the instrumentation tool. An example of code instrumentation is shown by the red font instructions inFigure 5. These instrumentation store instructions are called DFI store , which we overload its use with underlyingsemantics different from ordinary store instructions. Our key technique is to differentiate between ordinary store and DFI store without adding new instructions. The basic syntax of the DFI store is store runtime_info dfi_global where dfi_global is the address of a global variable declared at the beginning of a program and serves as a signatureto indicate a DFI store . The address of this global variable is set by writing a dummy value at the beginning of aprogram as store dfi_dummy dfi_global The info-collector (dotted box in Figure 5) checks if a store instruction has a target address the same as that of dfi_global . If yes, then the instruction is a DFI store .Every store and load instruction in a user program, called target instruction, is followed by a DFI store . The runtime_info contains the instruction type and identiﬁer of the proceeding target instruction. For example, inFigure 5, line 2 is an instrumentation instruction store “load, id=12” , which tells the instruction type andidentiﬁer of the target instruction in line 1. To encode the instruction type and identiﬁer, according to [1], 16 bitsare sufﬁcient for representing instruction identiﬁers in a large program. We use an additional bit to indicate instructiontype, where 0 means write and 1 means read . When the info-collector recognizes a DFI store , it extracts the targetaddress of the proceeding target instruction. The target address and the runtime_info form a

DFI packet to be sentto PIM.At the beginning of code execution, a memory space is dynamically allocated at the PIM processor for DFI veriﬁcation.This includes the memory space for storing incoming packets, which is called packet

FIFO memory . The starting6

PREPRINT - F

EBRUARY

22, 2021address of packet FIFO memory is packet_mem_addr , which is also a dynamic value. We specify it by adding thefollowing instruction at the beginning of each user program: store packet_dummy packet_mem_addr

The packet_dummy is a dummy packet that has a ﬁxed value to obtain the destination address for future DFIpackets. The info-collector can obtain packet_mem_addr by indentiﬁng the ﬁrst store in a program that stores packet_dummy to an address. For example, packet_dummy can be designed as . Once store 12345611122 is executed, the info-collector assigns to packet_mem_addr , and packet_mem_addr can only beassigned one time for a program. Later during the code execution, all DFI packets are sent to FIFO memory basedon packet_mem_addr . Please note that dfi_global and packet_mem_addr are generated by the automatic codeinstrumentation, and not visible to security attackers.An example of the instrumentation is shown in Figure 6 where lines 7 and 10 are the original instructions in the userprogram, while lines 2, 3, 4, 5, 8 and 11 are instrumentations. The identiﬁers of the instructions at lines 7 and 10 arein the parentheses (12 and 25). The data of a DFI store (lines 8 and 11 in Figure 6) has bit 16 for instruction type andbits 15-0 for an instruction identiﬁer. /* ===== beginning of the program ====== */ (instructions for allocating FIFO memory) (instructions for storing RDS to memory) store dfi_dummy dfi_glabal store packet_dummy packet_mem_addr ... store x1 addr1 //(12) store (0 << ... load x2 addr2 //(25) store (1 << Figure 6: An example of code instrumentation.

A software program often calls library functions, whose source code or IR is not directly accessible. However,instrumentation can still be performed to obtain target instruction information, which is a library function call. Thisis similar to the wrapper [1] in spirit, but our realization is quite different. As a library function call may involve amulti-byte data block in general, the instrumentation needs to keep track of data-length besides data address. Ourapproach is illustrated using the example in Figure 7. store (1 << << << << store (y1’s addr) dfi_global store (x1’s addr) dfi_global store 40 dfi_global memcpy(x1 , y1 , 40) //(7) ... store (1 << << << << store (x2’s addr) dfi_global store 12 dfi_global store 9 dfi_global memset(x2 , 3, (9 < <32)+12) //(15) Figure 7: The instrumentation for library functions.In this example, the target instructions are the function calls in lines 5 and 11, with their identiﬁers in parentheses.The instrumentation for each library function call includes multiple DFI store instructions like lines 1-4 for the targetinstruction of line 5. The ﬁrst DFI store keeps the corresponding identiﬁer in its lower 16 bits. Its bits 17-20 arefour binary indicators telling if the target instruction is a library function call or not, if the data-length needs 64 bits torepresent or not, and if the function loads/stores data or not. The info-collector parses these indicators and then takescorresponding actions. Additional DFI store instructions are added to send other information. For example, lines 2and 3 send load and store addresses. Depending on if the data-length is represented in 32 or 64 bits, the data-lengthneeds to be sent through a single or two DFI store instructions. For example, line 4 sends the data-length in a singleDFI store while lines 9 and 10 send in two DFI store instructions.

Function return addresses are stored in stack and vulnerable to security attacks such as Return-Oriented Programming(ROP) [3]. We treat their accesses as implicit load/store instructions and perform DFI check accordingly. When7

PREPRINT - F

EBRUARY

22, 2021a parent function parent_func() calls a child function child_func() , the return address is stored in the stackby an instruction parent_inst . When function child_func() returns, the return address is loaded by a returninstruction child_inst . DFI ensures that the return address used by child_inst should be the latest value stored by parent_inst . However, function return is not covered by some static analysis tools like SVF [26]. Thus, we developa dedicated instrumentation technique different from that for ordinary load/store instructions. Although a similaridea was also proposed in [1], our instrumentation is quite different. /* ===== beginning of the function ====== */ p_ret_addr = instruction_getting_ret_addr_pointer store (1 << store p_ret_addr dfi_global ... store (1 << << store p_ret_addr dfi_global return Figure 8: Instrumentation for function return.The instrumentation for function return is illustrated in Figure 8. At the beginning (line 2), the pointer to re-turn address p_ret_addr is obtained. For a C/C++ program, this can be realized by calling built-in function __builtin_frame_address(0) and adding 4 to the returned result. We designate the identiﬁer of the implicit store instruction (function call) parent_inst as the maximum identiﬁer from the static analysis plus the thread ID (lines 3and 6). This ensures that the identiﬁer of parent_inst is unique. Bit 21 of the data in the DFI store in line 3 is set to1, to inform the info-collector that this is for function return. Then, the info-collector expects a subsequent DFI store for the pointer to return address. The info-collector combines instruction type (implicit load/store ), identiﬁer andthe pointer to form a DFI packet. At the end of the child function (lines 6 and 7), similar instrumentation instructionsare added for the implicit load (function return). For each load whose identiﬁer is larger than the maximum identiﬁerof static analysis, DFI requires the identiﬁer of the latest store to be the same as the identiﬁer of this load . Info-collector is the key hardware component to be added at the main processor. It detects DFI store instructions,collects runtime information of a target instruction, generates DFI packets and sends them to PIM. It can be realizedas a combinational circuit through synthesizing Verilog description. Its basic operations are depicted in Figure 9. (cid:39)(cid:68)(cid:87)(cid:68)(cid:3)(cid:53)(cid:72)(cid:79)(cid:68)(cid:92)(cid:38)(cid:75)(cid:72)(cid:70)(cid:78)(cid:3)(cid:44)(cid:81)(cid:71)(cid:76)(cid:70)(cid:68)(cid:87)(cid:82)(cid:85)(cid:86)(cid:3)(cid:76)(cid:81)(cid:3)(cid:39)(cid:41)(cid:44)(cid:3)(cid:54)(cid:87)(cid:82)(cid:85)(cid:72)(cid:47)(cid:76)(cid:69)(cid:85)(cid:68)(cid:85)(cid:92)(cid:3)(cid:41)(cid:88)(cid:81)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:53)(cid:72)(cid:74)(cid:88)(cid:79)(cid:68)(cid:85)(cid:3)(cid:54)(cid:87)(cid:82)(cid:85)(cid:72)(cid:18)(cid:47)(cid:82)(cid:68)(cid:71)(cid:3)(cid:57)(cid:72)(cid:85)(cid:76)(cid:73)(cid:76)(cid:70)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:53)(cid:72)(cid:87)(cid:88)(cid:85)(cid:81)(cid:3)(cid:36)(cid:71)(cid:71)(cid:85)(cid:72)(cid:86)(cid:86)(cid:51)(cid:85)(cid:82)(cid:87)(cid:72)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:60)(cid:72)(cid:86) (cid:60)(cid:72)(cid:86)(cid:39)(cid:41)(cid:44)(cid:3)(cid:54)(cid:87)(cid:82)(cid:85)(cid:72)(cid:34)(cid:37)(cid:68)(cid:86)(cid:76)(cid:70)(cid:3)(cid:51)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:37)(cid:68)(cid:86)(cid:76)(cid:70)(cid:3)(cid:51)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:47)(cid:76)(cid:69)(cid:85)(cid:68)(cid:85)(cid:92)(cid:3)(cid:51)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87) (cid:49)(cid:82) (cid:53)(cid:72)(cid:70)(cid:82)(cid:85)(cid:71)(cid:3)(cid:71)(cid:73)(cid:76)(cid:66)(cid:74)(cid:79)(cid:82)(cid:69)(cid:68)(cid:79)(cid:3)(cid:82)(cid:85)(cid:3)(cid:83)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:66)(cid:80)(cid:72)(cid:80)(cid:66)(cid:68)(cid:71)(cid:71)(cid:85)(cid:60)(cid:72)(cid:86) (cid:49)(cid:82)(cid:54)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:44)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:53)(cid:72)(cid:70)(cid:72)(cid:76)(cid:89)(cid:72)(cid:71) (cid:71)(cid:73)(cid:76)(cid:66)(cid:74)(cid:79)(cid:82)(cid:69)(cid:68)(cid:79)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:83)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:66)(cid:80)(cid:72)(cid:80)(cid:66)(cid:68)(cid:71)(cid:71)(cid:85)(cid:3)(cid:39)(cid:72)(cid:73)(cid:76)(cid:81)(cid:72)(cid:71)(cid:34) (cid:44)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:80)(cid:72)(cid:81)(cid:87)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:44)(cid:81)(cid:71)(cid:76)(cid:70)(cid:68)(cid:87)(cid:76)(cid:81)(cid:74)(cid:3)(cid:83)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:66)(cid:80)(cid:72)(cid:80)(cid:66)(cid:68)(cid:71)(cid:71)(cid:85)(cid:3)(cid:82)(cid:85)(cid:3)(cid:71)(cid:73)(cid:76)(cid:66)(cid:74)(cid:79)(cid:82)(cid:69)(cid:68)(cid:79)(cid:34) (cid:49)(cid:82)(cid:36) (cid:37) (cid:38)(cid:39) (cid:40) (cid:41)(cid:42) (cid:43) (cid:44) (cid:45)

Figure 9: Operations of info-collector.The info-collector acts only when a store instruction is executed. In step B of Figure 9, it checks if dfi_global and packet_mem_addr have already been deﬁned. If not, it proceeds to step C to capture dfi_global or packet_mem_addr . Please note “ store dfi_dummy dfi_glabal ” and “ store packet_dummypacket_mem_addr ” are instrumented at the beginning of a program. Moreover, both dfi_dummy and packet_dummy have signature values that can be recognized by the info-collector. If they have already been deﬁned, the info-collector8 PREPRINT - F

EBRUARY

22, 2021further checks if the store is a DFI store . This is by examining if the target address is the same as that of dfi_global .If this store is a DFI store , the info-collector parses the indicators in the data part of the DFI store and tells ifthis is to verify load/store , function return or a library function call. If this instrumentation is for a load/store instruction, the info-collector collects instruction type and identiﬁer from this DFI store instruction, and the targetaddress from the previous instruction. These pieces of information form a basic packet ( data’ in Figure 5) to be sentto PIM, which stores the packet to the address of the allocated packet FIFO memory ( addr’ in Figure 5).If this DFI store is for a return address protection (step H in Figure 9), the info-collector takes the identiﬁer andinstruction type from this DFI store , and extracts the pointer to the return address from the next DFI store . Thisinformation also forms a basic packet . If this DFI store is for a library function (step G), the indicators of this store tell if the library function is to load data, store data or not, and if the data-length needs to be encoded by 64 bits ornot. Next, the info-collector continues to collect additional information from subsequent DFI store instructions andgenerates a library packet to be sent to PIM.If the store instruction is a part of the user program (step J), i.e., not a DFI store , its data is relayed to memorywithout any change and its target address is stored in a local register for future use. A memory space is allocated to store DFI packets sent from the main processor. It is used as a packet FIFO to storeand process the packets in a ﬁrst-come-ﬁrst-serve manner. In order to maintain the FIFO nature using a region ofrandom access memory with low overhead, we develop circuit design techniques to maintain the head and tail pointersin hardware, where the head pointer is updated by PIM (consumer) and tail pointer is updated by the main processor(producer). Due to space limit, we omit the detailed description for brevity.

A main reason for performance overhead of PIM DFI is transferring DFI packets to memory. Although each DFIpacket has only a few bytes, the number of DFI packets is huge and the overall impact is signiﬁcant. We proposeto compress target addresses and identiﬁers by exploiting locality. The compression is realized in the info-collectorhardware.Consider the two C program examples in Figure 10. For example A, assume the starting memory address of aa is0x8000, then the program stores data at 0x8000, 0x8004, 0x8008, and so on. Starting from i=1 , each target addressincreases by 4 compared to the previous one. Thus, we only need to send the increment in 4 bits, which include 1sign bit, instead of a 32-bit address. Example B in Figure 10 is similar, but has an address pattern of 0x8000, 0x8400,0x8800, etc. Although the address increment 0x400 is relatively large and needs 11 bits to represent, the lower bitsof the increment are all 0s. Thus, instead of using integer compression, we use a format similar to ﬂoating pointnumber representation to further reduce the bitwidth of the address increment. This format consists of a sign bit,signiﬁcand and exponent of 16. To represent 0x400, the sign bit is 0, there are 3 bits for signiﬁcand to represent 4and the exponent is 2. Overall, the bitwidth is 6, which is shorter than the 11-bit binary encoding. The ﬂoating pointnumber representation contains 8-bits, 1 sign bit, 4 bits of signiﬁcand and 3 bits of exponents (the power of 16). Thisrepresentation can cover the range from − × to 15 × . The info-collector calculates the difference betweentwo target addresses. If the difference is within this range and the signiﬁcand is within −

15 to 15, then the difference isrepresented by an 8-bit ﬂoating point number. Note that the difference is compressed only when it can be representedin this format with a 16-basis exponent. /* ======== Example A ========== */ int aa [1024]; for( int i=0;i <1024;i ++) aa[i]= i; /* ======== Example B ========== */ int bb [1024][1024]; for( int i=0;i <1024;i ++) for( int j =0;j <1024;j++) bb[j][i]=i+j; Figure 10: Examples of address locality.Identiﬁers can also be compressed based on their value locality. However, they rarely have the patterns like exampleB, where the increment is at the middle bits of an address. Thus, the difference between two identiﬁers is representedby a binary number. Overall, a DFI packet can be compressed to 15 bits. Thus, we can pack two compressed packet sinto one word. 9

PREPRINT - F

EBRUARY

22, 2021

We develop packet pruning techniques and a technique for increasing the opportunity of locality for data compression.These optimization techniques help reduce the amount of data sent to PIM and thereby further decrease performanceoverhead. Some pruning techniques described here are similar to those in [1]. However, the pruning techniques in [1]are ofﬂine while our hardware approach allows pruning at runtime. As more information, such as target address, isavailable at runtime, the opportunity of pruning is increased.Similar to data transfer between memory and cache in cache lines, we pack multiple DFI packets into a block ofhundreds of bytes before sending them to PIM. The packets in a block are organized in a transmission buffer , whichis implemented as a register ﬁle. The optimizations are performed for packets in the buffer before they are sent out.Note that waiting other packets to form a block increases DFI veriﬁcation latency but does not increase performanceoverhead.Consider two pairs of basic packets in the transmission buffer, ( P , P ) and ( Q , Q ) . Each basic packet is forinstruction load , store , or function return. Packet P ( Q ) precedes P ( Q ). The packets of each pair share thesame target address and there is no other DFI packet for store of the same target address between them. There areﬁve optimization techniques described using the packet pairs:A: If P and P are for store instruction, and there is no other DFI packet for a load with the same target addressbetween them, then packet P is redundant and can be pruned out without being sent to PIM.B: If P and P are both for store instruction, and their identiﬁers are the same, then P can be pruned out.C: If P and P are both for load instruction, and their identiﬁers are the same, then P can be pruned out.D: P / P are for store / load of the same target address. After P and P , if packets Q and Q are for store / load of another same target address, and Q / Q have the same identiﬁers as P / P , respectively, then Q and Q areredundant. This is to make sure that the same store/load pair appears only once in the transmission buffer.E: All basic packets in the transmission buffer are sorted according to their target addresses. If two packets have thesame target address, their relative order keeps unchanged. If there is a library packet, the basic packets before andafter this library packet are sorted separately. After sorting, the target address difference between two adjacentpackets is examined to ﬁnd if data compression can be performed. The sorting helps ﬁnd opportunities for datacompression. DFI veriﬁcations for load/store of different target addresses are independent of each other andhence sorting does not affect DFI veriﬁcation results.Among the optimizations, A, B and C are similar to those in [1] except that they can be performed both ofﬂine and atruntime while those in [1] are restricted to ofﬂine. Techniques D and E are newly developed in this work. After theoptimizations are performed, a packet is compressed if possible. All the 5 optimizations can be realized in circuits for runtime use in the main processor. We illustrate the circuitdesigns by using optimization C as an example. (cid:51)(cid:19)(cid:51)(cid:20) (cid:51)(cid:19)(cid:51)(cid:21) (cid:51)(cid:19)(cid:51)(cid:81)(cid:16)(cid:20) (cid:51)(cid:19)(cid:51)(cid:22) (cid:51)(cid:20)(cid:51)(cid:21) (cid:51)(cid:20)(cid:51)(cid:22) (cid:51)(cid:20)(cid:51)(cid:81)(cid:16)(cid:20) (cid:51)(cid:21)(cid:51)(cid:22) (cid:51)(cid:21)(cid:51)(cid:81)(cid:16)(cid:20) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:53)(cid:20) (cid:53)(cid:21) (cid:53)(cid:22) (cid:53)(cid:81)(cid:16)(cid:20)(cid:51)(cid:68)(cid:51)(cid:69)(cid:39)(cid:76)(cid:81)(cid:39)(cid:82)(cid:88)(cid:87) (cid:53) (cid:17)(cid:17)(cid:17) (cid:51)(cid:81)(cid:16)(cid:21)(cid:51)(cid:81)(cid:16)(cid:20) (cid:17)(cid:17)(cid:17)

Figure 11: Circuit for implementing optimization C.The schematic of combinational circuit implementation of optimization C is shown in Figure 11. Assume there are n basic packets in the transmission buffer, Pi represents the i -th packet, and Ri indicates if the i -th packet is redundant10 PREPRINT - F

EBRUARY

22, 2021or not. Each square in Figure 11 is a Processing Element (PE) that computes if a packet is redundant or not. In eachcolumn of Figure 11, a packet Pi is compared with all later packets P j , j > i and attempts to ﬁnd a redundant P j tobe pruned. If there are multiple packets that are redundant with respect to Pi , only the topmost one (with the smallest | j − i | ) is asserted for pruning and the others can be pruned later in other columns to the right. The R signals in a roware OR ed such that a packet in a row can potentially be pruned by any proceeding packets organized in columns. Forexample, P P P P Pa and Pb . A necessary but insufﬁcient condition for asserting R = T RUE isthat Pa and Pb are both for load with the same target address and identiﬁer. The ﬁnal result of R also depends on Din , which is a disable signal for the pruning. The value of R = T RUE when

Din == store at the same target address betweenthe two load instructions of Pa and Pb , and thus the conditions for optimization C is not completely satisﬁed; (2) aredundant packet has already been found and no further pruning is needed in a column. For scenario (1), Dout = Pa is for load while Pb is for store . For scenario (2), Dout = R = T RUE for the same PE.

The DFI veriﬁcation program is written in C language, and its binary code is executed on the PIM processor. TheRDT memory space is allocated by the instrumentation code. Same as in [1], all program data are organized in words,each of which requires one RDT entry. If the data memory for user program has N bytes, there are N / N × = N / Basic packet for store or load : The veriﬁcation program extracts instruction type, identiﬁer α and target address β from the packet. If the instruction type is store , identiﬁer α is stored at entry β >> load , the veriﬁcation program readsidentiﬁer γ from entry β >> α . Then, the program checks if γ is in theRDS of α or not. If not, a DFI violation is reported. Finally, identiﬁer α and target address β are saved in registersfor future decompression of compressed packets.• Compressed packet for store or load : The process is similar to handling basic packets except that decompressionis performed.•

Library packet:

The veriﬁcation program extracts target address α if there is load in the library function call, andtarget address β if there is store . Then, data-length γ (in words) of the load and/or store and identiﬁer δ of thisfunction are also extracted. If there is address α , the veriﬁcation program loads the identiﬁers ε , ε ... ε γ − fromentries α >> ( α >> ) +

1, ... ( α >> ) + γ − ε i is in the RDS of identiﬁer δ .If there is address β , the program stores identiﬁer δ to all the entries from β >> ( β >> ) + γ − We evaluate our approach and the proposed techniques using architecture simulations through SMCsim [27, 21], whichis an extension to the gem5 simulator [28] for accommodating PIM. The main processor is an ARM Cortex-A15 with2GHz frequency, 32KB L1 instruction cache, 64KB L1 data cache, 2MB L2 cache, and 512 MB memory. A singlePIM processor is used and operates at 2GHz frequency [29, 30]. 64MB memory is allocated for RDT, which issufﬁcient for the testcases in our experiment. Other details of the PIM can be found in [21, 27]. Please note that thePIM conﬁguration has little impact on the user program execution.

Our approach veriﬁes the same DFI as deﬁned in [1] and thus achieves similar security as [1] except that our approachis asynchronous monitoring [11, 31, 7], where detection of DFI violation can trigger system interrupt for furthersecurity measures, rather than synchronous enforcement like [1]. This difference is a tradeoff between security andservice availability. Synchronization inevitably entails extra performance overhead as DFI veriﬁcation blocks userprogram executions. 11

PREPRINT - F

EBRUARY

22, 2021

Hardware-assisted Data-Flow Isolation (HDFI) [13] veriﬁes partial DFI at a very coarse granularity. It uses a 1-bittag to differentiate a sensitive region and a non-sensitive data region, and only ensures that data in one region are notlastly written by an instruction for the other region. As such, it cannot detect attacks that mingles data within the sameregion. For the example of Figure 1, we exhaustively tested different tag schemes of HDFI, which are listed in the leftthree columns of Table 2. For each tag scheme, there is some overﬂow that cannot be detected by HDFI as shown incolumn 4, where u ⇒ u HDFI Our approach u u u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u TMDFI [16] employs an 8-bit tag and thus can differentiate data among 256 regions. Although this is a signiﬁcantimprovement over HDFI, its veriﬁcation resolution is still far from enough in many applications. Figure 12 showsthe numbers of identiﬁers needed for several benchmarks, which are hundreds or tens of hundreds. Hence, the gapbetween the 256 regions by TMDFI [16] and the actual needs is large. By contrast, our approach can accommodate allidentiﬁers in these benchmarks and achieve complete DFI with an overhead similar to TMDFI. . b z i p . m c f . m il c445 . gob m k456 . h mm e r . s j e ng .li bqu a n t u m . h r e f . as t a r . s ph i n x3 o f I d e n t i f i e r s Benchmarks

Figure 12: The number of identiﬁers of each benchmark.

RIPE [32, 33] is a well-known benchmark containing various control-ﬂow attacks, and all control-ﬂow attacks canbe identiﬁed by DFI. RIPE is originally designed for X86 architecture and modiﬁcation is required for executionson an ARM processor. We implemented 156 attacks of the benchmark for our system, including Return-OrientedProgramming (ROP) [3] attacks and Jump-Oriented Programming (JOP) [2] attacks. In addition, we also prepared aRIPE program without any attack. It is observed that our DFI system successfully identiﬁes all the 156 attacks anddoes not make false alarm for the case without attack.

Heartbleed (CVE-2014-0160) [4] is a vulnerability in OpenSSL cryptography library. When a message, includingthe payload and the length of the payload, is sent to a server, the server echoes back the message with the claimedlength. However, it is not checked if the actual payload length is the same as the claimed one. As such, an attackermay send a message with the actual payload length smaller than the claimed one. Then, the server sends back notonly the original payload but also some additional data, which might be private sensitive data, to fulﬁll the claimedlength. Consequently, sensitive data is stolen by the attacker. We use the source code in [34] to simulate such attack.This attack is successfully detected by our DFI veriﬁcation as the data to be loaded for sending back cannot be mostrecently written by an instruction not from the sender. An attack-free transaction, where the actual payload lengthconforms to the claimed one, is also tested and no false alarm is made by our approach.12

PREPRINT - F

EBRUARY

22, 2021

Nullhttpd is a HTTP server that has heap overﬂow vulnerability (CVE-2002-1496) [5]. If the server receives a POSTrequest with negative content length L , it should not process the request. However, the server continues to process andallocates a buffer of L + load instruction attempts to access the data written by overﬂow, it is found that the data is notwritten by any instructions in the RDS of the load instruction. An experiment is also conducted to conﬁrm that ourapproach does not produce false alarm in this context. Performance overheads of the following methods are evaluated through simulations on the SPEC CPU 2006 bench-mark [35].• Software. This is the original software DFI by [1].• HBM. This is similar to [1] except that High Bandwidth Memory [36, 37] is employed.• CMP. This is a parallel approach, where DFI veriﬁcation is performed in another core in CMP with two versions:the software version

CMP-S (multithreading) and the hardware version

CMP-H using our info-collector circuit.• PIM. This is the proposed hardware-assisted parallel approach using PIM.Our proposed approach has two variants: CMP-H and PIM. To ensure a fair comparison, each application wasterminated at the same point in the simulations. The results are summarized in Table 3. As the static analysis toolfailed in some applications, results are only shown for the successful runs.Table 3: Performance overhead of DFI. † Computation time of optimizations and compression is neglected. ‡ Computation time of optimizations and compression is considered. § No DFI packet is sent to the memory.

Software [1] HBM CMP-S CMP-H PIM (No Compression or Optimization) PIM (512B Buffer) PIM (2KB Buffer)Column ID 1 2 3 4 5 6 § × × × √ × × √ × √ √ √ √ √ Transmit Buf Size - - - 2KB - - 2KB 2KB 512B 512B 2KB 2KB 2KBRuntime Optimization × × ×

All × ×

E A,B,C,D All C,E All C,E C,E † < † † † † † ‡ † † ‡ Average 161.4% 162.0% 426.9% 37.0% 232.9% 31.4% 36.4% 38.2% 37.2% 38.8% 35.0% 35.4% 36.4%

On average, the performance overhead of software DFI [1] is 161% as shown in column 1. Column 2 shows theresult of software DFI using HBM, where the memory bandwidth is abundant and memory access latency is fairlylow. One can see that using HBM brings almost no overhead reduction. This result conﬁrms the analysis in Section 4.The results of parallel approach using another CMP core are summarized in columns 3 and 4, for software and ourhardware version, respectively. Without dedicated hardware, the parallel approach actually increases the overhead dueto the expensive communication in software. CMP-H reduces the overhead to 37%.The PIM results are listed in columns 5-13, where “All” means all of the 5 optimization techniques are applied and“C, E” corresponds to the results where only the two most effective optimizations are employed. In column 5, theoverhead is 233% although the ofﬂine optimization has been applied. This tells the importance of our hardware-basedoptimization and compression. In column 6, we dropped all DFI packets without sending them out by simulating onlyinstruction fetching but not executions of instrumentation. This is not realistic for DFI, but is to obtain a lower boundfor the overhead, which is about 31%. Column 7 shows that the joint effect of data compression and optimization E isdramatic. Please note optimization E is designed for increasing the chance of data compression. The setup for column11 is very similar to column 4, except that one is by PIM and the other is by CMP. Examining the results of the twocolumns that their overhead reductions are similar. PIM is a little better as it causes less cache contentions as CPM.Column 13 takes the two most important optimizations and considers the compression/optimization delay, showing anoverhead of about 36%. 13

PREPRINT - F

EBRUARY

22, 2021The effect of transmission buffer size on reducing performance overhead is plotted in Figure 13. It shows that anincrease of buffer size from 0 quickly brings down the overhead. However, the reduction soon diminishes as buffersize reaches 2K bytes and this is why we limit the buffer size to be no more than 2K in our experiments. P e r f o r m a n ce O ve r h ea d ( % ) Buffer Size (Bytes)

Figure 13: Overhead vs.buffer size. . b z i p . m c f . m il c445 . gob m k456 . h mm e r . s j e ng .li bqu a n t u m . h r e f . as t a r . s ph i n x3 A v g . D a t a R e du c t i on ( % ) Benchmarks Opt. A Opt. B Opt. C Opt. D Opt. E

Figure 14: Effects of optimization techniques.

22 24 26 28 30 32 34 36 38 40 42 44050100150200250300350400

Performance Overhead (%) L a t a n cy ( K C l o ck C yc l es ) L a t a n cy ( K C l o ck C yc l es ) Figure 15: Detection latency vs. over-head for and .The effects of the 5 optimization techniques described in Section 7.4 on data reduction are evaluated separately andthe results are depicted in Figure 14. It shows that optimizations C and E always lead to more data reduction than theother techniques. For , optimization C can reduce data by over 80% while optimization E reduces databy more than 60% for both and . Optimization E is designed to facilitate compression, and onecan observe that its average data reduction is 46%, which is also the average compression ratio . Ideally, the latency for detecting DFI violations need to be minimized so that attackers have less time to completedamaging operations. In Figure 15, we show that the latency can be managed by a tradeoff with the overhead viavarying the buffer size. The results also indicate that the PIM approach performs better for low overhead while theCMP-H approach is slight better for obtaining low latency.

The info-collector circuit is implemented by synthesizing Verilog using Synopsys Design Compiler and ASAP 7nmcell library [38]. The info-collector with basic operation and compression costs only 2908 gates and less than 30pscircuit delay. Hence, its area and delay are negligible. We also implemented the circuit for optimization C/E. Theresults with these implementations are in columns 10 and 13 of Table 3, where the gate counts of the info-collectorwith different buffer sizes are listed. The circuit overhead is dominated by the optimization part. The gate count of754K is not trivial, but still a small fraction of a modern microprocessor that often has hundreds of millions of gates.Moreover, our DFI can isolate data among 64K regions and the hardware cost per region is no more than 12 gates. Theworks of CHERI [17] and HDFI [13] did not describe their hardware details. However, they can isolate only between 2regions, and their hardware cost is almost impossible to be less than 24 gates. Therefore, the hardware cost per regionof our approach is less than CHERI and HDFI.

10 Conclusions and Future Research

Data-Flow Integrity (DFI) is potentially a very powerful security measure that can detect a large number of softwareattacks. However, it requires to check a large volume of data and thus intrinsically entails huge performance overhead.We propose a hardware-assisted parallel approach to address this challenge. This approach can reduce the overhead bymore than 4 × compared to the original software DFI while verifying complete DFI. In future research, we will studyhow to further reduce the performance overhead and detection latency. References [1] Miguel Castro, Manuel Costa, and Tim Harris. Securing Software by Enforcing Data-Flow Integrity.

Symposiumon Operating Systems Design and Implementation , pages 147–160, 2006.14

PREPRINT - F

EBRUARY

22, 2021[2] Tyler Bletsch, Xuxian Jiang, Vince W. Freeh, and Zhenkai Liang. Jump-oriented Programming: A New Classof Code-reuse Attack.

ACM Symposium on Information, Computer and Communications Security , pages 30–40,2011.[3] Hovav Shacham. The Geometry of Innocent Flesh on the Bone: Return-into-libc Without Function Calls (on thex86).

ACM Conference on Computer and Communications Security , pages 552–561, 2007.[4] The Heartbleed Bug. http://heartbleed.com/ .[5] Null HTTPd Remote Heap Overﬂow Vulnerability. .[6] Martín Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti. Control-ﬂow Integrity.

ACM Conference onComputer and Communications Security , pages 340–353, 2005.[7] Yongje Lee, Jinyong Lee, Ingoo Heo, Dongil Hwang, and Yunheung Paek. Using CoreSight PTM to IntegrateCRA Monitoring IPs in an ARM-Based SoC.

ACM Transactions on Design Automation of Electronic Systems ,22(3):52:1–52:25, 2017.[8] Zonglin Guo, Ram Bhakta, and Ian G. Harris. Control-ﬂow Checking for Intrusion Detection via a Real-timeDebug Interface.

International Conference on Smart Computing Workshops , pages 87–92, 2014.[9] Xinyang Ge, Weidong Cui, and Trent Jaeger. GRIFFIN: Guarding Control Flows Using Intel Processor Trace.

ACM International Conference on Architectural Support for Programming Languages and Operating Systems ,pages 585–598, 2017.[10] Yutao Liu, Peitao Shi, Xinran Wang, Haibo Chen, Binyu Zang, and Haibing Guan. Transparent and EfﬁcientCFI Enforcement with Intel Processor Trace.

IEEE International Symposium on High Performance ComputerArchitecture , pages 529–540, 2017.[11] Yubin Xia, Yutao Liu, Haibo Chen, and Zang Binyu. CFIMon: detecting violation of control ﬂow integrity usingperformance counters. In

Proceedings of the IEEE/IFIP International Conference on Dependable Systems andNetworks , pages 1–12, 2012.[12] Lucas Davi, Ra Dmitrienko, Manuel Egele, Thomas Fischer, Thorsten Holz, Ralf Hund, Stefan Nürnberger, andAhmad reza Sadeghi. MoCFI: A Framework to Mitigate Control-ﬂow Attacks on Smartphones.

Symposium onNetwork and Distributed System Security , 2012.[13] Chengyu Song, Hyungon Moon, Monjur Alam, Insu Yun, Byoungyoung Lee, Taesoo Kim, Wenke Lee, andYunheung Paek. HDFI: Hardware-Assisted Data-Flow Isolation.

IEEE Symposium on Security and Privacy ,pages 1–17, 2016.[14] Chengyu Song, Byoungyoung Lee, Kangjie Lu, William R. Harris, Taesoo Kim, and Wenke Lee. EnforcingKernel Security Invariants with Data Flow Integrity.

Network and Distributed System Security Symposium , 2016.[15] Periklis Akritidis, Cristian Cadar, Costin Raiciu, Manuel Costa, and Miguel Castro. Preventing Memory ErrorExploits with WIT.

IEEE Symposium on Security and Privacy , pages 263–277, 2008.[16] Tong Liu, Gang Shi, Liwei Chen, Fei Zhang, Yaxuan Yang, and Jihu Zhang. TMDFI: Tagged MemoryAssisted for Fine-Grained Data-Flow Integrity Towards Embedded Systems Against Software Exploitation.

IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ IEEEInternational Conference On Big Data Science And Engineering , pages 545–550, 2018.[17] Robert N. M. Watson, Jonathan Woodruff, Peter G. Neumann, Simon W. Moore, Jonathan Anderson, DavidChisnall, Nirav Dave, Brooks Davis, Khilan Gudka, Ben Laurie, Steven J. Murdoch, Robert Norton, MichaelRoe, Stacey Son, and Munraj Vadera. CHERI: A Hybrid Capability-System Architecture for Scalable SoftwareCompartmentalization.

IEEE Symposium on Security and Privacy , pages 20–37, 2015.[18] Hong Hu, Shweta Shinde, Sendroiu Adrian, Zheng Leong Chua, Prateek Saxena, and Zhenkai Liang. Data-Oriented Programming: On the Expressiveness of Non-control Data Attacks.

IEEE Symposium on Security andPrivacy , pages 969–986, 2016.[19] Jedidiah R. Crandall and Frederic T. Chong. Minos: Control Data Attack Prevention Orthogonal to MemoryModel.

IEEE/ACM International Symposium on Microarchitecture , pages 221–232, 2004.[20] Ken Biba. Integrity Considerations for Secure Computer Systems.

Defense Technical Information Center ,page 68, 1977.[21] Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Design and Evaluation of a Processing-in-MemoryArchitecture for the Smart Memory Cube.

International Conference on Architecture of Computing Systems ,pages 19–31, 2016. 15

PREPRINT - F

EBRUARY

22, 2021[22] Youngmin Shin, Hoi-Jin Lee, Ken Shin, Prashant Kenkae, Rajesh Kashyap, DongJoo Seo, Brian Millar,Yohan Kwon, Ravi Iyengar, Min-Su Kim, Ahsan Chowdhury, Sung-Il Bae, Inpyo Hong, Wookyeong Jeong,Aaron Lindner, Uk-Rae Cho, Keith Hawkins, Jae-Cheol Son, and Sung-Ho Park. 28nm high-K metal gateheterogeneous quad-core CPUs for high performance and energy-efﬁcient mobile application processor. In

Proceedings of the IEEE International SoC Design Conference , 2013.[23] Mario Drumond, Alexandros Daglis, Nooshin Mirzadeh, Dmitrii Ustiugov, Javier Picorel, Babak Falsaﬁ, BorisGrot, and Dionisios Pnevmatikatos. The mondrain data engine. In

Proceedings of the ACM InternationalSymposium on Computer Architecture , pages 639–651, 2017.[24] LLVM. https://llvm.org/ .[25] Yulei Sui and Jingling Xue. Svf: Interprocedural static value-ﬂow analysis in llvm. In

Proceedings of the25th International Conference on Compiler Construction , CC 2016, pages 265–266, New York, NY, USA, 2016.ACM.[26] SVF for Reaching Deﬁnition Analysis. https://github.tamu.edu/jyhuang/SVF .[27] SMCsim. https://iis-git.ee.ethz.ch/erfan.azarkhish/SMCSim .[28] The gem5 Simulator. .[29] Xu Yang, Yumin Hou, and Hu He. A Processing-in-Memory Architecture Programming Paradigm for WirelessInternet-of-Things Applications.

Sensors , 19(1):140, 2019.[30] Seth H Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, AlperBuyuktosunoglu, Al Davis, and Feifei Li. NDC: Analyzing the Impact of 3D-stacked Memory+Logic Deviceson MapReduce Workloads.

IEEE International Symposium on Performance Analysis of Systems and Software ,pages 190–200, 2014.[31] Sanjeev Das, Yang Liu, Wei Zhang, and Mahintham Chandramohan. Semantics-based online malware detectiontowards efﬁcient real-time protection against malware.

IEEE Transactions on Information Forensics and Security ,11(2):289–302, February 2016.[32] RIPE. https://github.com/johnwilander/RIPE .[33] John Wilander, Nick Nikiforakis, Yves Younan, Mariam Kamkar, and Wouter Joosen. RIPE: Runtime IntrusionPrevention Evaluator.

Computer Security Applications Conference , pages 41–50, 2011.[34] The Source Code for Triggering Heartbleed Bug. https://github.com/mykter/afl-training/tree/master/challenges/heartbleed .[35] SPEC CPU 2006 Benchmark. .[36] Dong Uk Lee, Kyung Whan Kim, Kwan Weon Kim, Kang Seol Lee, Sang Jin Byeon, Jae Hwan Kim, Jin HeeCho, Jaejin Lee, and Jun Hyun Chun. A 1.2 V 8 Gb 8-Channel 128 GB/s High-Bandwidth Memory (HBM)Stacked DRAM With Effective I/O Test Circuits.

IEEE Journal of Solid-State Circuits , 50(1):191–203, 2015.[37] Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. HBM(High Bandwidth Memory) DRAM Technology and Architecture.

IEEE International Memory Workshop , pages1–4, 2017.[38] ASAP 7nm Predictive PDK. http://asap.asu.edu/asap/http://asap.asu.edu/asap/