Toward Taming the Overhead Monster for Data-Flow Integrity
aa r X i v : . [ c s . A R ] F e b T OWAR D T AMING THE O VER HEAD M ONSTER FOR D ATA -F LOW I NTEGRITY
Lang Feng ∗ § , Jiayi Huang † , Jeff Huang ¶ , Jiang Hu § , ∗ School of Electronic Science and Engineering, Nanjing University † Department of Electrical and Computer Engineering, University of California, Santa Barbara § Department of Electrical & Computer Engineering, Texas A&M University ¶ Department of Computer Science & Engineering, Texas A&M University fl[email protected]; [email protected]; [email protected]; [email protected];February 22, 2021 A BSTRACT
Data-Flow Integrity (DFI) is a well-known approach to effectively detecting a wide range of softwareattacks. However, its real-world application has been quite limited so far because of the prohibitiveperformance overhead it incurs. Moreover, the overhead is enormously difficult to overcome withoutsubstantially lowering the DFI criterion. In this work, an analysis is performed to understand themain factors contributing to the overhead. Accordingly, a hardware-assisted parallel approach isproposed to tackle the overhead challenge. Simulations on SPEC CPU 2006 benchmark show thatthe proposed approach can completely verify the DFI defined in the original seminal work whilereducing performance overhead by 4 × on average. K eywords Data-Flow Integrity · Architecture · Security
Data-Flow Integrity (DFI) is a regulation to ensure that data to be accessed are written by legitimate instructions [1]. Assuch, DFI verification can identify unwanted data modifications that are not consistent with programmer’s intention.It can detect a wide variety of security attacks including control data attacks such as Jump-Oriented Programming(JOP) [2] and Return-Oriented Programming (ROP) [3], and non-control data attacks such as Heartbleed [4] and theheap overflow attack to Nullhttpd [5]. As a large number of software attacks rely on data modifications, DFI is asingle principle that is effective for many different attack scenarios including future potential ones. In fact, its defensescope is a much bigger superset of Control-Flow Integrity (CFI) [6], which is another well-known software securityapproach.The concept of DFI was introduced in 2006 by the seminal work [1], and has received a lot of attention thereafter dueto its potential of being a powerful security measure. However, a complete DFI enforcement as in [1] incurs morethan 100% performance overhead even though several optimization techniques have been applied. Indeed, the hugeoverhead seems inevitable as every data access needs to be examined. Due to this intrinsic difficulty, there have beenfew follow-up works on DFI despite its widely recognized importance. This is in sharp contrast to CFI [6], which hasmuch more published studies [7, 8, 9, 10, 11, 12].The few later works on DFI [13, 14, 15, 16] reduce the overhead by exploiting partial DFI, whose criteria aresubstantially lower than the original DFI definition [1]. The Hardware-Assisted Data-Flow Isolation (HDFI) [13]is one example. It partitions data into two regions, and only requires that data to be read and written must be consistentin the same region. In other words, it reports a violation only when data intends to be in one region but is actuallywritten by an instruction for another region. Although its overhead is very small, the verification granularity is verycoarse and may miss attacks that mingle different data within the same region. Consider the example in Figure 1,where input data are first written into u0 and u1 in lines 10 and 11. Later, the data are copied to buffers in lines13-15. If there is buffer overflow when executing line 10, i.e., the input data size exceeds 256, then offset u0- > off is PREPRINT - F
EBRUARY
22, 2021modified unintentionally. Then, line 13 may copy user0’s data to other users’ buffers through the modified u0- > off .Meanwhile, user1 can write to user2’s buffer in line 14 in the same way. As HDFI partitions data into only two regions,one of the user pairs - (user0, user1), (user0, user2) or (user1, user2) must share the same region. Consequently, theformer user in a pair can attack the latter in the pair without being detected by HDFI. By contrast, a complete DFI [1]can isolate data among dozens of thousands of regions, i.e., a resolution > × higher than HDFI. Therefore, thesecurity price that HDFI paid for its overhead reduction can be very high. struct vuln{ char data [256]; int off=0; int size=0; }*u0 , *u1 , *u2; /* ============== */ char user0_buffer[256]; char user1_buffer[256]; char user2_buffer[256]; read_user_input(u0 , user0_input); read_user_input(u1 , user1_input); ... memcpy (user0_buffer+u0 ->off , u0 ->data , u0 ->size); memcpy (user1_buffer+u1 ->off , u1 ->data , u1 ->size); memcpy (user2_buffer+u2 ->off , u2 ->data , u2 ->size); Figure 1: An example of vulnerability that HDFI cannot detect.Verifying the complete DFI [1] with practically acceptable overhead is a huge challenge. Different from most ofexisting overhead reduction techniques [13, 14, 15, 16], which rely on lowering the DFI criterion, we pursue a newapproach that exploits additional hardware while the original DFI [1] can still be completely verified. As hardwarecost becomes increasingly affordable along with the progress of semiconductor technology, reducing performanceoverhead at the expense of extra hardware is a promising direction.We first conduct an extensive performance analysis of DFI and, surprisingly, we discover that the frequent DFIdata access does not lead to frequent memory access and thus, memory access is not a bottleneck, but the DFIkernel computations usually contribute the most to the overhead. We propose a parallel approach, where kernelcomputations are performed in another processor core. However, a straightforward software-based parallel computingstill experiences huge overhead resulted from runtime information collection and communications with the otherprocessor core. Therefore, we develop a new hardware technique to further trim down the overhead. This hardware-assisted parallel approach also includes new software instrumentation techniques, lossless data compression andruntime optimization techniques. For the ease of deployment, we intend to minimize the dependence on computinginfrastructure changes. Except the necessary circuits and software instrumentation, our approach does not rely onusing new instructions or OS/compiler modifications.Overall, the proposed approach reduces performance overhead from 161% of [1] to an average of 36% on the sameSPEC CPU2006 benchmarks. As it is a complete DFI verification, it can detect a wide range of security attacks andcover cases that cannot be handled by the previous low-overhead methods [13, 14, 15, 16]. Our approach provides asolution with a security-overhead tradeoff in complement to existing methods [13, 14, 15, 16]. A brief comparisonwith existing methods is summarized in Table 1. The contributions of this work are as follows.• An overhead breakdown analysis is performed to understand the main performance bottlenecks in software DFI.• This is the first hardware approach to complete DFI verification, to the best of our knowledge.• Two variants of the proposed approach are investigated, one for Processing-In-Memory (PIM) and the other forChip Multiprocessor (CMP).• The tradeoff between DFI violation detection latency and performance overhead is studied.• Our approach achieves about 4 × overhead reduction, which is a major progress for complete DFI since 2006. Data-flow integrity requires that data to be loaded from memory can only be stored by legitimate instructions thatare consistent with the programmer’s original intention [1]. Every instruction in a program is assigned a numerical identifier through automatic code instrumentation. The reaching definition of an instruction A is the latest instruction B that stores the data loaded by A and is represented by the identifier of B . Each instruction that can load data frommemory has its own Reaching Definition Set (RDS) , which consists of all the allowed reaching definitions of thisinstruction. A static software analysis can be performed for a program to obtain the RDSs for all relevant instructions.2
PREPRINT - F
EBRUARY
22, 2021Table 1: Comparison between our work and others.
Method PerformanceOverhead DFIVerificationCompleteness Approach NewInstruction OSChange CompilerChange InstrumentSW DFI [1] 161% Complete SW × × × √
KENALI [14] 7-15% Partial SW × √ × √
WIT [15] 7% Partial SW × × × √
CHERI [17] 5-20% Partial HW √ √ √ ×
TMDFI [16] 39% Partial HW √ × × ×
HDFI [13] <
2% Partial HW √ √ √ ×
Our work 39% Complete HW × × × √
In the example of Figure 2, “ store x y ” means storing variable x at address y , “ load x y ” is to load the data ataddress y to variable x , and “ jump label ” implies an unconditional branch to the location marked by label . If theidentifier of each instruction is the same as its line number, the RDS of line 7’s instruction is {1}. DFI requires that allthe instructions that can load data from memory are consistent with their RDSs, i.e., when executing an instruction A that loads data from memory, the data should be indeed most recently stored by one of the instructions in the RDS of A . Hence, the identifier of the latest instruction that stores a data needs to be tracked for the data. Such identifiers forall data form a Reaching Definition Table (RDT) . store x1 addr1 store x1 addr2 jump label store x2 addr1 load x3 addr1 label: load x4 addr1 Figure 2: A code example for illustrating DFI.DFI is a superset of Control-Flow Integrity (CFI) [6], which only regulates instruction flow transitions toward targetaddresses conforming to the original design intention. Attackers have to modify control data, such as the target addressfor an indirect branch, to change a control flow. By protecting all the data, DFI can also prevent all control-flow attacks.Additionally, DFI can protect non-control data that cannot be covered by CFI.A general threat model for DFI is that computer hardware and OS are secure, while attackers can manage to view thebinary code and opportunistically modify some program data, e.g., through buffer overflow.
The concept of Data-Flow Integrity (DFI) was proposed in the seminal work [1] in 2006. This work also provides asoftware implementation technique and optimization techniques for overhead reduction. Although the DFI verificationprocedure is simple, its performance overhead is intrinsically huge as the verification needs to be conducted fortremendous data.The few later previous works [14, 15, 13, 17, 16] achieved much lower overhead by focusing on partial DFI. The workof [14] is restricted to only certain selected data for kernel software. One of its main contributions is the techniques onhow to select data to be protected. Although its performance overhead is only 7 − load and store instructions, the scope of Write Integrity Testing (WIT) [15] is restricted to store . It requires that each store instruction can only write to certain data objects, and each indirect call can onlycall certain functions. Although its overhead is at most 25%, it does not cover load instructions. Thus, an unsafe load instruction may read more bytes than the programmer’s intention, and consequently information leak may occur,e.g., Heartbleed [4] is an attack that WIT would fail to detect.Data isolation is another approach to protecting data with relatively low overhead. A hardware solution for data-flowisolation, called HDFI, is proposed in [13]. It designates two data regions, a sensitive one and a non-sensitive one.A 1-bit tag is employed to tell the region that a data belongs to. Instruction set is modified such that the tags can beread and set. Moreover, processor hardware, operating system and compiler also need changes. If data belongs to oneregion, it cannot be written by an instruction for the other region. Although the isolation helps security, it cannot handlethe case where load/store instructions for different data of the same region are mingled. Thus, its low overheadof <
2% comes with the price of very coarse grained security resolution. To a certain degree, the original DFI [1]3
PREPRINT - F
EBRUARY
22, 2021can be regarded as data isolation among individual instructions. If 16 bits are used for each instruction identifier, it isequivalent to isolation among up to 2 regions. Compared to the only 2 regions of HDFI [13], the resolution of theoriginal DFI is 2 = =
256 regions,which are much coarser grained than the resolution of our approach. For a typical program, such as each benchmark inSPEC CPU 2006, it needs at least > > We analyze the source of performance overhead of software DFI [1]. We call the program to be checked by DFIverification the user program . For a user program, when each store or load is executed, RDT needs to be accessedand consequently data transfer with memory may be greatly increased. A memory access typically takes hundreds ofclock cycles and can cause huge overhead. Thus, we first tested the cache hit rate to understand the DFI’s impact onmemory accesses. L1d Hits (No DFI)
L2 Hits (No DFI) L1i Hits (Software DFI)
L1d Hits (Software DFI)
L2 Hits (Software DFI) C ac h e H i t R a t e ( % ) Benchmarks
Figure 3: Cache hit rates of user programs with and without software DFI.The cache hit rates of user programs without DFI verification and with software DFI are shown in Figure 3. One cansee that the cache hit rates are usually greater than 95% regardless with DFI verification or not. This indicates thatmemory access is probably not a bottleneck. . b z i p . m c f . m il c445 . gob m k456 . h mm e r . s j e ng .li bqu a n t u m . h r e f . as t a r . s ph i n x3 A v g . Library Loop Bounds Check DFI Check P e r ce n t a g e Benchmarks RDT Store RDT Load RDT Search
Figure 4: Overhead breakdown of software DFI.We further investigated the overhead breakdown of software DFI, of which the results are shown in Figure 4, where“RDT Search” represents the instrumentation execution to find the RDT entry of the corresponding user load or store .“Bounds Check” means the check for preventing RDT from illegal modification. “Library Loop” is the additional loop4 PREPRINT - F
EBRUARY
22, 2021in the instrumented wrapper of each library function. “DFI Check” indicates the comparisons verifying the identifierfound in RDT Search is in the RDS of the corresponding user load .According to Figure 4, most of the overhead is from DFI check. It also shows that RDT access itself (excluding theRDT search part) contributes little to the overhead. This confirms that the bottleneck is not memory access but DFIcheck instructions. Specifically, many comparison and branch instructions are executed for each DFI check, whichcompares the identifier found in RDT with each identifier in RDS of the corresponding user load . Although this checkcomputation is fairly simple, it is performed for a huge volume of data.
Our approach is to delegate DFI verification to another computing resource external to the main processor wherethe user program is executed. The delegated resource can be a processor core in a Chip Multiprocessor (CMP) ora Processing-In-Memory (PIM) processor [21]. The two options are similar in terms of the overhead reduction. Anon-essential yet non-trivial difference is that the PIM approach entails less data movement as RDSs and RDT residein memory. Thus, the PIM approach is more power-efficient. Moreover, the DFI kernel computation is simple and aPIM processor would suffice, whereas a CMP core is usually an overkill. An approximate estimate [22, 23] tells thatthe circuit area of a PIM processor is often <
10% of the main processor under the same technology. We will use PIMas a platform to describe our approach while the same idea is applicable to the CMP core option.We first summarize the information required for DFI verification by PIM and their locations as follows.1. RDS (Reaching Definition Set) for all user load instructions in the program. This information does not changethroughout the program execution and can be loaded into PIM once at the beginning.2. RDT (Reaching Definition Table). This information changes dynamically during a program execution. It ismaintained by the PIM processor, and therefore is local to DFI verification at PIM .3. Target instruction information. A target instruction is an instruction in a user program to be verified for DFI.Mainly two types of instructions are involved: load instructions for which DFI verification is performed, and store instructions that affect RDT. These two pieces of information change at runtime and need to be transferredfrom the main processor to memory. It consists of the following components:• Instruction identifier.• Instruction type: either load or store .• Target address of load or store .The PIM processor undertakes most of the DFI verification components analyzed in Section 4, and can quickly accessRDSs and RDT in its vicinity. As such, what remains for the main processor to do is to collect target instructioninformation and send it to PIM. Although the information collection and transmission can be implemented withsoftware in a way same as multithreading, our study shows that such a software approach still experiences hugeor even worse performance overhead. Thus, we propose a hardware approach to minimize extra software executionsat the main processor. Moreover, the hardware approach facilitates runtime application of optimizations described inSection 7.4.The overall flow of the proposed DFI verification is depicted in Figure 5, where green numbers indicate step ID:1. Static analysis is performed for a user program.2. RDSs are obtained from the static analysis.3. The codes are instrumented automatically. The main instrumentation is to add store instructions, which are called DFI store s and in red font in Figure 5, after each target instruction so as to help collect its information.4. The DFI checking program and RDS are loaded onto the PIM processor before the user program execution startson the main processor.5. During program execution, a dedicated hardware, called info-collector in Figure 5, parses each DFI store ,collects target instruction information accordingly, forms a
DFI packet , and sends it to the PIM processor, whereverification computations are performed or RDT is updated.
Instrumentation is to add additional code into a user program in order to facilitate the DFI verification. Thesoftware instrumentation in our approach helps not only extract the necessary information but also avoid changing5
PREPRINT - F
EBRUARY
22, 2021 (cid:53)(cid:39)(cid:54) (cid:20)(cid:3)(cid:79)(cid:82)(cid:68)(cid:71)(cid:3)(cid:68)(cid:3)(cid:68)(cid:71)(cid:71)(cid:85)(cid:20)(cid:3)(cid:3)(cid:11)(cid:20)(cid:21)(cid:12)(cid:21)(cid:3)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:3247)(cid:79)(cid:82)(cid:68)(cid:71)(cid:15)(cid:3)(cid:76)(cid:71)(cid:32)(cid:20)(cid:21)(cid:3248)(cid:17)(cid:17)(cid:17)(cid:22)(cid:3)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:69)(cid:3)(cid:68)(cid:71)(cid:71)(cid:85)(cid:21)(cid:3)(cid:3)(cid:11)(cid:20)(cid:23)(cid:12)(cid:23)(cid:3)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:3247)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:15)(cid:3)(cid:76)(cid:71)(cid:32)(cid:20)(cid:23)(cid:3248)(cid:17)(cid:17)(cid:17)(cid:24)(cid:3)(cid:79)(cid:82)(cid:68)(cid:71)(cid:3)(cid:70)(cid:3)(cid:68)(cid:71)(cid:71)(cid:85)(cid:21)(cid:3)(cid:3)(cid:11)(cid:21)(cid:24)(cid:12)(cid:25)(cid:3)(cid:86)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:3247)(cid:79)(cid:82)(cid:68)(cid:71)(cid:15)(cid:3)(cid:76)(cid:71)(cid:32)(cid:21)(cid:24)(cid:5)(cid:17)(cid:17)(cid:17) (cid:56)(cid:86)(cid:72)(cid:85)(cid:3)(cid:51)(cid:85)(cid:82)(cid:74)(cid:85)(cid:68)(cid:80) (cid:48)(cid:68)(cid:76)(cid:81)(cid:3)(cid:51)(cid:85)(cid:82)(cid:70)(cid:72)(cid:86)(cid:86)(cid:82)(cid:85) (cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92)(cid:38)(cid:82)(cid:81)(cid:87)(cid:85)(cid:82)(cid:79)(cid:79)(cid:72)(cid:85)(cid:68)(cid:71)(cid:71)(cid:85)(cid:71)(cid:68)(cid:87)(cid:68)(cid:82)(cid:87)(cid:75)(cid:72)(cid:85) (cid:71)(cid:68)(cid:87)(cid:68)(cid:3244) (cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17) (cid:44)(cid:81)(cid:73)(cid:82)(cid:16)(cid:38)(cid:82)(cid:79)(cid:79)(cid:72)(cid:70)(cid:87)(cid:82)(cid:85) (cid:68)(cid:71)(cid:71)(cid:85)(cid:3244)(cid:71)(cid:68)(cid:87)(cid:68)(cid:3244)(cid:82)(cid:87)(cid:75)(cid:72)(cid:85)(cid:53)(cid:72)(cid:74)(cid:86) (cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92) (cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92)(cid:3)(cid:37)(cid:88)(cid:86)(cid:51)(cid:44)(cid:48)(cid:3)(cid:51)(cid:85)(cid:82)(cid:70)(cid:72)(cid:86)(cid:86)(cid:82)(cid:85)(cid:54)(cid:87)(cid:68)(cid:87)(cid:76)(cid:70)(cid:3)(cid:36)(cid:81)(cid:68)(cid:79)(cid:92)(cid:86)(cid:76)(cid:86)(cid:3)(cid:51)(cid:85)(cid:82)(cid:74)(cid:85)(cid:68)(cid:80)(cid:44)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:80)(cid:72)(cid:81)(cid:87)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:51)(cid:85)(cid:82)(cid:74)(cid:85)(cid:68)(cid:80) (cid:53)(cid:72)(cid:68)(cid:70)(cid:75)(cid:76)(cid:81)(cid:74)(cid:3)(cid:39)(cid:72)(cid:73)(cid:76)(cid:81)(cid:76)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:54)(cid:72)(cid:87)(cid:86) (cid:20)(cid:21)(cid:22) (cid:23) (cid:23)(cid:24) (cid:24) (cid:53)(cid:39)(cid:55)(cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92)(cid:3)(cid:37)(cid:79)(cid:82)(cid:70)(cid:78)(cid:86) (cid:11)(cid:41)(cid:44)(cid:41)(cid:50)(cid:3)(cid:48)(cid:72)(cid:80)(cid:82)(cid:85)(cid:92)(cid:12) (cid:51)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:86)(cid:50)(cid:73)(cid:73)(cid:79)(cid:76)(cid:81)(cid:72) (cid:53)(cid:88)(cid:81)(cid:87)(cid:76)(cid:80)(cid:72)(cid:3)(cid:38)(cid:75)(cid:72)(cid:70)(cid:78)(cid:76)(cid:81)(cid:74) (cid:39)(cid:41)(cid:44)(cid:38)(cid:75)(cid:72)(cid:70)(cid:78)(cid:76)(cid:81)(cid:74)(cid:51)(cid:85)(cid:82)(cid:74)(cid:85)(cid:68)(cid:80)
Figure 5: The flow of PIM DFI verification.the instruction set. The description here is based on C/C++ programs compiled by LLVM [24] and the static analysisis performed by SVF [25]. However, our techniques are general and directly applicable to other software languages,compilers and static analysis tools.Given a program’s LLVM Intermediate Representation (IR), static analysis is performed to obtain its reachingdefinition sets (RDSs), which will be sent to PIM at the beginning of code execution. The instrumentation isautomatically performed on an IR by a software that we developed. Then, the instrumented IR is further compiledinto binary code. Although the instrumentation is inserted in the middle of a compiling flow, it does not require anychanges to compiler code. In absence of source code such as proprietary software, our method can still be applied byemploying a binary code analysis tool and instrumentation for the binary code.
The instrumentation is mainly to extract the runtime information of target instructions, which are the load / store instructions in a user program related to DFI checking, and sent to the PIM processor. The information includesinstruction identifier, instruction type and target address of load / store . Instruction identifiers are automaticallyassigned by the instrumentation tool. An example of code instrumentation is shown by the red font instructions inFigure 5. These instrumentation store instructions are called DFI store , which we overload its use with underlyingsemantics different from ordinary store instructions. Our key technique is to differentiate between ordinary store and DFI store without adding new instructions. The basic syntax of the DFI store is store runtime_info dfi_global where dfi_global is the address of a global variable declared at the beginning of a program and serves as a signatureto indicate a DFI store . The address of this global variable is set by writing a dummy value at the beginning of aprogram as store dfi_dummy dfi_global The info-collector (dotted box in Figure 5) checks if a store instruction has a target address the same as that of dfi_global . If yes, then the instruction is a DFI store .Every store and load instruction in a user program, called target instruction, is followed by a DFI store . The runtime_info contains the instruction type and identifier of the proceeding target instruction. For example, inFigure 5, line 2 is an instrumentation instruction store “load, id=12” , which tells the instruction type andidentifier of the target instruction in line 1. To encode the instruction type and identifier, according to [1], 16 bitsare sufficient for representing instruction identifiers in a large program. We use an additional bit to indicate instructiontype, where 0 means write and 1 means read . When the info-collector recognizes a DFI store , it extracts the targetaddress of the proceeding target instruction. The target address and the runtime_info form a
DFI packet to be sentto PIM.At the beginning of code execution, a memory space is dynamically allocated at the PIM processor for DFI verification.This includes the memory space for storing incoming packets, which is called packet
FIFO memory . The starting6
PREPRINT - F
EBRUARY
22, 2021address of packet FIFO memory is packet_mem_addr , which is also a dynamic value. We specify it by adding thefollowing instruction at the beginning of each user program: store packet_dummy packet_mem_addr
The packet_dummy is a dummy packet that has a fixed value to obtain the destination address for future DFIpackets. The info-collector can obtain packet_mem_addr by indentifing the first store in a program that stores packet_dummy to an address. For example, packet_dummy can be designed as . Once store 12345611122 is executed, the info-collector assigns to packet_mem_addr , and packet_mem_addr can only beassigned one time for a program. Later during the code execution, all DFI packets are sent to FIFO memory basedon packet_mem_addr . Please note that dfi_global and packet_mem_addr are generated by the automatic codeinstrumentation, and not visible to security attackers.An example of the instrumentation is shown in Figure 6 where lines 7 and 10 are the original instructions in the userprogram, while lines 2, 3, 4, 5, 8 and 11 are instrumentations. The identifiers of the instructions at lines 7 and 10 arein the parentheses (12 and 25). The data of a DFI store (lines 8 and 11 in Figure 6) has bit 16 for instruction type andbits 15-0 for an instruction identifier. /* ===== beginning of the program ====== */ (instructions for allocating FIFO memory) (instructions for storing RDS to memory) store dfi_dummy dfi_glabal store packet_dummy packet_mem_addr ... store x1 addr1 //(12) store (0 << ... load x2 addr2 //(25) store (1 << Figure 6: An example of code instrumentation.
A software program often calls library functions, whose source code or IR is not directly accessible. However,instrumentation can still be performed to obtain target instruction information, which is a library function call. Thisis similar to the wrapper [1] in spirit, but our realization is quite different. As a library function call may involve amulti-byte data block in general, the instrumentation needs to keep track of data-length besides data address. Ourapproach is illustrated using the example in Figure 7. store (1 << << << << store (y1’s addr) dfi_global store (x1’s addr) dfi_global store 40 dfi_global memcpy(x1 , y1 , 40) //(7) ... store (1 << << << << store (x2’s addr) dfi_global store 12 dfi_global store 9 dfi_global memset(x2 , 3, (9 < <32)+12) //(15) Figure 7: The instrumentation for library functions.In this example, the target instructions are the function calls in lines 5 and 11, with their identifiers in parentheses.The instrumentation for each library function call includes multiple DFI store instructions like lines 1-4 for the targetinstruction of line 5. The first DFI store keeps the corresponding identifier in its lower 16 bits. Its bits 17-20 arefour binary indicators telling if the target instruction is a library function call or not, if the data-length needs 64 bits torepresent or not, and if the function loads/stores data or not. The info-collector parses these indicators and then takescorresponding actions. Additional DFI store instructions are added to send other information. For example, lines 2and 3 send load and store addresses. Depending on if the data-length is represented in 32 or 64 bits, the data-lengthneeds to be sent through a single or two DFI store instructions. For example, line 4 sends the data-length in a singleDFI store while lines 9 and 10 send in two DFI store instructions.
Function return addresses are stored in stack and vulnerable to security attacks such as Return-Oriented Programming(ROP) [3]. We treat their accesses as implicit load/store instructions and perform DFI check accordingly. When7
PREPRINT - F
EBRUARY
22, 2021a parent function parent_func() calls a child function child_func() , the return address is stored in the stackby an instruction parent_inst . When function child_func() returns, the return address is loaded by a returninstruction child_inst . DFI ensures that the return address used by child_inst should be the latest value stored by parent_inst . However, function return is not covered by some static analysis tools like SVF [26]. Thus, we developa dedicated instrumentation technique different from that for ordinary load/store instructions. Although a similaridea was also proposed in [1], our instrumentation is quite different. /* ===== beginning of the function ====== */ p_ret_addr = instruction_getting_ret_addr_pointer store (1 << store p_ret_addr dfi_global ... store (1 << << store p_ret_addr dfi_global return Figure 8: Instrumentation for function return.The instrumentation for function return is illustrated in Figure 8. At the beginning (line 2), the pointer to re-turn address p_ret_addr is obtained. For a C/C++ program, this can be realized by calling built-in function __builtin_frame_address(0) and adding 4 to the returned result. We designate the identifier of the implicit store instruction (function call) parent_inst as the maximum identifier from the static analysis plus the thread ID (lines 3and 6). This ensures that the identifier of parent_inst is unique. Bit 21 of the data in the DFI store in line 3 is set to1, to inform the info-collector that this is for function return. Then, the info-collector expects a subsequent DFI store for the pointer to return address. The info-collector combines instruction type (implicit load/store ), identifier andthe pointer to form a DFI packet. At the end of the child function (lines 6 and 7), similar instrumentation instructionsare added for the implicit load (function return). For each load whose identifier is larger than the maximum identifierof static analysis, DFI requires the identifier of the latest store to be the same as the identifier of this load . Info-collector is the key hardware component to be added at the main processor. It detects DFI store instructions,collects runtime information of a target instruction, generates DFI packets and sends them to PIM. It can be realizedas a combinational circuit through synthesizing Verilog description. Its basic operations are depicted in Figure 9. (cid:39)(cid:68)(cid:87)(cid:68)(cid:3)(cid:53)(cid:72)(cid:79)(cid:68)(cid:92)(cid:38)(cid:75)(cid:72)(cid:70)(cid:78)(cid:3)(cid:44)(cid:81)(cid:71)(cid:76)(cid:70)(cid:68)(cid:87)(cid:82)(cid:85)(cid:86)(cid:3)(cid:76)(cid:81)(cid:3)(cid:39)(cid:41)(cid:44)(cid:3)(cid:54)(cid:87)(cid:82)(cid:85)(cid:72)(cid:47)(cid:76)(cid:69)(cid:85)(cid:68)(cid:85)(cid:92)(cid:3)(cid:41)(cid:88)(cid:81)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81) (cid:53)(cid:72)(cid:74)(cid:88)(cid:79)(cid:68)(cid:85)(cid:3)(cid:54)(cid:87)(cid:82)(cid:85)(cid:72)(cid:18)(cid:47)(cid:82)(cid:68)(cid:71)(cid:3)(cid:57)(cid:72)(cid:85)(cid:76)(cid:73)(cid:76)(cid:70)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:53)(cid:72)(cid:87)(cid:88)(cid:85)(cid:81)(cid:3)(cid:36)(cid:71)(cid:71)(cid:85)(cid:72)(cid:86)(cid:86)(cid:51)(cid:85)(cid:82)(cid:87)(cid:72)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:60)(cid:72)(cid:86) (cid:60)(cid:72)(cid:86)(cid:39)(cid:41)(cid:44)(cid:3)(cid:54)(cid:87)(cid:82)(cid:85)(cid:72)(cid:34)(cid:37)(cid:68)(cid:86)(cid:76)(cid:70)(cid:3)(cid:51)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:37)(cid:68)(cid:86)(cid:76)(cid:70)(cid:3)(cid:51)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:47)(cid:76)(cid:69)(cid:85)(cid:68)(cid:85)(cid:92)(cid:3)(cid:51)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87) (cid:49)(cid:82) (cid:53)(cid:72)(cid:70)(cid:82)(cid:85)(cid:71)(cid:3)(cid:71)(cid:73)(cid:76)(cid:66)(cid:74)(cid:79)(cid:82)(cid:69)(cid:68)(cid:79)(cid:3)(cid:82)(cid:85)(cid:3)(cid:83)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:66)(cid:80)(cid:72)(cid:80)(cid:66)(cid:68)(cid:71)(cid:71)(cid:85)(cid:60)(cid:72)(cid:86) (cid:49)(cid:82)(cid:54)(cid:87)(cid:82)(cid:85)(cid:72)(cid:3)(cid:44)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:53)(cid:72)(cid:70)(cid:72)(cid:76)(cid:89)(cid:72)(cid:71) (cid:71)(cid:73)(cid:76)(cid:66)(cid:74)(cid:79)(cid:82)(cid:69)(cid:68)(cid:79)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:83)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:66)(cid:80)(cid:72)(cid:80)(cid:66)(cid:68)(cid:71)(cid:71)(cid:85)(cid:3)(cid:39)(cid:72)(cid:73)(cid:76)(cid:81)(cid:72)(cid:71)(cid:34) (cid:44)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:80)(cid:72)(cid:81)(cid:87)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:44)(cid:81)(cid:71)(cid:76)(cid:70)(cid:68)(cid:87)(cid:76)(cid:81)(cid:74)(cid:3)(cid:83)(cid:68)(cid:70)(cid:78)(cid:72)(cid:87)(cid:66)(cid:80)(cid:72)(cid:80)(cid:66)(cid:68)(cid:71)(cid:71)(cid:85)(cid:3)(cid:82)(cid:85)(cid:3)(cid:71)(cid:73)(cid:76)(cid:66)(cid:74)(cid:79)(cid:82)(cid:69)(cid:68)(cid:79)(cid:34) (cid:49)(cid:82)(cid:36) (cid:37) (cid:38)(cid:39) (cid:40) (cid:41)(cid:42) (cid:43) (cid:44) (cid:45)
Figure 9: Operations of info-collector.The info-collector acts only when a store instruction is executed. In step B of Figure 9, it checks if dfi_global and packet_mem_addr have already been defined. If not, it proceeds to step C to capture dfi_global or packet_mem_addr . Please note “ store dfi_dummy dfi_glabal ” and “ store packet_dummypacket_mem_addr ” are instrumented at the beginning of a program. Moreover, both dfi_dummy and packet_dummy have signature values that can be recognized by the info-collector. If they have already been defined, the info-collector8 PREPRINT - F
EBRUARY
22, 2021further checks if the store is a DFI store . This is by examining if the target address is the same as that of dfi_global .If this store is a DFI store , the info-collector parses the indicators in the data part of the DFI store and tells ifthis is to verify load/store , function return or a library function call. If this instrumentation is for a load/store instruction, the info-collector collects instruction type and identifier from this DFI store instruction, and the targetaddress from the previous instruction. These pieces of information form a basic packet ( data’ in Figure 5) to be sentto PIM, which stores the packet to the address of the allocated packet FIFO memory ( addr’ in Figure 5).If this DFI store is for a return address protection (step H in Figure 9), the info-collector takes the identifier andinstruction type from this DFI store , and extracts the pointer to the return address from the next DFI store . Thisinformation also forms a basic packet . If this DFI store is for a library function (step G), the indicators of this store tell if the library function is to load data, store data or not, and if the data-length needs to be encoded by 64 bits ornot. Next, the info-collector continues to collect additional information from subsequent DFI store instructions andgenerates a library packet to be sent to PIM.If the store instruction is a part of the user program (step J), i.e., not a DFI store , its data is relayed to memorywithout any change and its target address is stored in a local register for future use. A memory space is allocated to store DFI packets sent from the main processor. It is used as a packet FIFO to storeand process the packets in a first-come-first-serve manner. In order to maintain the FIFO nature using a region ofrandom access memory with low overhead, we develop circuit design techniques to maintain the head and tail pointersin hardware, where the head pointer is updated by PIM (consumer) and tail pointer is updated by the main processor(producer). Due to space limit, we omit the detailed description for brevity.
A main reason for performance overhead of PIM DFI is transferring DFI packets to memory. Although each DFIpacket has only a few bytes, the number of DFI packets is huge and the overall impact is significant. We proposeto compress target addresses and identifiers by exploiting locality. The compression is realized in the info-collectorhardware.Consider the two C program examples in Figure 10. For example A, assume the starting memory address of aa is0x8000, then the program stores data at 0x8000, 0x8004, 0x8008, and so on. Starting from i=1 , each target addressincreases by 4 compared to the previous one. Thus, we only need to send the increment in 4 bits, which include 1sign bit, instead of a 32-bit address. Example B in Figure 10 is similar, but has an address pattern of 0x8000, 0x8400,0x8800, etc. Although the address increment 0x400 is relatively large and needs 11 bits to represent, the lower bitsof the increment are all 0s. Thus, instead of using integer compression, we use a format similar to floating pointnumber representation to further reduce the bitwidth of the address increment. This format consists of a sign bit,significand and exponent of 16. To represent 0x400, the sign bit is 0, there are 3 bits for significand to represent 4and the exponent is 2. Overall, the bitwidth is 6, which is shorter than the 11-bit binary encoding. The floating pointnumber representation contains 8-bits, 1 sign bit, 4 bits of significand and 3 bits of exponents (the power of 16). Thisrepresentation can cover the range from − × to 15 × . The info-collector calculates the difference betweentwo target addresses. If the difference is within this range and the significand is within −
15 to 15, then the difference isrepresented by an 8-bit floating point number. Note that the difference is compressed only when it can be representedin this format with a 16-basis exponent. /* ======== Example A ========== */ int aa [1024]; for( int i=0;i <1024;i ++) aa[i]= i; /* ======== Example B ========== */ int bb [1024][1024]; for( int i=0;i <1024;i ++) for( int j =0;j <1024;j++) bb[j][i]=i+j; Figure 10: Examples of address locality.Identifiers can also be compressed based on their value locality. However, they rarely have the patterns like exampleB, where the increment is at the middle bits of an address. Thus, the difference between two identifiers is representedby a binary number. Overall, a DFI packet can be compressed to 15 bits. Thus, we can pack two compressed packet sinto one word. 9
PREPRINT - F
EBRUARY
22, 2021
We develop packet pruning techniques and a technique for increasing the opportunity of locality for data compression.These optimization techniques help reduce the amount of data sent to PIM and thereby further decrease performanceoverhead. Some pruning techniques described here are similar to those in [1]. However, the pruning techniques in [1]are offline while our hardware approach allows pruning at runtime. As more information, such as target address, isavailable at runtime, the opportunity of pruning is increased.Similar to data transfer between memory and cache in cache lines, we pack multiple DFI packets into a block ofhundreds of bytes before sending them to PIM. The packets in a block are organized in a transmission buffer , whichis implemented as a register file. The optimizations are performed for packets in the buffer before they are sent out.Note that waiting other packets to form a block increases DFI verification latency but does not increase performanceoverhead.Consider two pairs of basic packets in the transmission buffer, ( P , P ) and ( Q , Q ) . Each basic packet is forinstruction load , store , or function return. Packet P ( Q ) precedes P ( Q ). The packets of each pair share thesame target address and there is no other DFI packet for store of the same target address between them. There arefive optimization techniques described using the packet pairs:A: If P and P are for store instruction, and there is no other DFI packet for a load with the same target addressbetween them, then packet P is redundant and can be pruned out without being sent to PIM.B: If P and P are both for store instruction, and their identifiers are the same, then P can be pruned out.C: If P and P are both for load instruction, and their identifiers are the same, then P can be pruned out.D: P / P are for store / load of the same target address. After P and P , if packets Q and Q are for store / load of another same target address, and Q / Q have the same identifiers as P / P , respectively, then Q and Q areredundant. This is to make sure that the same store/load pair appears only once in the transmission buffer.E: All basic packets in the transmission buffer are sorted according to their target addresses. If two packets have thesame target address, their relative order keeps unchanged. If there is a library packet, the basic packets before andafter this library packet are sorted separately. After sorting, the target address difference between two adjacentpackets is examined to find if data compression can be performed. The sorting helps find opportunities for datacompression. DFI verifications for load/store of different target addresses are independent of each other andhence sorting does not affect DFI verification results.Among the optimizations, A, B and C are similar to those in [1] except that they can be performed both offline and atruntime while those in [1] are restricted to offline. Techniques D and E are newly developed in this work. After theoptimizations are performed, a packet is compressed if possible. All the 5 optimizations can be realized in circuits for runtime use in the main processor. We illustrate the circuitdesigns by using optimization C as an example. (cid:51)(cid:19)(cid:51)(cid:20) (cid:51)(cid:19)(cid:51)(cid:21) (cid:51)(cid:19)(cid:51)(cid:81)(cid:16)(cid:20) (cid:51)(cid:19)(cid:51)(cid:22) (cid:51)(cid:20)(cid:51)(cid:21) (cid:51)(cid:20)(cid:51)(cid:22) (cid:51)(cid:20)(cid:51)(cid:81)(cid:16)(cid:20) (cid:51)(cid:21)(cid:51)(cid:22) (cid:51)(cid:21)(cid:51)(cid:81)(cid:16)(cid:20) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17) (cid:17)(cid:17)(cid:17) (cid:53)(cid:20) (cid:53)(cid:21) (cid:53)(cid:22) (cid:53)(cid:81)(cid:16)(cid:20)(cid:51)(cid:68)(cid:51)(cid:69)(cid:39)(cid:76)(cid:81)(cid:39)(cid:82)(cid:88)(cid:87) (cid:53) (cid:17)(cid:17)(cid:17) (cid:51)(cid:81)(cid:16)(cid:21)(cid:51)(cid:81)(cid:16)(cid:20) (cid:17)(cid:17)(cid:17)
Figure 11: Circuit for implementing optimization C.The schematic of combinational circuit implementation of optimization C is shown in Figure 11. Assume there are n basic packets in the transmission buffer, Pi represents the i -th packet, and Ri indicates if the i -th packet is redundant10 PREPRINT - F
EBRUARY
22, 2021or not. Each square in Figure 11 is a Processing Element (PE) that computes if a packet is redundant or not. In eachcolumn of Figure 11, a packet Pi is compared with all later packets P j , j > i and attempts to find a redundant P j tobe pruned. If there are multiple packets that are redundant with respect to Pi , only the topmost one (with the smallest | j − i | ) is asserted for pruning and the others can be pruned later in other columns to the right. The R signals in a roware OR ed such that a packet in a row can potentially be pruned by any proceeding packets organized in columns. Forexample, P P P P Pa and Pb . A necessary but insufficient condition for asserting R = T RUE isthat Pa and Pb are both for load with the same target address and identifier. The final result of R also depends on Din , which is a disable signal for the pruning. The value of R = T RUE when
Din == store at the same target address betweenthe two load instructions of Pa and Pb , and thus the conditions for optimization C is not completely satisfied; (2) aredundant packet has already been found and no further pruning is needed in a column. For scenario (1), Dout = Pa is for load while Pb is for store . For scenario (2), Dout = R = T RUE for the same PE.
The DFI verification program is written in C language, and its binary code is executed on the PIM processor. TheRDT memory space is allocated by the instrumentation code. Same as in [1], all program data are organized in words,each of which requires one RDT entry. If the data memory for user program has N bytes, there are N / N × = N / Basic packet for store or load : The verification program extracts instruction type, identifier α and target address β from the packet. If the instruction type is store , identifier α is stored at entry β >> load , the verification program readsidentifier γ from entry β >> α . Then, the program checks if γ is in theRDS of α or not. If not, a DFI violation is reported. Finally, identifier α and target address β are saved in registersfor future decompression of compressed packets.• Compressed packet for store or load : The process is similar to handling basic packets except that decompressionis performed.•
Library packet:
The verification program extracts target address α if there is load in the library function call, andtarget address β if there is store . Then, data-length γ (in words) of the load and/or store and identifier δ of thisfunction are also extracted. If there is address α , the verification program loads the identifiers ε , ε ... ε γ − fromentries α >> ( α >> ) +
1, ... ( α >> ) + γ − ε i is in the RDS of identifier δ .If there is address β , the program stores identifier δ to all the entries from β >> ( β >> ) + γ − We evaluate our approach and the proposed techniques using architecture simulations through SMCsim [27, 21], whichis an extension to the gem5 simulator [28] for accommodating PIM. The main processor is an ARM Cortex-A15 with2GHz frequency, 32KB L1 instruction cache, 64KB L1 data cache, 2MB L2 cache, and 512 MB memory. A singlePIM processor is used and operates at 2GHz frequency [29, 30]. 64MB memory is allocated for RDT, which issufficient for the testcases in our experiment. Other details of the PIM can be found in [21, 27]. Please note that thePIM configuration has little impact on the user program execution.
Our approach verifies the same DFI as defined in [1] and thus achieves similar security as [1] except that our approachis asynchronous monitoring [11, 31, 7], where detection of DFI violation can trigger system interrupt for furthersecurity measures, rather than synchronous enforcement like [1]. This difference is a tradeoff between security andservice availability. Synchronization inevitably entails extra performance overhead as DFI verification blocks userprogram executions. 11
PREPRINT - F
EBRUARY
22, 2021
Hardware-assisted Data-Flow Isolation (HDFI) [13] verifies partial DFI at a very coarse granularity. It uses a 1-bittag to differentiate a sensitive region and a non-sensitive data region, and only ensures that data in one region are notlastly written by an instruction for the other region. As such, it cannot detect attacks that mingles data within the sameregion. For the example of Figure 1, we exhaustively tested different tag schemes of HDFI, which are listed in the leftthree columns of Table 2. For each tag scheme, there is some overflow that cannot be detected by HDFI as shown incolumn 4, where u ⇒ u HDFI Our approach u u u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u u ⇒ u TMDFI [16] employs an 8-bit tag and thus can differentiate data among 256 regions. Although this is a significantimprovement over HDFI, its verification resolution is still far from enough in many applications. Figure 12 showsthe numbers of identifiers needed for several benchmarks, which are hundreds or tens of hundreds. Hence, the gapbetween the 256 regions by TMDFI [16] and the actual needs is large. By contrast, our approach can accommodate allidentifiers in these benchmarks and achieve complete DFI with an overhead similar to TMDFI. . b z i p . m c f . m il c445 . gob m k456 . h mm e r . s j e ng .li bqu a n t u m . h r e f . as t a r . s ph i n x3 o f I d e n t i f i e r s Benchmarks
Figure 12: The number of identifiers of each benchmark.
RIPE [32, 33] is a well-known benchmark containing various control-flow attacks, and all control-flow attacks canbe identified by DFI. RIPE is originally designed for X86 architecture and modification is required for executionson an ARM processor. We implemented 156 attacks of the benchmark for our system, including Return-OrientedProgramming (ROP) [3] attacks and Jump-Oriented Programming (JOP) [2] attacks. In addition, we also prepared aRIPE program without any attack. It is observed that our DFI system successfully identifies all the 156 attacks anddoes not make false alarm for the case without attack.
Heartbleed (CVE-2014-0160) [4] is a vulnerability in OpenSSL cryptography library. When a message, includingthe payload and the length of the payload, is sent to a server, the server echoes back the message with the claimedlength. However, it is not checked if the actual payload length is the same as the claimed one. As such, an attackermay send a message with the actual payload length smaller than the claimed one. Then, the server sends back notonly the original payload but also some additional data, which might be private sensitive data, to fulfill the claimedlength. Consequently, sensitive data is stolen by the attacker. We use the source code in [34] to simulate such attack.This attack is successfully detected by our DFI verification as the data to be loaded for sending back cannot be mostrecently written by an instruction not from the sender. An attack-free transaction, where the actual payload lengthconforms to the claimed one, is also tested and no false alarm is made by our approach.12
PREPRINT - F
EBRUARY
22, 2021
Nullhttpd is a HTTP server that has heap overflow vulnerability (CVE-2002-1496) [5]. If the server receives a POSTrequest with negative content length L , it should not process the request. However, the server continues to process andallocates a buffer of L + load instruction attempts to access the data written by overflow, it is found that the data is notwritten by any instructions in the RDS of the load instruction. An experiment is also conducted to confirm that ourapproach does not produce false alarm in this context. Performance overheads of the following methods are evaluated through simulations on the SPEC CPU 2006 bench-mark [35].• Software. This is the original software DFI by [1].• HBM. This is similar to [1] except that High Bandwidth Memory [36, 37] is employed.• CMP. This is a parallel approach, where DFI verification is performed in another core in CMP with two versions:the software version
CMP-S (multithreading) and the hardware version
CMP-H using our info-collector circuit.• PIM. This is the proposed hardware-assisted parallel approach using PIM.Our proposed approach has two variants: CMP-H and PIM. To ensure a fair comparison, each application wasterminated at the same point in the simulations. The results are summarized in Table 3. As the static analysis toolfailed in some applications, results are only shown for the successful runs.Table 3: Performance overhead of DFI. † Computation time of optimizations and compression is neglected. ‡ Computation time of optimizations and compression is considered. § No DFI packet is sent to the memory.
Software [1] HBM CMP-S CMP-H PIM (No Compression or Optimization) PIM (512B Buffer) PIM (2KB Buffer)Column ID 1 2 3 4 5 6 § × × × √ × × √ × √ √ √ √ √ Transmit Buf Size - - - 2KB - - 2KB 2KB 512B 512B 2KB 2KB 2KBRuntime Optimization × × ×
All × ×
E A,B,C,D All C,E All C,E C,E † < † † † † † ‡ † † ‡ Average 161.4% 162.0% 426.9% 37.0% 232.9% 31.4% 36.4% 38.2% 37.2% 38.8% 35.0% 35.4% 36.4%
On average, the performance overhead of software DFI [1] is 161% as shown in column 1. Column 2 shows theresult of software DFI using HBM, where the memory bandwidth is abundant and memory access latency is fairlylow. One can see that using HBM brings almost no overhead reduction. This result confirms the analysis in Section 4.The results of parallel approach using another CMP core are summarized in columns 3 and 4, for software and ourhardware version, respectively. Without dedicated hardware, the parallel approach actually increases the overhead dueto the expensive communication in software. CMP-H reduces the overhead to 37%.The PIM results are listed in columns 5-13, where “All” means all of the 5 optimization techniques are applied and“C, E” corresponds to the results where only the two most effective optimizations are employed. In column 5, theoverhead is 233% although the offline optimization has been applied. This tells the importance of our hardware-basedoptimization and compression. In column 6, we dropped all DFI packets without sending them out by simulating onlyinstruction fetching but not executions of instrumentation. This is not realistic for DFI, but is to obtain a lower boundfor the overhead, which is about 31%. Column 7 shows that the joint effect of data compression and optimization E isdramatic. Please note optimization E is designed for increasing the chance of data compression. The setup for column11 is very similar to column 4, except that one is by PIM and the other is by CMP. Examining the results of the twocolumns that their overhead reductions are similar. PIM is a little better as it causes less cache contentions as CPM.Column 13 takes the two most important optimizations and considers the compression/optimization delay, showing anoverhead of about 36%. 13
PREPRINT - F
EBRUARY
22, 2021The effect of transmission buffer size on reducing performance overhead is plotted in Figure 13. It shows that anincrease of buffer size from 0 quickly brings down the overhead. However, the reduction soon diminishes as buffersize reaches 2K bytes and this is why we limit the buffer size to be no more than 2K in our experiments. P e r f o r m a n ce O ve r h ea d ( % ) Buffer Size (Bytes)
Figure 13: Overhead vs.buffer size. . b z i p . m c f . m il c445 . gob m k456 . h mm e r . s j e ng .li bqu a n t u m . h r e f . as t a r . s ph i n x3 A v g . D a t a R e du c t i on ( % ) Benchmarks Opt. A Opt. B Opt. C Opt. D Opt. E
Figure 14: Effects of optimization techniques.
22 24 26 28 30 32 34 36 38 40 42 44050100150200250300350400
Performance Overhead (%) L a t a n cy ( K C l o ck C yc l es ) L a t a n cy ( K C l o ck C yc l es ) Figure 15: Detection latency vs. over-head for and .The effects of the 5 optimization techniques described in Section 7.4 on data reduction are evaluated separately andthe results are depicted in Figure 14. It shows that optimizations C and E always lead to more data reduction than theother techniques. For , optimization C can reduce data by over 80% while optimization E reduces databy more than 60% for both and . Optimization E is designed to facilitate compression, and onecan observe that its average data reduction is 46%, which is also the average compression ratio . Ideally, the latency for detecting DFI violations need to be minimized so that attackers have less time to completedamaging operations. In Figure 15, we show that the latency can be managed by a tradeoff with the overhead viavarying the buffer size. The results also indicate that the PIM approach performs better for low overhead while theCMP-H approach is slight better for obtaining low latency.
The info-collector circuit is implemented by synthesizing Verilog using Synopsys Design Compiler and ASAP 7nmcell library [38]. The info-collector with basic operation and compression costs only 2908 gates and less than 30pscircuit delay. Hence, its area and delay are negligible. We also implemented the circuit for optimization C/E. Theresults with these implementations are in columns 10 and 13 of Table 3, where the gate counts of the info-collectorwith different buffer sizes are listed. The circuit overhead is dominated by the optimization part. The gate count of754K is not trivial, but still a small fraction of a modern microprocessor that often has hundreds of millions of gates.Moreover, our DFI can isolate data among 64K regions and the hardware cost per region is no more than 12 gates. Theworks of CHERI [17] and HDFI [13] did not describe their hardware details. However, they can isolate only between 2regions, and their hardware cost is almost impossible to be less than 24 gates. Therefore, the hardware cost per regionof our approach is less than CHERI and HDFI.
10 Conclusions and Future Research
Data-Flow Integrity (DFI) is potentially a very powerful security measure that can detect a large number of softwareattacks. However, it requires to check a large volume of data and thus intrinsically entails huge performance overhead.We propose a hardware-assisted parallel approach to address this challenge. This approach can reduce the overhead bymore than 4 × compared to the original software DFI while verifying complete DFI. In future research, we will studyhow to further reduce the performance overhead and detection latency. References [1] Miguel Castro, Manuel Costa, and Tim Harris. Securing Software by Enforcing Data-Flow Integrity.
Symposiumon Operating Systems Design and Implementation , pages 147–160, 2006.14
PREPRINT - F
EBRUARY
22, 2021[2] Tyler Bletsch, Xuxian Jiang, Vince W. Freeh, and Zhenkai Liang. Jump-oriented Programming: A New Classof Code-reuse Attack.
ACM Symposium on Information, Computer and Communications Security , pages 30–40,2011.[3] Hovav Shacham. The Geometry of Innocent Flesh on the Bone: Return-into-libc Without Function Calls (on thex86).
ACM Conference on Computer and Communications Security , pages 552–561, 2007.[4] The Heartbleed Bug. http://heartbleed.com/ .[5] Null HTTPd Remote Heap Overflow Vulnerability. .[6] Martín Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti. Control-flow Integrity.
ACM Conference onComputer and Communications Security , pages 340–353, 2005.[7] Yongje Lee, Jinyong Lee, Ingoo Heo, Dongil Hwang, and Yunheung Paek. Using CoreSight PTM to IntegrateCRA Monitoring IPs in an ARM-Based SoC.
ACM Transactions on Design Automation of Electronic Systems ,22(3):52:1–52:25, 2017.[8] Zonglin Guo, Ram Bhakta, and Ian G. Harris. Control-flow Checking for Intrusion Detection via a Real-timeDebug Interface.
International Conference on Smart Computing Workshops , pages 87–92, 2014.[9] Xinyang Ge, Weidong Cui, and Trent Jaeger. GRIFFIN: Guarding Control Flows Using Intel Processor Trace.
ACM International Conference on Architectural Support for Programming Languages and Operating Systems ,pages 585–598, 2017.[10] Yutao Liu, Peitao Shi, Xinran Wang, Haibo Chen, Binyu Zang, and Haibing Guan. Transparent and EfficientCFI Enforcement with Intel Processor Trace.
IEEE International Symposium on High Performance ComputerArchitecture , pages 529–540, 2017.[11] Yubin Xia, Yutao Liu, Haibo Chen, and Zang Binyu. CFIMon: detecting violation of control flow integrity usingperformance counters. In
Proceedings of the IEEE/IFIP International Conference on Dependable Systems andNetworks , pages 1–12, 2012.[12] Lucas Davi, Ra Dmitrienko, Manuel Egele, Thomas Fischer, Thorsten Holz, Ralf Hund, Stefan Nürnberger, andAhmad reza Sadeghi. MoCFI: A Framework to Mitigate Control-flow Attacks on Smartphones.
Symposium onNetwork and Distributed System Security , 2012.[13] Chengyu Song, Hyungon Moon, Monjur Alam, Insu Yun, Byoungyoung Lee, Taesoo Kim, Wenke Lee, andYunheung Paek. HDFI: Hardware-Assisted Data-Flow Isolation.
IEEE Symposium on Security and Privacy ,pages 1–17, 2016.[14] Chengyu Song, Byoungyoung Lee, Kangjie Lu, William R. Harris, Taesoo Kim, and Wenke Lee. EnforcingKernel Security Invariants with Data Flow Integrity.
Network and Distributed System Security Symposium , 2016.[15] Periklis Akritidis, Cristian Cadar, Costin Raiciu, Manuel Costa, and Miguel Castro. Preventing Memory ErrorExploits with WIT.
IEEE Symposium on Security and Privacy , pages 263–277, 2008.[16] Tong Liu, Gang Shi, Liwei Chen, Fei Zhang, Yaxuan Yang, and Jihu Zhang. TMDFI: Tagged MemoryAssisted for Fine-Grained Data-Flow Integrity Towards Embedded Systems Against Software Exploitation.
IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ IEEEInternational Conference On Big Data Science And Engineering , pages 545–550, 2018.[17] Robert N. M. Watson, Jonathan Woodruff, Peter G. Neumann, Simon W. Moore, Jonathan Anderson, DavidChisnall, Nirav Dave, Brooks Davis, Khilan Gudka, Ben Laurie, Steven J. Murdoch, Robert Norton, MichaelRoe, Stacey Son, and Munraj Vadera. CHERI: A Hybrid Capability-System Architecture for Scalable SoftwareCompartmentalization.
IEEE Symposium on Security and Privacy , pages 20–37, 2015.[18] Hong Hu, Shweta Shinde, Sendroiu Adrian, Zheng Leong Chua, Prateek Saxena, and Zhenkai Liang. Data-Oriented Programming: On the Expressiveness of Non-control Data Attacks.
IEEE Symposium on Security andPrivacy , pages 969–986, 2016.[19] Jedidiah R. Crandall and Frederic T. Chong. Minos: Control Data Attack Prevention Orthogonal to MemoryModel.
IEEE/ACM International Symposium on Microarchitecture , pages 221–232, 2004.[20] Ken Biba. Integrity Considerations for Secure Computer Systems.
Defense Technical Information Center ,page 68, 1977.[21] Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Design and Evaluation of a Processing-in-MemoryArchitecture for the Smart Memory Cube.
International Conference on Architecture of Computing Systems ,pages 19–31, 2016. 15
PREPRINT - F
EBRUARY
22, 2021[22] Youngmin Shin, Hoi-Jin Lee, Ken Shin, Prashant Kenkae, Rajesh Kashyap, DongJoo Seo, Brian Millar,Yohan Kwon, Ravi Iyengar, Min-Su Kim, Ahsan Chowdhury, Sung-Il Bae, Inpyo Hong, Wookyeong Jeong,Aaron Lindner, Uk-Rae Cho, Keith Hawkins, Jae-Cheol Son, and Sung-Ho Park. 28nm high-K metal gateheterogeneous quad-core CPUs for high performance and energy-efficient mobile application processor. In
Proceedings of the IEEE International SoC Design Conference , 2013.[23] Mario Drumond, Alexandros Daglis, Nooshin Mirzadeh, Dmitrii Ustiugov, Javier Picorel, Babak Falsafi, BorisGrot, and Dionisios Pnevmatikatos. The mondrain data engine. In
Proceedings of the ACM InternationalSymposium on Computer Architecture , pages 639–651, 2017.[24] LLVM. https://llvm.org/ .[25] Yulei Sui and Jingling Xue. Svf: Interprocedural static value-flow analysis in llvm. In
Proceedings of the25th International Conference on Compiler Construction , CC 2016, pages 265–266, New York, NY, USA, 2016.ACM.[26] SVF for Reaching Definition Analysis. https://github.tamu.edu/jyhuang/SVF .[27] SMCsim. https://iis-git.ee.ethz.ch/erfan.azarkhish/SMCSim .[28] The gem5 Simulator. .[29] Xu Yang, Yumin Hou, and Hu He. A Processing-in-Memory Architecture Programming Paradigm for WirelessInternet-of-Things Applications.
Sensors , 19(1):140, 2019.[30] Seth H Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, AlperBuyuktosunoglu, Al Davis, and Feifei Li. NDC: Analyzing the Impact of 3D-stacked Memory+Logic Deviceson MapReduce Workloads.
IEEE International Symposium on Performance Analysis of Systems and Software ,pages 190–200, 2014.[31] Sanjeev Das, Yang Liu, Wei Zhang, and Mahintham Chandramohan. Semantics-based online malware detectiontowards efficient real-time protection against malware.
IEEE Transactions on Information Forensics and Security ,11(2):289–302, February 2016.[32] RIPE. https://github.com/johnwilander/RIPE .[33] John Wilander, Nick Nikiforakis, Yves Younan, Mariam Kamkar, and Wouter Joosen. RIPE: Runtime IntrusionPrevention Evaluator.
Computer Security Applications Conference , pages 41–50, 2011.[34] The Source Code for Triggering Heartbleed Bug. https://github.com/mykter/afl-training/tree/master/challenges/heartbleed .[35] SPEC CPU 2006 Benchmark. .[36] Dong Uk Lee, Kyung Whan Kim, Kwan Weon Kim, Kang Seol Lee, Sang Jin Byeon, Jae Hwan Kim, Jin HeeCho, Jaejin Lee, and Jun Hyun Chun. A 1.2 V 8 Gb 8-Channel 128 GB/s High-Bandwidth Memory (HBM)Stacked DRAM With Effective I/O Test Circuits.
IEEE Journal of Solid-State Circuits , 50(1):191–203, 2015.[37] Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. HBM(High Bandwidth Memory) DRAM Technology and Architecture.
IEEE International Memory Workshop , pages1–4, 2017.[38] ASAP 7nm Predictive PDK. http://asap.asu.edu/asap/http://asap.asu.edu/asap/