[PDF] Practical Byte-Granular Memory Blacklisting using Califorms

Abstract

Full PDF

PPractical Byte-Granular Memory Blacklisting usingCaliforms

Hiroshi Sasaki

Columbia University [email protected]

Miguel A. Arroyo

Columbia University [email protected]

M. Tarek Ibn Ziad

Columbia University [email protected]

Koustubha Bhat † Vrije Universiteit Amsterdam [email protected]

Kanad Sinha

Columbia University [email protected]

Simha Sethumadhavan

Columbia University [email protected]

Abstract

Recent rapid strides in memory safety tools and hardwarehave improved software quality and security. While coarse-grained memory safety has improved, achieving memorysafety at the granularity of individual objects remains a chal-lenge due to high performance overheads which can be be-tween ∼ − Califorms , and associated program observations, toobtain a low overhead security solution for practical, byte-granular memory safety.The idea we build on is called memory blacklisting, whichprohibits a program from accessing certain memory regionsbased on program semantics. State of the art hardware-supported memory blacklisting while much faster than soft-ware blacklisting creates memory fragmentation (of the orderof few bytes) for each use of the blacklisted location. In thispaper, we observe that metadata used for blacklisting can bestored in dead spaces in a program’s data memory and thatthis metadata can be integrated into microarchitecture bychanging the cache line format. Using these observations,Califorms based system proposed in this paper reduces theperformance overheads of memory safety to ∼ − With recent interest in microarchitecture side channels, itis important not to lose sight of more traditional softwaresecurity threats. Security is a full-system property whereboth software and hardware have to be secure for a systemto be secure. Historically, program memory safety violationshave provided a significant opportunity for exploitation: forinstance, a recent report from Microsoft revealed that theroot cause of more than half of all exploits were software † Part of this work was carried out while the author was a visiting studentat Columbia University. memory safety violations [1]. In response to the severity ofthis threat, improvements in software checking tools, such asAddressSanitizer [2], and advances in the form of commercialhardware support for memory safety such as Oracle’s ADI [3]and Intel’s MPX [4] have enabled programmers to detect andfix memory safety violations before deploying software.Current software and hardware-supported solutions excelat providing coarse-grained memory safety, i.e., detectingmemory access beyond arrays and malloc ’d regions (structand class instances). However, they are not suitable for fine-grained memory safety (i.e., detecting overflows within ob-jects, such as fields within a struct, or members within aclass) due to the high performance overheads and/or needfor making intrusive changes to the source code [5]. Forinstance, a recent work that aims to provide intra-objectoverflow protection functionality incurs a 2.2x performanceoverhead [6]. These overheads are problematic because theynot only reduce the number of pre-deployment tests thatcan be performed, but also impede post-deployment con-tinuous monitoring, which researchers have pointed out isnecessary for detecting benign and malicious memory safetyviolations [7]. Thus, a low overhead memory safety solutionthat can enable continuous monitoring and provide completeprogram safety has been elusive.The source of overheads stem from how current designsstore and use metadata necessary for enforcing memorysafety. In Intel MPX [4], Hardbound [8], CHERI [9, 10], andPUMP [11], the metadata is stored for each pointer, and eachdata or code memory access through a pointer performschecks using the metadata. Since C/C++ memory accessestend to be highly pointer based, the performance and energyoverheads of accessing metadata can be significant in suchsystems. Furthermore, the management of metadata espe-cially if it is stored in a disjoint manner from the pointer canalso create significant engineering complexity in terms ofperformance and usability. This was evidenced by the factthat compilers like LLVM and GCC dropped support for IntelMPX in their mainline after an initial push to integrate intothe toolchain [4].Our approach for reducing overheads is two-fold. First,instead of checking access bounds for each pointer access, we a r X i v : . [ c s . CR ] J un . Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan blacklist all memory locations that should never be accessed.In theory, this is a strictly weaker form of security thanwhitelisting but we argue that in practice, blacklisting canbe more practical because of its ease of deployment andlow overheads. Informally, deployments apply whitelistingtechniques partially to reduce overheads and be backwardcompatible which reduces their security, while blacklistingtechniques can be applied more broadly due to their lowoverheads. Additionally, blacklisting techniques complementdefenses in existing systems better since they do not requireintrusive changes.Our second optimization is the novel metadata storagescheme. We observe that by using dead memory spaces in theprogram, we can store metadata needed for memory safetyfor free for nearly half of the program objects. These deadspaces occur because of language alignment requirementsand are inserted by the compiler. When we cannot find anaturally occurring dead space, we manually insert a deadspace. The overhead due to this dead space is smaller thantraditional methods for storing metadata because of how werepresent the metadata: our metadata is smaller (one byte)as opposed to multiple bytes with traditional whitelisting orblacklisting memory safety techniques.A natural question is how the dead (more commonly re-ferred to as padding ) bytes can be distinguished from normalbytes in memory. A straightforward scheme results in onebit of additional storage per byte to identify if a byte is adead byte; this scheme results in a space overhead of 12.5%.We reduce this overhead to one bit per 64B cache line (0.2%overhead) without any loss of precision by only reformattinghow data is stored in cache lines. Our technique, Califorms ,uses one bit of additional storage to identify if the cache lineassociated with the memory contains any dead bytes. For cal-iformed cache lines, i.e., lines which contain dead bytes, theactual data is stored following the “header”, which indicatesthe location of dead bytes, as shown in Figure 1.With this support, it is easy to describe how a Califormsbased system for memory safety works. The dead bytes,either naturally harvested or manually inserted, are used toindicate memory regions that should never be accessed bya program (i.e., blacklisting). If an attacker accesses theseregions, we detect this rogue access without any additionalmetadata accesses as our metadata resides inline.Our experimental results on the SPEC CPU2006 bench-mark suite indicate that the overheads of Califorms are quitelow: software overheads range from 2 to 14% slowdown(or alternatively, 1.02x to 1.16x performance overhead) de-pending on the amount and location of padding bytes used.This provides the functionality for the user/customer to tunethe security according to their performance requirements.Hardware induced overheads are also negligible, on aver-age less than 1%. All of the software transformations areperformed using the LLVM compiler framework using afront-end source-to-source transformation. These overheads

A B C D E Header A B C D EDead byte

Natural

Califorms

Core L1D L21 3 7 4 8 6 7 5 1 3 7 4 8 6 7 5

Natural Natural

Figure 1.

Califorms offers memory safety by detecting ac-cesses to dead bytes in memory. Dead bytes are not stored be-yond the L1 data cache and identified using a special headerin the L2 cache (and beyond) resulting in very low overhead.The conversion between these formats happens when linesare filled or spilled between the L1 and L2 caches. The ab-sence of dead bytes results in the cache lines stored in thesame natural format across memory system.are substantially lower compared to the state-of-the-art soft-ware or hardware supported schemes (viz., 2.2x performanceand 1.1x memory overheads for EffectiveSan [6], and 1.7xperformance and 2.1x memory overheads for Intel MPX [4]).

One of the key ways in which we mitigate the overheadsfor fine-grained memory safety is by opportunistically har-vesting padding bytes in programs to store metadata. So howoften do these occur in programs? Before we answer thatquestion let us concretely understand padding bytes withan example. Consider the struct A defined in Listing 1(a).Let us say the compiler inserts a three-byte padding in be-tween char c and int i as in Listing 1(b) because of theC language requirement that integers should be padded totheir natural size (which we assume to be four bytes here).These types of paddings are not limited to C/C++ but alsomany other languages and their runtime implementations.To obtain a quantitative estimate on the amount of paddings,we developed a compiler pass to statically collect the paddingsize information. Figure 3 presents the histogram of structdensities for SPEC CPU2006 C and C++ benchmarks andthe V8 JavaScript engine. Struct density is defined as thesum of the size of each field divided by the total size of thestruct including the padding bytes (i.e., the smaller or sparsethe struct density the more padding bytes the struct has).The results reveal that 45 .

7% and 41 .

0% of structs withinSPEC and V8, respectively, have at least one byte of padding.This is encouraging since even without introducing addi-tional padding bytes (no memory overhead), we can offerprotection for certain compound data types restricting theremaining attack surface.Naturally, one might inquire about the safety for the restof the program. To offer protection for all defined compounddata types (called the full strategy), we can insert random ractical Byte-Granular Memory Blacklisting using Califorms struct A { char c; int i; char buf[64]; void (*fp)(); double d;} (a) Original. struct

A_opportunistic { char c;/* compiler inserts padding* bytes for alignment */ char padding_bytes[3]; int i; char buf[64]; void (*fp)(); double d;} (b) Opportunistic. struct

A_full {/* we protect every field with* random security bytes */ char security_bytes[2]; char c; char security_bytes[1]; int i; char security_bytes[3]; char buf[64]; char security_bytes[2]; void (*fp)(); char security_bytes[1]; double d; char security_bytes[2];} (c) Full. struct

A_intelligent { char c; int i;/* we protect boundaries* of arrays and pointers with* random security bytes */ char security_bytes[3]; char buf[64]; char security_bytes[2]; void (*fp)(); char security_bytes[3]; double d;} (d) Intelligent.

Listing 1.

Example of three security bytes harvesting strategies: (b) opportunistic uses the existing padding bytes as securitybytes, (c) full protect every field within the struct with security bytes, and (d) intelligent surrounds arrays and pointers withsecurity bytes.

Struct density F r a c t i ono f s t r u c t s (a) SPEC CPU2006 C and C++benchmarks.

Struct density F r a c t i ono f s t r u c t s (b) V8 JavaScript engine.

Figure 3.

Struct density histogram of SPEC CPU2006 bench-marks and the V8 JavaScript engine. More than 40% of thestructs have at least one padding byte. S l o w do w n Figure 4.

Average performance overhead with additionalpaddings (one byte to seven bytes) inserted for every fieldwithin structs (and classes) of SPEC CPU2006 C and C++benchmarks.sized padding bytes, also referred to as security bytes, be-tween every field of a struct or member of a class as inListing 1(c). Random sized security bytes are chosen to pro-vide a probabilistic defense as fixed sized security bytes canbe jumped over by an attacker once s/he identifies the actualsize (and the exact memory layout). Additionally, by carefullychoosing the minimum and maximum sizes for insertion, wecan keep the average security byte size small (such as two orthree bytes). Intuitively, the higher the unpredictability (orrandomness) there is within the memory layout, the higherthe security level we can offer.While the full strategy provides the widest coverage, notall of the security bytes provide the same security utility.For example, basic data types such as char and int cannot be easily overflowed past their bounds. The idea behind theintelligent insertion strategy is to prioritize insertion of se-curity bytes into security-critical locations as presented inListing 1(d). We choose data types which are most proneto abuse by an attacker via overflow type accesses: (1) ar-rays and (2) data and function pointers. In the example inListing 1(d), the array buf[64] and the function pointer fp are protected with random sized security bytes. While it ispossible to utilize padding bytes present between other datatypes without incurring memory overheads, doing so wouldcome at an additional performance overhead.In comparison to opportunistic harvesting, the other moresecure strategies (e.g., full strategy) come at an additionalperformance overhead. We analyze the performance trend inorder to decide how many security bytes can be reasonablyinserted. For this purpose we developed an LLVM pass whichpads every field of a struct with fixed size paddings. Wemeasure the performance of SPEC CPU2006 benchmarksby varying the padding size from one byte to seven bytes.The detailed evaluation environment and methodology isdescribed later in Section 8.Figure 4 demonstrates the average slowdown when in-serting additional bytes for harvesting. As expected, we cansee the performance overheads increase as we increase thepadding size, mainly due to ineffective cache usage. On av-erage the slowdown is 3 .

0% for one byte and 7 .

6% for sevenbytes of padding. The figure presents the ideal (lower bound)performance overhead when fully inserting security bytesinto compound data types; the hardware and software mod-ifications we introduce add additional overheads on top ofthese numbers. We strive to provide a mechanism that allowsthe user to tune the security level at the cost of performanceand thus explore several security byte insertion strategies toreduce the performance overhead in the paper. . Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan The Califorms framework consists of multiple componentswe discuss in the following sections: • Architecture Support.

An ISA extension of a ma-chine instruction called

CFORM that performs califorming(i.e., (un)setting security bytes) of cache lines, and a privi-leged Califorms exception which is raised upon misuse ofsecurity bytes (Section 4). • Microarchitecture Design.

New cache line formats thatenable low cost access to the metadata — we propose differentCaliforms for L1 cache vs. L2 cache and beyond (Section 5). • Software Design.

Compiler, memory allocator and oper-ating system extensions which insert the security bytes atcompile time and manages the security bytes via the

CFORM instruction at runtime (Section 6).At compile time each compound data type, a struct or aclass, is examined and security bytes are added accordingto a user defined insertion policy viz. opportunistic, full orintelligent, by a source-to-source translation pass. When werun the binary with security bytes, when compound datatype instances are created in the heap dynamically, we usea new version of malloc that issues

CFORM instructions toset the security bytes after the space is allocated. Whenthe

CFORM instruction is executed, the cache line format istransformed at the L1 cache controller (assuming a cachemiss) and is inserted into the L1 data cache. Upon an L1eviction, the L1 cache controller re-califorms the cache lineto meet the Califorms of the L2 cache.While we add additional metadata storage to the caches,we refrain from doing so for main memory and persistentstorage to keep the changes local within the CPU core. Whena califormed cache line is evicted from the last-level cache tomain memory, we keep the cache line califormed and storethe additional one metadata bit into spare ECC bits simi-lar to Oracle’s ADI [3]. When a page is swapped out frommain memory, the page fault handler stores the metadata forall the cache lines within the page into a reserved addressspace managed by the operating system; the metadata is re-claimed upon swap in. Therefore, our design keeps the cacheline format califormed throughout the memory hierarchy. Acaliformed cache line is un-califormed only when the cor-responding bytes cross the boundary where the califormeddata cannot be understood by the other end, such as writingto I/O (e.g., pipe, filesystem or network socket). Finally, whenan object is freed, the freed bytes are califormed and zeroedfor offering temporal safety.At runtime, when a rogue load or store accesses a cali-formed byte the hardware returns a privileged, precise secu-rity exception to the next privilege level which can take anyappropriate action including terminating the program. ADI stores four bits of metadata per cache line for allocation granularityenforcement while Califorms stores one bit for sub-allocation granularityenforcement.

Table 1.

K-map for the

CFORM instruction. X represents“Don’t Care”.

R2, R3X, Allow Set, Allow Set, Allow I n i t i a l Regular Byte Regular Byte Exception Security ByteSecurity Byte Security Byte Regular Byte Exception

The format of the instruction is “

CFORM R1 , R2 , R3 ”. The valuein register R1 points to the starting (cache aligned) address inthe virtual address space, denoting the start of the 64B chunkwhich fits in a single 64B cache line. Table 1 represents aK-map for the CFORM instruction. The value in register R2 indicates the attributes of said region represented in a bitvector format (1 to set and 0 to unset the security byte). Thevalue in register R3 is a mask to the corresponding 64B re-gion, where 1 allows and 0 disallows changing the state ofthe corresponding byte. The mask is used to perform partialupdates of metadata within a cache line. We throw a privi-leged Califorms exception when the CFORM instruction triesto set a security byte to an existing security byte location,and unset a security byte from a normal byte.The

CFORM instruction is treated similar to a store instruc-tion in the processor pipeline, where it first fetches the corre-sponding cache line into the L1 data cache upon an L1 miss(assuming a write allocate cache policy). Next, it manipulatesthe bits in the metadata storage to appropriately set or unsetthe security bytes. When the hardware detects an access violation, it throwsa privileged exception once the instruction becomes non-speculative. There are some library functions which violatethe aforementioned operations security bytes such as memcpy so we need a way to suppress the exceptions. In order towhitelist such functions, we manipulate the exception maskregisters and let the exception handler decide whether tosuppress the exception or not. Although privileged exceptionhandling is more expensive than handling user-level excep-tions (because it requires a context switch to the kernel), westick with the former to limit the attack surface. We rely onthe fact that the exception itself is a rare event and wouldhave negligible effect on performance. We also investigate the possibility of using a variant of

CFORM instructionwhich does not store the modified cache line into the L1 data cache, justlike the non-temporal (or streaming) load/store instructions (e.g.,

MOVNTI , MOVNTQ , etc) in Section 6.1.4 ractical Byte-Granular Memory Blacklisting using Califorms [0]

Security byte?

Add’l storage [0] [1] [63]1 [1] [63]

Figure 5.

Califorms-bitvector: L1 Califorms implementationusing a bit vector that indicates whether each byte is a secu-rity byte. HW overhead of 8B per 64B cache line.

Address calc O ﬀ s e t I n d e x T a g e t c . A dd r e ss D ec o d e r TagArray = DataArray A li g n e r D a t a MetadataArray C a li f o r m s C h ec k e r E x ce p t i o n? Figure 6.

Pipeline diagram for the L1 cache hit operation.The shaded components correspond to Califorms.

The microarchitectural support for our technique aims tokeep the common case fast: L1 cache uses the straightfor-ward scheme of having one bit of additional storage per byte.All califormed cache lines are transformed to the straightfor-ward scheme at the L1 data cache controller so that typicalloads and stores which hit in the L1 cache do not have toperform address calculations to figure out the location oforiginal data (which is required for Califorms of L2 cacheand beyond). This design decision guarantees that for thecommon case the latencies will not be affected due to secu-rity functionality. Beyond the L1, the data is stored in theoptimized califormed format, i.e., one bit of additional stor-age for the entire cache line. The transformation happenswhen the data is filled in or spilled from the L1 data cache(between the L1 and L2), and adds minimal latency to the L1miss latency. For main memory, we store the additional bitper cache line size in the DRAM ECC spare bits, thus com-pletely removing any cycle time impact on DRAM access ormodifications to the DIMM architecture.

To satisfy the L1 design goal we consider a naive (but lowlatency) approach which uses a bit vector to identify whichbytes are security bytes in a cache line. Each bit of the bitvector corresponds to each byte of the cache line and repre-sent its state (normal byte or security byte). Figure 5 presentsa schematic view of this implementation califorms-bitvector .The bit vector requires a 64-bit (8B) bit vector per 64B cacheline which adds 12.5% storage overhead for just the L1-Dcaches (comparable to ECC overhead for reliability). [4] [63]

Line califormed?

00: 101: 210: 311: 4+

00 Addr001 Addr0 Addr110 Addr0 Addr1 Addr211 Addr0 Addr1 Addr2 Addr3 Sentinel [1] [2][1] [2] [3][2] [3][3]

Figure 7.

Califorms-sentinel that stores a bit vector in se-curity byte locations. HW overhead of 1-bit per 64B cacheline.Figure 6 shows the L1 data cache hit path modifications forCaliforms. If a load accesses a califormed byte (which is de-termined by reading the bit vector) an exception is recordedto be processed when the load is ready to be committed.Meanwhile, the load returns a pre-determined value for thesecurity byte (in our design the value 0 which is the valuethat the memory region is initialized to upon deallocation).The reason to return the pre-determined value is to avoida speculative side channel attack to identify security bytelocations and is discussed in greater detail in Section 7. Onstore accesses to califormed bytes we report an exceptionbefore the store commits.

For L2 and beyond, we take a different approach that al-lows us to recognize whether each byte is a security bytewith fewer bits, as using the L1 metadata format throughoutthe system will increase the cache area overhead by 12.5%,which may not be acceptable. Figure 7 illustrates our pro-posed califorms-sentinel , which has a 1-bit or 0.2% metadataoverhead per 64B cache line.The key insight that enables these savings is the followingobservation: the number of addressable bytes in a cache lineis less than what can be represented by a single byte (weonly need six bits). For example, let us assume that there is(at least) one security byte in a 64B cache line. Considering abyte granular protection there are at most 63 unique values(bytes) that non-security bytes can have. Therefore, we areguaranteed to find a six bit pattern which is not presentin any of the normal bytes’ least (or most) significant sixbits. We use this pattern as a sentinel value to represent thesecurity bytes within the cache line.If we store the six bit sentinel value as additional metadata,the overhead will be seven bits (six bits plus one bit to specifyif the cache line is califormed) per cache line. Instead, wepropose a new cache line format which stores the sentinelvalue within a security byte to reduce the metadata overhead . Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan

1: Read the Califorms metadata for the evicted line and OR them2: if result is 0 then

3: Evict the line as is and set Califorms bit to 04: else

5: Set Califorms bit to 16: Perform following operations on the cache line:7: Scan least 6-bit of every byte to determine sentinel8: Get locations of 1st 4 security bytes9: Store data of 1st 4 bytes in locations obtained in 8:10: Fill the 1st 4 bytes based on Figure 711: Use the sentinel to mark the remaining security bytes12: end

Algorithm 1.

Califorms conversion from the L1 cache(califorms-bitvector) to L2 cache (califorms-sentinel).

1: Read the Califorms bit for the inserted line2: if result is 0 then

3: Set the Califorms metadata bit vector to [0]4: else

5: Perform following operations on the cache line:6: Check the least significant 2-bit of byte 07: Set the metadata of byte[Addr[0-3]] to 1 based on 6:8: Set the metadata of byte[Addr[byte == sentinel]] to 19: Set the data of byte[0-3] to byte[Addr[0-3]]10: Set the new locations of byte[Addr[0-3]] to zero11: end

Algorithm 2.

Califorms conversion from the L2 cache(califorms-sentinel) to L1 cache (califorms-bitvector).down to one bit per cache line. The idea is to use four differentformats depending on the number of security bytes in thecache line, as we explain below.Califorms-sentinel stores the metadata into the first fourbytes (at most) of the 64B cache line. Two bits of the 0th byteis used to specify the number of security bytes within thecache line: , , and represent one, two, three, andfour or more security bytes, respectively. If there is only onesecurity byte in the cache line, we use the remaining six bitsof the 0th byte to specify the location of the security byte(and the original value of the 0th byte is stored in the securitybyte). Similarly when there is two or three security bytesin the cache line, we use the bits of the 1st and 2nd bytesto locate them. The key observation is that, we gain twobits per security byte since we only need six bits to specifya location in the cache line. Therefore when we have foursecurity bytes, we can locate four addresses and have six bitsremaining in the first four bytes. This remaining six bits canbe used to store a sentinel value, which allows us to haveany number of additional security bytes.Although the sentinel value depends on the actual valueswithin the 64B cache line, it works naturally with a write-allocate L1 cache, which is the most commonly used cacheallocation policy in modern microprocessors. The cache lineformat can be converted upon L1 cache eviction and inser-tion (califorms-bitvector to/from califorms-sentinel), and thesentinel value only needs to be found upon L1 cache eviction.Also, it is important to note that califorms-sentinel supportscritical-word first delivery since the security byte locationscan be quickly retrieved by scanning only the first 4B ofthe first 16B flit. Algorithms 1 and 2 describe the high-level process used for converting from L1 to L2 Califorms and viceversa.Figure 8 shows the logic diagram for the spill module. Thecircled numbers refer to the corresponding steps in Algo-rithm 1. In the top-left corner, the Califorms metadata forthe evicted line is ORed to construct the L2 cache (califorms-sentinel) metadata bit. The bottom-right square details theprocess of determining sentinel. We scan least 6-bit of everybyte, decode them, and OR the output to construct a used-values vector. The used-values vector is then processed by aFind-index block to get the sentinel (line 7). The Find-indexblock takes a 64-bit input vector and searches for the index ofthe first zero. It is constructed using 64 shift blocks followedby a single comparator.The top-right corner of Figure 8 shows the logic for gettingthe locations of the first four security bytes (line 8). It consistsof four successive combinational Find-index blocks (eachdetecting one security byte) in our evaluated design. Thislogic can be easily pipelined into four stages, if needed, tocompletely hide the latency of the spill process in the pipeline.Finally, we store the data of the first four bytes in locationsobtained from the Find-index blocks and fill the same fourbytes based on Figure 7.Figure 9 shows the logic diagram for the fill module, assummarized in Algorithm 2. The blue (==) blocks are con-structed using logic comparators. The Califorms bit of theL2 inserted line is used to control the value of the L1 cache(califorms-bitvector) metadata. The first two bits of the L2inserted line are used as inputs for the comparators to decideon the metadata bits of the first four bytes as specified inFigure 7. Only if those two bits are , the sentinel valueis read from the fourth byte and fed, with the least 6-bitsof each byte, to 60 comparators simultaneously to set therest of the L1 metadata bits. Such parallelization reduces thelatency impact of the fill process. Since the

CFORM instruction updates the architecture state(writes values), it is functionally a store instruction and han-dled as such in the pipeline. However, there is a key differ-ence: unlike a store instruction, the

CFORM instruction shouldnot forward the value to a younger load instruction whoseaddress matches within the load/store queue (LSQ) but in-stead return the value zero. This functionality is requiredto provide tamper-resistance against side-channel attacks.Additionally, upon an address match, both load and storeinstructions subsequent to an in flight

CFORM instruction aremarked for Califorms exception (which is thrown when theinstruction is committed).In order to detect an address match in the LSQ witha

CFORM instruction, first a cache line address should bematched with all the younger instructions. Subsequentlyupon a match, the value stored in the LSQ for the

CFORM instruction, which contains the mask value indicating ractical Byte-Granular Memory Blacklisting using Califorms [63][0] [1] ... L1 Califorms bitvector ...

L2 Califorms metadata [0] ...

L1 Cacheline data [1] [63]

Decoder 6 x 64 ...

Decoder 6 x 64 ... ... [63][0] ... [1] ......

Find Index of First Bitof Value 0 Sentinel Value

Find Index of First Bitof Value 1 mask 6bit

Find Index of First Bitof Value 1 mask

Cross Bar & Combinational Logic data of ﬁrst 4 bytes [0] ... L2 Cacheline data [1] [63] Figure 8.

Logic diagram for Califorms conversion from the L1 cache (califorms-bitvector) to L2 cache (califorms-sentinel). Thegreen Find-index blocks are constructed using 64 shift blocks followed by a single comparator. The circled numbers refer tothe corresponding steps in Algorithm 1. [63][0] [1] ...

L1 Califorms bitvectorL2 Califorms metadata [2] [3] ... [4][63][0] ...

L2 Cacheline data !=00?==10?==11? ==11? ==Sentinel?

Figure 9.

Logic diagram for Califorms conversion fromthe L2 cache (califorms-sentinel) to L1 cache (califorms-bitvector), as described in Algorithm 2. The blue (==) blocksare constructed using logic comparators.to-be-califormed bytes, is used to confirm the final match. Tofacilitate a match with a

CFORM instruction, each LSQ entryshould be associated with a bit to indicate whether the entrycontains a

CFORM instruction. Detecting a complete matchmay take multiple cycles, however, a legitimate load/storeinstruction should never be forwarded a value from a

CFORM instruction, and thus the store-to-load forwarding from a

CFORM instruction is not on the critical path of the program(i.e., its latency should not affect the performance, and wedo not evaluate its effect in our evaluation). Alternately, ifLSQ modifications are to be avoided, the

CFORM instructionscan be surrounded by memory serializing instructions(i.e., ensure that

CFORM instructions are the only in flightmemory instructions).

We describe compiler support, the memory allocator changesand the operating system changes to support Califorms inthe following.

We can consider two approaches to applying security bytes:(1)

Dirty-before-use . Unallocated memory has no securitybytes. We set security bytes upon allocation and unset themupon deallocation; or (2)

Clean-before-use . Unallocated mem-ory remains filled with security bytes all the time. We clearthe security bytes (in legitimate data locations) upon allo-cation and set them upon deallocation. Ensuring temporalmemory safety in the heap remains a non-trivial problem [1].We therefore choose to follow a clean-before-use approachin the heap, so that deallocated memory regions remain pro-tected by califormed security bytes . Additionally, in orderto provide temporal memory safety, we do not reallocate re-cently freed regions until the heap is sufficiently consumed(quarantining). Compared to the heap, the security benefitsare limited for the stack since temporal attacks on the stack(e.g., use-after-return attacks) are much rarer. Hence, weapply the dirty-before-use scheme on the stack. It is natural to use the non-temporal

CFORM instruction when deallocatinga memory region; deallocated region is not meant to be used by the programand thus polluting the L1 data cache with those memory is harmful andshould be avoided. Not evaluated in this paper is the use of non-temporalinstructions which should provide better performance.7 . Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan

Our compiler-based instrumentation infers where to placesecurity bytes within target objects, based on their type lay-out information. The compiler pass supports three insertionpolicies: the first opportunistic policy supports security bytesinsertion into existing padding bytes within the objects, andthe other two support modifying object layouts to introducerandomly sized security byte spans that follow the full or intelligent strategies described in Section 2. The first policyaims at retaining interoperability with external code modules(e.g., shared libraries) by avoiding type layout modification.Where this is not a concern, the latter two policies help offerstronger security coverage — exhibiting a tradeoff betweensecurity and performance. We need the following support in the operating system: • Privileged Exceptions.

As the Califorms exception isprivileged, the operating system needs to properly handle itas with other privileged exceptions (e.g., page faults). We alsoassume the faulting address is passed in an existing registerso that it can be used for reporting/investigation purposes.Additionally, for the sake of usability and backwards compat-ibility, we have to accommodate copying operations similarin nature to memcpy . For example, a simple struct to structassignment could trigger this behavior, thus leading to apotential breakdown of califormed software. Hence, in orderto maintain usability, we allow whitelisting functionality tosuppress the exceptions. This is done by issuing a privilegedstore instruction to modify the value of exception mask reg-isters before entering and after exiting the according pieceof code. We discuss the implications of this design choice inSection 7. • Page Swaps.

As we have discussed in Section 3, data withsecurity bytes is stored in main memory in a califormedformat. When a page with califormed data is swapped outfrom main memory, the page fault handler needs to storethe metadata for the entire page into a reserved addressspace managed by the operating system; the metadata isreclaimed upon swap in. The kernel has enough addressspace in practice (kernel’s virtual address space is 128TB fora 64-bit Linux with 48-bit virtual address space) to store themetadata for all the processes on the system since the sizeof the metadata for a 4KB page consumes only 8B.

For the security evaluation of this work, we assume a threatmodel comparable to that used in contemporary relatedworks. We assume the victim program to have one or morevulnerabilities that an attacker can exploit to gain arbitraryread and write capabilities in the memory. Furthermore, weassume that the adversary has access to the source code of the program, therefore s/he is able to glean all source-levelinformation and/or deterministic compilation results from it(e.g., find code gadgets within the program and determinenon-califormed layouts of data structures). However, s/hedoes not have access to the host binary (e.g., server-side ap-plications). Finally, we assume that all hardware is trusted:it does not contain and/or is not subject to bugs arising fromexploits such as physical or glitching attacks. Due to its re-cent rise in relevance however, we maintain side channelattacks in our design of Califorms within the purview of ourthreats. Specifically, we accommodate attack vectors seekingto leak the location and value of security bytes. • Metadata Tampering Attacks.

A key feature of Cali-forms as a metadata-based safety mechanism is the absenceof programmer visible metadata in the general case (apartfrom a metadata bit in the page information maintained byhigher privilege software). Beyond the implications for itsstorage overhead, this also means that our technique is im-mune to attacks that explicitly aim to leak or tamper themetadata to bypass the respective defense. This, in turn, im-plies a smaller attack surface so far as software maintenanceof metadata is concerned. • Bit-granularity Attacks.

Califorms’s capability of fine-grained memory protection is the key enabler for intra-objectoverflow detection. However, our byte granular mechanismis not enough for protecting bit-fields without turning theminto char bytes functionally. This should not be a majordetraction since security bytes can still be added aroundcomposites of bit-fields. • Heterogeneous Architectural Attacks.

Califorms’hardware modifications affect the memory hierarchy.Hence, its protection is lost whenever one of its layersis bypassed (e.g., heterogeneous architectures or DMAis used). Mitigating this requires that these mechanismsalways respect the security byte semantics by propagatingthem along the respective memory structures and detectingaccesses to them. If the algorithm used for califorming isused by accelerators then attacks through heterogeneouscomponents can also be averted. • Side-Channel Attacks.

Our design takes multiple stepsto be resilient to side channel attacks. Firstly, we purposefullyavoid timing variances introduced due to our hardware mod-ifications in order to avoid timing based side channel attacks.Additionally, to avoid speculative execution side channels alaSpectre [12], our design returns zero on a load to a securitybyte, thus preventing speculative disclosure of metadata. Weaugment this further by requiring that deallocated objects(heap or stack) be zeroed out in software [13]. This is toavoid the following attack scenario: consider a case if theattacker somehow knows that the padding locations shouldcontain a non-zero value (for instance, because s/he knows ractical Byte-Granular Memory Blacklisting using Califorms the object allocated at the same location prior to the currentobject had non-zero values). However, while speculativelydisclosing memory contents of the object, s/he discovers thatthe padding location contains a zero instead. As such, s/hecan infer that the padding there contains a security byte. Ifdeallocations were accompanied with zeroing, however, thisassumption does not hold. • Coverage-Based Attacks.

For califorming the paddingbytes (in an object), we need to know the precise type infor-mation of the allocated object. This is not always possible inC-style programs where void* allocations may be used. Inthese cases, the compiler may not be able to infer the correcttype, in which case intra-object support may be skipped forsuch allocations. Similarly, our metadata insertion policies(viz., intelligent and full) require changes to the type lay-outs. This means that interactions with external modulesthat have not been compiled with Califorms support mayneed (de)serialization to remain compatible. For an attacker,such points in execution may appear lucrative because ofinserted security bytes getting stripped away in those shortperiods. We note however that the opportunistic policy canstill remain in place to offer some protection.On the other hand, for those interactions that remain obliv-ious to type layout modifications (e.g., passing a pointer toan object that shall remain opaque within the external mod-ule), our hardware-based implicit checks have the benefit ofpersistent tampering protection, even across binary moduleboundaries. • Whitelisting Attacks.

Our concession of allowingwhitelisting of certain functions was necessary to makeCaliforms more usable in common environments withoutrequiring significant source modifications. However, thisalso creates a vulnerability window wherein an adversarycan piggy back on these functions in the source to bypassour protection. To confine this vector, we keep the numberof whitelisted functions as minimal as possible. • Derandomization Attacks.

Since Califorms can be by-passed if an attacker can guess a security bytes location, itis crucial that it be placed unpredictably. For the attackerto carry out a guessing attack, s/he first needs to obtain thevirtual memory address of the object they want to corrupt,and then overwrite a certain number of bytes within thatobject. To know the address of the object of interest, s/hetypically has to scan the process’ memory: the probabilityof scanning without touching any of the security bytes is ( − P / N ) O where O is number of allocated objects, N isthe size of each object, and P is number of security byteswithin that object. With 10% padding ( P / N = . O reaches 250, the attack success goes to 10 − . If the attackercan somehow reduce O to 1, which represents the ideal casefor the attacker, the probability of guessing the element of interest is 1 / n (since we insert 1–7 wide security bytes),compounding as the number of paddings to be guessed ( = n )increases.The randomness is, however, introduced statically akinto randstruct plugin introduced in recent Linux kernelswhich randomizes structure layout of those which are speci-fied (it does not offer detection of rogue accesses unlike Cal-iforms do) [14, 15]. The static nature of the technique maymake it prone to brute force attacks like BROP [16] whichrepeatedly crashes the program until the correct configura-tion is guessed. This could be prevented by having multipleversions of the same binary with different padding sizes orsimply by better logging, when possible. Another mitigatingfactor is that BROP attacks require specific type of programsemantics, namely, automatic restart-after-crash with thesame memory layout. Applications with these semantics canbe modified to spawn with a different padding layout in ourcase and yet satisfy application level requirements. • Cache Access Latency Impact of Califorms.

Califormsadds additional state and operations to the L1 data cacheand the interface between the L1 and L2 caches. The goalof this section is to evaluate the access latency impact ofthe additional state and operations described in Section 5.Qualitatively, the metadata area overhead of L1 Califormsis 12.5%, and the access latency should not be impacted asthe metadata lookup can happen in parallel with the L1tag access; the L1 to/from L2 califorms conversion shouldalso be simple enough so that its latency can be completelyhidden. However, the metadata area overhead can increasethe L1 tag access latency and the conversions might add littlelatency. Without loss of generality, we measure the accesslatency impact of adding califorms-bitvector on a 32KB directmapped L1 cache in the context of a typical energy optimizedtag, data, formatting L1 pipeline with multicycle fill/spillhandling. For the implementation we use the 65nm TSMCcore library, and generate the SRAM arrays with the ARMArtisan memory compiler. Table 2 summarizes the resultsfor the L1 Califorms (califorms-bitvector).As expected, the overheads associated with the califorms-bitvector are minor in terms of delay (1.85%) and powerconsumption (2.12%). We found the SRAM area to be thedominant component in the total cache area (around 98%)where the overhead was 18.69%, higher than 12.5%. The re-sults of fill/spill modules are reported separately in the righthand side of Table 2.The latency impact of the fill operation is within the accessperiod of the L1 design. Thus, the califorming operationcan be folded completely within the pipeline stages that areresponsible for bringing cache lines from L2 to L1. . Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan Table 2.

Area, delay and power overheads of Califorms (GE represents gate equivalent). L1 Califorms (califorms-bitvector)adds negligible delay and power overheads to the L1 cache access.

Design Main synthesis results L1 overheads Fill overheads Spill overheadsName Area (GE) Delay ( ns ) Power ( mW ) Area (%) Delay(%) Power (%) Area (GE) Delay ( ns ) Power ( mW ) Area (GE) Delay ( ns ) Power ( mW )Baseline 347,329.19 1.62 15.84 — — — — — — — — —L1 Califorms 412,263.87 1.65 16.17 18.69 1.85 2.12 8,957.16 1.43 0.18 34,561.80 5.50 0.52 The timing delay of the (less performance sensitive) spilloperation is larger than that of the fill operation (5.5 ns vs.1.4 ns ) as we use pure combinational logic to construct thecaliforms-sentinel format in one cycle, as shown in Figure 8.This cycle period can be reduced by dividing the operationsof Algorithm 1 (lines 7 to 11) into two or more pipeline stages.For instance, getting the locations of the first four securitybytes (line 8) consists of four successive combinational blocks(each detecting one security byte) in our evaluated design.This logic can be easily pipelined into four stages. Thereforewe believe that the latency of both the fill and spill operationscan be minimal (or completely hidden) in the pipeline. • Performance with Additional Cache Access Latency.

Our results from the VLSI implementation imply that therewill be no additional L2/L3 latency imposed by implementingCaliforms. However, this might not be the case depending onseveral implementation details (e.g., target clock frequency)so we pessimistically assume that the L2/L3 access latencyincurs additional one cycle latency overhead. In order toevaluate the performance of the additional latency posed byCaliforms, we perform detailed microarchitectural simula-tions.We use ZSim [17] as the processor simulator and usePinPoints [18] with Intel Pin [19], to select representativesimulation regions of the SPEC CPU2006 benchmarks with ref inputs compiled with Clang version 6.0.0 with “ -O3-fno-strict-aliasing ” flags. We do not warmup the simu-lator upon executing each SimPoint region but instead use arelatively large interval length of 500M instructions to avoidany warmup issues. We set MaxK used in SimPoint regionselection to 30. Table 3 shows the parameters of the processor, an IntelWestmere-like out-of-order core which has been validatedagainst a real system whose performance and microarchitec-tural events to be commonly within 10% [17]. We evaluatethe performance when both L2 and L3 caches incur addi-tional latency of one cycle.As shown in Figure 10 slowdowns range from 0.24%( hmmer ) to 1.37% ( xalancbmk ). The average performance For some benchmark-input pairs we have seen discrepancies in the num-ber of instructions measured by PinPoints vs. ZSim and thus the ap-propriate SimPoint regions might not be simulated. Those inputs are: foreman_ref_encoder_main for h264ref and pds-50 for soplex . Also,due to time constraints, we could not complete executing SimPointfor h264ref with sss_encoder_main input and excluded it from theevaluation.

Table 3.

Hardware configuration of the simulated system.

Core x86-64 Intel Westmere-like OoO core at 2.27GHzL1 inst. cache 32KB, 4-way, 3-cycle latencyL1 data cache 32KB, 8-way, 4-cycle latencyL2 cache 256KB, 8-way, 7-cycle latencyL3 cache 2MB, 16-way, 27-cycle latencyDRAM 8GB, DDR3-1333 S l o w do w n a s t a r b z i p2 dea l II g cc gob m k h264 r e f h mm e r l b m li bquan t u m m c f m il c na m d o m ne t pp pe r l ben c h po v r a y s j eng s op l e x s ph i n x x a l an c b m k AV G Figure 10.

Slowdown with additional one-cycle access la-tency for both L2 and L3 caches.slowdown is 0.83% which is negligible and is well in therange of error when executed on real systems.

Our evaluations so far revealed that the hardware modifi-cations required to implement Califorms add little or noperformance overhead. Here, we evaluate the overheads in-curred by the two software based changes required to enableintra-object memory safety with Califorms: the effect of un-derutilized memory structures (e.g., caches) due to additionalsecurity bytes, and the additional work necessary to issue

CFORM instructions (and the overhead of executing the in-structions themselves). • Evaluation Setup.

We run the experiments on an IntelSkylake-based Xeon Gold 6126 processor running at 2.6GHzwith RHEL Linux 7.5 (kernel 3.10). We omit dealII and omnetpp since the shared libraries installed on RHEL are tooold to execute these two Califorms enabled binaries, and gcc since it fails when executed with the memory allocator withinter-object spatial and temporal memory safety support.The remaining 16 SPEC CPU2006 C/C++ benchmarks arecompiled with our modified Clang version 6.0.0 with “ -O3-fno-strict-aliasing ” flags. We use the ref inputs andrun to completion. We run each benchmark-input pair fivetimes and use the shortest execution time as its performance.For the benchmarks with multiple ref inputs, the sum of the ractical Byte-Granular Memory Blacklisting using Califorms S l o w do w n -10%0%10%20%30%40%50% astar bzip2 gobmk h264ref hmmer lbm libquantum mcf milc namd perlbench povray sjeng soplex sphinx3 xalancbmk AVG Figure 11.

Slowdown of the opportunistic policy, and full insertion policy with random sized security bytes (with and without

CFORM instructions). The average slowdowns of opportunistic and full insertion policies are 6.2% and 14.2%, respectively. S l o w do w n -5%0%5%10%15%20% astar bzip2 gobmk h264ref hmmer lbm libquantum mcf milc namd perlbench povray sjeng soplex sphinx3 xalancbmk AVG Figure 12.

Slowdown of the intelligent insert policy with random sized security bytes (with and without

CFORM instructions).The average slowdown is 2.0%.execution time of all the inputs are used as their executiontimes. We estimate the performance impact of executing a

CFORM instruction by emulating it with a dummy store instructionthat writes some value to the corresponding cache line’spadding byte. Since one

CFORM instruction can caliform theentire cache line, issuing one dummy store instruction perto-be-califormed cache line suffices. In order to issue thedummy stores, we implement a LLVM pass to instrumentthe code to hook into memory allocations and deallocations.We then retrieve the type information to locate the paddingbytes, calculate the number of dummy stores and the addressthey access, and finally emit them. Therefore, all the softwareoverheads we need to pay to enable Califorms are accountedfor in our evaluation.For the random sized security bytes, we evaluate threevariants: we fix the minimum size to one byte while varyingthe maximum size to three, five and seven bytes (i.e., onaverage the amount of security bytes inserted are two, threeand four bytes, respectively). In addition, in order to accountfor the randomness introduced by the compiler, we gener-ate three different versions of binaries for the same setup(e.g., three versions of astar with random sized paddingsof minimum one byte and maximum three bytes). The errorbars in the figure represent the minimum and the maximumexecution times among 15 executions (three binaries × five We use the arithmetic mean of the speedup (execution time of the originalsystem divided by that of the system with additional latency) to computethe average, or in other words, we are interested in a condition where theworkloads are not fixed and all types of workloads are equally probable onthe target system [20, 21]. runs) and the average of the execution times is representedas the bar. • Performance of the Opportunistic and Full InsertionPolicies with

CFORM

Instructions.

Figure 11 presents theslowdown incurred by three set of strategies: full insertionpolicy (with random sized security bytes) without

CFORM in-structions, the opportunistic policy with

CFORM instructions,and the full insertion policy with

CFORM instructions. Sincethe first strategy does not execute

CFORM instructions it doesnot offer any security coverage, but is shown as a reference toshowcase the performance breakdown of the third strategy(cache underutilization vs. executing

CFORM instructions).First, we focus on the three variants of the first strategy,which is shown in the three left most bars. We can see thatdifferent sizes of random sized security bytes does not make alarge difference in terms of performance. The average slow-down of the three variants for the policy without

CFORM instructions are 5.5%, 5.6% and 6.5%, respectively. This canbe backed up by our results shown in Figure 4, where the av-erage slowdowns of additional padding of two, three and fourbytes ranges from 5.4% to 6.2%. Therefore in order to achievehigher security coverage without losing performance, usinga random sized bytes of, minimum of one byte and maxi-mum of seven bytes, is promising. When we focus on in-dividual benchmarks, we can see that a few benchmarksincluding h264ref , mcf , milc and omnetpp incur noticeableslowdowns (ranging from 15.4% to 24.3%).Next, we examine the opportunistic policy with CFORM in-structions, which is shown in the middle (fourth) bar. Sincethis strategy does not add any additional security bytes, theoverheads are purely due to the work required to setup and . Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan execute CFORM instructions. The average slowdown of thispolicy is 7.9%. There are benchmarks which encounter aslowdown of more than 10%, namely gobmk , h264ref and perlbench . The overheads are due to frequent allocationsand deallocations made during program execution, wherethe programs have to calculate and execute CFORM instruc-tions upon every event (since every compound data type willbe/was califormed). For instance perlbench is notorious forbeing malloc-intensive, and reported as such elsewhere [2].Lastly the third policy, the full insertion policy with

CFORM instructions, offers the highest security coverage in Cali-forms based system with the highest average slowdown of14.0% (with the random sized security bytes of maximumseven bytes). Nearly half (seven out of 16) the benchmarksencounter a slowdown of more than 10%, which might notbe suitable for performance-critical environments, and thusthe user might want to consider the use of the followingintelligent insertion policy. • Performance of the Intelligent Insertion Policy with

CFORM

Instructions.

Figure 12 shows the slowdowns of theintelligent insertion policy with random sized security bytes( with and without

CFORM instructions, in the same spirit asFigure 11). First we focus on the strategy without executing

CFORM instructions (the three bars on the left). The perfor-mance trend is similar such that the three variants withdifferent random sizes have little performance difference,where the average slowdown is 0.2% with the random sizedsecurity bytes of maximum seven bytes. We can see thatnone of the programs incurs a slowdown of greater than5%. Finally with

CFORM instructions (three bars on the right), gobmk and perlbench have slowdowns of greater than 5%(16.1% for gobmk and 7.2% for perlbench ). The average slow-down is 1.5%, where considering its security coverage andperformance overheads the intelligent policy might be themost practical option for many environments.

Implementations of various safety mechanisms in hardwarewere very popular in the 70–90s, introducing crucial legacytechniques such as capabilities, segmentation and virtualmemory. Subsequently, the focus shifted towards scalabilityand performance until the last decade, when security saw arevival in interest. In this section, we only focus on the lattergroup of modern hardware based security techniques, andcompare them to Califorms. Previous hardware solutionsin this domain can be broadly categorized into the follow-ing three classes: disjoint metadata whitelisting, cojoinedmetadata whitelisting and inlined metadata blacklisting, aspresented in Figure 13. • Disjoint Metadata Whitelisting.

This class of tech-niques, also called base and bounds, attaches boundsmetadata with every pointer, bounding the region of mem-ory they can legitimately dereference (see Figure 13(a)). Hardbound [8] was the first hardware proposal to providespatial memory safety using this mechanism. Intel MPX [4]is similar, but also introduces explicit architectural interface(registers and instructions) for managing bounds informa-tion. Temporal safety was introduced to this scheme bystoring an additional “version” information along with thepointer metadata and verifying that no stale versions areever retrieved [22, 23]. BOGO [24] adds temporal safety toMPX by invalidating all pointers to freed regions in MPX’slookup table. Introduced about 35 years ago in commercialchips like Intel 432 and IBM System/38, the CHERI [9]revived capability based architectures. It has similar bounds-checking guarantees, in addition to having other metadatafields pertaining to permissions, etc . PUMP [11], on theother hand, is a general-purpose framework for metadatapropagation, and can be used for propagating pointerbounds.Typically, per pointer metadata is stored separately fromthe pointer in a shadow memory region, in order to maintainlegacy pointer layout assumptions. Thus, although metadatastorage overhead scales according to the number of pointersin principle, techniques generally reserve a fixed chunk ofmemory for easy lookup. Owing to this disjoint nature, meta-data access therefore requires additional memory operations,which individual proposals seek to minimize with cachingand other optimizations. Regardless, disjoint metadata in-troduces atomicity concerns potentially resulting in falsepositives and negatives or complicating coherence designs atthe least (e.g., MPX is not thread-safe). Explicit specificationof bounds per pointer also allows bounds-narrowing in prin-ciple, wherein pointer bounds can be tailored to protect indi-vidual elements in a composite memory object (for instance,when passing the pointer to an element to another func-tion). However, commercial compilers do not support thisfeature for MPX due to the complexity of compiler analysesrequired. Furthermore, compatibility issues with untreatedmodules (unprotected libraries, for instance) also introducesreal-world deployability concerns for these techniques. Forinstance MPX drops its bounds when protected pointers aremodified by unprotected modules, while CHERI does notsupport it at all. MPX additionally makes bounds checkingexplicit, thus introducing a marginal computational over-head to bounds management as well. • Cojoined Metadata Whitelisting.

Originally introducedin the IBM System/360 mainframes, this mechanism assignsa “color” to memory chunks as well as pointers. As such, theruntime check for access validity simply consists of com-paring the colors of the pointer and accessed memory (seeFigure 13(b)). A recent version of CHERI [10], however, manages to compress metadatato 128 bits and change pointer layout to store it with the pointer value(i.e., implementing base and bounds as cojoined metadata whitelisting),accordingly introducing instructions to manipulate them specifically.12 ractical Byte-Granular Memory Blacklisting using Califorms

PointerBu ﬀ er ✘ ✔ BeginEnd BeginEnd (a)

Disjoint metadatawhitelisting.

PointerBu ﬀ er_A ✘ ✔ Bu ﬀ er_B ABA ColorTags (b) Cojoined metadatawhitelisting.

PointerBu ﬀ erTripwireTripwire ✘ ✔ (c) Inlined meta-data blacklisting.

Figure 13.

Three main classes of hardware solutions formemory safety.This technique is currently commercially deployed bySPARC ADI [3], which refactors unused higher order bitsin pointers to store the color. Color associated with memoryis stored in the ECC bits while in memory, and dedicated perline metadata bits while in cache. Due to the latter feature,metadata storage does not occupy any additional memoryin the program’s address space. Additionally, since meta-data bits are acquired along with concomitant data, extramemory operations are obviated. For the same reason, it isalso compatible with unprotected modules since the checksare implicit as well. Temporal safety is trivially achieved byassigning a different color when memory regions are reused.However, intra-object protection or bounds-narrowing isnot supported as there is no means for “overlapping” colors.Furthermore, protection is also dependent on the numberof metadata bits employed, since it determines the numberof colors that can be assigned. So, while color reuse allowsADI to scale and limit metadata storage overhead, it can alsobe exploited by this vector. Another disadvantage of thistechnique, specifically due to inlining metadata in pointers,is that it only supports 64-bit architectures. Narrower point-ers would not have enough spare bits to accommodate colorinformation. • Inlined Metadata Blacklisting.

Another line of work,also referred to as tripwires, aims to detect overflows by sim-ply blacklisting a patch of memory on either side of a buffer,and flagging accesses to this patch (see Figure 13(c)). This isvery similar to contemporary canary design [30], but thereare a few critical differences. First, canaries only detect over-writes, not overreads. Second, hardware tripwires triggerinstantaneously, whereas canaries need to be periodicallychecked for integrity, providing a period of attack to timeof use window. Finally, unlike hardware tripwires, canaryvalues can be leaked or tampered, and thus mimicked. ARM has a similar upcoming Memory Tagging [25] feature, whose imple-mentation details are unclear, as of this work. When a memory is swapped, color bits are copied into memory by the OShowever.

Proposal Protection Intra- Binary TemporalGranularity Object Composability Safety

Hardbound [8] Byte ✓ ∗ ✗ ✗ Watchdog [22] Byte ✓ ∗ ✗ ✓ WatchdogLite [23] Byte ✓ ∗ ✗ ✓ Intel MPX [4] Byte ✓ ∗ ✗ ‡ ✗ BOGO [24] Byte ✓ ∗ ✗ ‡ ✓ PUMP [11] Word ✗ ✓ ✓

CHERI [9] Byte ✗ † ✗ ✗ CHERI concentrate [10] Byte ✗ † ✗ ✗ SPARC ADI [3] Cache line ✗ ✓ ✓ § SafeMem [26] Cache line ✗ ✓ ✗

REST [27] 8–64B ✗ ✓ ✓ ¶ Califorms

Byte ✓ ✓ ✓ ¶ Table 4.

Security comparison against previous hardwaretechniques. ∗ Achieved with bounds narrowing. † Althoughthe hardware supports bounds narrowing, CHERI foregoes itsince doing so compromises capability logic [28] . ‡ Executioncompatible, but protection dropped when external modulesmodify pointer. § Limited to 13 tags. ¶ Allocator should ran-domize allocation predictability.SafeMem [26] implements tripwires by repurposing ECCbits in memory to mark memory regions invalid, thus tradingoff reliability for security. On processors supporting specula-tive execution, however, it might be possible to speculativelyfetch blacklisted lines into the cache without triggering afaulty memory exception. Unless these lines are flushed im-mediately after, SafeMem’s blacklisting feature can be triv-ially bypassed. Alternatively, REST [27] achieves the same bystoring a predetermined large random number, in the formof a 64B tokens, in the memory to be blacklisted. Violationsare detected by comparing cache lines with the token whenthey are fetched. REST provides temporal safety by quaran-tining freed memory, and not reusing them for subsequentallocations. Compatibility with unprotected modules is eas-ily achieved as well, since tokens are part of the program’saddress space and all access are implicitly checked. However,intra-object safety was not supported by REST owing to thelarge memory overhead such heavy usage of tokens wouldentail.Since it operates on the principle of detecting memoryaccesses to security bytes, which are in turn stored alongwith program data, Califorms belongs to the inlined metadataclass of defenses. However, it is different from other worksin the class in one key aspect — granularity. While bothREST and SafeMem blacklisted at the cache line granularity,Califorms does so at the byte granularity. It is this propertythat enables us to provide intra-object safety with negligibleperformance and memory overheads, unlike previous workin the area. For inter-object spatial safety and temporal safety,we employ the same design principles as REST. Hence, oursafety guarantees are a strict superset of those provided byprevious schemes in this class (spatial safety by blacklistingand temporal safety by quarantining). . Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan Proposal Metadata Memory Performance MainOverhead Overhead Overhead Operations

Hardbound [8] 0–2 words per ptr, ∝∼ ∝∼ µ ops.Watchdog [22] 4 words per ptr ∝∼ ∝∼ µ ops.WatchdogLite [23] 4 words per ptr ∝∼ ∝∼ ∝∼ ∝∼ ∝∼ ∝∼ ∝∼ Prog memory footprint ∝∼ ∝∼ ∝∼ ∝∼ ∝∼ ∝∼ Prog memory footprint ∝∼ ∝∼ Blacklisted memory ∝∼ ∝∼ Blacklisted memory ∝∼ Califorms

Byte granular security byte ∝∼ Blacklisted memory ∝∼ CFORM insns. Execute

CFORM insns.

Table 5.

Performance comparison against previous hardware techniques.

Proposal Core Caches/TLB Memory Software

Hardbound [8] µ op injection & logic for ptr meta, Tag cache and its TLB N/A Compiler & allocator annotates ptr metaextend reg file and data path topropagate ptr metaWatchdog [22] µ op injection & logic for ptr meta, Ptr lock cache N/A Compiler & allocator annotates ptr metaextend reg file and data path topropagate ptr metaWatchdogLite [23] N/A N/A N/A Compiler & allocator annotates ptrs,compiler inserts meta propagation and check insnsIntel MPX [4] Unknown (closed platform [29], design likely similar to Hardbound) Compiler & allocator annotates ptrs,compiler inserts meta propagation and check insnsBOGO [24] Unknown (closed platform [29], design likely similar to Hardbound) MPX mods + kernel mods for bounds pageright managementPUMP [11] Extend all data units by tag width, Rule cache N/A Compiler & allocator (un)sets memory, tag ptrsmodify pipeline stages for tag checks,new miss handlerCHERI [9] Capability reg file, coprocessor Capability caches N/A Compiler & allocator annotates ptrs,integrated with pipeline compiler inserts meta propagation and check insnsCHERI concentrate [10] Modify pipeline to integrate ptr checks N/A N/A Compiler & allocator annotates ptrs,compiler inserts meta propagation and check insnsSPARC ADI [3] Unknown (closed platform) Compiler & allocator (un)sets memory, tag ptrsSafeMem [26] N/A N/A Repurposes ECC bitsREST [27] N/A 1–8b per L1D line, N/A Compiler & allocator (un)sets tags,1 comparator allocator randomizes allocation order/placement Califorms

N/A 8b per L1D line, Use unused ECC bits Compiler & allocator mods to (un)set tags,1b per L2/L3 line compiler inserts intra-object spacing

Table 6.

Comparison of implementation complexity among previous hardware techniques.

Tables 4, 5, and 6 summarize the performance, security, andimplementation characteristics of the hardware based mem-ory safety techniques discussed in this section respectively.Califorms has the advantage of requiring simpler hardware modifications and being faster than disjoint metadata basedwhitelisting systems. The hardware savings mainly stemfrom the fact that our metadata resides with program data;it does not require explicit propagation while additionallyobviating all lookup logic. This significantly reduces our ractical Byte-Granular Memory Blacklisting using Califorms design’s implementation costs. Califorms also has lower per-formance and energy overheads since it neither requiresmultiple memory accesses, nor does it incur any significantchecking costs. However, unlike them, Califorms can be by-passed if accesses to security bytes can be avoided (furtherdiscussed in Section 7). This safety-vs.-complexity tradeoff iscritical to deployability and we argue that our design pointis more practical. This is because designers have to contendwith integrating these features to already complicated pro-cessor designs, without introducing additional bugs whilealso keeping the functionality of legacy software intact. Thisis a hard balance to strike [4].On the other hand, ideal cojoined metadata mechanismswould have comparable slowdowns and similar compilerrequirements. However practical implementations like ADIexhibits some crucial differences from the ideal. • It is limited to 64-bit architectures, which excludes a largeportion of embedded and IoT processors that operate on32-bit or narrower platforms. • It has finite number of colors since available tag bits arelimited — ADI supports 13 colors with 4 tag bits. This isimportant because reusing colors proportionally reducesthe safety guarantees of these systems in the event of acollision. • It operates at the coarse granularity of cache line width,and hence, is not practically applicable for intra-objectsafety.On the contrary, Califorms is agnostic of architecturewidth and is, hence, better suited for deployment over amore diverse device environment. In terms of safety, col-lision is not an issue for our design either. Hence, unlikecojoined metadata systems, our security does not scale in-versely with the number of allocations in the program (seeSection 7 for a detailed discussion). Finally, our fine-grainedprotection also makes us suitable for intra-object memorysafety which is a non-trivial threat in modern security [31].

10 Conclusion

Califorms is a hardware primitive which allows blacklist-ing a memory location at byte granularity with low areaand performance overhead. A key observation behind Cali-forms is that a blacklisted region need not store useful dataseparately in most cases, since we can utilize byte-granular,existing or added, space present between object elements tostore the metadata. This in-place compact data structure alsoavoids additional operations for extraneously fetching themetadata making it very performant in comparison. Further,by changing how data is stored within a cache line we areable to reduce the hardware area overheads substantially.Subsequently, if the processor accesses a califormed byte (ora security byte), due to programming errors or maliciousattempts, it reports a privileged exception. To provide memory safety, we use Califorms to insert se-curity bytes within data structures (e.g., between fields ofa struct) upon memory allocation and clear them on deal-location. Notably, by doing so, Califorms can even detectintra-object overflows, which is one of the prominent openproblems in memory safety, despite decades of research inthis area. We also described the necessary compiler andsoftware support for providing memory safety using Cali-forms. To the best of our knowledge, this is the first hardwareprimitive which makes in place byte-granular blacklistingpractical.

References [1] D. Weston and M. Miller. Windows 10 mitigation improvements. BlackHat USA, 2016.[2] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, andDmitry Vyukov. AddressSanitizer: a fast address sanity checker. In

USENIX ATC ’12: Proceedings of the 2012 USENIX Annual TechnicalConference , pages 28–28, June 2012.[3] Hardware-assisted checking using Silicon Secured Memory (SSM). https://docs.oracle.com/cd/E37069_01/html/E37085/gphwb.html , 2015.[4] Oleksii Oleksenko, Dmitrii Kuvaiskii, Pramod Bhatotia, Pascal Felber,and Christof Fetzer. Intel MPX explained: a cross-layer analysis of theIntel MPX system stack.

Proceedings of the ACM on Measurement andAnalysis of Computing Systems , 2(2):28:1–28:30, 2018.[5] Dokyung Song, Julian Lettner, Prabhu Rajasekaran, Yeoul Na, StijnVolckaert, Per Larsen, and Michael Franz. SoK: sanitizing for security.In

IEEE S&P ’19: Proceedings of the 40th IEEE Symposium on Securityand Privacy , May 2019.[6] Gregory J Duck and Roland H C Yap. EffectiveSan: type and memoryerror detection using dynamically typed C/C++. In

PLDI ’18: Proceed-ings of the 39th ACM SIGPLAN Conference on Programming LanguageDesign and Implementation , pages 181–195, June 2018.[7] Kostya Serebryany, Evgenii Stepanov, Aleksey Shlyapnikov, VladTsyrklevich, and Dmitry Vyukov. Memory tagging and how it im-proves C/C++ memory safety. arXiv.org , February 2018.[8] Joe Devietti, Colin Blundell, Milo M K Martin, and Steve Zdancewic.HardBound: architectural support for spatial safety of the C program-ming language. In

ASPLOS XIII: Proceedings of the 13th InternationalConference on Architectural Support for Programming Languages andOperating Systems , pages 103–114, March 2008.[9] Jonathan Woodruff, Robert N M Watson, David Chisnall, Simon WMoore, Jonathan Anderson, Brooks Davis, Ben Laurie, Peter G Neu-mann, Robert Norton, and Michael Roe. The CHERI capability model:revisiting RISC in an age of risk. In

ISCA ’14: Proceedings of the 41stInternational Symposium on Computer Architecture , pages 457–468,June 2014.[10] Jonathan Woodruff, Alexandre Joannou, Hongyan Xia, AnthonyFox, Robert Norton, David Chisnall, Brooks Davis, Khilan Gudka,Nathaniel W Filardo, , A Theodore Markettos, Michael Roe, Peter GNeumann, Robert Nicholas Maxwell Watson, and Simon Moore. CHERIconcentrate: practical compressed capabilities.

IEEE Transactions onComputers , pages 1–1, April 2019.[11] Udit Dhawan, Catalin Hritcu, Raphael Rubin, Nikos Vasilakis, SilviuChiricescu, Jonathan M Smith, Thomas F Knight, Jr, Benjamin C Pierce,and Andre DeHon. Architectural support for software-defined meta-data processing. In

ASPLOS ’15: Proceedings of the 20th InternationalConference on Architectural Support for Programming Languages andOperating Systems , pages 487–502, March 2015.[12] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Ham-burg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz,and Yuval Yarom. Spectre attacks: exploiting speculative execution.15 . Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan In IEEE S&P ’19: Proceedings of the 40th IEEE Symposium on Securityand Privacy , May 2019.[13] Alyssa Milburn, Herbert Bos, and Cristiano Giuffrida. SafeInit: compre-hensive and practical mitigation of uninitialized read vulnerabilities.In

NDSS ’17: Proceedings of the 2017 Network and Distributed SystemSecurity Symposium , pages 1–15, February 2017.[14] Introduce struct layout randomization plugin. https://lkml.org/lkml/2017/5/26/558 , May 2017.[15] Randomizing structure layout. https://lwn.net/Articles/722293/ , May2017.[16] Andrea Bittau, Adam Belay, Ali Mashtizadeh, David Mazi e res, andDan Boneh. Hacking blind. In

IEEE S&P ’14: Proceedings of the 35thIEEE Symposium on Security and Privacy , pages 227–242, May 2014.[17] Daniel Sanchez and Christos Kozyrakis. ZSim: fast and accurate mi-croarchitectural simulation of thousand-core systems. In

ISCA ’13:Proceedings of the 40th International Symposium on Computer Architec-ture , pages 475–486, June 2013.[18] Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun,and Anand Karunanidhi. Pinpointing representative portions of largeIntel® Itanium® programs with dynamic instrumentation.

MICRO-37:Proceedings of the 37th IEEE/ACM International Symposium on Microar-chitecture , pages 81–92, December 2004.[19] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser,Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazel-wood. Pin: building customized program analysis tools with dynamicinstrumentation. In

PLDI ’05: Proceedings of the 26th ACM SIGPLANConference on Programming Language Design and Implementation ,pages 190–200, June 2005.[20] Lieven Eeckhout.

Computer architecture performance evaluation meth-ods . Morgan & Claypool Publishers, 1st edition, 2010.[21] Lizy Kurian John. More on finding a single number to indicate overallperformance of a benchmark suite.

ACM SIGARCH Computer Archi-tecture News , 32(1):3–8, March 2004.[22] Santosh Nagarakatte, Milo M K Martin, and Steve Zdancewic. Watch-dog: hardware for safe and secure manual memory management andfull memory safety. In

ISCA ’12: Proceedings of the 39th InternationalSymposium on Computer Architecture , pages 189–200, June 2012.[23] Santosh Nagarakatte, Milo M K Martin, and Steve Zdancewic. Watch-dogLite: hardware-accelerated compiler-based pointer checking. In

CGO ’14: Proceedings of the 12th IEEE/ACM International Symposiumon Code Generation and Optimization , pages 175–184, February 2014.[24] Tong Zhang, Dongyoon Lee, and Changhee Jung. BOGO: buy spa-tial memory safety, get temporal memory safety (almost) free. In

ASPLOS ’19: Proceedings of the 24th International Conference on Archi-tectural Support for Programming Languages and Operating Systems ,pages 631–644, April 2019.[25] ARM A64 instruction set architecture for ARMv8-A architectureprofile. https://static.docs.arm.com/ddi0596/a/DDI_0596_ARM_a64_instruction_set_architecture.pdf , 2018.[26] Feng Qin, Shan Lu, and Yuanyuan Zhou. SafeMem: exploiting ECC-memory for detecting memory leaks and memory corruption duringproduction runs. In

HPCA ’05: Proceedings of the IEEE 11th InternationalSymposium on High Performance Computer Architecture , pages 291–302,February 2005.[27] Kanad Sinha and Simha Sethumadhavan. Practical memory safety withREST. In

ISCA ’18: Proceedings of the 45th International Symposium onComputer Architecture , pages 600–611, June 2018.[28] Brooks Davis, Khilan Gudka, Alexandre Joannou, Ben Laurie,A Theodore Markettos, J Edward Maste, Alfredo Mazzinghi, Ed-ward Tomasz Napierala, Robert M Norton, Michael Roe, Peter Sewell,Robert N M Watson, Stacey Son, Jonathan Woodruff, AlexanderRichardson, Peter G Neumann, Simon W Moore, John Baldwin, DavidChisnall, James Clarke, and Nathaniel Wesley Filardo. Cheriabi: en-forcing valid pointer provenance and minimizing pointer privilege in the POSIX C run-time environment. In

ASPLOS ’19: Proceedings of the24th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems , pages 379–393, April 2019.[29] Junjing Shi, Qin Long, Liming Gao, Michael A. Rothman, and Vin-cent J. Zimmer. Methods and apparatus to protect memory frombuffer overflow and/or underflow, April 2018. International patentWO/2018/176339.[30] Crispin Cowan, Calton Pu, Dave Maier, Heather Hintony, JonathanWalpole, Peat Bakke, Steve Beattie, Aaron Grier, Perry Wagle, and QianZhang. StackGuard: automatic adaptive detection and prevention ofbuffer-overflow attacks. In

USENIX Security ’98: Proceedings of the 7thUSENIX Security Symposium , pages 1–15, January 1998.[31] Kangjie Lu, Chengyu Song, Taesoo Kim, and Wenke Lee. UniSan:proactive kernel memory initialization to eliminate data leakages. In

CCS ’16: Proceedings of the 23rd ACM SIGSAC Conference on Computerand Communications Security , pages 920–932, October 2016.

AppendicesA CALIFORMS Variants

Here we present two other variants of califorms-bitvector(designed for the L1 cache) which have less storage over-head (but with additional complexity) compared to the onepresented in Section 5.1. • Califorms-4B.

The first variant has 4B of additional stor-age per 64B cache line. This Califorms stores the bit vector within a security byte (illustrated in Figure 14). Since a singlebyte bit vector (which can be stored in one security byte) canrepresent the state for 8B of data, we divide the 64B cacheline into eight 8B chunks. If there is at least one security bytewithin an 8B chunk, we use one of those bytes to store thebit vector which represents the state of the chunk. For eachchunk, we need to add four additional bits of storage. Onebit to represent whether the chunk is califormed (contains asecurity byte), and three bits to specify which byte withinthe chunk stores the bit vector. Therefore, the additionalstorage is 4B (4-bit × [0] being califormed wherethe corresponding bit vector is stored in byte [1] . • Califorms-1B.

We can further reduce the metadata over-head by restricting where we store the bit vector within thechunk (illustrated in Figure 15). The idea is to always storethe bit vector in a fixed location (the 0th byte in the figure,or the header byte; similar idea used in califorms-sentinel). Ifthe 0th byte is a security byte this works without additionalmodification. However if the 0th byte is not a security byte,we need to save its original value somewhere else so thatwe can retrieve it when required. For this purpose, we useone of the security bytes (the last security byte is chosen inthe figure). This way we can eliminate three bits of metadataper chunk to address the byte which contains the bit vector.Therefore, the additional storage is 1B or 1.56% per 64B cacheline. Similar with Figure 14, the figure highlights the chunk [0] being califormed (where the corresponding bit vector is ractical Byte-Granular Memory Blacklisting using Califorms [0] [1] [7] Chunk califormed?Byte addr

Add’l storage [0] [1] [7] [0] [1] [7]

Security byte?

Figure 14.

Califorms-bitvector that stores a bit vector insidesecurity byte locations. The additional metadata (4-bit per8B) specifies if the corresponding chunk contains a securitybyte, and if it does, where in the chunk the bit vector is storedin. HW overhead of 4B per 64B cache line. [0] [1] [7] [0] [1] [7]

Security byte? [0]

Chunk califormed?

Add’l storage [1] [7]1

Contains original value of [0](If [0] is not a security byte)

Figure 15.

Califorms-bitvector that stores a bit vector in theheader (0th) byte of the chunk. If the header byte is normaldata (not a security byte), its original value is stored in thelast security byte. The additional metadata (1-bit per 8B)specifies if the corresponding chunk contains a security byte.HW overhead of 1B per 64B cache line.stored in the first byte) and the original value of byte [0] stored in the last security byte, byte [7] , within the chunk. • Evaluation.

We perform the same VLSI evaluation shownin Section 8.1 for the two additional califorms-bitvector intro-duced in this section. Table 7 presents the results. As we can see, califorms-bitvector with 4B and 1B overheads incur 47%and 20% of extra delay, respectively, upon L1 hit compared tothe califorms-bitvector with 8B overhead (49% and 22% addi-tional delay compared to the baseline L1 data cache withoutCaliforms). Also, both califorms-bitvector add almost thesame overheads upon spill and fill operations (compared tothe califorms-bitvector with 8B overhead) which are 9% delayand 30% energy for spill, and 34% delay and 17% energy for filloperations. Our evaluation reveals that califorms-bitvectorwith 1B overhead outperforms the other with 4B overheadboth in terms of additional storage and access latency/energy.The reason is due to the design restriction of fixing the lo-cation of the header byte which allows faster lookup of thebit vector in the security byte. Califorms-bitvector with 1Boverhead can be a good alternative (to the one presented inSection 5) in domains where area budget is more tight and/orless performance critical; e.g., embedded or IoT systems.

B Handling SIMD/Vector Instructions

As we have discussed in Section 6.2, precise loads and stores(along with the whitelisting capability) allow us to detectaccess violation upon memory instructions. However, thereare certain class of instructions where issuing precise mem-ory instructions may noticeably degrade performance. Oneclass we can imagine is the SIMD/vector instructions wherevector loads read a very wide (e.g., 512 bits for Intel AVX-512)word into the SIMD/vector register with a single instruction.For such instructions we can (1) operate the same way withregular loads by issuing precise loads (e.g., by using vectorgather instructions with appropriate masks), (2) issue widevector loads as is and trigger an exception whenever thevector load touches a security byte; this may introduce falsepositives but in reality data structures used by SIMD/vectorinstructions are unlikely to contain security bytes, or (3)add one bit per byte in the SIMD/vector registers so that wecan propagate the security byte information, and trigger anexception whenever SIMD/vector instructions operates ona security byte. Investigating these alternatives are left forfuture work. . Sasaki, M. Arroyo, M. Tarek Ibn Ziad, K. Bhat, K. Sinha, and S. Sethumadhavan Table 7.

Area, delay and power overheads of the three L1 Califorms (GE represents gate equivalent). The top two rows arepresented in Table 2 and are shown here again for reference. Califorms-bitvector with 4B and 1B overheads incur 47% and 20%extra delay, respectively, upon L1 hit compared to califorms-bitvector with 8B overhead. Also the two Califorms add 9% delayand 30% energy upon spill and 34% delay and 17% energy upon fill.

Design Main synthesis results L1 overheads Fill overheads Spill overheadsName Area (GE) Delay ( ns ) Power ( mW ) Area (%) Delay(%) Power (%) Area (GE) Delay ( ns ) Power ( mW ) Area (GE) Delay ( ns ) Power ( mW )Baseline 347,329.19 1.62 15.84 — — — — — — — — —Califorms-8B 412,263.87 1.65 16.17 18.69 1.85 2.12 8,957.16 1.43 0.18 34,561.80 5.50 0.52Califorms-4B 370,972.35 2.42 17.95 6.80 49.38 11.00 9,770.04 1.92 0.21 35,775.36 5.99 0.68Califorms-1B 356,694.82 1.98 16.00 2.69 22.22 1.06 10,223.28 1.94 0.22 35,958.24 5.99 0.67)Baseline 347,329.19 1.62 15.84 — — — — — — — — —Califorms-8B 412,263.87 1.65 16.17 18.69 1.85 2.12 8,957.16 1.43 0.18 34,561.80 5.50 0.52Califorms-4B 370,972.35 2.42 17.95 6.80 49.38 11.00 9,770.04 1.92 0.21 35,775.36 5.99 0.68Califorms-1B 356,694.82 1.98 16.00 2.69 22.22 1.06 10,223.28 1.94 0.22 35,958.24 5.99 0.67