[PDF] The Granularity Gap Problem: A Hurdle for Applying Approximate Memory to Complex Data Layout

Abstract

The main memory access latency has not much improved for more than two decades while the CPU performance had been exponentially increasing until recently. Approximate memory is a technique to reduce the DRAM access latency in return of losing data integrity. It is beneficial for applications that are robust to noisy input and intermediate data such as artificial intelligence, multimedia processing, and graph processing. To obtain reasonable outputs from applications on approximate memory, it is crucial to protect critical data while accelerating accesses to non-critical data. We refer the minimum size of a continuous memory region that the same error rate is applied in approximate memory to as the approximation granularity. A fundamental limitation of approximate memory is that the approximation granularity is as large as a few kilo bytes. However, applications may have critical and non-critical data interleaved with smaller granularity. For example, a data structure for graph nodes can have pointers (critical) to neighboring nodes and its score (non-critical, depending on the use-case). This data structure cannot be directly mapped to approximate memory due to the gap between the approximation granularity and the granularity of data criticality. We refer to this issue as the granularity gap problem. In this paper, we first show that many applications potentially suffer from this problem. Then we propose a framework to quantitatively evaluate the performance overhead of a possible method to avoid this problem using known techniques. The evaluation results show that the performance overhead is non-negligible compared to expected benefit from approximate memory, suggesting that the granularity gap problem is a significant concern.

Full PDF

TThe Granularity Gap Problem: A Hurdle for ApplyingApproximate Memory to Complex Data Layout*

Soramichi Akiyama

The University of Tokyo, Tokyo, [email protected]

Ryota Shioya

The University of Tokyo, Tokyo, [email protected]

ABSTRACT

The main memory access latency has not much improved for morethan two decades while the CPU performance had been exponen-tially increasing until recently.

Approximate memory is a techniqueto reduce the DRAM access latency in return of losing data integrity.It is benecial for applications that are robust to noisy input andintermediate data such as articial intelligence, multimedia pro-cessing, and graph processing. To obtain reasonable outputs fromapplications on approximate memory, it is crucial to protect criticaldata while accelerating accesses to non-critical data. We refer theminimum size of a continuous memory region that the same errorrate is applied in approximate memory to as the approximationgranularity . A fundamental limitation of approximate memory isthat the approximation granularity is as large as a few kilo bytes.However, applications may have critical and non-critical data in-terleaved with smaller granularity. For example, a data structurefor graph nodes can have pointers (critical) to neighboring nodesand its score (non-critical, depending on the use-case). This datastructure cannot be directly mapped to approximate memory due tothe gap between the approximation granularity and the granularityof data criticality. We refer to this issue as the granularity gap prob-lem . In this paper, we rst show that many applications potentiallysuer from this problem. Then we propose a framework to quanti-tatively evaluate the performance overhead of a possible method toavoid this problem using known techniques. The evaluation resultsshow that the performance overhead is non-negligible comparedto expected benet from approximate memory, suggesting that thegranularity gap problem is a signicant concern.

The impact of main memory access latency to the overall perfor-mance is much larger on a computer today than in the past. Thisis because the performance gap between the main memory andthe CPU has ever been enlarging. Figure 1 shows the single threadperformance of server-class CPUs plotted over time The gureshows an exponential growth of the single thread performance untilrecent years. In contrast, the access latency of DRAM that the mainmemory consists of has been almost the same for more than twodecades. As shown in [8], the speedup of the major latency sourcesof DRAM over time is very marginal, especially when compared tothe exponential growth of the CPU performance. Because DRAMaccess latency occupies substantial amount in a random memory *This is an extended version of our conference paper published in the 12 th ACM/SPEC International Conference on Performance Engineering (ICPE). Data provided in [38] under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

Figure 1: Trend of single thread performance over time (nor-malized to SPEC CPU 2006 score × access latency , there is a strong need to reduce the DRAM accesslatency to catch up with the CPU performance. Approximate memory is a technique to reduce the main memorylatency by sacricing its data integrity [23, 33, 37, 42]. Prior workshave proven that DRAM modules used for main memory can beoperated much more aggressively than dened in the specica-tions. In concrete, the access latency can be reduced by violatingthe timing constraints of DRAM internal operations at the cost ofincreased bit-error rate [8, 11, 18, 26, 43, 48]. Prior works try notto expose the increased bit-error rate to applications by operatingDRAM to the extent that the error rate is still small enough to havezero bit-ips during applications’ runtime. Approximate memoryexploits the same idea more aggressively to reduce the memoryaccess latency by leveraging the error robustness of applicationsthemselves. It is expected to be benecial for application domainssuch as deep learning, multimedia processing, graph processing,and big-data analytics because these applications are known to berobust to bit-ips to some extent [7, 10, 29, 30, 34].To obtain reasonable outputs from applications on approximatememory, it is crucial to protect critical data while acceleratingaccesses to non-critical data . For example, suppose we want to ac-celerate a deep learning application on top of approximate memory.The matrices that express the weights of each layer are non-criticaldata because it is known that the accuracy of the trained modeldoes not degrade much even when some bit-ips are injected intothem [7, 16, 23, 29]. On the other hand, pointers from one layer ofa network to another or the loop counter that counts the numberof epochs are non-critical data because they must be protected For example, a random memory access latency in the machine shown in table 2 isaround 82 ns (measured by Intel MLC), while the sum of the three major latencysources shown in [8] of the DRAM module in this machine is around 45 ns [5]. a r X i v : . [ c s . ET ] J a n rom bit-ips. Therefore, we must control the error rate of memoryregions depending on the criticality of the stored in them.A limitation of approximate memory is that the error rate canbe controlled only with the granularity of a few kilo bytes due tothe internal structure of DRAM chips. We refer the minimum sizeof continuous data to which the same error rate must be applied toas approximation granularity . The approximation granularity fora given DRAM module is decided by the row size of the module.A row is a sequence of data bits inside a DRAM module that aredriven simultaneously with the same timing. Because approximatememory we focus on is based on tweaking the timing of DRAMinternal operations, the approximation granularity is equal to therow size. The row size of a DRAM module is in the range of 512bytes to a few kilo bytes. For example, the row size of a module fromMicron [31] is 2 KB, meaning that the approximation granularity ofthis module is also 2 KB. This stems from the fundamental limitationof modern DRAM that many bits must be driven in parallel to catchup with requests coming from a fast CPU.The large approximation granularity makes it dicult to gainbenet from approximate memory for applications that have criticaland non-critical data interleaved with a smaller granularity (e.g.,8 bytes). We refer to this problem as the granularity gap problem .This can happen when an application manages its data as an arrayof data structure that has critical members (e.g., pointers) and non-critical members (e.g., numbers whose small divergence do notaect the application’s result). For a concrete example, suppose anapplication that traverses an array of graph nodes, and each graphnode has pointers to its neighboring nodes and a score of it that isrobust to bit-ips. The non-critical data of this application cannotbe stored in approximate memory due to the dierence betweenthe approximation granularity and the granularity of interleavingof critical and non-critical data.In this paper, we show the granularity gap problem is a sig-nicant concern in using approximate memory. In concrete, thecontributions of this paper are summarized as follows:(1) A source code analysis of widely used benchmarks to provethat many applications potentially suer from the granu-larity gap problem, extended from our previous work [2].(2) A discussion on pros and cons of a memory layout con-version technique in the context of the granularity gapproblem.(3) A framework to quantitatively evaluate the negative perfor-mance impact of the memory layout conversion technique.(4) Evaluation results of the negative performance impact onwidely used benchmarks, which proves the signicance ofthe granularity gap problem in using approximate memory.The rest of the paper is structured as follows. Section 2 intro-duces the background knowledge of how DRAM and approximatememory works. Section 3 denes the granularity gap problem andthe goal of this paper. Section 4 analyzes the source code of SPECCPU 2006 and 2017 benchmarks. Section 5 explains a memory lay-out conversion technique to avoid the granularity gap problem, andpoints out why it is not sucient. Section 6 describes our simulationframework and gives quantitative evidence that the granularity gapproblem is a signicant concernt. Section 7 reviews related workand Section 8 concludes the paper. Approximate memory is a new technology to mitigate the perfor-mance gap between main memory and CPUs. The main idea is toreduce the latency of main memory accesses at a cost of the dataintegrity by exploiting design margins that exist in many DRAMchips today. The CPU may read a slightly dierent data from whathas been written before to the main memory. A design marginrefers to the dierence between a design parameter dened in thespecication of a device and the actual value which the device canbe operated with. In particular, we focus on the design margin inthe timing of internal operations of DRAM. Even when some wait-time parameters are shortened than the specication, many DRAMchips can read stored data “almost” correctly with a few bit-ips(errors) injected to the data [8, 21]. By controlling the timing ofinternal operations of DRAM, we can trade reduced main memoryaccess latency with increased bit-error rate.Approximate memory attracts much research interest due to theever-increasing performance gap between main memory and CPUs.Chang et al. [8] measure the relationship between error rates andlatency reduction for a large number of commercial DRAM chips.Tovletoglou et al. [42] propose a holistic approach to guarantee theservice level agreement of virtual machines running on approxi-mate memory. Koppula et al. [23] re-train deep learning modelson approximate memory so that the models can adapt to errors.Our previous work [1] estimates eect of approximate memory torealistic applications without simulation by counting the numberof DRAM internal operations that incur errors.Approximate memory is especially benecial for machine learn-ing, multimedia, and graph processing applications, all of whichincur many memory accesses and are tolerant to noisy data. Forexample, Stazi et al. [40] show that allocating data in approximatememory for the x264 video encoder can yield acceptable results,and our previous work [1] show that a graph-based search algo-rithm (mcf in SPEC 2006) can yield the same result as error-freeexecution even when some bit-ips are injected. Regarding theperformance improvement, Koppula et al. [23] show 8% speedup inaverage for training various DNN models on approximate memory,and Lee et al. [24] show that using Adaptive-Latency DRAM [25]for approximate memory gives 7% to 12% speedup in averag for “32benchmarks from Stream, SPEC CPU 2006, TPC and GUPS” (they donot show numbers for each benchmark though). The performanceimprovement of a few to 10+ percent is important to these applica-tions because they are typically executed in large scale data centers,where only a few % of relative eciency improvement results in ahuge absolute reduction of energy and/or runtime in total.

The design margin exploited to realize approximate memory thatwe focus on is the timing constraints of DRAM internal operations.Although there are other types of approximate memory such asapproximate ash memory that leverages multiple levels of pro-gramming voltages [17] and approximate SRAM based on supplyvoltage scaling [7, 14, 46], we focus on approximate DRAM in this igure 2: DRAM command sequence of normal memory(top) and approximate memory (bottom): In this example,tRCD is shortened to 7.5 ns and tREF is prolonged to 128 ms,both of which reduce the average latency. paper. A DRAM module is operated by the memory controller thatissues electric signals referred to as

DRAM commands . A DRAMcommand triggers an internal operation of DRAM such as resettingthe voltages of wires to the reference voltage. A timing constraint refers to the interval between two DRAM commands and we cate-gorize them into two types: • Type 1 species the interval that must pass before thenext DRAM command is issued. They are dened so thatthe internal operation of DRAM triggered by the previouscommand is guaranteed to nish before the next command. • Type 2 species the interval where the same commandmust be issued again within that interval. This is denedso that the electrical charges inside DRAM does not leaktoo much by periodically refreshing them.The actual values of the timing constraints for each type of DRAMmodule (e.g, DDR4-2400) are specied by JEDEC, which is an orga-nization that publishes DRAM-related specications.Relaxing a timing constraint means either shortening or pro-longing an interval dened by the specications (i.e., “violating”the specications). It reduces the average access latency of DRAMbecause commands are served faster (by shortening the Type 1 con-straints) and it increases the number of useful commands executed(by prolonging the Type 2 constraints). However, it increases thepossibility that bit-ips are injected into the data because thereis no guarantee that a DRAM module works awlessly when thetiming constraints are violated.Figure 2 shows an example of DRAM command sequence innormal memory and approximate memory. It shows four represen-tative DRAM commands: refresh (

REF ), precharge (

PRE ), activation(

ACT ), and read ( RD ). In this example, a timing constraint called tRCD (Type 1) is shortened from 12.5 ns to 7.5 ns, and one called tREF (Type 2) is prolonged from 64 ms to 128 ms. tRCD is the intervalthat must pass between an ACT command and the following RD com-mand, and it is around 11 – 13 ns depending on the DRAM module(e.g., 12.5 ns for DDR3-1600J [4]). Chang et al. [8] found that onlya small portion of the data bits experience errors even when tRCD is shortened below it. We explain how an ACT command and tRCD work inside DRAM more in detail in Section 2.3. tREF is anothertiming constraint that species the longest interval between two

REF commands, which refresh DRAM cells to prevent them fromlosing stored data. Das et al. [13] and Zhang et al. [48] propose toprolong this interval because many DRAM cells can retain data for

Figure 3: ACT command copies the value of the selected rowto the sense amplieres. Left: The BLs are reset to Vref in thereset state. Right: After 12.5 ns, the BLs are guaranteed to beeither Vref+ or Vref-. Middle: If tRCD is reduced to 7.5 ns,the sense ampliers sense unstable voltages of BLs (Vx andVy), resulting in a few bit-ips but a shorter latency. more than 64 ms in practice. Because prolonging tREF increases theamount of time during which more useful commands are served, itreduces the average DRAM access latency.

The left-most side of Figure 3 shows the reset state of DRAM. Thecircles show an array of memory cells, where each row is connectedby a wordline (WL) and each column is connected by a bitline (BL).Although a DRAM chip consists of a hierarchy of many of thesearrays operated in parallel, we focus on one array here without lossof generality. A black cell has electric charge in it and a white cellis empty. A cell with charge in it represents a value of 1, an emptycell represents a value of 0. In the reset state, the voltages of all theBLs are set to the reference value denoted as

Vref in the gure.An

ACT command takes the target row number as its parameter(for example, the 2 nd row from the top). The WL of the targetrow is enabled to connect the cells in the target row to the BLs.The voltages of the BLs connected to cells with charge start beingpulled up, and the voltages of the BLs connected to empty cellsstart being pulled down. At the same time, the cells connected tothe BLs become intermediate state (denoted by gray circles in thegure) because the capacitance of a BL is much larger than that ofa cell. After tRCD (12.5 ns in the gure) has passed, the voltages ofthe BLs are guaranteed to be either Vref+ or Vref- as shown inthe right-most side of the gure. Finally, the sense ampliers sensethe voltages of the BLs to fetch the values and buer them.Although the value of tRCD is strictly dened by JEDEC, realDRAM chips are known to have much design margin in this timingparameter. Previous work [8, 22] show that many bits can be fetchedcorrectly even when tRCD is shortened by a substantial amount. Themiddle of Figure 3 shows how the

ACT command works when tRCD is shortened to 7.5 ns to reduce the memory access latency. Because12.5 ns has not yet passed, the voltages of the BLs may have notreached to

Vref+ or Vref- , but they are unstable values denotedas Vx and Vy in the gure ( Vref+ > Vx > Vref > Vy > Vref- ). Whenthe sense ampliers sense the voltages of the BLs at this point,they may fetch wrong values because the dierence of Vx and Vy against Vref are not large enough to sense the values. Larger theierence of Vx and Vy against Vref is, larger the probability offetching correct values. This way, controlling the timing parameterserves as a knob for trading the access latency and bit-error rate.

A limitation of approximate memory exploiting design margins intiming constraints is that the approximation granularity cannot besmaller than a few kilo bytes. The approximation granularity refersthe minimum size of a continuous memory region to which thesame error rate must be applied. This is because the same timingparameter is applied to an entire row as we describe in Section 2.3,and the size of a DRAM row (also known as a page [20]) is as largeas a few kilo bytes in modern DRAM modules. This stems froma fundamental constraint that many DRAM cells must be drivenin parallel so that slow DRAM can catch up with the high rate ofrequests coming from the CPU. Therefore, the same limitation isapplicable to DRAM commands other than

ACT and their timingconstraints as well. For example, refreshing DRAM cells are alsodone row by row (i.e, an entire row is refreshed at once) and thusprolonging tREF also aects an entire row at once [20].We give two examples of the row size in real DRAM modules. A32 Gb DRAM module from Micron [31] has 64 K rows, 16 banks,and 2 ranks . The row size of this module is calculated as:32Gb64K × × = = . This can be conrmedby a similar calculation to equation (1) using the number of rows(2 = 128K) and the number of banks (2 banks per bank group × × = = Even for applications that can tolerate noisy input and intermediatedata, they have critical parts of data that must be protected frombit-ips. For example, deep learning is known to be robust to bit-ips [7, 16, 23, 29] but not all parts of the data are robust to them. Right-most column of Table 3 in [31]. Right-most column of “16 Gb Addressing Table” on page 9 in [39].

Figure 4: Mapping critical and non-critical parts of data intodierent DRAM rows to protect the former while reducingaccess latency to the latter. struct node_t {int id; // id of the node, criticalstruct node_t ∗r; // pointer to the right child, criticalstruct node_t ∗l; // pointer to the left child, criticaldouble score; // score of this node, non−critical};int size = 1000 ∗ sizeof(struct node_t);struct node_t ∗nodes = malloc(size);

Figure 5: Critical and non-critical data interleaved in a sin-gle C struct: it is not possible to protect the critical partswhile storing the non-critical parts on approximate memorydue to a large approximation granularity (e.g., 2 KB).

Pointers from one layer of a network to another or the loop counterthat counts the number of epochs must be protected from bit-ips.Protecting critical parts of data requires two steps:(1) Detecting which parts of data are critical and which partsare non-critical(2) Storing non-critical parts of data into approximate memorywhile storing the critical parts to normal memoryFor step (1), there have been much eort [1, 3, 30, 34, 45] and it isout of the scope of this work, so we assume that discrimination ofcritical and non-critical data is given. For step (2), because the timingconstraints are controlled per row, we must map the critical andnon-critical parts of data into dierent DRAM rows running withdierent timing parameters: the timing parameter same as denedin the specication and the one shortened for faster accesses.Figure 4 depicts an example of mapping critical and non-criticaldata into dierent DRAM rows. In the gure, suppose the variables N and i are critical because the former decides the size of allocatedmemory and the latter is a loop counter, and the memory regionpointed to by A is non-critical. N and i are mapped to the rst rowin the gure that is applied normal timing parameters so that ityields no bit-ips. The data pointed to by A is mapped to rows atthe bottom that are applied tweaked timing parameters so that theycan be accessed faster. By mapping data of dierent criticality todierent DRAM rows as in the gure, we can protect critical datawhile improving the access latency to non-critical data. A challenge in using approximate memory is the gap betweenthe approximation granularity and the granularity at which criti-cal and non-critical data are interleaved. We call this problem the ranularity gap problem . We say critical and non-critical data are in-terleaved when they co-locate inside one instance of a C struct ora C++ class . Figure 5 shows an example of interleaved critical andnon-critical data. The data structure struct node_t contains bothcritical and non-critical data, and a pointer named nodes pointsto an array of struct node_t . To gain benet from approximatememory for this code, we must protect the critical parts ( id , r ,and l ) while storing the non-critical part ( score ) into approximatememory. This is not possible because the approximation granularityis as larger as a few kilo bytes (say 2 KB), while we need to enableor disable approximation with a granularity of 4 bytes to achieve it.The granularity gap problem has been overlooked by the researchcommunity because it is not relevant to applications that have largechunks of non-critical data. For example for deep learning appli-cations, the non-critical data are matrices storing the weights of anetwork whose sizes range from a few kilo bytes to hundreds ofmega bytes. In this case, we can store entire matrices into approxi-mate memory and the approximate granularity is not an issue. The goal of this paper is to prove the signicance of the gran-ularity gap problem with quantitative evidence. First, we show thatthere are many applications that potentially suer from this prob-lem. Second, more importantly, we show that avoiding this problemwith a known technique has negative performance impact that isas large as almost canceling the benet of approximate memory.

To show that many real applications can potentially suer from thegranularity gap problem, we analyze source code of widely usedbenchmarks in this section.

For a given application, we nd if the data structure that can obtainbenet from approximate memory has critical and non-criticaldata interleaved with smaller granularity than the approximationgranularity. Because approximate memory is the most eectivewhen an application’s data that incur many cache misses are storedon it, we focus our analysis on a data structure that incurs thelargest number of cache misses within an application. We refer tosuch a data structure as the most cache-unfriendly data structure .After nding such a data structure, we analyze it to estimate if theapplication potentially suers from the granularity gap problem.To nd the most cache-unfriendly data structure of an applica-tion, we rst measure the number of cache misses per instruction using Precise Event Based Sampling (PEBS) on Intel CPUs. PEBS isan enhancement of normal performance counters that uses desig-nated hardware for sampling to reduce the skid between the timean event (e.g., a cache miss) occurs and the time it is recorded [6, 44].The small skid enables pinpointing which instruction in an applica-tion binary causes many hardware events. We execute a benchmarkwith its sample dataset using linux perf , and the actual commandline is ‘ perf record -e r20D1:pp -- benchmark ’. The parame-ter r20D1:pp species a performance event whose event number is0xD1 and the umask value is 0x20 and “ counts retired load instruc-tions with at least one uop that missed in the L3 cache ” (describedin Table 19.3 of [19]). Note that the “L3 cache” is the last level inthe cache hierarchy in the CPU we use (described in Table 2). The

Figure 6: Sample output of perf report: It shows an instruc-tion, the oset of the instruction from the head of the bi-nary, and the percentage of cache misses that it incurs (ifany), from right to left. The C code, if ( arc->ident > BASIC ),corresponds to the assembly code below it. parameter benchmark is replaced by an actual command line toexecute each benchmark.After measuring the number of per-instruction cache misses, wend the data structure accessed by this instruction, which is themost cache-unfriendly data structure of this application. Due tothe lack of o-the-shelf tools to disassemble an arbitrary binaryinto C/C++ source code, we rely on human knowledge and laborto do this. Figure 6 shows an output of perf report , executedafter a measurement by perf record . The measurement is donefor a benchmark called mcf in SPEC CPU 2006 and the details ofthe benchmarks we analyze are described in Section 4.2. Each lineshows, from right to left, an instruction, the oset of the instructionfrom the head of the binary, and the percentage of cache misses itincurs (if any) against the overall cache misses in the measurement.The C code, if( arc->ident > BASIC ) , corresponds to the linesof assembly code below it. From the gure, we can see that the mov instruction at oset incurs 48.58 % of all cache misses of thisapplication. We can conrm that this instruction incurs the largestnumber of cache misses by checking that no other instruction incursmore than this percentage.To nding the data structure that the mov instruction accessesgiven Figure 6, we analyze the assembly code with help of thedebug information and the source code. In Figure 6, we can see atypical pattern of assembly code where a jump instruction ( jle )follows after a compare instruction ( test , which is commonly usedto compare a register with 0). Therefore, we can guess that the mov instruction copies a value to be compared with 0. From the Ccode corresponding to this block of assembly, we can see that thevalue compared with 0 should be arc->ident . This is conrmedby the fact that

BASIC is a compile-time constant whose value is0, and that the oset of the ident member inside arc is 0x18. As aconclusion, the mov instruction accesses the variable named arc ,whose data type is struct arc . Note that the same methodologyis applicable to a template function in C++ as well because there isan independent piece of assembly code for each instantiation of it(i.e., no type-ambiguity exists in assembly).

Table 1 describes the benchmarks we analyze . Each line shows abenchmark’s name, its domain, and the cache miss rate measured bythe linux perf tool. From both SPEC CPU 2006 and 2017, we analyze Although deepsjeng is named ‘deep’, it uses a classical tree search algorithm. able 1: Analyzed BenchmarksSPEC CPU 2006

Name Domain Cache Miss Ratemilc quantum simulation 82.6%sjeng game AI (chess) 74.5%libquantum quantum computing 54.6%lbm uid dynamics 49.2%omnetpp discrete event simulation 47.9%soplex linear programming 41.2%gobmk game AI (go) 38.4%gcc c compiler 36.8%mcf optimization 33.7%dealII nite element analysis 33.6%namd molecular dynamics 21.0%

SPEC CPU 2017

Name Domain Cache Miss Ratedeepsjeng_r game AI (chess) 77.5 %nab_r molecular modeling 64.9 %omnetpp_r discrete event simulation 56.1 %namd_r molecular dynamics 50.4 %lbm_r uid dynamic 48.8 %x264_r video encoding 47.3 %mcf_r optimization 43.5 %gcc_r c compiler 36.6 %blender_r image processing 35.0 %xz_r data compression 31.6 %perlbench_r perl interpreter 21.4 %

Table 2: Experiment Environment

CPU Intel Xeon Silver 4108 (Skylake, 8 cores)Memory DDR4-2666, 96 GB (8GB × ref for SPEC CPU 2006 and theone named refrate for SPEC CPU 2017. The LLC miss rate ismeasured using the linux perf tool with the following command: perf -e cache-misses,cache-references -- benchmark , where benchmark is replaced by an actual command for each benchmark. Table 3: Results of Source Code Analysis (S: is a C struct ora C++ class, P: has a pointer, F: has a fp, I: has an integer)SPEC CPU 2006

Benchmark Data Type S P F Imilc complex[] (cid:33) (cid:33) sjeng QTType[] (cid:33) (cid:33) libquantum quantum_reg_node_struct[] (cid:33) (cid:33) (cid:33) lbm double[]omnetpp cChannel (cid:33) (cid:33) (cid:33) soplex Element[] (cid:33) (cid:33) gobmk hashnode_t[] (cid:33) (cid:33) (cid:33) gcc rtx_def (cid:33) (cid:33) mcf arc[] (cid:33) (cid:33) (cid:33) dealII double[]namd CompAtom[] (cid:33) (cid:33) (cid:33)

SPEC CPU 2017

Benchmark Data Type S P F Ideepsjeng_r ttentrty_t[] (cid:33) (cid:33) nab_r INT_T[]omnetpp_r sVector (cid:33) (cid:33) (cid:33) (cid:33) namd_r CompAtom[] (cid:33) (cid:33) (cid:33) lbm_r double[]x264_r uint8_t[]mcf_r arc[] (cid:33) (cid:33) (cid:33) gcc_r -blender_r VlakRen[] (cid:33) (cid:33) (cid:33) (cid:33) xz_r uint8_t[], uint32_t[]perlbench_r char[]

Table 3 shows the analysis results. Each row shows a benchmark,the most cache-unfriendly data structure, ags that represent thekinds of members that the data structure contains: • S : the data is either a C struct or a C++ class . • P : the data structure contains a pointer. • F : the data structure contains a oating pointer number. • I : the data structure contains an integer.The data type column is denoted by [] if the data is managed as anarray of that data type. We regard any type compatible with an in-teger (e.g., char , long ) as an integer. If a class inherits other classes,we include the members of the parent classes as well because aninstance of a child class in the memory contains all members of theparent classes. We exclude static members and member functionsbecause they are not stored in the memory region allocated for eachinstance. We do not show the result for gcc_r in because cachemisses are scattered across many instructions. Two data types areshown for xz_r because two instructions incur almost the samenumber of cache misses. For all the benchmarks, the instructionthat incurs the largest number of cache misses existed in their owncode and not in any standard C/C++ libraries.he results show that many applications potentially suer fromthe granularity gap problem. The most cache-unfriendly data struc-ture is either a C struct or a C++ class in 9 out of 11 bench-marks in SPEC CPU 2006 and 5 out of 11 benchmarks in SPEC CPU2017. Although there are only two benchmarks ( omnetpp_r and blender_r ) that have a pointer and a oating point number in itsmost cache-unfriendly data structure, this does not mean that thesetwo are the only benchmarks that suer from the granularity gapproblem. For example, the data type arc in mcf and mcf_r containsa pointer and an integer named cost , which represents the cost of agraph edge. Our previous work [1] shows that even if some bit-ipsare injected into the member cost , mcf can yield the same resultas an error-free execution. Therefore, we conclude that these 14applications “potentially” suer from the granularity gap problem. Manual eort to nd the data type accessed by a given instructionincurs a scalability issue and increases the chances of analysis errors.There are two error patterns stemming from the manual eort:(1) Mis-identifying the variable in the source code that corre-sponds to a given memory access instruction(2) Mis-identifying the type of data that is stored in the identi-ed variable in source codePattern (1) can happen when the application binary has complexdata/control ows for example with multiple levels of indirection(e.g., a->b->c) or when the binary does not look similar to the sourcecode due to compiler optimizations. Pattern (2) can happen whenthe declared type of a source variable and the type of actual datastored in it are dierent (i.e., polymorphism). Developing compilersupport to reduce the possibilities of these errors is future work.Another concern for our analysis arises when a member variableof a C struct or a C++ class is passed to a function by reference.For example in Figure 7, the same function ( f ) is called either bypassing &s1.v or &s2.v as its argument. Finding the data typethat the memory region pointed to by fp belongs to requires aninvestigation of stack traces and points-to analysis [41]. Althoughit seems more natural for a function to take a pointer of a whole struct such as ‘ void g(struct S1 *sp) ’, this may appear in somecases such as when a library function returns the result through apointer. However, we did not hit this case in any of the benchmarksin our experiment. This section discusses an applicability of a memory layout conver-sion technique to avoid the granularity gap problem, and pointsout that it can degrade the performance for some applications. Weshow in Section 6 that this performance overhead is as large asalmost canceling the benet of approximate memory in some cases.

An array of structures (AoS) can be converted into a structure ofarrays (SoA) without changing the results of an application. Givenan array of C struct instances, this technique converts the memorylayout of an application so that each member of the C struct isstored as a distinct array. Figure 8 and Figure 9 show an exampleof this conversion done explicitly by hand. The code in Figure 8 struct S1 {double v; // non−criticaldouble vv; // non−critical} s1;struct S2 {double v; // non−criticalint ∗p; // critical} s2;void f(double ∗fp) { /∗ do something ∗/ }f(&s1.v); // (1): invoke f by passing s1.v by referencef(&s2.v); // (2): invoke f by passing s2.v by reference

Figure 7: Calling the same function by passing members ofdierent structs by reference. Identifying the data type that*fp belongs to requires stack traces and points-to analysis. struct {double x;double y;} points[N];// calculate the centerdouble center_x = 0, center_y = 0;for(i = 0; i

Figure 8: Example of an array of structures. The data struc-ture {x, y} consists an array of structures named “points”. struct {double x[N];double y[N];} points;// calculate the centerdouble center_x = 0, center_y = 0;for(i = 0; i

Figure 9: AoS to SoA conversion is applied to the code in Fig-ure 8. Each member, x and y, of the struct is allocated a dis-tinct array for it. calculates the center of 𝑁 points (in some sense) that are stored inmemory as an array of structures. Figure 9 shows the convertedversion of the code that does the same calculation. This versionmanages each member of the data structure, x and y , as a distinctarray. Note that following code that access the data are also changedin Figure 9 (e.g., from points[i].x to points.x[i] ).Altough the AoS to SoA conversion seems very dicult at aglance for realistic applications, existing research has proven it tobe possible at compile time [12, 27, 49]. The main diculty stemsfrom the fact that a pointer can have an arbitrary address in C/C++.For example in Figure 8, if another pointer points somewhere insidea memory range pointed to by points , it is not easy to apply the igure 10: The change of memory layout when the AoS toSoA conversion is applied to the code in Figure 5. conversion without changing the application’s output. However,points-to analysis [41] solves this problem in almost linear time ofthe source code length.Besides the technical diculties that have been tackled by manyresearchers (e.g., how to apply it dynamically to programs withoutthe source code, how to ensure safety in weakly typed languages), afundamental limitation of the AoS to SoA conversion is that there isno method to precisely predict its eect on performance. Petrank etal. [36] show that predicting the number of cache misses that a givendata layout generates for an arbitrary memory access pattern is NPregarding the number of data objects. This means that one musteither do exhaustive experiments for memory access patterns underinterest or use heuristics to informally estimate the performanceimplication. This limitaion leads us to do the former to evaluate itsperformance overhead in a later section. The AoS to SoA conversion enables using approximate memoryeven when critical and non-critical data are interleaved by avoidingthe granularity gap problem. Because each member of the converteddata structure is stored in a distinct array, it can be mapped to adesignated DRAM row that has the appropriate timing parameterfor the criticality of that member.Figure 10 depicts how we can selectively store non-critical dataof the code in Figure 5 to approximate memory by the AoS to SoAconversion. Gray boxes in the gure show critical data and whiteboxes show non-critical data. In the original code that manages thedata as an AoS, it is not possible to selectively protect the criticaldata while accelerating accesses to the non-critical data because ofthe granularity gap problem (Figure 10 (a)). In the converted codethat manages the data as a SoA, the non-critical data ( score ) con-sists a distinct array and it can be mapped directly to approximatememory, while the critical data ( id , r , l ) can be mapped to normalmemory (Figure 10 (b)). The disadvantage of the AoS to SoA conversion is that it can de-grade the performance due to increased number of cache misses.In the code in Figure 5, it is highly possible that all the mem-bers of the same struct instance (that is, for any i , nodes[i].id , nodes[i].r , nodes[i].l , and nodes[i].score ) share the samecache line. Thus, accessing more than two members of the same struct instance closely in time incurs at most 1 cache miss. How-ever, if we apply the AoS to SoA conversion to the same code, // points to the rst nodestruct node_t ∗node = malloc(sizeof(node_t) ∗ 1000);while(/∗ until some condition is met ∗/) {// do something, then traverse the next nodeif (node−>score > threshold)node = node−>l;elsenode = node−>r;} Figure 11: A sample code accessing an AoS. The denitionsof node_t is the same as Figure 5. Applying the AoS to SoAconversion to it increases the number of cache misses. members that are in the same struct instance in the original codedo not share the same cache line. This might increase the numberof cache misses and degrade the performance depending on thememory access pattern to the data to be converted.For example, the code in Figure 11 decides which child (eitherright or left) of the current node to traverse next depending onthe score of the current node, and its memory access pattern isunpredictable. When the AoS to SoA conversion is applied to thiscode, node->score is stored in a dierent cache line from node->l and node->r . Because the memory access pattern is unpredictable,an access to a new cache line incurs a cache miss every single timeif the total amount of the data is large enough compared to thecache size. Therefore, applying the AoS to SoA conversion to thiscode increases the number of cache misses from 1 miss per each while(...) iteration to 2 misses per iteration.

The negative performance impact of the AoS to SoA conversion(the details in Section 5.3) is a serious concern if it cancels or evenoutperforms the benet of approximate memory. However, to thebest of our knowledge, there is no study on how the AoS to SoAconversion slows down applications, because research have focusedon how to speed them up. This section introduces a new methodol-ogy to quantitatively analyze the slowdown given by the AoS toSoA conversion, and shows that it is as large as almost cancelingthe benet of approximate memory in the worst case.

In order to quantitatively analyze the slowdown and show its signi-cance, we propose a method to estimate the eect of memory layoutchanges incurred by the AoS to SoA conversion. The main idea is touse a cycle accurate simulator of CPUs to run applications and re-produce the memory layout that the AoS to SoA conversion wouldgenerate inside the simulator. The use of cycle accurate simulatorhas two advantages (more details are described in Section 6.3):(1) It can quantitatively tell how much slowdown an applica-tion experiences due to memory layout conversion. This isimportant because the signicance of the granularity gapproblem is determined by how the slowdown is large orsmall relative to the benet of approximate memory.(2) It is more robust than actually applying the conversionbecause it does not require complex source code analysis. .2 Pseudo Conversion by CPU Simulator

Figure 12 shows how our simulator estimates the performanceimpact by reproducing the memory layout changes:(1) The source code of the target application is annotated sothat it prints the starting addresses and the sizes of mem-ory regions that contain the most cache-unfriendly datastructure. For benchmarks written in C, this is done bynding malloc calls whose return values are casted to thepointer type of that data structure. For benchmarks writtenin C++ and use the standard template libraries (STLs), thisis done by replacing the memory allocator of the STLs.(2) The target application is executed on a vanilla simulatorto gain the starting addresses and the sizes printed by theannotations added in step (1).(3) The remap info that decides which members of the struct are stored in distinct arrays is dened. The remap infocontains the size of each struct member and a booleanvalue that represents if it is stored into a distinct array (wesay that a member is remapped if this value is true ).(4) A simulation is started on our modied simulator withinformation obtained in step (2) and step (3) (the startingaddresses and sizes of memory regions and the remap info)passed as inputs.(5) While in a simulation, the target addresses of memoryaccess instructions are investigated. If the target addresspoints to a remapped member, it is converted to reproducethe memory layout that the AoS to SoA conversion wouldgenerate.The address conversion is done when the front-end of the CPUinserts requests into the load store queue. The component thatconverts addresses is illustrated as the address remapper in Figure 12.This is because the border between the front-end and the load storequeue is a place right after an accessed address is determined andright before it is used. Inside the front-end, the target address of amemory access instruction might not be ready, for example whenthe register that contains the address is an operand of a not-yet-committed instruction. Inside the load store queue, the address of arequest is used to access the caches rst before accessing the mainmemory. Therefore, we convert an address before it is referencedin the load store queue to maintain the cache consistency.Three requests are passed from the front-end to the addressremapper in Figure 12:(1) 8-byte read request to .(2) 8-byte write request to .(3) 4-byte read request to .From the starting address of the memory region that contains themost cache-unfriendly data structure and the remap info, the ad-dress remapper can nd that the rst request reads the member p . Because the remap info species that p is not remapped, therequest is passed as-is to the load store queue. The second requestaccesses the member v . Because the remap info species that it isremapped, its target address is converted into an unused address( in the gure). The third request accesses the member id . Although it is not remapped, its address is shifted by 8 bytesbecause the previous member v is remapped “away”. Thus, thetarget address is converted to . As a result, the memory layout from the application’s point of view is converted into theone shown in the gure. The member v consists a distinct arrayand the other members are packed as if there is no v in-between. With regard to the eect of memory layout conversion, there areeorts to estimate how it speeds up applications [32, 47, 49] by inves-tigating their memory access traces without applying the memorylayout changes themselves. They measure the access frequenciesto struct members and the access anities between them froma memory trace of unmodied source code. Given these metrics,they suggest which members should be placed closer in memoryand which members should be separated to a dierent memoryregion. However, we cannot directly leverage this method for ourpurpose because they only “suggest” better memory layouts butdo not quantitatively estimate the performance impact of the sug-gested layouts. A diculty when it is used for our purpose is thatmemory layout conversion has two eects of opposite directions:(1) slowdown caused by increased number of cache misses dueto separation of members with strong anities, and (2) speedup caused by decreased size of members that are not separated as dis-tinct arrays. For example, the size of arc data structure in mcf is 72bytes (9 members × gcc versions supported structure reordering, which reorders membersof a C struct and requires the same type of analysis. However, thisfeature was removed because it “ did not always work correctly ” [15].In contrast, because our method converts memory addresses insidea simulator at runtime, there is no diculty in nding the addressthat a pointer contains.A disadvantage of our method is that a cycle accurate simulationis needed for every single conversion pattern. This is not alwayspossible because the number of memory layout conversion pat-terns increases exponentially to the number of members in themost cache-unfriendly data structure. On the other hand, if can wesomehow estimate the slowdown only from access frequencies andanities, we can estimate the slowdown of all conversion patternsat once because the access frequencies and anities can be obtainedby one execution of a non-modied application. Table 4 shows the simulated environment. The “Mem Ctrl Latency”shows the length of time between a point when the CPU sends arequest to the memory controller and a point when it receives theresponse. The memory access latency from software point of viewadditionally contains the time it takes to miss the caches, whichis 7.3 ns (= (2 + 20) cycles × ns per cycle) and makes up a total igure 12: Simulation framework to estimate the negative performance impact of the AoS to SoA conversion.Table 4: Simulated Environment ISA x86_64Frequency 3 GHzIssue Width 8Reorder Buer 192 entriesL1 cache 32 KB, 2 way, 32 MSHRs, 2 cycles/missL2 cache 2 MB, 8 way, 32 MSHRs 20 cycles/missMem Ctrl Latency 75 nsof 82.3 ns. We use version 20.0.0.0 (the latest version as of May2020) of gem5 and its SE mode. Besides simulating instructions,this mode emulates system calls by replacing them with calls tonormal functions dened in the gem5 source code. It allows easysimulation because there is no need to run an entire OS on thesimulator, and there should be no noticeable impact on the resultsas non OS-intensive workloads have few system call invocations.The benchmark binaries are compiled by gcc 8.3.0 (Debian 8.3.0-6).For the evaluation, we rst skip the initial phase of each bench-mark using a simulation mode that only emulates a CPU usingthe

AtomicSimpleCPU model. After the initialization phase, wesimulate a xed number of ticks using a mode that simulates anout-of-order CPU using the

DerivO3CPU model. A tick is the notionof time in gem5 and it is 1/1000 ns in our conguration. We simulate200 billion ticks after the initialization phase, which is equal to 0.2seconds in the simulated world. The initialization phase of eachbenchmark is determined by investigating the source code.We evaluate three benchmarks from SPEC CPU 2017, namely mcf_r , deepsjeng_r , and namd_r . For each benchmark, we testevery possible memory layout conversion pattern and comparethe performance. Let 𝑁 be the number of members in the mostcache-unfriendly data structure in each benchmark, we test all2 𝑁 − cases of remapping. Note that remapping a given 𝑀 members(and not remapping the rest) is equivalent to not remapping the 𝑀 members (and remapping the rest) with regard to the memorylayout. We exclude blender_r and omnetpp_r although their mostcache-unfriendly data structures are C++ class . This is because(1) the source code of blender_r does not have clear separationbetween the initialization phase and the main computation, and(2) omnetpp_r has 21 members in its most cache-unfriendly data structure ( sVector ) and it is not possible to test 2 possibilities.The latter stems from a disadvantage of our method that we mustconduct time-consuming simulations for dierent memory layoutconversion patterns, as we describe in Section 6.1. Figure 13, Figure 14, and Figure 15 show the evaluation resultsof performance degradation for mcf_r , deepsjeng_r , and namd_r .Each bar corresponds to a memory layout conversion pattern andeach graph has 2 𝑁 − + 𝑁 is the number of membersof the most cache-unfriendly data structure. The right-most barshows the average of all patterns. The 𝑦 values show the numberof executed micro operations during simulation normalized to thevalue when memory layout conversion is not applied. The bars areordered by their 𝑦 values. Because we simulate a xed number ofticks, lower bars show larger negative performance impact. mcf_r: The most cache-unfriendly data structure (arc) has9 members and there are 2 =

256 memory layout con-version patterns. Among them, 229 patterns yield worseperformance than the no conversion case (the 𝑦 values aresmaller than 1). The lowest performance is observed whenthe rst three members are remapped to consist distinctarrays, and its performance is 8.13 % slower than the noconversion case. The 9 members are all 8 bytes in size andthus this pattern makes the size of the non-remapped partto be 48 bytes. The average negative performance impactis 3.81 %. deepsjeng_r The most cache-unfriendly data structure (tten-try_t) “wraps” an array of length 4, and each member ofthe array is another C struct whose number of membersis 5 (see Figure 16 for illustration). We apply the sameremapping policy for the same members of the inner datastructure, resulting in 2 =

16 memory layout conver-sion patterns (if ttentry_t.array[0].d1 is remapped, tten-try_t.array[i].d1 are remapped for any i ∈ { , , } ). Thelowest performance is observed when the rst and thefourth member (d1 and d4 in Figure 16) are remapped,resulting in 2.90 % slowdown. namd_r The most cache-unfriendly data structure (Com-pAtom) has 7 members and there are 2 =

64 memory igure 13: Evaluation result for mcf_rFigure 14: Evaluation result for deepsjeng_rFigure 15: Evaluation result for namd_r struct inner_struct {type1 d1;type2 d2;type3 d3;type4 d4;type5 d5;};struct ttentry_t {struct inner_struct array[4];};

Figure 16: The most cache-unfriendly data structure (tten-try_t) of deepsjeng_r. The details of inner_struct are notshown because SPEC CPU is non-free software. layout conversion patterns. The negative performance im-pact to this application is negligible (0.1 % in the worstcase). In average, the performance is improved by 0.4 %.The negative performance impact of the memory layout conver-sion to avoid the granularity gap problem is not negligible comparedto the benet of approximate memory. For example, Kim et al. [21]report that the average speedup of SPEC CPU 2006 benchmarks when the timing constraints are violated is around 4 - 5 % (Figure 8of [21]). Note that their system, Solar-DRAM, does not reduce thelatency to the extent that bit-ips are visible to the applications.Even if we assume that the performance gain by approximate mem-ory (allowing bit-ips to be visible to the application) is twice aslarge as Solar-DRAM, it is almost canceled in the worst case by theperformance overhead due to the granularity gap problem (8 - 10 %speedup vs. 8.13 % slowdown). Another research by Tovletoglou etal. [42] report that their system can save up to around 12.5 % ofoverall (CPU + memory) energy consumption for mcf in SPEC CPU2006 (the bar labeled as “429” in Figure 7 (c) of [42]) by approx-imate memory (prolonging tREF ). Prolonging tREF reduces notonly the memory access latency but also the energy consumptionof DRAM [13], which is another benet of approximate memorybesides performance. If we assume that the negative performanceof memory layout conversion to mcf is similar to the one to mcf_r ,the 12.5 % gain is deducted by a non-negligible amount, because wehave up to 8.13 % performance overhead by memory layout con-version. Therefore, we conclude that the granularity gap problemis a signicant issue and the research community needs eorts tosolve it with low overhead to expand the benet of approximatememory into wider range of applications. To the best of our knowledge, we are the rst to study the granular-ity gap problem. One of the reasons is that it is not relevant whenwe consider storing only large arrays of numbers such as weightmatrices of a neural network to approximate memory. However,as we point out in this paper, it is a signicant problem for manyrealistic applications. Esmaeilzadeh et al. mention [14] about thisproblem a bit, but they provide no further investigation.Nguyen et al. [33] propose a method that partially mitigatesthe granularity gap problem. It transposes rows and columns ofdata layout inside DRAM so that a chunk of data is stored acrossmany rows that have dierent error rates. This enables protectionof important bits (e.g., the sign bit of a oating point number) whileaggressively approximating less important bits. This mechanism iseective for DNNs because they require the whole part of a largeweight matrix at once and the number of memory accesses do notincrease regardless of the data layout. However, it is not eective ingeneral cases where memory is accessed with smaller granularity. They are quite similar and the most cache-unfriendly data stuructures are the same. apping data into memory regions with dierent error ratesdepending on its criticality is commonly proposed. Liu et al. [28]partition a DRAM bank into bins with proper refresh interval andones with prolonged refresh interval. Each data is store into eithertype of bins depending on the criticality specied by the program-mer. Although they do not discuss the minimum bin size, it cannotbe smaller than a DRAM row as we discuss in this paper. Chen etal. [9] propose a memory controller that maps data into dierentDRAM banks with dierent error rates depending on the criticalityof the data. Because this method is bank-based, the approximationgranularity is limited to the bank size. A typical DDR3/DDR4 DIMMmodule has 2 GB to 16 GB with either 8 or 16 banks, resulting ina typical bank size of 256 MB to 2 GB. Raha et al. [37] advance aprevious work [28] by measuring each bin’s error rate at a givenprolonged refresh interval and assigning them to approximate datain the ascending order of the error rate. They realize the bin size(or “page size” in their terminology) of 1 KB by measuring the aver-age error rate per 1 KB. Although this approach could be furtherpursued to realize smaller page sizes, it still cannot control errorrates per byte as it just measures them and use appropriate pages.Our previous work [2] investigates the source code of SPECCPU 2006 benchmarks and shows that there are many applicationsthat potentially suer from the granularity gap problem. Besidesadding more data for the source code analysis to further strengthenthe analysis, the novelty of this paper is that we quantitativelyanalyze slowdown caused by the granularity gap problem and showexperimental results on some benchmarks to further understandthe signicance of the granularity gap problem.

In this paper, we investigated the granularity gap problem of approx-imate memory. The problem arises due to the dierence betweenapproximation granularity and the granularity of data criticality ofrealistic applications. Because the former is as large as a few kilobytes in realistic DRAM modules and the latter is often a few bytes,we cannot map data of these applications directly on approximatememory. We analyzed source code of SPEC CPU 2006 and 2017benchmarks and found that 14 out of 22 benchmarks potentiallysuer from this problem. In addition, we pointed out the applica-bility of a memory layout conversion technique to this problemand negative performance impact of it. We proposed a simulationframework to quantitatively analyze the negative performance im-pact of this technique, and found that the the performance can bedegraded by up to 8.13 % in our tested cases. We conclude that thegranularity gap problem is a signicant issue and it requires moreattention from the research community.

ACKNOWLEDGMENTS

This work was supported by JST, ACT-I Grant Number JPMJPR18U1,Japan.

REFERENCES [1] Soramichi Akiyama. 2019. A Lightweight Method to Evaluate Eect of Approx-imate Memory with Hardware Performance Monitors.

IEICE Transactions onInformation and Systems

E102-D, 12 (Dec. 2019), 2354–2365.[2] Soramichi Akiyama. 2020. Assessing Impact of Data Partitioning for Approxi-mate Memory in C/C++ Code. In

The 10th Workshop on Systems for Post-MooreArchitectures (SPMA) . 1 – 7. [3] Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the Propagation of TransientErrors in HPC Applications. In

International Conference for High PerformanceComputing, Networking, Storage and Analysis (SC) . 72:1–72:12.[4] JEDEC SOLID STATE TECHNOLOGY ASSOCIATION. 2010. JEDEC STANDARD:DDR3 SDRAM Standard. JESD79-3F.[5] JEDEC SOLID STATE TECHNOLOGY ASSOCIATION. 2013. JEDEC STANDARD:DDR4 SDRAM. JESD79-B.[6] Denis Bakhvalov. 2018. Advanced proling topics. PEBS and LBR. https://easyperf.net/blog/2018/06/08/Advanced-proling-topics-PEBS-and-LBR.[7] Nandhini Chandramoorthy, Karthik Swaminathan, Martin Cochet, Arun Paidi-marri, Schuyler Eldridge, Raji V. Joshi, Matthew M Ziegler, Alper Buyuktosunoglu,and Pradip Bose. 2019. Resilient Low Voltage Accelerators for High Energy E-ciency. In

International Symposium on High Performance Computer Architecture(HPCA) . 147–158.[8] Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh,Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, and Onur Mutlu.2016. Understanding Latency Variation in Modern DRAM Chips: ExperimentalCharacterization, Analysis, and Optimization. In

International Conference onMeasurement and Modeling of Computer Science (SIGMETRICS) . 323–336.[9] Yuanchang Chen, Xinghua Yang, Fei Qiao, Jie Han, Qi Wei, and Huazhong Yang.2016. A Multi-accuracy Level Approximate Memory Architecture Based onData Signicance Analysis. In

IEEE Computer Society Annual Symposium on VLSI(ISVLSI) . 385–390.[10] Zitao Chen, Guanpeng Li, Karthik Pattabiraman, and Nathan DeBardeleben. 2019.BinFI: An Ecient Fault Injector for Safety-Critical Machine Learning Systems.In

International Conference for High Performance Computing, Networking, Storageand Analysis (SC) . 69:1 – 69:23.[11] J. Choi, W. Shin, J. Jang, J. Suh, Y. Kwon, Y. Moon, and L. Kim. 2015. MultipleClone Row DRAM: A low latency and area optimized DRAM. In

InternationalSymposium on Computer Architecture (ISCA) . 223–234.[12] Stephen Curial, Peng Zhao, Jose Nelson Amaral, Yaoqing Gao, Shimin Cui, RaulSilvera, and Roch Archambault. 2008. MPADS: Memory-Pooling-Assisted DataSplitting. In

International Symposium on Memory Management (ISMM) . 101–110.[13] Anup Das, Hasan Hassan, and Onur Mutlu. 2018. VRL-DRAM: Improving DRAMPerformance via Variable Refresh Latency. In

Design Automation Conference(DAC) . 1–6.[14] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Ar-chitecture Support for Disciplined Approximate Programming. In

InternationalConference on Architectural Support for Programming Languages and OperatingSystems (ASPLOS) . 301–312.[15] Free Software Foundation, Inc. 2019. GCC 4.8 Release Series Changes, NewFeatures, and Fixes. https://gcc.gnu.org/gcc-4.8/changes.html.[16] Kamyar Givaki, Behzad Salami, Reza Hojabr, S. M. Reza Tayaranian, AhmadKhonsari, Dara Rahmati, Saeid Gorgin, Adrian Cristal, and Osman S. Unsal. 2020.On the Resilience of Deep Learning for Reduced-voltage FPGAs. In

InternationalConference on Parallel, Distributed and Network-Based Processing (PDP) . 110–117.[17] Qing Guo, Karin Strauss, Luis Ceze, and Henrique S. Malvar. 2016. High-DensityImage Storage Using Approximate Memory Cells.

SIGARCH Comput. Archit.News

44, 2 (March 2016), 413–426.[18] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and O.Mutlu. 2016. ChargeCache: Reducing DRAM latency by exploiting row accesslocality. In

International Symposium on High Performance Computer Architecture(HPCA) . 581–593.[19] Intel. 2018. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume3 (3A, 3B, 3C & 3D): System Programming Guide.[20] Bruce Jacob, Spencer Ng, and David Wang. 2007.

Memory Systems: Cache, DRAM,Disk . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.[21] Jeremire Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu. 2018. Solar-DRAM:Reducing DRAM Access Latency by Exploiting the Variation in Local Bitlines.In

IEEE International Conference on Computer Design (ICCD) . 282–291.[22] Jeremie S. Kim, Minesh Patel, Hasan Hassan, Lois Orosa, and Onur Mutlu. 2019.D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numberswith Low Latency and High Throughput. In

International Symposium on HighPerformance Computer Architecture (HPCA) . 582–595.[23] Skanda Koppula, Lois Orosa, A. Giray Yağlıkçı, Roknoddin Azizi, Taha Shahroodi,Konstantinos Kanellopoulos, and Onur Mutlu. 2019. EDEN: Enabling Energy-Ecient, High-Performance Deep Neural Network Inference Using ApproximateDRAM. In

International Symposium on Microarchitecture (Micro) . 166–181.[24] Donghyuk Lee, Samira Khan, Lavanya Subramanian, Saugata Ghose, RachataAusavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and Onur Mutlu. 2017.Design-Induced Latency Variation in Modern DRAM Chips: Characterization,Analysis, and Latency Reduction Mechanisms.

Proceedings of the ACM on Mea-surement and Analysis of Computing Systems , Article 26 (June 2017), 36 pages.[25] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu.2015. Adaptive-latency DRAM: Optimizing DRAM timing for the common-case.In

International Symposium on High Performance Computer Architecture (HPCA) .89–501.[26] Y. Lee, H. Kim, S. Hong, and S. Kim. 2017. Partial Row Activation for Low-Power DRAM System. In

International Symposium on High Performance ComputerArchitecture (HPCA) . 217–228.[27] Jin Lin and Pen-Chung Yew. 2010. A Compiler Framework for General MemoryLayout Optimizations Targeting Structures. In

Workshop on Interaction betweenCompilers and Computer Architecture (INTERACT) . 8:1 – 8:8.[28] Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn.2011. Flikker: Saving DRAM Refresh-power Through Critical Data Partitioning.In

International Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) . 213–224.[29] Abdulrahman Mahmoud, Siva Kumar Sastry Hari, Christopher W. Fletcher,Sarita V. Adve, Charbel Sakr, Naresh Shanbhag, Pavlo Molchanov, Michael B.Sullivan, Timothy Tsai, and Stephen W. Keckler. 2020. HarDNN: Feature MapVulnerability Evaluation in CNNs. In arXiv:2002.09786 . 1 – 14.[30] Abdulrahman Mahmoud, Radha Venkatagiri, Khalique Ahmed, Sasa Misailovic,Darko Marinov, Christopher W. Fletcher, and Sarita V. Adve. 2019. Minotaur:Adapting Software Testing Techniques for Hardware Errors. In

InternationalConference on Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

International Symposium on Memory Systems (MEMSYS) . 243–253.[33] Duy Thanh Nguyen, Nguyen Huy Hung, Hyun Kim, and Hyuk-Jae Lee. 2020.An Approximate Memory Architecture for Energy Saving in Deep LearningApplications.

IEEE Trans. on Circuits and Systems I: Regular Papers (2020), 1–14.[34] Bin Nie, Lishan Yang, Adwait Jog, and Evgenia Smirni. 2018. Fault Site Pruning forPractical Reliability Analysis of GPGPU Applications. In

International Symposiumon Microarchitecture (Micro) . 750 – 762.[35] Yoshio Nishi and Blanka Magyari-Kope. 2019.

Advances in Non-volatile Memoryand Storage Technology, 2nd Edition . Woodhead Publishing.[36] Erez Petrank and Dror Rawitz. 2002. The Hardness of Cache Conscious DataPlacement. In

Symposium on Principles of Programming Languages (POPL) .101–112.[37] Arnab Raha, Soubhagya Sutar, Hrishikesh Jayakumar, and Vijay Raghunathan.2017. Quality Congurable Approximate DRAM.

IEEE Trans. Comput.

High Performance Computing. ISC High Performance2018. Lecture Notes in Computer Science . 545–553.[41] Bjarne Steensgaard. 1996. Points-to Analysis in Almost Linear Time. In

Sympo-sium on Principles of Programming Languages (POPL) . 32–41.[42] Konstantinos Tovletoglou, Lev Mukhanov, Dimitrios S. Nikolopoulos, and Geor-gios Karakonstantis. 2020. HaRMony: Heterogeneous-Reliability Memory andQoS-Aware Energy Management on Virtualized Servers. In

International Confer-ence on Architectural Support for Programming Languages and Operating Systems(ASPLOS) . 575–590.[43] Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose, Nika Mansouri Ghiasi,Minesh Patel, Jeremie S. Kim, Hasan Hassan, Mohammad Sadrosadati, and OnurMutlu. 2018. Reducing DRAM Latency via Charge-Level-Aware Look-AheadPartial Restoration. In

IEEE/ACM International Symposium on Microarchitecture(Micro) . 298 – 311.[44] Vincent M. Weaver. 2016.

Advanced Hardware Proling and Sampling(PEBS, IBS,etc.): Creating a New PAPI Sampling Interface . Technical Report UMAINE-VMW-TR-PEBS-IBS-SAMPLING-2016-08. University of Maine.[45] J. Wei, A. Thomas, G. Li, and K. Pattabiraman. 2014. Quantifying the Accuracyof High-Level Fault Injection Techniques for Hardware Faults. In

InternationalConference on Dependable Systems and Networks (DSN) . 375–382.[46] Lita Yang and Boris Murmann. 2017. Approximate SRAM for Energy-Ecient,Privacy-Preserving Convolutional Neural Networks. In

IEEE Computer SocietyAnnual Symposium on VLSI (ISVLSI) . 689–694.[47] Louis Ye, Mieszko Lis, and Alexandra Fedorova. 2019. A Unifying Abstractionfor Data Structure Splicing. In

International Symposium on Memory Systems(MEMSYS) . 173–183.[48] X. Zhang, Y. Zhang, B. R. Childers, and J. Yang. 2016. Restore truncation forperformance improvement in future DRAM systems. In

International Symposiumon High Performance Computer Architecture (HPCA) . 543–554.[49] Peng Zhao, Shimin Cui, Yaoqing Gao, Raúl Silvera, and José Nelson Amaral. 2007.Forma: A Framework for Safe Automatic Array Reshaping.