[PDF] TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems

Abstract

This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. 3D-XPoint is becoming a viable alternative to DRAM as it enables high-capacity and non-volatile main memory systems; however, 3D-XPoint has 4-8x slower read, and worse writes. As such, effective DRAM caching in front of 3D-XPoint is important to enable a high-capacity, low-latency, and high-write-bandwidth memory. There are two major approaches for DRAM cache design: (1) a Tag-Inside-Cacheline (TIC) organization that optimizes for hits, by storing tag next to each line such that one access gets both tag and data, and (2) a Tag-Outside-Cacheline (TOC) organization that optimizes for misses, by storing tags from multiple data-lines together such that one tag-access gets info for several data-lines. Ideally, we desire the low hit-latency of TIC, and the low miss-bandwidth of TOC. To this end, we propose TicToc, an organization that provisions both TIC and TOC to get hit and miss benefits of both. However, we find that naively combining both actually performs worse than TIC, because one needs to pay bandwidth to maintain both metadata. The main contribution of this work is developing architectural techniques to reduce the bandwidth of maintaining both TIC and TOC metadata. We find the majority of the bandwidth cost is due to maintaining TOC dirty bits. We propose DRAM Cache Dirtiness Bit, which carries DRAM cache dirty info to last-level caches, to prune repeated dirty-bit checks for known dirty lines. We then propose Preemptive Dirty Marking, which predicts which lines will be written and proactively marks dirty bit at install time, to amortize the initial dirty-bit update. Our evaluations on a 4GB DRAM cache with 3D-XPoint memory show that TicToc enables 10% speedup over baseline TIC, nearing 14% speedup possible with an idealized DRAM cache w/ 64MB of SRAM tags, while needing only 34KB SRAM.

Full PDF

TTicToc: Enabling Bandwidth-Efﬁcient DRAM Cachingfor both Hits and Misses in Hybrid Memory Systems

Vinson Young † , Zeshan Chishti ‡ , and Moinuddin K. Qureshi †† Georgia Institute of Technology ‡ Intel Corporation {vyoung,moin}@gatech.edu ,[email protected]

ABSTRACT

This paper investigates bandwidth-efﬁcient DRAM cachingfor hybrid DRAM + 3D-XPoint memories. 3D-XPoint isbecoming a viable alternative to DRAM as it enables high-capacity and non-volatile main memory systems. However,3D-XPoint has several characteristics that limit it from out-right replacing DRAM: 4-8x slower read, and even worsewrites. As such, effective DRAM caching in front of 3D-XPoint is important to enable a high-capacity, low-latency,and high-write-bandwidth memory. There are currently twomajor approaches for DRAM cache design: (1) a Tag-Inside-Cacheline (TIC) organization that optimizes for hits, by stor-ing tag next to each line such that one access gets both tag anddata, and (2) a Tag-Outside-Cacheline (TOC) organizationthat optimizes for misses, by storing tags from multiple datalines together in a tag-line such that one access to a tag-linegets information on several data-lines. Ideally, we wouldlike to have the low hit-latency of TIC designs, and the lowmiss-bandwidth of TOC designs. To this end, we propose a

TicToc organization that provisions both TIC and TOC to getthe hit and miss beneﬁts of both.We ﬁnd that naively combining both techniques actuallyperforms worse than TIC individually, because one has topay the bandwidth cost of maintaining both metadata. Themain contribution of this work is developing architecturaltechniques to reduce bandwidth cost of accessing and main-taining both TIC and TOC metadata. We ﬁnd that most ofthe update bandwidth is due to maintaining the TOC dirtyinformation. We propose a

DRAM Cache Dirtiness Bit tech-nique that carries DRAM cache dirty information to last-levelcaches, to help prune repeated dirty-bit updates for knowndirty lines. We also propose a

Preemptive Dirty Marking technique that predicts which lines will be written and proac-tively marks the dirty bit at install time, to help avoid theinitial dirty-bit update for dirty lines. To support PDM, wedevelop a novel PC-based

1. INTRODUCTION

As memory systems scale, non-volatile memories or NVMs(such as, 3D-XPoint [1]) are emerging as viable alternativesto DRAM. NVMs offer the advantages of higher bit densityand the ability to retain data after power outages. However,NVMs also have signiﬁcant limitations that prevent themfrom outright replacing DRAM in the memory hierarchy. For example, 3D-XPoint is reported to have 4-8x slower read, andeven slower writes compared to DRAM [2]. As such, futuresystems are likely to utilize hybrid memory systems [3,4,5,6]consisting of both DRAM and 3D-XPoint. We focus onthe setup where DRAM is operated as a hardware-managedcache for 3D-XPoint based main memory, since such a setupenables applications to beneﬁt from the lower latency andhigher write-bandwidth of DRAM and the higher capacity of3D-XPoint without relying on any software or OS support.Recently, there have been many works [7,8,9,10,11,12,13]on architecting High Bandwidth Memory (HBM) [14] cachesin front of traditional DRAM main memory [15]. Theseworks target improving memory bandwidth by migrating databetween DRAM and HBM, and servicing most data at thehigher internal/bus bandwidth of HBM. These works areeffective due to HBM having dedicated higher-bandwidthchannels/interfaces compared to commodity DDRx DRAM.We would like to utilize the insights learned from these worksto design effective DRAM caches in front of NVMs, such as3D-XPoint. However, we note that there are signiﬁcant dif-ferences in setup and goals for a DRAM+3D-XPoint hybridmemory as compared to a HBM+DDRx hybrid memory.First, in a 3D-XPoint based hybrid memory, 3D-XPointand DRAM will be sharing the same channel interfaces [16].Second, DRAM caches in front of 3D-XPoint target reducingread latency and improving write bandwidth and endurance of 3D-XPoint, by servicing most data at the lower latencyand higher write bandwidth of DRAM. An added complexityis that the DRAM cache and 3D-XPoint are likely to sitbehind the same channel [17], as depicted in Figure 1(a).Such a set-up enables a balanced conﬁguration where everychannel has DRAM backing it. However, in such a channel-sharing set-up, bandwidth needed for maintaining DRAMcache state now comes directly at a cost to bus bandwidthavailable for memory. As such, there is a renewed needfor bandwidth-efﬁcient DRAM caches. We analyze priorDRAM caching approaches, highlight cases of bandwidth-inefﬁciency, and rigorously target the remaining bandwidthoverheads to develop a bandwidth-efﬁcientf DRAM cachesuitable for DRAM + 3D-XPoint systems.We start with a baseline hit-optimized

Tag-Inside-Cacheline(TIC)

DRAM cache design [7, 11, 18]. A TIC design orga-nizes its DRAM cache as a direct-mapped cache with tagsstored inside each cacheline, such that one access can retrieveboth tag and data.

TIC has good hit-latency , as it can ser-vice cache hits in one DRAM access. However, TIC incursbandwidth overhead on cache misses as it needs to probe thetag in DRAM in order to determine a miss. This approachof trading miss-bandwidth for hit-latency has been proveneffective in situations where the cache has its own dedicated a r X i v : . [ c s . A R ] J u l a) Channel-Sharing Hybrid Memory (b) Performance of DRAM Cache Organizations DRAM 3D-XPointCPU Chip

Channel

DRAM3D-XPoint

Channel S p ee dup w . r . t T a g I n s i d e C a c h e li n e Figure 1: (a) Channel-Sharing Hybrid Memory, and (b) Performance of hit-optimized Tag-Inside-Cacheline (TIC) [7],miss-optimized Tag-Oustide-Cacheline (TOC) [8], and idealized Tag-In-SRAM, normalized to TIC. access channel, such as the HBM+DRAM hybrid memory inIntel’s Knights Landing [11]. However, in a channel-sharingsetup, the miss probe bandwidth directly consumes availablemain memory bandwidth, resulting in bandwidth inefﬁciency.As we show in Figure 1(b), there is a 14% performance gapbetween TIC and an idealized Tag-In-SRAM approach.An alternative approach to DRAM cache design is a miss-optimized

Tag-Outside-Cacheline (TOC) design [8, 12, 13].A TOC design stores tags of multiple cachelines together ina tag-only-line, such that one access to a tag-line can obtaininformation for multiple cachelines at once. We can bring inthese bundles of tags as needed, and cache them in a small tagcache (e.g., 32KB SRAM) [8]. If the tag cache has high hit-rate, TOC can service most hits with one DRAM access, andmisses to clean lines without a DRAM access. However, ifthe tag cache has low hit-rate, TOC may need two accesses toservice a hit, and one access to service a miss. As such,

TOCconsumes lower bandwidth on misses than TIC; however, itconsumes higher bandwidth on hits due to separate tag anddata read. Overall, as shown in Figure 1(b), TOC approachperforms worse than TIC due to bandwidth overheads.We notice that TIC is good for hits, while TOC is good formisses – one can perhaps combine both approaches to getboth good hit and miss bandwidth. Fortunately, it is cheapto provision both metadata at once: TIC uses spare ECCbits [11], and TOC needs to dedicate only 1.5% of DRAMcache capacity to store metadata and a 32KB SRAM for ametadata cache [8]. To decide when to use TOC or TIC, onecan employ a hit/miss predictor [7] that uses TIC for likelyhits and TOC for likely misses. We call this proposal that pro-visions both TIC and TOC metadata as

TicToc . Unfortunately,we ﬁnd that naively combining TIC and TOC actually leadsto worse performance than TIC by itself. This is becausemaintaining and updating TOC metadata bits consumes sig-niﬁcant DRAM bandwidth. In order for

TicToc to be effective,we need ways to reduce TOC maintenance bandwidth.TOC incurs bandwidth overheads for the following threecases: (i) tag-check on hits, ( ii) tag-update on installs, and(iii) dirty-bit-update on writebacks. Hit overhead is easilymitigated by additionally storing TIC metadata in

TicToc .Tag updates are generally inexpensive because they occur atmiss time, and miss trafﬁc usually has good spatial localityand therefore a high metadata-cache hit-rate. Dirty-bit up-dates, however, remain costly because they are carried outwhen dirty lines are evicted from an earlier level of cache.Such evictions have poor access locality and therefore lowmetadata-cache hit-rates. Hence, we identify dirty bit updatesas the most signiﬁcant bandwidth overhead for

TicToc . To reduce dirty data tracking costs for TOC, we targetthe following two cases: initial write to a cache line, andrepeated writes to the same cache line. For repeated writes,we propose to store a

DRAM Cache Dirtiness bit alongsidethe line in an earlier level of cache, to track the current dirtystatus of the line in the DRAM cache. On a writeback toDRAM cache, we only need to update the TOC if the line inthe DRAM cache has changed from clean to dirty. However,many workloads write to lines only once. For such workloads,we propose

Preemptive Dirty Marking that predicts likely-to-be-written cache lines and proactively mark sthose lines asdirty in the TOC at install time. This avoids needing to updatedirty information at eviction time, thereby avoiding metadata-cache misses. We develop a PC-based

Write Predictor that is92% accurate for our Preemptive Dirty Marking.Even after solving for hit and miss bandwidth, when datahas poor reuse, installing lines and updating TOC tag can be-come a major source of bandwidth overhead. To mitigate thatproblem, we develop a

Write-Aware Bypassing technique thatreduces install and tag-update bandwidth, without increasingwrites to write-constrained 3D-XPoint.Overall our paper makes the following contributions:

Contribution-1:

This paper evaluates and rigorously targetsthe bandwidth overheads of prior DRAM-cache organiza-tions. We ﬁnd that we can combine two tag-storage methodswith a

TicToc organization to obtain both good hit and goodmiss path. However, such an approach suffers signiﬁcantbandwidth cost to maintain TOC dirty information on writes.

Contribution-2:

We develop two techniques to reduce thecost of tracking dirty information.

DRAM Cache Dirtiness Bit targets reducing cost of dirty-bit updates for repeated writesto the same location, via maintaining DRAM cache dirtyinformation alongside the line in an earlier level of cache.And,

Preemptive Dirty Marking targets reducing cost of theinitial dirty-bit update to a location, via predicting which linesare likely to be written to (with our

Signature-based WritePredictor ) and preemptively setting the dirty-bit.

Contribution-3:

To reduce install bandwidth while not in-creasing 3D-XPoint write trafﬁc, we develop a Write-AwareBypass technique. This technique bypasses most clean linesby default to save install bandwidth. And, it installs mostdirty and predicted write-likely lines (to amortize metadataupdates) to buffer writes to write-constrained 3D-XPoint.Overall, our proposed TicToc organization, enables 10%speedup over TIC baseline, nearing the 14% speedup ofan idealized Tag-In-SRAM approach, while needing signiﬁ-cantly less SRAM storage (34 KB vs. 64 MB).2

RAM ARRAY

ROW BUFFER HIT: DATAOUT MISS: DATAOUT+READMEMADDR HIT: DATAOUT (a) Tag-In-SRAM organization

Full SRAM Tag+Dirty Store (MBs)

ADDR HIT: TAGFETCH+DATAOUT (c) Tag-Outside-Cacheline organization

SRAM Metadata Cache (KBs) T Tag+Dirty StoreMISS: READMEMHitMiss HitMissMISS: TAGFETCH+READMEM

Figure 2: DRAM cache organization and ﬂow for (a) idealized Tag-In-SRAM, (b) hit-latency-optimized Tag-Inside-Cacheline (TIC) [7], and (c) miss-bandwidth-optimized Tag-Outside-Cacheline (TOC) [8].

2. BACKGROUND AND MOTIVATION

DRAM caches are important for enabling heterogeneousmemory systems to have the effective latency and bandwidthof one memory technology, and the capacity of another; how-ever, there are several challenges in designing DRAM caches.A DRAM cache design has to balance multiple goals. First,it should minimize the SRAM storage needed for DRAMcache maintenance. Second, it should minimize cache hitlatency. Third, it should minimize miss latency. Fourth, itshould provide high hit-rate. Lastly, it should try to minimizetotal bandwidth costs for maintaining DRAM cache state.It is desirable to organize DRAM caches at the granularityof a cache line to efﬁciently utilize cache capacity, and tominimize the consumption of main memory bandwidth [10].A key challenge in designing a large line-granularity cache isdeciding where to store the tag and dirty-bit metadata. For amoderately-sized 4GB DRAM cache with 64B lines, therewould be 64 million lines. Even if each metadata required8 bits (6 tag, 1 dirty, 1 valid bit), this would result in 64MBstorage for metadata. Next, we discuss the various options forDRAM cache metadata management, and their implicationsfor SRAM storage cost and bandwidth consumption.

Table 1: Bandwidth of DRAM Cache Organizations – ρ denotes Metadata-Cache Miss Probability Organization SRAM TIC TOC(SRAM Cost) (>20MB) (<1KB) (~32KB)Hit 1 1 1 + ρ Miss + Evict-Clean 0 ρ Miss + Evict-Dirty 1 1 1 + ρ A costly method to design high performance DRAM cachesis to simply maintain all of the tag and dirty bits in on-chipSRAM, and query the on-chip SRAM metadata to determinehit or miss, in a

Tag-In-SRAM approach, shown in Figure 2(a).Assuming 1-byte metadata per cache line, such an approachwould require 64MB of SRAM storage for a 4GB cache(>20MB with sectoring [10, 19]). Table 1 shows the DRAMbandwidth consumption for such an approach. SRAM meta-data is queried ﬁrst to determine hit or miss. A hit can beserviced with one DRAM access to data. A miss can beserviced without a DRAM access to data. However, in prepa-ration for installing the newly accessed line, the cache wouldneed to perform an eviction of the resident line. If the resident line were clean, the location could be directly overwritten.However, if the resident line were dirty, the resident linewould need to be read before writeback to memory. Hence,miss with eviction of a clean line costs 0 bandwidth, andmiss with eviction of a dirty line costs 1 bandwidth. Such adesign represents the minimum DRAM bandwidth neededfor DRAM cache maintenance, and an upper-bound for per-formance. We aim to achieve Tag-In-SRAM performance atlow SRAM cost.

To reduce SRAM storage costs, one could store tags insideeach line in DRAM [7,11,18] in a

Tag-Inside-Cacheline (TIC) approach, shown in Figure 2(b). TIC optimizes for hit-latencyby using a direct-mapped design and storing tag inside eachdata-line such that one access can retrieve both tag and data.Direct-mapped organization enables the controller to knowwhich location to access, without waiting for tags.Table 1 shows the bandwidth of such an approach. Hitsare serviced with one DRAM access that retrieves both tagand data: in case of a tag match, the attached data can beused to service the request. However, misses also need toaccess tag in DRAM. As such, TIC is effective for hit-latency,but consumes extra bandwidth on misses. This approachof trading miss-bandwidth for hit-latency has been proveneffective in commercial products [11], and, as such, we usethe TIC organization [7] as our baseline.

Setup:

We store metadata alongside data in unused ECCbits similar to Intel’s Knights Landing [11]. TIC additionallyemploys a small hit-miss predictor to guide when to accesscache+memory either in a parallel or serial manner (needs<1KB SRAM storage overhead). We additionally includebandwidth-reducing enhancements from Chou et al. [18],such as DCP to reduce writeback probe.

Another option with reduced SRAM storage costs, is tostore metadata lines in a separate area of DRAM and bringthem in as needed in a

Tag-Outside-Cacheline (TOC) [8, 12,13] approach, shown in Figure 2(c). To determine hit ormiss, TOC ﬁrst accesses a metadata line to get tag+dirty in-formation for the requested data line, then routes the requestappropriately to DRAM cache or to memory. Of note, each ofthese metadata lines actually stores tag+dirty information ofseveral adjacent data lines. An enhanced design [8] proposesto cache the metadata lines in a small metadata cache to avoid3epeated accesses to the same metadata line, and would amor-tize metadata lookup if there is spatial locality. Table 1 showsthe bandwidth consumption of such an approach. In case of ametadata-cache hit, TOC performs similar to idealized Tag-In-SRAM. For a metadata-cache miss, TOC spends additionalbandwidth to access the metadata. Overall, TOC has the po-tential for reducing miss bandwidth, but it can suffer fromsigniﬁcant bandwidth overhead when the metadata-cache haspoor hit rate (due to poor spatial locality).

Setup:

We assume 1-byte metadata (6 tag, 1 dirty, 1 validbits), and 64 tags stored in each metadata entry. The metadataare stored in a separate part of DRAM, consisting of 64MBout of the 4GB DRAM capacity. Recently accessed metadataare stored in a 512-entry metadata cache, which requires32KB of SRAM. Note that the metadata cache is sized tocapture only spatial locality and not the working set of theDRAM cache, which would need megabytes of SRAM.

Optimizing for Latency:

In the case of metadata-missin the metadata-cache, we want to avoid the latency for se-rialized tag + cache-data access, as well as the latency forserialized cache-data + memory access. We employ a direct-mapped organization and a hit-miss predictor [7] for latencyand bandwidth considerations. If predicted hit, we accesstag + cache-data in parallel to save latency (direct-mappedorganization dictates only one possible location for data), andserially access memory only if prediction is wrong to savememory bandwidth. If predicted miss, we access tag + mem-ory in parallel for latency, and serially access cache-data onlyif prediction is wrong to save DRAM cache bandwidth.

A TIC approach has good hit-latency , but suffers fromextra miss bandwidth. Whereas, a TOC approach has good miss bandwidth but incurs extra hit bandwidth. Our keyinsight is that if one could use TIC for hits and TOC formisses, then one could potentially achieve both good hit andmiss bandwidth.We note that provisioning metadata for both TIC andTOC is relatively inexpensive: TIC simply uses spare ECCbits [11], and TOC needs to dedicate only ~1.5% of DRAMcache capacity to store metadata lines and employs 32KBSRAM for its metadata cache [8]. However, we need aneffective design that can use TIC for hits and TOC for misses.We notice that we can use hit/miss predictor [7] to helpguide when to use TIC or TOC. For predicted hits, we candirectly access the line with TIC. For predicted misses, wecan consult the metadata-line / metadata-cache in TOC tohelp avoid miss probes. We call this proposal

TicToc . Un-fortunately, we ﬁnd that naively combining both approachesactually leads to worse performance than TIC individually.This is because maintaining TOC tag and dirty bits consumessubstantial bandwidth. To complete our design, we need todevelop effective solutions to reduce maintenance bandwidthfor TOC. We discuss methodology before proposed design.

3. METHODOLOGY3.1 Framework and Conﬁguration

We use USIMM [20], an x86 simulator with detailed mem-ory system model. We extend USIMM to include a DRAM cache. Table 2 shows the conﬁguration used in our study. Weassume a four-level cache hierarchy (L1, L2, L3 being on-chip SRAM caches and L4 being off-chip DRAM cache). Allcaches use 64B line size. We model a virtual memory systemto perform virtual to physical address translations. The base-line L4 is a 4GB DRAM-cache [11], which is direct-mappedand places tags with data in unused ECC bits. The parametersof our DRAM cache are based on DDR4 DRAM technol-ogy [15]. The main memory is based on 3D-XPoint [1, 2, 21]:the read latency is ~6X, the write latency is ~24X that ofDRAM, and there are 64 rowbuffers each 256B in size.

Table 2: System Conﬁguration

Processors 8 cores; 3.0GHz, 4-wide OoOLast-Level Cache 8MB, 16-way

DRAM Cache

Capacity 4GBBus Frequency 1000MHz (DDR 2GHz)Conﬁguration 1 channel, 64-bit bus, sharedAggregate Bandwidth 16 GB/s, shared with MemorytCAS-tRCD-tRP-tRAS 13-13-13-30 ns

Main Memory (3D XPoint)

Capacity 64GBBus Frequency 1000MHz (DDR 2GHz)Conﬁguration 1 channel, 64-bit bus, sharedAggregate Bandwidth 16 GB/s, shared with DRAMtCAS-tRCD-tRP 4-80-0 nstRAS-tWR 96-320 ns

We use a representative slice of 2-billion instructions se-lected by PinPoints [22], from benchmark suites that includeSPEC 2006 [23] and GAP [24]. For SPEC, we pick a subsetof high memory intensity workloads that have at least 2 L3misses per thousand instructions (MPKI). The evaluations ex-ecute benchmarks in rate mode, where all eight cores executethe same benchmark. In addition to rate-mode workloads,we also evaluate 21 mixed workloads, which are created byrandomly choosing 8 of the 17 SPEC workloads. Table 3shows L3 miss rates, and memory footprints for the 8-corerate-mode workloads in our study.We perform timing simulation until each benchmark ina workload executes at least 2 billion instructions. We useweighted speedup to measure aggregate performance of theworkload normalized to the baseline and report geometricmean for the average speedup across all the 17 workloads (11SPEC, 2 GAP, 4 MIX). We provide key performance resultsfor additional 17 SPEC-mixed workloads in Section 6.4.

Table 3: Workload Characteristics

Suite Workload L3 MPKI FootprintSPEC mcf 101.14 13.4 GBlbm 49.3 3.2 GBsoplex 35.3 1.8 GBlibq 30.1 256 MBgems 29.1 6.4 GBomnet 29.0 1.2 GBwrf 10.4 1.1 GBgcc 7.6 1.5 GBxalanc 7.4 1.5 GBzeus 7.0 1.6 GBcactus 6.5 2.6 GBGAP cc twitter 116.8 9.3 GBpr twitter 126.6 15.3 GB4

RAM ARRAY

ROW BUFFER ADDR HIT: DATAOUT

TicToc Metadata Organization

Metadata Cache T TIC MetadataTOC MetadataPred HitPred Miss

Hit / Miss Prediction

MISS: READMEM+TAGFETCH

DATA (64B)

ECC+TAG+D (8B)

D D D DD D D D D D DD D D D

Figure 3: TicToc Metadata Organization queries hit/miss predictor to use TIC metadata for hits and TOC metadatafor misses. TicToc enables good hit latency, and good hit/miss bandwidth.

4. TICTOC DESIGN

DRAM caches need metadata to conﬁrm if a line is cacheresident or not (tag bits), and if the resident line is the mostup-to-date copy (dirty bit). Tag-Inside-Cacheline (TIC) or-ganizations are optimized for hits as one access gets bothmetadata and data, but can suffer for misses as misses stillneed to access DRAM for metadata. In contrast, Tag-Outside-Cacheline (TOC) organizations are optimized for misses asone metadata access gets residency and dirty information formultiple lines; however, such approaches suffer from needingto frequently query and update TOC metadata. Ideally, wewant the hit-path of TIC, the miss-path of TOC, all withoutpaying signiﬁcant cost to access and maintain TOC metadata.This section is organized as follows: we describe how toprovision and effectively use both TIC and TOC metadata ina TicToc organization, describe how to reduce TOC metadatamaintenance costs, and show effectiveness of our design.

Figure 3 shows metadata organization of our

TicToc de-sign. TicToc provisions TIC metadata – tag-bits and dirty-bitare stored inside the cacheline in unused ECC bits, simi-lar to commercial implementation [11]. In addition, TicTocprovisions TOC metadata – metadata is stored in dedicatedmetadata lines corresponding to 1.5% of DRAM capacity,and cached as needed in a 32KB on-chip metadata cache.While provisioning both TIC and TOC metadata is relativelycheap, the complexity lies in utilizing TIC and TOC metadataappropriately to save on bandwidth for both hits and misses.

Figure 3 shows the operation of TicToc. Ideally, we want touse TIC metadata for hits and TOC metadata for misses. Ourkey insight is that one can use hit/miss prediction [7, 25] tohelp guide when to use which metadata. Hit/miss predictorshave been primarily used to hide the serialization latency thatcan occur from waiting on last-level cache response beforesending memory access. They work by predicting whichcache accesses are likely to miss, and sending both cache andmemory requests in parallel to avoid serialization. We exploitan effective hit/miss predictor [7] to guide TicToc to use TICmetadata on likely-hit and TOC metadata on likely-miss. Thecommon result: a hit is serviced in one cache access (TICpath), a miss with clean eviction directly goes to memory(TOC path), and a miss with dirty eviction goes to cacheand memory (TIC path). An uncommon path of predict-hitactual-miss incurs serialization latency and bandwidth costto access cache before memory. The other uncommon pathof predict-miss actual-hit incurs extra memory access due toparallel lookup of cache and memory.

To analyze effectiveness of TicToc, Figure 4 and Figure 5shows the proportion of channel bandwidth being used foruseful operations, install operations, and assorted mainte-nance operations, for baseline TIC and proposed TicToc.Useful operations include 3D-XPoint Read and Write, andDRAM Cache Hit and Writeback. Install operations refer tocache installs, which are important for improving hit-rate butcost bandwidth to write the line to DRAM. Lastly, Mainte-nance operations refer to bandwidth-wasting operations usedto conﬁrm a line is not resident: miss probes for TIC, andaccessing and updating TOC metadata for TOC. m c f l b m s op l e x li bq ge m s o m ne t w r f g ccx a l an c z eu sc a c t u s cc t w i p r t w i A m ean N o r m a li z ed B and w i d t h C on s u m p t i on Useful BW Install Miss Probe

Figure 4: Breakdown of bus bandwidth consumption forTIC organization [7]. Workloads with low hit-rate wastesigniﬁcant bandwidth to conﬁrm misses.

As expected, Figure 4 shows that TIC wastes bandwidthprobing the DRAM cache to conﬁrm misses. The proposedTicToc can utilize TOC to reduce such miss probes. How-ever, Figure 5 shows that TicToc actually fares worse due toneeding bandwidth to maintain TOC tag and TOC dirty-bit.TOC tag-updates happen when the workload misses on aline and installs it. A large fraction of misses occur when aworkload is accessing many lines in a new page, so missesgenerally have good spatial locality. In such cases, metadata-accesses/updates are amortized with the small metadata cache.TOC dirty-bit-updates, on the other hand, occur upon evic-tion of a dirty line from an earlier level of cache. Evictiongenerally has poor spatial and temporal locality, so updatingthis information often takes signiﬁcant bandwidth to readthen update the TOC dirty-bit. We need effective methodsthat target reducing the cost of maintaining dirty information. m c f l b m s op l e x li bq ge m s o m ne t w r f g ccx a l an c z eu sc a c t u s cc t w i p r t w i A m ean N o r m a li z ed B and w i d t h C on s u m p t i on Useful BW Install Tag-Update Dirty-Update

Figure 5: Breakdown of bus bandwidth consumption forproposed TicToc organization. Write-heavy workloadswaste signiﬁcant bandwidth updating TOC dirty-bit. a) Write Path (b) Miss + Install Path Install Writeback Decoupled Metadata Maintenance Miss/WB Probe Mem Read Install |D|C

TIC |C Mem R Cache WTICMetadata R Metadata WTOC

D| .C| .

Mem R Cache WTOC

D|D = (TOC Dirty-bit | TIC Dirty-bit). C is clean, D is dirty = access

D|DC|C

TicToc Metadata R Metadata W Mem R Cache WTicTocTicToc(+ PDM)

D|DD|C D|C

Mem R Cache WTicToc(+ PDM)

Marking predicted dirty lines at install saves metadata bandwidth

But, overpredicting dirty lines increases miss BW “Predicted-Dirty”

Figure 6: Bandwidth for a typical (a) write path and (b) miss+install path. TicToc+PDM adds “Predicted-Dirty” state,where TOC dirty-bit is installed as dirty but TIC dirty-bit is installed as clean. Installing lines in Pred-Dirty can (a)save TOC dirty-bit update, but (b) increase miss cost. Using Pred-Dirty only for write-likely lines can save bandwidth

The main source of bandwidth overhead of TicToc is main-taining the dirty-bit for TOC metadata. We need effectivemethods to reduce the cost of tracking dirty information. Weexplain difﬁculty before describing solution.

The dirty-bit update procedure starts upon an eviction of adirty line from L3. First, we need to check the tag/dirty-bitline in the destination location of the DRAM cache to seeif we can overwrite it (i.e., we must ﬁrst evict a dirty tag-mismatched line). The common case is that the line evictedfrom L3 is resident in L4 in Figure 6(a). Chou et al. [18]proposes to eliminate the tag-check for this common case bymaintaining a

DRAM Cache Presence bit (DCP) alongsideevery line in L3. The DCP informs us that the same line is inboth L3 and L4 – if the bit is set, the destination location hasthe same tag and can be directly overwritten (note that this op-timizationt is included in our baseline). Second, the DRAMcache will then write the dirty line to the DRAM cache. Third,the cache will need to update any pertinent tag and dirty-bitmetadata. The tag-update for TIC and TOC is uncommon, astypically L3 to L4 writebacks will hit. The dirty-bit-updatefor TIC is sent along with L4 install, so it does not incur band-width overhead. However, Figure 6(a)[TOC,TicToc] showsthat the dirty-bit update for TOC often needs to be separatelyqueried and potentially updated. This TOC dirty-bit updateis TicToc’s main source of bandwidth overhead.The overhead of dirty-bit updates is comprised of two parts:repeated TOC dirty-bit checks for already-dirty lines, and theinitial TOC dirty-bit update to mark clean-to-dirty transition.We target these two scenarios with two techniques.

We have an insight that if we also knew the dirty state ofthe corresponding line in the DRAM cache, we can avoid theneed to check the TOC dirty-bit. Instead, we can check (andupdate) the TOC dirty-bit only if the dirty status changes.

DRAM Cache Dirtiness:

To enable this optimization, wepropose to additionally store a

DRAM Cache Dirtiness bit(DCD) alongside the DCP [18] next to each line in the L3cache. The DCP stores information that the current L3 line isalso resident in L4. Meanwhile, the DCD will additionally store the dirty status of that L4 line. We set the DCD onread of a dirty line from L4. On a DRAM cache write, wecheck both the DCP and DCD. If both DCD and DCP areset, we know the line is resident and already dirty in the TOCmetadata – tag and dirty-bit will be unchanged and we do notneed to fetch TOC. Hence, DCP reduces tag checks when tagwill not be modiﬁed, and DCD reduces dirty-bit checks whendirty-bit will not be modiﬁed.Figure 7 shows that DCD reduces dirty-bit check of manyworkloads that repeatedly write to same lines (e.g., omnet , soplex ). However, there are several workloads (e.g., zeusmp )that are write-heavy and write to most lines only once – wewant to reduce dirty-bit updates for those workloads as well. For workloads that write-once to lines, we have an insightthat if we can preemptively mark the dirty bit in the TOC atinstall time, we can avoid even the initial TOC clean-to-dirtyupdate that would have occurred at L3 eviction time. We callthis approach

Preemptive Dirty Marking (PDM) . Preemptive Dirty Marking:

Figure 6 shows the typicalwrite and miss+install bandwidth for TicToc and one that pre-emptively marks TOC dirty-bit. Figure 6(a)[TicToc] showsa typical write path needs 4 accesses: a normal line wouldincur clean install, a write, and TOC dirty-bit read and write.Figure 6(a)[TicToc+PDM] shows that PDM can limit writesto 2 accesses. We add a new dirty state of “Predicted-Dirty,”where TOC dirty-bit is marked as dirty but TIC dirty-bit ismarked as clean. If we install lines in “Predicted-Dirty,” theTOC dirty-bit is set at install time, and even the initial TOCclean-to-dirty update can be avoided.However, while early marking can save bandwidth onwrites, PDM incurs a different problem on the miss path.Figure 6(b)[TicToc] shows a typical miss+install path needs2 accesses: TOC metadata informs residence and dirtiness somiss+install can be accomplished with a memory read and aDRAM cache install. However, Figure 6(b)[TicToc+PDM]shows that PDM can increase miss+install to 3 accesses. Forinstance, if an otherwise clean line has been preemptivelymarked as dirty in the TOC dirty-bit, we would read theDRAM cache line in preparation for an eviction of a dirtyline, thereby adding an extra DRAM read. Note that thePredicted-Dirty state does not cause extra memory write-6 .200.400.600.801.001.201.401.60 m c f l b m s op l e x li bq ge m s o m ne t w r f g cc x a l an c z eu s c a c t u s cc t w i p r t w i m i x m i x m i x m i x G m ean + no m c f .30.33 S peedup TOC TicToc TicToc+DCD TicToc+PDM Tag-In-SRAM

Figure 7: Speedup of TOC, proposed TicToc, TicToc with DRAM Cache Dirtiness bit, TicToc with Preemptive DirtyMarking (PDM), and ideal Tag-In-SRAM, normalized to TIC. TicToc+PDM performs near ideal for most workloads. backs as the miss/wb probe will ﬁnd the TIC dirty-bit, andwrite back only if the data is dirty. Thus, being aggressive inmarking lines as “Predicted-Dirty” will save write bandwidth,but it can come at the cost of increasing miss bandwidth.Ideally, we want to avoid write costs by installing write-likely lines as “Predicted-Dirty”, and avoid increased misscosts by installing write-unlikely lines as clean. However, ifwe install a write-likely line as clean, it will pay increasedmiss cost. Conversely, if we install a write-unlikely lineas “Predicted-Dirty,” it will pay cost to update TOC dirtybit. Hence, performance of Preemptive Dirty Marking iscontingent on good classiﬁcation of write-likely and write-unlikely lines at install-time to avoid both TOC dirty bitupdate and TIC miss probe bandwidth.

Sig = PC%(1<<10)

Cache

Counters

Sig IncomingLineBSig W Sig IncomingLineA

Predict-Clean(Init Dirty=0)Predict-Dirty(Init Dirty=1)

CTR

Figure 8: Signature-based Write Predictor learns whichsigs correspond to eventual write, to aid PDM technique.Write Predictor:

For accurate write-classiﬁcation for PDM,we develop a

Signature-based Write Predictor (SWP) to pre-dict likeliness an incoming line will be written. SWP employsa sampling PC-based prediction, inspired by SHiP [26, 27].Figure 8 shows structures and operation of SWP. SWP con-sists of write-behavior observation, learning, and prediction.Observation is accomplished by maintaining signature(installing-PC in this case) and a written-to bit inside themetadata of each line (10 bits additional metadata for the 1%sampled lines, stored in TOC-metadata). Signature is set atinstall-time, and written-to bit is updated on ﬁrst write to line.On eviction of such a sampled line, we get the informationthat this PC installed a line that was either written-to or neverwritten-to in its lifetime in the cache.Learning is then accomplished by storing observed write-behavior into a PC-indexed table of saturating 3-bit counters.On eviction of a line that has the written-to bit set, the countercorresponding to installing-PC is incremented. On eviction of a line that does not have written-to bit set, the countercorresponding to installing-PC is decremented. This countertable becomes a PC-indexed table of write-behavior.Prediction is then simple – on install, the installing-PC isused to index into the counter-table to provide a write-likelyor write-unlikely prediction. If the counter is non-zero, thisPC has seen write behavior and the incoming line shouldbe installed in “Predicted-Dirty” state to avoid initial TOCclean-to-dirty update. If the counter is zero, then this PC hasnot seen much write behavior and the incoming line shouldbe installed as clean to avoid miss/wb probes.

Accuracy of Write Predictor:

Effectiveness of PDM is con-tingent on good classiﬁcation of write-likely (dirty) lines toreduce dirty-bit update cost, and write-unlikely (clean) linesto reduce miss-probe cost. Figure 9 shows the fraction oflines that are predicted clean or dirty, and actually clean ordirty. On average, SWP predicts clean and dirty with 92%accuracy, and enables PDM to save most dirty-update andmiss-probe bandwidth. m c f l b m s op l e x li bq ge m s o m ne t w r f g ccx a l an c z eu sc a c t u s cc t w i p r t w i G m ean W r i t e - P r ed A cc ( % ) PDirty,ADirty PClean,ADirty PClean,AClean PDirty,AClean

Figure 9: Accuracy of Write Prediction (P=predicted,A=actual). Low PClean/ADirty and PDirty/AClean ratereﬂects accurate write-behavior prediction.

Performance:

Figure 7 shows the speedup of TOC, ourTicToc, TicToc with DCD, TicToc with PDM, and idealizedTag-In-SRAM, normalized to TIC approach. TOC performspoorly due to poor metadata-cache hit-rate, for 30% slow-down. Our TicToc reduces hit bandwidth, for 22% slowdown.Adding DRAM Cache Dirtiness bit reduces dirty-bit trackingfor repeated writes to same lines, for 0% speedup. AddingPreemptive Dirty Marking reduces the initial dirty-bit up-date without incurring extra miss bandwidth due to accurateWrite Predictor. Notably, TicToc+PDM achieves near ideal-ized Tag-In-SRAM performance for most workloads for 10%speedup, without including worst-case mcf in the average.Few workloads see performance gap to ideal. We analyzebandwidth consumption to gain insight into the problem.7 .000.250.500.751.00 m c f l b m s op l e x li bq ge m s o m ne t w r f g ccx a l an c z eu sc a c t u s cc t w i p r t w i A m ean N o r m a li z ed B and w i d t h C on s u m p t i on Useful BW Install Tag-Update Dirty-Update

Figure 10: Breakdown of bus bandwidth for dirty-optimized TicToc. Dirty-bit updates are greatly reduced. m c f l b m s op l e x li bq ge m s o m ne t w r f g ccx a l an c z eu sc a c t u s cc t w i p r t w i A m ean N o r m a li z ed B and w i d t h C on s u m p t i on Useful BW Install Metadata

Figure 11: Breakdown of bus bandwidth for dirty-optimized TicToc w/ Write-Aware Bypassing. Installs are mitigated.Bandwidth:

Figure 10 shows the bandwidth breakdownof TicToc + dirty-bit optimizations. Overall, our approacheliminates nearly all of the TOC dirty-bit update bandwidth(decreased fraction from 10% to 0.8%) and frees up band-width for useful reads and writes. However, we note thatinstalling lines and updating the TOC-tag now becomes theprimary source of DRAM cache bandwidth overhead. Wetarget this overhead next.

5. REDUCING INSTALL BANDWIDTHWITH WRITE-AWARE BYPASS

When data has poor reuse, installing lines and updatingTOC metadata wastes bandwidth. In fact, in such cases, em-ploying a DRAM cache could actually hurt performance, asthe line install and tag maintenance operations needlesslysteal bus bandwidth from memory accesses. Figure 13 showsthe performance of a setup without a DRAM cache, normal-ized to a setup with a TIC DRAM cache. We note that thereare multiple workloads (e.g., pr twi and cc twi ), for which“no DRAM cache” performs better than TIC. While one canavoid this degradation by disabling the DRAM cache at boot-time, doing so would then hurt the cache-friendly workloads.Therefore, we need effective mechanisms to reduce the costof unnecessary installs.

Insight – Write-Aware Bypassing:

Prior work has pro-posed cache bypassing [18, 28, 29] to avoid unnecessary in-stalls. On an L3 miss, one can bypass the DRAM cacheand install the line only in L1/L2/L3 caches, thereby savingthe DRAM cache install bandwidth. However, such bypass-ing must be done selectively and carefully, otherwise it mayincrease writes to 3D-XPoint memory, and degrade perfor-mance, endurance, and power.

Figure 12 shows our Write-Aware Bypassing policy. Westart with the default 90%-bypass policy proposed in [18],which bypasses 90% of all installs. While such aggressivebypassing was shown to work well for an HBM+DDR hybridmemory [18], we note that it can increase write trafﬁc tothe write-constrained 3D-XPoint memory. To address thisproblem, we add write awareness to the bypass policy. Weaugment the default bypass policy with a write-allocate con-dition, which requires that dirty L3 evictions would always install DRAM cache lines. Thus, the DRAM cache would actas a write buffer for 3D-XPoint memory. Unfortunately, thedrawback of such an approach is that installing DRAM cachelines at the time of L3 evictions may result in signiﬁcant tag-update costs. L3 evictions often have poor spatial locality, soTOC tag updates carried out at L3 eviction time exhibit poormetadata cache hit rates and incur extra DRAM accesses.To amortize the TOC tag-update cost of our write-allocatepolicy, we propose

Preemptive Write-Allocate , whereby wealso always-install write-likely lines (predicted with SWP).Preemptive Write-Allocate enables our write-allocate installsto happen at L3 miss time. Such installs have higher spa-tial locality, resulting in more metadata cache hits and moreeffective amortization of TOC metadata updates.

Demand MissWritebackMiss 90%-Bypass(save install/tag BW)Pred-CleanPred-Dirty Always-Install(reduce 3D-XPoint writes)Write-Predictor= Preemptive Write-Allocate

Figure 12: Write-Aware Bypass. Reduce install band-width by bypassing most write-unlikely lines. Reduce 3D-XPoint writes by installing write-likely lines. .600.801.001.201.401.601.802.002.20 m c f l b m s op l e x li bq ge m s o m ne t w r f g cc x a l an c z eu s c a c t u s cc t w i p r t w i m i x m i x m i x m i x G m ean + no m c f .44 .54 2.2 S peedup No DRAM Cache TicToc + 90% bypass + Write-Allocate + Preemptive Write-Allocate

Figure 13: Speedup of a no-DRAM-cache conﬁguration, proposed TicToc organization, adding 90%-bypass, addingWrite-Allocate, and adding Preemptive Write-Allocate, relative to TIC approach.

Bandwidth:

To understand the effectiveness of our installand metadata-update reducing optimizations, we show thebandwidth breakdown of our approach in Figure 11. Over-all, we ﬁnd that install-reducing optimizations can eliminatenearly all of the install bandwidth overheads and leave muchmore bandwidth for useful reads and writes. In total, thecombination of our cache bandwidth reducing optimizationsimproves fraction of bandwidth going to useful operations(servicing reads / writes) from 70% to 90% on average.

Performance:

Figure 13 shows the performance of TicTocwith dirty-optimizations, TicToc with 90%-bypass, TicTocwith 90%-bypass and write-allocate, and TicToc with 90%-bypass and preemptive write-allocate, relative to TIC.TicToc with dirty-bit optimizations does well for mostworkloads for an average 4.2% speedup, but can suffer forworkloads with poor spatial locality and low hit-rate (e.g., mcf ). TicToc with 90%-bypass reduces install and TOC tag-update cost to improve speedup to 16.7%. Notably, the perfor-mance degradation for mcf has been substantially mitigated.TicToc with 90%-bypass and write-allocate enables effectivewrite-buffering to improve speedup to 20.6%. Finally, Tic-Toc with 90%-bypass and preemptive write-allocate furtheramortizes TOC metadata-update (e.g., useful for zeusmp and pr twi ) to improve speedup to 23.2%.

Overall, our proposed techniques target all forms of DRAMcache maintenance bandwidth to achieve a bandwidth-efﬁcient(>90% of channel bandwidth to useful operations) and lowSRAM storage overhead (34KB) DRAM cache organiza-tion:

TicToc improves hit and miss bandwidth,

DRAM CacheDirtiness bit and

Preemptive Dirty Marking reduces dirty-bit-tracking bandwidth, and

Write-Aware Bypass reducesinstall and tag-tracking bandwidth. Our TicToc with dirty-bitand install bandwidth reducing optimizations enables 23.2%speedup at the cost of only 34KB of SRAM.

6. RESULTS AND DISCUSSION

In this section we present sensitivity studies and storageanalysis. Due to space constraints, we limit results to TicTocwith dirty-bit optimizations.

We analyze the SRAM storage requirements of our TicTocorganization. TicToc requires structures from its componentTIC and TOC organizations. Inheriting from TIC, we need~1KB for PC-based hit/miss prediction [7], and 1 bit along-side each L3 line for DRAM Cache Presence bit to avoid tag-check for writes to resident lines [18]. Inheriting fromTOC, we need 32KB for a metadata cache [8].Speciﬁc to TicToc, to implement our dirty-bit optimiza-tions, we need a 1-bit bit alongside each L3 line for DRAMCache Dirtiness, and ~1KB for our Signature-based Write-Predictor (512 entries of 3-bit counters with 9-bit PC tag).Our bypassing optimizations do not require additional space.In total, TicToc needs 34KB SRAM storage in the memorycontroller, with 2 bits alongside each L3 line.

Table 4: Storage Requirements of TicToc

TicToc Component SRAM StorageHit-Miss Predictor [7] 1 KBDRAM Cache Presence [18] 1-bit / L3-lineMetadata Cache [8] 32 KBDRAM Cache Dirtiness 1-bit / L3-lineSignature-based Write Predictor 1 KBTicToc 34KB + 2-bits/L3-line

The largest SRAM component of our TicToc proposal isthe TOC metadata cache. Table 5 shows performance sen-sitivity of our TicToc organization to metadata-cache sizing.We show average speedup of TicToc with dirty-bit optimiza-tions, when employing metadata-caches with sizes rangingfrom 8KB to 64KB. The dirty-bit tracking optimizations en-able TOC approaches to be much more effective with smallmetadata-cache sizes, as the metadata caches do not need tosized to handle writeback trafﬁc that has poor spatial locality.

Table 5: Sensitivity to Metadata Cache Sizing

Num. Entries TicToc TicToc (no mcf)128 (8KB) -3.0% +1.9%256 (16KB) +1.5% +7.0%512 (32KB) +4.2% +10.0%

DRAM and 3D-XPoint are likely to be behind the samechannel to maximize the bandwidth out of each physical pin,as shown in Figure 1. Figure 14 shows the system perfor-mance of a channel-shared system (two channels of TICDRAM cache + 3D-XPoint), normalized over previously as-sumed dedicated-channel systems (one channel of 2x TICDRAM cache, and one channel of 2x 3D-XPoint). We ﬁndthat channel-shared systems enable more balanced channelbandwidth usage due to each channel having a DRAM cache.9or example, under high DRAM cache hit-rate, a channel-shared system would be able to utilize all channels, whereasa dedicated-channel system would only be able to use thehalf of channels employing DRAM caches. Such channel-shared approaches enable up to 40% speedup compared tothe traditional dedicated-channel setups. m c f l b m s op l e x li bqge m s o m ne t w r f g ccx a l an cz eu sc a c t u scc t w i p r t w i m i x m i x m i x m i x G m ean S peedup Figure 14: Speedup of Channel-Shared Hybrid Mem-ory, over Dedicated-Channel Hybrid Memory. Channel-sharing enables up to 40% speedup.

To show robustness of our proposal to multi-programmedworkloads, we conduct evaluations over a larger set of 17mix-application workloads. Figure 15 shows that our dirty-optimized TicToc organization provides 11% speedup across17 mixes, with no workloads experiencing slowdown. m i x m i x m i x m i x m i x m i x m i x m i x m i x m i x m i x m i x m i x m i x m i x m i x m i x G m ean S peedup TicToc (dirty-opt) Idealized Tag-In-SRAM

Figure 15: Speedup of TicToc with dirty-bit optimiza-tions, and idealized Tag-In-SRAM, on mixed workloads.

In order to quantify the remaining performance opportunity,we compare our TicToc DRAM cache + 3D-XPoint solutionwith an expensive DRAM-only solution having the sameDRAM main memory capacity as the 3D-XPoint capacityin our setup. Note that this DRAM-only solution will costsubstantially (4–8x) more than a hybrid DRAM+3D-XPointmemory. Figure 16 shows performance results normalized toTIC. TicToc’s bandwidth-efﬁcient DRAM caching enables3D-XPoint to perform within 13% of the expensive DRAM-only solution. m c f l b m s op l e x li bqge m s o m ne t w r f g ccx a l an cz eu sc a c t u scc t w i p r t w i m i x m i x m i x m i x G m ean S peedup TicToc (dirty-opt,bypass) DRAM Only

Figure 16: Speedup of TicToc (dirty-opt, bypassing) andDRAM-only solution, relative to TIC cache + 3D-XPoint.

Our TicToc implementation uses a direct-mapped organi-zation to avoid the latency of serialized tag and data lookup(TOC explanation in Section 2.3). There have been works thatcan avoid the serialized tag lookup for associative TIC [30]and associative TOC [13] designs via scalable way-predictionmethods. As such, we ﬁnd associativity an orthogonal issuefor our proposal; techniques such as [30] can be incorporatedinto our design.

7. RELATED WORK7.1 Line-based DRAM Caches

In our work, we utilize and combine the two major types ofline-granularity DRAM cache designs:

Tag-Inside-Cacheline(TIC) and

Tag-Outside-Cacheline (TOC) approaches.

TIC designs [7, 11, 18, 30, 31, 32] organize their cacheas direct-mapped and store tag inside the cacheline, suchthat one access can retrieve both tag and data. Such ap-proaches are optimized for hits, but pay bandwidth to conﬁrmmisses [7]. BEAR [18] proposes several enhancements toreduce bandwidth cost of cache maintenance: we include itsDRAM Cache Presence that targets reducing write probe inour baseline TIC design, we compare with Bandwidth-AwareBypass with 90%-bypass in Figure 13, but, however, we donot include Neighboring Tag Cache as current implemen-tations cannot obtain neighboring tag for free [11]. Suchhit-latency optimized approaches have been proven effectivein industrial application with Intel Knights Landing prod-uct [11]; as such, we perform all of our experiments withBEAR as our baseline. We use BEAR as our TIC componentof TicToc, and improve upon TIC miss-bandwidth inefﬁcien-cies to enable a scalable bandwidth-efﬁcient DRAM cache.

TOC designs [8, 9, 12, 33, 34] store tags in a separate areaof the DRAM cache and fetch them as needed. The earli-est forms of such caches were highly associative and wouldneed a serial tag then data lookup [9]. Some enhancementsused tag-prefetching [33] or way-prediction [34] to avoid thisserialized tag lookup. Others used direct-mapped organiza-tion [8,12] to avoid serialized tag lookup, with one employinga tag cache [8] to reduce the bandwidth of tag lookup as well.Figure 7 shows that TIMBER [8], a direct-mapped TOC de-sign with tag cache, performs well for misses but can performpoorly due to high bandwidth cost to update metadata. Weuse TIMBER as our TOC component of TicToc, and improveupon TOC metadata-bandwidth inefﬁciencies to enable ascalable bandwidth-efﬁcient DRAM cache.

An alternate approach to designing DRAM-caches is touse large-granularity caches to amortize tag and metadataoverhead, in hardware [10, 13] or software [35, 36, 37, 38].

Hardware-only:

Hardware-based approaches store tagseither in SRAM [10, 19] or in DRAM [13]. The Tag-In-SRAM proposals typically use sector caching [19] to reducethe overall tag requirements, and ﬁt them all in MegaBytesof SRAM [10]. However, the storage for these approachesare still typically quite large. And, they have the penalty ofpoor cache utilization when not all lines in a page are used.10or comparison, we show what these cache organizations canachieve with the line-granularity Tag-In-SRAM organizationin Figure 7. Our proposed TicToc achieves close to this upper-bound with much less SRAM storage (34KB vs. >20MB).Alternatively, there are Tag-In-DRAM proposals that storemetadata in DRAM, and fetch the tags as needed [13]. Theseapproaches need to spend bandwidth to access and updatemetadata information in a separate area of DRAM cache. Assuch, these approaches often have similar bandwidth over-heads and performance to the TOC component of TicToc.And, they have the penalty of poor cache utilization whennot all lines in a page are used. For comparison, we showwhat these cache organizations can achieve with the line-granularity TOC organization in Figure 7. Our proposedTicToc, as well as the baseline TIC, outperforms such TOCorganizations. Nonetheless, our dirty-tracking optimizationsare general and can be applied to improve metadata updatecost for such caches as well.

Software-supported:

Software-supported DRAM cacheapproaches maintain mapping and metadata information in-side page tables [35, 36, 37, 38], and use various heuristicsto determine when to install pages. The beneﬁt to such ap-proaches is that they do not need to pay additional bandwidthto access tags. The shortcomings of such an approach aretwo-fold. First, the migration granularity is ﬁxed to the sizeof OS page, which can cause overfetch problems, as wellas poor cache DRAM utilization when not all of the pageis useful. Second, such approaches require both hardwareand software support, and can be difﬁcult to deploy withoutcooperation between multiple vendors. We do not performcomparison with such works as these approaches are out ofscope (i.e., break our design goal of OS-transparency).

Other hybrid memory approaches attempt to get the capac-ity of both memories, and instead initiate hardware-managedline or page swaps to enable most data to be serviced at thelower-latency or higher-bandwidth memory [39,40,41,42,43].These approaches have various tracking overheads and effec-tiveness. However, we note there is a fundamental differencefrom caching. On eviction of an unmodiﬁed line/page, cachescan simply drop the clean line/page – whereas, swap-basedapproaches need to always write back the evicted line/page.Such swaps incur extra writes that could otherwise have beenavoided. For our target DRAM + 3D-XPoint conﬁguration,these extra swapping-induced writes would cost performance,endurance, and power when writing to write-constrained 3D-XPoint. The added capacity beneﬁts (3-12%) obtained fromsuch swapping are unlikely to make up the difference. Hence,we do not take a swapping-based / two-level memory ap-proach for this work. We do not perform comparison withsuch works as these approaches are out of scope (i.e., breakour design goal of write-efﬁciency).

Tracking dirty-bit or most-recent-copy of cacheline efﬁ-ciently with low SRAM storage costs is a known difﬁcultproblem. Many works limit the amount of lines that can bekept dirty [25, 44], to reduce SRAM storage costs neededto track dirty lines. Other approaches are more extreme and make the cache clean-only by always writing through [45,46].However, for our work, we target a DRAM + 3D-XPointsystem, which is often constrained by 3D-XPoint write band-width. Such mostly-clean caching techniques, which limitthe fraction of DRAM cache that can be dirty, hamper theability for the DRAM cache to act as an effective write bufferfor 3D-XPoint. This write limit can cause correspondingdegradation in performance, endurance, and power.Our approach, on the other hand, does not impose anylimitation on which lines of the DRAM cache can be keptdirty. Instead, we fundamentally target dirty-bit update costwith architectural techniques. Our DRAM Cache Dirtinessand Preemptive Dirty Marking techniques reduce over 90% ofthe bandwidth cost to track dirty information, while needingonly 34KB of SRAM storage.

There has been a long line of work on hybrid DRAM +NVM systems [4, 5, 6]. These works typically try to useDRAM to hide the 4-8x read latency and poor write char-acteristics of NVM (e.g., low write bandwidth, high powerconsumption, low write endurance). Our work follows on thisline of research. We develop a scalable (low 34KB SRAMcost) and bandwidth-efﬁcient DRAM cache design, and adda NVM-speciﬁc Write-Aware Bypassing that speciﬁcally tar-gets hiding NVM’s poor write-relative-to-read characteristics.

8. CONCLUSION

This paper investigates bandwidth-efﬁcient DRAM cachingfor hybrid DRAM + 3D-XPoint memories. Effective DRAMcaching in front of 3D-XPoint is critical to enabling a mem-ory system that has the apparent high-capacity of 3D-XPoint,and the low-latency and high-write-bandwidth of DRAM.There are two currently major approaches for DRAM cachedesign: (1) a Tag-Inside-Cacheline (TIC) organization thatoptimizes for hits, by storing tag next to each line such thatone access gets both tag and data, and (2) a Tag-Outside-Cacheline (TOC) organization that optimizes for misses, bystoring tags from multiple data lines together in a tag-linesuch that one access to a tag-line gets information on severaldata-lines. Ideally, we would like to have the low hit-latencyof TIC designs, and the low miss-bandwidth of TOC designs.To this end, we propose a

TicToc organization that provisionsboth TIC and TOC to get the hit and miss beneﬁts of both.However, we ﬁnd that naively combining both techniquesactually performs worse than TIC individually, because onehas to pay the bandwidth cost of maintaining both metadata.We ﬁnd the majority of update bandwidth is due to maintain-ing the TOC dirty information. We propose

DRAM CacheDirtiness Bit that helps prune repeated dirty-bit updates forknown dirty lines. We propose

Preemptive Dirty Marking technique that predicts which lines will be written and proac-tively marks the dirty bit at install time, to help avoid even theinitial dirty-bit update for dirty lines. To support PDM, wedevelop a novel PC-based

Write-Predictor to aid in markingonly write-likely lines. Our evaluations on a 4GB DRAMcache in front of 3D-XPoint show that our TicToc organiza-tion enables 10% speedup over the baseline TIC, nearing the14% speedup possible with an idealized DRAM cache designwith 64MB of SRAM tags, while needing only 34KB SRAM.11 . REFERENCES [1] Intel and Micron, “A revolutionary breakthrough in memorytechnology,” 2015.[2] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y. J.Soh, Z. Wang, Y. Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basicperformance measurements of the intel optane DC persistent memorymodule,”

CoRR , vol. abs/1903.05714, 2019.[3] A. Ilkbahar, “Intel© optane™ dc persistent memory operating modesexplained,” 2018. Accessed: 2019-03-20.[4] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable highperformance main memory system using phase-change memorytechnology,” in

Proceedings of the 36th Annual InternationalSymposium on Computer Architecture , ISCA ’09, (New York, NY,USA), pp. 24–33, ACM, 2009.[5] G. Dhiman, R. Ayoub, and T. Rosing, “Pdram: A hybrid pram anddram main memory system,” in , pp. 664–669, July 2009.[6] A. Bivens, P. Dube, M. Franceschini, J. Karidis, L. Lastras, andM. Tsao, “Architectural design for next generation heterogeneousmemory systems,” in

Memory Workshop (IMW), 2010 IEEEInternational , pp. 1–4, IEEE, 2010.[7] M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off inarchitecting dram caches: Outperforming impractical sram-tags with asimple and practical design,” in , pp. 235–246, Dec2012.[8] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enablingefﬁcient and scalable hybrid memories using ﬁne-granularity dramcache management,”

IEEE Computer Architecture Letters , vol. 11,pp. 61–64, July 2012.[9] G. H. Loh and M. D. Hill, “Efﬁciently enabling conventional blocksizes for very large die-stacked dram caches,” in

Proceedings of the44th Annual IEEE/ACM International Symposium onMicroarchitecture , MICRO-44, (New York, NY, USA), pp. 454–464,ACM, 2011.[10] D. Jevdjic, S. Volos, and B. Falsaﬁ, “Die-stacked dram caches forservers: Hit ratio, latency, or bandwidth? have it all with footprintcache,” in

Proceedings of the 40th Annual International Symposium onComputer Architecture , ISCA ’13, (New York, NY, USA),pp. 404–415, ACM, 2013.[11] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod,S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knightslanding: Second-generation intel xeon phi product,”

IEEE Micro ,vol. 36, pp. 34–46, Mar 2016.[12] J. Sim, G. H. Loh, V. Sridharan, and M. O’Connor, “Resilientdie-stacked dram caches,” in

Proceedings of the 40th AnnualInternational Symposium on Computer Architecture , ISCA ’13, (NewYork, NY, USA), pp. 416–427, ACM, 2013.[13] D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsaﬁ, “Unison cache: Ascalable and effective die-stacked dram cache,” in

Microarchitecture(MICRO), 2014 47th Annual IEEE/ACM International Symposium on ,pp. 25–37, IEEE, 2014.[14] J. Standard, “High bandwidth memory (hbm) dram,”

JESD235 , 2013.[15] JEDEC,

DDR4 SPEC (JESD79-4) , 2013.[16] ArsTechnica, “Intel’s crazy-fast 3d xpoint optane memory heads forddr slots (but with a catch),” 2018. Accessed: 2019-01-23.[17] M. Arafa, B. Fahim, S. Kottapalli, A. Kumar, L. P. Looi, S. Mandava,A. Rudoff, I. M. Steiner, B. Valentine, G. Vedaraman, and S. Vora,“Cascade lake: Next generation intel xeon scalable processor,”

IEEEMicro , vol. 39, pp. 29–36, March 2019.[18] C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques formitigating bandwidth bloat in gigascale dram caches,” in

Proceedingsof the 42nd Annual International Symposium on ComputerArchitecture , ISCA ’15, (New York, NY, USA), pp. 198–210, ACM,2015.[19] J. B. Rothman and A. J. Smith, “Sector cache design and performance,”in

Proceedings 8th International Symposium on Modeling, Analysisand Simulation of Computer and Telecommunication Systems (Cat.No.PR00728) , pp. 124–133, Aug 2000.[20] N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shaﬁee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utahsimulated memory module,”

University of Utah, Tech. Rep , 2012.[21] Intel, “Fact sheet: New intel architectures and technologies targetexpanded market opportunities,” 2018. Accessed: 2019-03-20.[22] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, andA. Karunanidhi, “Pinpointing representative portions of large intelitanium programs with dynamic instrumentation,” in

Microarchitecture, 2004. MICRO-37 2004. 37th InternationalSymposium on , pp. 81–92, Dec 2004.[23] J. L. Henning, “Spec cpu2006 benchmark descriptions,”

SIGARCHComput. Archit. News , vol. 34, pp. 1–17, Sept. 2006.[24] S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmarksuite,”

CoRR , vol. abs/1508.03619, 2015.[25] J. Sim, G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi, “Amostly-clean dram cache for effective hit speculation andself-balancing dispatch,” in

Microarchitecture (MICRO), 2012 45thAnnual IEEE/ACM International Symposium on , pp. 247–257, IEEE,2012.[26] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr.,and J. Emer, “Ship: Signature-based hit predictor for highperformance caching,” in

Proceedings of the 44th Annual IEEE/ACMInternational Symposium on Microarchitecture , MICRO-44, (NewYork, NY, USA), pp. 430–441, ACM, 2011.[27] V. Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancingsignature-based hit predictor for improved cache performance,” in

The2nd Cache Replacement Championship (CRC-2 Workshop in ISCA2017) , 2017.[28] M. Kharbutli and Y. Solihin, “Counter-based cache replacement andbypassing algorithms,”

IEEE Trans. Comput. , vol. 57, pp. 433–447,Apr. 2008.[29] H. Gao and C. Wilkerson, “A dueling segmented lru replacementalgorithm with adaptive bypassing,” in

JWAC 2010-1st JILP Worshopon Computer Architecture Competitions: cache replacementChampionship , 2010.[30] V. Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enablingassociativity for gigascale dram caches by coordinating way-installand way-prediction,” in , pp. 328–339, June2018.[31] C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherentdram caches for multi-node systems,” in , pp. 1–13,Oct 2016.[32] V. Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing dramcaches for bandwidth and capacity,” in

ISCA ’17 , (New York, NY,USA), pp. 627–638, ACM, 2017.[33] C.-C. Huang and V. Nagarajan, “Atcache: reducing dram cache latencyvia a small sram tag cache,” in

Proceedings of the 23rd internationalconference on Parallel architectures and compilation , pp. 51–60,ACM, 2014.[34] Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y. Xie, “Buildinga low latency, highly associative dram cache with the buffered waypredictor,” in ,pp. 109–117, Oct 2016.[35] Y. Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “Afully associative, tagless dram cache,” in

Proceedings of the 42NdAnnual International Symposium on Computer Architecture , ISCA ’15,(New York, NY, USA), pp. 211–222, ACM, 2015.[36] H. Jang, Y. Lee, J. Kim, Y. Kim, J. Kim, J. Jeong, and J. W. Lee,“Efﬁcient footprint caching for tagless dram caches,” in

HighPerformance Computer Architecture (HPCA), 2016 IEEEInternational Symposium on , pp. 237–248, IEEE, 2016.[37] G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, andK. McGrath, “Challenges in heterogeneous die-stacked and off-chipmemory systems,” in , 02 2012.[38] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee:Bandwidth-efﬁcient dram caching via software/hardware cooperation,”in

Proceedings of the 50th Annual IEEE/ACM InternationalSymposium on Microarchitecture , MICRO-50 ’17, (New York, NY, SA), pp. 1–14, ACM, 2017.[39] C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memoryorganization with capacity of main memory and ﬂexibility ofhardware-managed cache,” in

Proceedings of the 47th AnnualIEEE/ACM International Symposium on Microarchitecture ,MICRO-47, (Washington, DC, USA), pp. 1–12, IEEE ComputerSociety, 2014.[40] J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim,“Transparent hardware management of stacked dram as part ofmemory,” in

Proceedings of the 47th Annual IEEE/ACM InternationalSymposium on Microarchitecture , MICRO-47, (Washington, DC,USA), pp. 13–24, IEEE Computer Society, 2014.[41] J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, “Silc-fm:Subblocked interleaved cache-like ﬂat memory organization,” in , pp. 349–360, Feb 2017.[42] A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen,“Mempod: A clustered architecture for efﬁcient and scalable migration in ﬂat address space multi-level memories,” in , pp. 433–444, Feb 2017.[43] A. Kokolis, “Pageseer: Using page walks to trigger page swaps inhybrid memory systems,” , pp. 596–608,2019.[44] C. Huang, R. Kumar, M. Elver, B. Grot, and V. Nagarajan, “C3d:Mitigating the numa bottleneck via coherent dram caches,” in , pp. 1–12, Oct 2016.[45] I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M.Aamodt, “Cache coherence for GPU architectures,” in , 2013.[46] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa,“Combining hw/sw mechanisms to improve numa performance ofmulti-gpu systems,” in

MICRO ’18 , October 2018., October 2018.