[PDF] Achieving Multi-Port Memory Performance on Single-Port Memory with Coding Techniques

Abstract

Many performance critical systems today must rely on performance enhancements, such as multi-port memories, to keep up with the increasing demand of memory-access capacity. However, the large area footprints and complexity of existing multi-port memory designs limit their applicability. This paper explores a coding theoretic framework to address this problem. In particular, this paper introduces a framework to encode data across multiple single-port memory banks in order to {\em algorithmically} realize the functionality of multi-port memory. This paper proposes three code designs with significantly less storage overhead compared to the existing replication based emulations of multi-port memories. To further improve performance, we also demonstrate a memory controller design that utilizes redundancy across coded memory banks to more efficiently schedule read and write requests sent across multiple cores. Furthermore, guided by DRAM traces, the paper explores {\em dynamic coding} techniques to improve the efficiency of the coding based memory design. We then show significant performance improvements in critical word read and write latency in the proposed coded-memory design when compared to a traditional uncoded-memory design.

Full PDF

AAchieving Multi-Port Memory Performance onSingle-Port Memory with Coding Techniques

Hardik Jain

Department of ECEThe University of Texas at Austin

Austin, United [email protected]

Matthew Edwards

GenXComm Inc.

Austin, United [email protected]

Ethan Elenberg

ASAPP Inc.

New York, United [email protected]

Ankit Singh Rawat

Google

New York , United [email protected] Sriram Vishwanath

Department of ECEUniversity of Texas at Austin

Austin, United [email protected]

Abstract — Many performance critical systems today must relyon performance enhancements, such as multi-port memories, tokeep up with the increasing demand of memory-access capacity.However, the large area footprints and complexity of existingmulti-port memory designs limit their applicability. This paperexplores a coding theoretic framework to address this problem.In particular, this paper introduces a framework to encodedata across multiple single-port memory banks in order to algorithmically realize the functionality of multi-port memory.This paper proposes three code designs with signiﬁcantly lessstorage overhead compared to the existing replication basedemulations of multi-port memories. To further improve per-formance, we also demonstrate a memory controller designthat utilizes redundancy across coded memory banks to moreefﬁciently schedule read and write requests sent across multiplecores. Furthermore, guided by DRAM traces, the paper explores dynamic coding techniques to improve the efﬁciency of the codingbased memory design. We then show signiﬁcant performanceimprovements in critical word read and write latency in theproposed coded-memory design when compared to a traditionaluncoded-memory design.

Keywords —DRAM, coding, memory controller, computer ar-chitecture, multi-port memory

I. I

NTRODUCTION

Loading and storing information to memory is an intrinsicpart of any computer program. As illustrated in Figure 1, thepast few decades have seen the performance gap betweenprocessors and memory grow. Even with the saturation anddemise of Moore’s law [1], [2], [3], processing power isexpected to grow as multi-core architectures become morereliable [4]. The end-to-end performance of a program heavilydepends on both processor and memory performance. Slowermemory systems can bottleneck computational performance.This has motivated computer architects and researchers toexplore strategies for shortening memory access latency, in-cluding sustained efforts towards enhancing the memory hier-archy [5]. Despite these efforts, long-latency memory accesses Work done primarily at The University of Texas at Austin

Fig. 1:

The gap in performance, measured as the differencein the time between processor memory requests for a singleprocessor and the latency of a DRAM access [6]. do occur when there is a miss in the last level cache (LLC).This triggers an access to shared memory, and the processor isstalled as it waits for the shared memory to return the requestedinformation.In multi-core systems, shared memory access conﬂicts be-tween cores result in large access request queues. Figure 2illustrates a general multi-core architecture. The bank queuesare served every memory clock cycle and the acknowledge-ment with data is sent back to the corresponding processor.In scenarios where multiple cores request access to memorylocations in the same bank, the memory controller arbitratesthem using bank queues. This contention between cores toaccess from the same bank is known as a bank conﬂict . As thenumber of bank conﬂicts increases, the resultant increases inmemory access latency causes the multi-core system to slow.We address the issue of increased latency by introducing acoded memory design. The main principle behind our memorydesign is to distribute accesses intended for a particular bankacross multiple banks. We redundantly store encoded data, andwe decode memory for highly requested memory banks usingidle memory banks. This approach allows us to simultaneouslyserve multiple read requests intended for a particular bank.Figure 3 shows this with an example. Here, Bank 3 is a r X i v : . [ c s . A R ] J a n ig. 2: General multi-core architecture with a shared memory. N processor cores share a memory consisting of M banks. Bringing coding closer to hardware a Memory controller b a + b a(i) a(j)b(j) a(j)+b(j) a(i) Read request 1 Read request 2

Bank1 Bank 2 Bank 3

Fig. 3:

Here the redundant memory in Bank 3 enables multipleread accesses to Bank 1 or 2. Given two read requests { a ( i ) , a ( j ) } directed to Bank , we can resolve bank conﬂictby reading a ( i ) directly from Bank 1 and acquiring a ( j ) withtwo reads from Bank 2 and Bank 3. b ( j ) and a ( j ) + b ( j ) areread from Bank 2 and Bank 3, and a ( j ) is recovered because a ( j ) = b ( j ) + a ( j ) + b ( j ) . redundant as its content is a function of the content storedon Banks 1 and 2. Such redundant banks are also referred toas parity banks . Assume that the information is arranged in L rows in two ﬁrst two banks, represented by [ a (1) , . . . , a ( L )] and [ b (1) , . . . , b ( L )] , respectively. Let + denote the XORoperation, and additionally assume that the memory controlleris capable of performing simple decoding operations, i.e. recovering a ( j ) from b ( j ) and a ( j ) + b ( j ) . Because the thirdbank stores L rows containing [ a (1) + b (1) , . . . , a ( L ) + b ( L )] ,this design allows us to simultaneously serve any two readrequests in a single memory clock cycle.Hybrid memory designs such as the one in Figure 3 haveadditional requirements in addition to serving read requests.The presence of redundant parity banks raises a number ofchallenges while serving write requests. The memory overheadof redundant memory storage adds to the overall cost ofsuch systems, so efforts must be made to minimize this over-head. Finally, the heavy memory access request rate possiblein multi-core scenarios necessitates sophisticated schedulingstrategies to be performed by the memory controller. Inthis paper we address these design challenges and evaluatepotential solutions in a simulated memory environment. Main contributions and organization:

In this paper wesystematically address all key issues pertaining to a sharedmemory system that can simultaneously service multipleaccess requests in a multi-core setup. We present all thenecessary background on realization of multi-port memoriesusing single-port memory banks along with an account ofrelevant prior work in Section II. We then present the maincontributions of the paper which we summarize below. • We focus on the design of the storage space in Section III.In particular, we employ three speciﬁc coding schemesto redundantly store the information in memory banks.These coding schemes, which are based on the literatureon distributed storage systems [7], [8], [9], [10], allowus to realize the functionality of multi-port memoriesfrom single port memories while efﬁciently utilizing thestorage space. • We present a memory controller architecture for theproposed coding based memory system in Section IV.Among other issues, the memory controller design in-volves devising scheduling schemes for both read andwrite requests. This includes careful utilization of the re-dundancy present in the memory banks while maintainingthe validity of information stored in them. • Focusing on applications where memory traces mightexhibit favorable access patterns, we explore dynamiccoding techniques which improve the efﬁciency of ourcoding based memory design in Sections IV-E. • Finally, we conduct a detailed evaluation of the proposeddesigns of shared memory systems in Section V. Weimplement our memory designs by extending Ramulator,a DRAM simulator [11]. We use the gem5 simulator [12]to create memory traces of the PARSEC benchmarks [13]which are input to our extended version of Ramulator. Wethen observe the execution-time speedups our memorydesigns yield.II. B

ACKGROUND AND R ELATED W ORK

A. Emulating multi-port memories

Multi-port memory systems are often considered to beessential for multi-core computation. Individual cores mayrequest memory from the same bank simultaneously, and ab-sent a multi-port memory system some cores will stall. Multi-port memory systems have signiﬁcant design costs. Complexcircuitry and area costs for multi-port bit-cells are signiﬁ-cantly higher than those for single-port bit-cells [14], [15].This motivates the exploration of algorithmic and systematicdesigns that emulate multi-port memories using single-portedmemory banks [16], [17], [18], [19], [20]. Attempts have beenmade to emulate multi-port memory using replication baseddesigns [21], however the resulting memory architectures arevery large.

1) Read-only Support:

Replication-based designs are oftenproposed as a method for multi-port emulation. Suppose that amemory design is required to support only read requests, say r read requests per memory clock cycle. A simple solution is ringing coding closer to hardware a Memory controller a a(i) a(j) a(j) a(i) Read request 1 Read request 2

Bank1 Bank 2 b b Bank 3 Bank 4

Fig. 4: A -replication design which supports read requestsper bank. In this design, the data is partitioned betweentwo banks a = [ a (1) , . . . , a ( L )] b = [ b (1) , . . . , b ( L )] andduplicated. a(j) a(i) Bank1 Bank 2 Bank 3 Bank 4 a a b b a a b b Bank 5 Bank 6 Bank 7 Bank 8

Memory controller a(i) a(j)

Read request 1 Read request 2 a(k)

Write request 1

Pointer storage (Group 2)Pointer storage (Group 1) a(k) a(k) a(j) a(k)a(k)a(i)

Fig. 5: A -replication based design to support r = 2 read re-quests and w = 1 write requests. Both collections of informa-tion elements a = [ a (1) , . . . , a ( L )] and b = [ b (1) , . . . , b ( L )] are replicated to obtain r · ( w + 1) = 4 single-port memorybanks. These banks are then partitioned into r = 2 disjointgroups, Banks – and Banks – . The pointer storageis required to ensure all read requests are not served stalesymbols. As shown in the illustration, the write request isserved to two of the a banks to ensure that the fresh a ( k ) may be served during any future cycle. storing r copies of each data element on r different single-port memory banks. In every memory clock cycle, the r readrequests can be served in a straightforward manner by mappingall read request to distinct memory banks (see Figure 4). Thisway, the r -replication design completely avoids bank conﬂictsfor up to r read request in a memory clock cycle. Remark 1:

If we compare the memory design in Figure 4with that of Figure 3, we notice that both designs can si-multaneously serve read requests without causing any bankconﬂicts. Note that the design in Figure 3 consumes lessstorage space as it needs only single-port memory bankswhile the design in Figure 4 requires single-port memorybanks. However, the access process for the design in Figure 3involves some computation. This observation raises the notionthat sophisticated coding schemes allow for storage efﬁcientdesigns compared to replication based methods [22]. However,this comes at the expense of increased computation requiredfor decoding.

2) Read and Write Support:

A proper emulation of multi-port memory must be able to serve write requests. A challengethat arises from this requirement is tracking the state of mem-ory. In replication-based designs where original data banks areduplicated, the service of writes requests results in differencesin state between the original and duplicate banks.Replication-based solutions to the problems presented whensupporting write requests involve creating yet more duplicatebanks. A replication-based multi-port memory emulation thatsimultaneously supports r read requests and w write requestsrequires a r · ( w + 1) replication scheme, where r · ( w + 1) copies of each data element are stored on r · ( w + 1) differentsingle-port memory banks. We illustrate this scheme for r =2 and w = 1 in Figure 5. As in previous illustrations, wehave two groups of symbols a = [ a (1) , . . . , a ( L )] and b =[ b (1) , . . . , b ( L )] . We store copies each of data elements a and b and partition the banks into r = 2 disjoint groups. Eachgroup contains ( w + 1) = 2 memory banks. An additionalstorage space, the pointer storage, is required to keep trackthe state of the data in the banks. B. Storage-efﬁcient emulation of multi-port memories

As described in Section II-A, introducing redundancy tosystems which use single-port memory banks allows such sys-tems to emulate the behavior of multi-port banks. Emulatingmulti-port read and write systems is costly (cf. Section II-A2).A greater number of single-port memory banks are needed,and systems which redundantly store memory require trackingof the various versions of the data elements present in thememory banks. Furthermore, as write requests are served theelements stored across redundant banks temporary differ. Thistransient inconstancy between redundant storage complicatesthe process of arbitration.We believe that various tasks that arise in the presence ofwrite requests and contribute to computational overhead ofthe memory design, including synchronization among memorybanks and complicated arbitration, can be better managed atthe algorithmic level. Note that these tasks are performed bythe memory controller. It is possible to mitigate the effect ofthese tasks on the memory system by relying on the increasingavailable computational resources while designing the memorycontroller. Additionally, we believe that large storage overheadis a more fundamental issue that needs to be addressed beforemulti-port memory emulation is feasible. In particular, thelarge replication factor in a naive emulation creates such alarge storage overhead that the resulting area requirements ofsuch designs are impractical.Another approach arises from the observation that some databanks are left unused during arbitration in individual memorycycles, while other data banks receive multiple requests. Weencode the elements of the data banks using speciﬁc codingschemes to generate parity banks. Elements drawn from mul-tiple data banks are encoded and stored in the parity banks.This approach allows us to utilize idle data banks to decodeelements stored in the parity banks in service of multiplerequests which target the same data bank. We recognize thathis approach leads to increased complexity at the memorycontroller. However, we show that the increase in complexitycan be kept within an acceptable level while ensuring storage-efﬁcient emulation of multi-port memories.

C. Related work

Coding theory is a well-studied ﬁeld which aims to mitigatethe challenges of underlying mediums in information process-ing systems [22], [23]. The ﬁeld has enabled both reliablecommunication across noisy channels and reliability in fault-prone storage units. Recently, we have witnessed intensiveefforts towards the application of coding theoretic ideas todesign large scale distributed storage systems [24], [25], [26].In this domain, the issue of access efﬁciency has also receivedattention, especially the ability to support multiple simulta-neous read accesses with small storage overhead [9], [10],[27], [28]. In this paper, we rely on such coding techniques toemulate multi-port memories using single-port memory banks.We note that the existing work on batch codes [9] focuses onlyon read requests, but the emulation of multi-port memory mustalso handle write requests.Coding schemes with low update complexity that can beimplemented at the speed memory systems require have alsobeen studied [29], [30]. Our work is distinguished from themajority of the literature on coding for distributed storage,because we consider the interplay between read and writerequests and how this interplay affects memory access latency.The work which is closest to our solution for emulating amulti-port memory is by Iyer and Chuang [19], [20], wherethey also employ XOR based coding schemes to redundantlystore information in an array of single-port memory banks.However, we note that our work signiﬁcantly differers from[19], [20] as we speciﬁcally rely on different coding schemesarising under the framework of batch codes [9]. Additionally,due to the employment of distinct coding techniques, thedesign of memory controller in our work also differs fromthat in [19], [20].III. C

ODES TO I MPROVE A CCESSES

Introducing redundancy into a storage space comprisedof single-port memory banks enables simultaneous memoryaccess. In this section we propose memory designs that utilizecoding schemes which are designed for access-efﬁciency. Weﬁrst deﬁne some basic concepts with an illustrative exampleand then describe coding schemes in detail. A. Coding for memory banks

A coding scheme deﬁnes how memory is encoded to yieldredundant storage. The memory structures which store theoriginal memory elements are known as data banks . Theelements of the data banks go through an encoding process which generates a number of parity banks . The parity bankscontain elements constructed from elements drawn from two ormore data banks. A linear encoding process such as XOR maybe used to minimize computational complexity. The following

Data BankData Bank Parity Bank a (1) a ( L ) a ( L ) b ( L ) b ( L ) b (1) a (1) + b (1) a ( L ) + b ( L ) Fig. 6:

This illustration is an example parity design. example further clariﬁes these concepts and provides somenecessary notation.

Example 1:

Consider a setup with two data banks a and b . We assume that each of the banks store L · W binary dataelements which are arranged in an L × W array. In particular,for i ∈ [ L ] (cid:44) { , . . . , L } , a ( i ) and b ( i ) denote the i -th rowof the bank a and bank b , respectively. Moreover, for i ∈ [ L ] and j ∈ [ W ] (cid:44) { , . . . , W } , we use a i,j and b i,j to denote the j -th element in the rows a ( i ) and b ( i ) , respectively. Therefore,for i ∈ [ L ] , we have a ( i ) = (cid:0) a i, , a i, , . . . , a i,W (cid:1) ∈ { , } W b ( i ) = (cid:0) b i, , b i, , . . . , b i,W (cid:1) ∈ { , } W . Now, consider a linear coding scheme that produces a paritybank p with L (cid:48) W bits arranged in an L (cid:48) × W array such thatfor i ∈ [ L (cid:48) ] (cid:44) { , . . . , L (cid:48) } , p ( i ) = (cid:0) p i, , . . . , p i,W (cid:1) = a ( i ) + b ( i ) (cid:44) ( a i, + b i, , a i, + b i, , . . . , a i, + b i, ) . (1) Remark 2:

Figure 6 illustrates this coding scheme. Becausethe parity bank is based on those rows of the data banks thatare indexed by the set [ L (cid:48) ] ⊆ [ L ] , we use the following concisenotation to represent the encoding of the parity bank. p = a ([ L (cid:48) ]) + b ([ L (cid:48) ]) . In general, we can use any subset S = { i , i , . . . , i L (cid:48) } ⊆ [ L ] comprising L (cid:48) rows of data banks to generate the parity bank p . In this case, we have p = a ( S ) + b ( S ) , or p ( l ) = a ( i l ) + b ( i l ) for l ∈ [ L (cid:48) ] . Remark 3:

Note that we allow for the data banks and paritybanks to have different sizes, i.e. L (cid:54) = L (cid:48) . This freedom inmemory design can be utilized to reduce the storage overheadof parity banks based on the underlying application. If thesize of a parity bank is smaller than a data bank, i.e. L (cid:48)

Note that the size of shallow banks is a designchoice which is controlled by the parameter < α ≤ .A small value of α corresponds to small storage overhead. It is possible to work with data elements over larger alphabets/ﬁnite ﬁelds.However, assuming data elements to be binary sufﬁces for this paper as onlywork with coding schemes deﬁned over binary ﬁeld. he choice of a small α comes at the cost of limiting paritymemory accesses to certain memory ranges. In Section IV-Ewe discuss techniques for choosing which regions of memoryto encode. In scenarios where many memory accesses arelocalized to small regions of memory, shallow banks cansupport many parallel memory accesses for little storageoverhead. For applications where memory access patterns areless concentrated, the robustness of the parity banks allowsone to employ a design with α = 1 .

1) Degraded reads and their locality:

The redundant datagenerated by a coding scheme mitigates bank conﬂicts bysupporting multiple read accesses to the original data elements.Consider the coding scheme illustrated in Figure 6 with aparity bank p = a ([ L (cid:48) ]) + b ([ L (cid:48) ]) . In an uncoded memorysystem simultaneous read requests for bank a , such as a (1) and a (5) , result in a bank conﬂict. The introduction of p allows both read requests to be served. First, a (1) is serveddirectly from bank a . Next, b (5) and p (5) are downloaded. a (5) = b (5) + p (5) , so a (5) is recovered by means of thememory in the parity bank. A read request which is servedwith the help of parity banks is called a degraded read . Eachdegraded read has a parameter locality which corresponds tothe total number of banks used to serve it. Here, the degradedread for a (5) using b and p has locality . B. Codes to emulate multi-port memory

We will now describe the code schemes proposed for theemulation of multi-port memories. Among a large set ofpossible coding schemes, we focus on three speciﬁc codingschemes for this task. We believe that these three codingschemes strike a good balance among various quantitative pa-rameters, including storage overhead, number of simultaneousread requests supported by the array of banks, and the localityassociated with various degraded reads. Furthermore, thesecoding schemes respect the practical constraint of encodingacross a small number of data banks. In particular, we focuson the setup with memory banks.

1) Code Scheme I:

This code scheme is motivated fromthe concept of batch codes [9] which enables parallel accessto content stored in a large scale distributed storage system.The code scheme involves data banks { a , b , . . . , h } eachof size L and shallow banks each of size L (cid:48) = αL . Wepartition the data banks into two groups of banks. Theunderlying coding scheme produces shallow parity banks byseparately encoding data banks from the two groups. Figure 7shows the resulting memory banks. The storage overhead ofthis schemes is αL which implies the rate of the codingscheme is L L + 12 αL = 22 + 3 α . We now analyze the number of simultaneous read requeststhat can be supported by this code scheme. The information rate is a standard measure of redundancy of a codingscheme ranging from to , where corresponds to the most efﬁcientutilization of storage space. Best case analysis:

This code scheme achieves maximumperformance when sequential accesses to the coded regionsare issued. During the best case access, we can achieve up to parallel accesses to a particular coded region in one accesscycle. Consider the scenario when we receive accesses to thefollowing rows: { a (1) , b (1) , c (1) , d (1) , a (2) , b (2) , c (2) , d (2) , c (3) , d (3) } . Note that we can serve the read requests for the rows { a (1) , b (1) , c (1) , d (1) } using the data bank a and the threeparity banks storing { a (1)+ b (1) , b (1)+ c (1) , c (1)+ d (1) } . Therequests for { a (2) , c (2) , d (2) } can be served by downloading b (2) from the data bank b and { a (2)+ d (2) , b (2)+ d (2) , a (2)+ c (2) } from their respective parity banks. Lastly, in the samememory clock cycle, we can serve the requests for { c (3) , d (3) } using the data banks c and d . Worst case analysis : This code scheme (cf. Figure 7) may

Data Banks Parity Banks a (1)+ b (1) a ( L )+ b ( L ) a (1) a ( L ) a ( L ) b ( L ) b ( L ) b (1) e (1) e ( L ) e ( L ) f ( L ) f ( L ) f (1) g (1) g ( L ) g ( L ) h ( L ) h ( L ) h (1) c (1) c ( L ) c ( L ) d ( L ) d ( L ) d (1) Code Region 1Code Region 2 c (1)+ d (1) a (1)+ d (1) b (1)+ c (1) b (1)+ d (1) a ( L )+ d ( L ) a ( L )+ c ( L ) a (1)+ c (1) b ( L )+ c ( L ) b ( L )+ d ( L ) c ( L )+ d ( L ) e ( L )+ f ( L ) e (1)+ f (1) e (1)+ g (1) e (1)+ h (1) e ( L )+ h ( L ) e ( L )+ g ( L ) f ( L )+ g ( L ) f (1)+ g (1) f (1)+ h (1) f ( L )+ h ( L ) g ( L )+ h ( L ) g (1)+ h (1) Fig. 7:

Pictured here is an illustration of code scheme I. fail to utilize any parity banks depending on the requestswaiting to be served. The worst case scenario for this codescheme is when there are non-sequential and non-consecutiveaccess to the memory banks. Take for example a scenariowhere we only consider the ﬁrst four banks of the codescheme. The following read requests are waiting to be served: { a (1) , a (2) , b (8) , b (9) , c (10) , c (11) , d (14) , d (15) } . Because none of the requests share the same row index, weare unable to utilize the parity banks. The worst case numberof reads per cycle is equal to the number of data banks.

2) Code Scheme II:

Figure 8 illustrates the second codescheme explored in this paper. Again, the data banks { a , b , . . . , h } are partitioned into two groups containing databanks each. These two groups are then associated with twocode regions. The ﬁrst code region is similar to the previouscode scheme, as it contains parity elements constructed fromtwo data banks. The second code region contains data directlyduplicated from single data banks. This code scheme furtherdiffers from the previous code scheme (cf. Figure 7) interms of the size and arrangement parity banks. Even though L (cid:48) = αL rows from each data bank are stored in a codedmanner by generating parity elements, the parity banks areassumed to be storing αL > L (cid:48) rows.For a speciﬁc choice of α , the storage overhead of thisscheme is αL which leads to a rate of L L + 20 αL = 22 + 5 α . ote that this code scheme can support read accesses perdata bank in a single memory clock cycle as opposed to readrequests supported by the code scheme from Section III-B1.However, this is made possible at the cost of extra storageoverhead. Next, we discuss the performance of this codescheme in terms of the number of simultaneous read requeststhat can be served in the best and worst case. Parity BanksData Banks a (1) a ( L ) a ( L ) b ( L ) b ( L ) b (1) e (1) e ( L ) e ( L ) f ( L ) f ( L ) f (1) g (1) g ( L ) g ( L ) h ( L ) h ( L ) h (1) c (1) c ( L ) c ( L ) d ( L ) d ( L ) d (1) Code Region 1Code Region 2 a (1)+ b (1) a ( L )+ b ( L ) a (1)+ d (1) b (1)+ c (1) b (1)+ d (1) a ( L )+ d ( L ) a ( L )+ c ( L ) a (1)+ c (1) b ( L )+ c ( L ) b ( L )+ d ( L ) g ( L )+ h ( L ) g (1)+ h (1) e (1) e ( L ) f (1) f ( L ) g ( L ) g (1) h (1) h ( L ) c (1)+ d (1) c ( L )+ d ( L ) e ( L )+ f ( L ) e (1)+ f (1) e (1)+ g (1) e (1)+ h (1) e ( L )+ h ( L ) e ( L )+ g ( L ) f ( L )+ g ( L ) f (1)+ g (1) f (1)+ h (1) f ( L )+ h ( L ) a (1) b (1) c (1) d (1) d ( L ) c ( L ) b ( L ) a ( L ) Fig. 8:

Pictured here is an illustration of code scheme II.

Best case analysis:

This code scheme achieves the best accessperformance when sequential accesses to the data banks areissued. In particular, this scheme can support up to readrequests in a single memory clock cycle. Consider the scenariowhere we receive read requests for the following rows of thedata banks: (cid:8) a (1) , b (1) , c (1) , d (1) , a (2) , b (2) , c (2) , d (2) , a (3) , b (3) , c (3) (cid:9) . Here, we can serve { a (1) , b (1) , c (1) , d (1) } using the databank a with the parity banks storing the parity elements { a (1) + b (1) , b (1) + c (1) , c (1) + d (1) } . Similarly, we canserve the requests for the rows { a (2) , b (2) , d (2) } usingthe data bank b with the parity banks storing the parityelements { a (2) + d (2) , b (2) + d (2) } . Lastly, the request forthe rows c (2) and d (3) is served using the data banks c and d . Worst case analysis:

Similar to the worst case in SchemeI, this code scheme can enable simultaneous accesses in asingle memory clock cycle in the worst case. The worst caseoccurs when requests are non-sequential and non-consecutive.

3) Code Scheme III:

The next code scheme we discusshas locality 3, so each degraded read requires two paritybanks to be served. This code scheme works with data bank { a , b , . . . , h , z } and generates shallow parity banks. Figure 9shows this scheme. The storage overhead of this scheme is αL which corresponds to the rate of α . We note that thisscheme possesses higher logical complexity as a result of itsincreased locality.This scheme supports simultaneous read access per bankper memory clock cycle as demonstrated by the followingexample. Suppose rows { a (1) , a (2) , a (3) , a (4) } are requested. a (1) can be served directly from a . a (2) is served by meansof a parity read and reads to banks b and c , a (3) is served bymeans of a parity read and reads to banks d and g , and a (4) is served by means of a parity read and reads to banks e and z . Best case analysis:

Following the analysis similar to codeschemes I and II, the best case number of reads per cycle willbe equal to the number of data and parity banks.

Worst case analysis:

Similar to code schemes I and II, thenumber of reads per cycle is equal to the number of data banks.

Data BanksParity Banks a (1) a ( L ) a ( L ) b ( L ) b ( L ) b (1) e (1) e ( L ) e ( L ) f ( L ) f ( L ) f (1) g (1) g ( L ) g ( L ) h ( L ) h ( L ) h (1) c (1) c ( L ) c ( L ) d ( L ) d ( L ) d (1) z (1) z ( L ) z ( L ) a ( L )+ b ( L )+ c ( L ) a (1)+ b (1)+ c (1) a (1)+ d (1)+ g (1) a ( L )+ d ( L )+ g ( L ) b ( L )+ e ( L )+ h ( L ) b (1)+ e (1)+ h (1) b (1)+ f (1)+ g (1) b ( L )+ f ( L )+ g ( L ) a ( L )+ e ( L )+ z ( L ) a (1)+ e (1)+ z (1) c (1)+ f (1)+ z (1) c ( L )+ f ( L )+ z ( L ) c ( L )+ d ( L )+ h ( L ) c (1)+ d (1)+ h (1) d (1)+ e (1)+ f (1) d ( L )+ e ( L )+ f ( L ) g ( L )+ h ( L )+ z ( L ) g (1)+ h (1)+ z (1) Fig. 9:

Pictured here is an illustration of code scheme III.Remark 5:

Note that the coding scheme in Figure 9 de-scribes a system with data banks. However, we have setout to construct a memory system with data banks. It isstraightforward to modify this code scheme to work with data banks by simple omitting the ﬁnal data bank from theencoding operation.IV. M EMORY C ONTROLLER D ESIGN

The architecture of the memory controller is focused onexploiting redundant storage in the coding schemes to servememory requests faster than an uncoded scheme.The following three stages are illustrated in Figure 2: • Core arbiter:

Every clock cycle, the core arbiter receivesup to one request from each core which it stores in aninternal queue. The core arbiter attempts to push theserequests to the appropriate bank queue. If it detects thatthe destination bank queue is full, the controller signalsthat the core is busy which stalls the core. • Bank queues:

Each data bank has a corresponding readqueue and write queue . The core arbiter sends memoryrequests to the bank queues until the queues are full. Inour simulations, we use a bank queue depth of 10. Thereis also an additional queue which holds special requestssuch as memory refresh requests. • Access scheduler:

Every memory cycle, the accessscheduler chooses to serve read requests or write requests,algorithmically determining which requests in the bankqueues it will schedule. The scheduling algorithms theaccess scheduler uses are called pattern builders. Depend-ing on the current memory cycle’s request type, the accessscheduler invokes either the read or write pattern builder.We note that the core arbiter and bank queues should notdiffer much from those in a traditional uncoded memorycontroller. The access scheduler directly interacts with thememory banks, and therefore must be designed speciﬁcallyfor our coding schemes. ead Pattern Builder Write Pattern Builder Dynamic Coding Controller ReCoding Controller

Bank1 Bank2 BankN Row0 Row1 Row2 RowM-1

Fig. 10:

Pictured here is an illustration of the access scheduler.

Fig. 11:

Pictured here is a ﬂowchart of the read pattern builder.A. Code Status Table

The code status table keeps track of the validity of elementsstored in the data and parity banks. When a write is served to arow in a data bank, any parity bank which is constructed fromthe data bank will contain invalid data in its correspondingrow. Similarly, when the access scheduler serves a write to aparity bank, both the data bank which contains the memoryaddress speciﬁed by the write request and any parity bankswhich utilize that data bank will contain invalid data.Figure 10 depicts our implementation of the code statustable. It contains an entry for every row in each data bank,which can take one of three values indicating 1) the data inboth the data bank and parity banks is fresh, 2) the data bankcontains the most recent data, or 3) one of the parity bankscontains the most recent data.

B. Read pattern builder

The access scheduler uses the read pattern builder algorithmto determine which requests to serve using parity banks andwhich to serve with data banks. The read pattern builder selects

Code Region 1 a b c d a + b a + c a + d b + d c + db + c Data Banks Parity Banks

Elements downloaded a (1) b (2) c (3) d (4) a (1)+ b (1) b (2)+ c (2) c (3)+ d (3) a (1)+ c (1) a (1)+ d (1) b (2)+ d (2) Bank Queues a (1) a (2) b (1) b (2) b (9) c (1) c (8) c (3) c (2) d (4) d (2) d (30) d (1) d (3) Fig. 12:

Illustration of the algorithm to build a read requestpattern to be served in a given memory cycle. All the readrequests associated with the strike-through elements are sched-uled to be served in a given memory cycle. The ﬁgure alsoshows the elements downloaded from all the memory banks inorder to serve these read requests. which memory requests to serve and determines how requestsserved by parity banks will be decoded. The algorithm isdesigned to serve many read requests in a single memorycycle. Figure 11 shows our implementation of the read patternbuilder.Figure 12 shows an example read pattern constructed byour algorithm. The provided scenario is an example where theparity banks are used to their best effect, because each paritybank is used to serve an additional read request.

C. Write pattern builder

Parity banks allow the memory controller to serve additionalwrite requests per cycle. When multiple writes target a singlebank, it can commit some of them to parity banks. The accessscheduler implements a write pattern builder algorithm todetermine which write requests to schedule in a single memorycycle. Figure 13 illustrates our implementation of the writepattern builder. Only when the write bank queues are nearlyfull does the access scheduler execute the write pattern builderalgorithm.Figure 14 shows an example write pattern produced byour algorithm. Parity banks increase the maximum numberof write requests from to . Note that an element whichis addressed to row n in a data bank can only be written tothe corresponding row n in the parity banks. In this scenario,the write queues for every data bank are full. The controllertakes write requests from each queue and schedules one tothe queue’s target data bank and the other to a parity bank.The controller also updates the code status table.Figure 14 also demonstrates how the code status tablechanges to reﬂect the freshness of the elements in the data andparity banks. Here, the 00 status indicates that all elements areupdated. The 01 status indicates that the data banks containfresh elements and the elements in the parity banks must beig. 13: Pictured here is a ﬂowchart of the write patternbuilder.

Fig. 14:

The behavior of the write pattern builder on a 4-bankmemory system is demonstrated here. recoded. The 10 status indicates that the parity banks containfresh elements, and that both data banks and parity banks mustbe updated.

D. ReCoding unit

After a write request has been served, the stale data in theparity (or data) banks must be replaced. The

ReCoding Unit accomplishes this with a queue of recoding requests . Everytime a write is served, recoding requests are pushed onto thequeue indicating which data and parity banks contain staleelements, as well as the bank which generated the recodingrequest. Requests also contain the current cycle number so thatthe ReCoding Unit may prioritize older requests.

E. Dynamic Coding

To reduce memory overhead α , parity banks are designed tobe smaller than data banks. The dynamic coding block main- tains codes for the most heavily accessed memory subregions,so that parity banks are utilized more often.The dynamic coding block partitions each memory bankinto (cid:100) r (cid:101) regions. The block can select up to αr − regions tobe encoded in the parity banks. A single region is reserved toallow encoding of a new region.Every T cycles, the dynamic coding unit chooses the αr − regions with the greatest number of memory accesses. Thedynamic coding unit will then encode these regions in theparity banks. If all the selected regions are already encoded,the unit does nothing. Otherwise, the unit begins encoding themost accessed region. Once the dynamic coding unit is ﬁnishedencoding a new region, the region becomes available for use bythe rest of the memory controller. A memory region of length r is reserved by the dynamic coding unit for constructing newencoded regions, and a memory region of length α − r isreserved for active encoded regions. If the memory ceiling α − r is reached when a new memory region is encoded, theunit evicts the least frequently used encoded region.V. E XPERIMENTS

In this section, we discuss our method for evaluating theperformance of the proposed memory system. We utilize thePARSEC v2.1 and v3.0 benchmark suites with the gem5simulator [13], [12] to generate memory traces. Next, we runthe Ramulator DRAM simulator [11] to measure the perfor-mance of the proposed memory system. Next, we comparethe baseline performance of the Ramulator DRAM simulatoragainst a modiﬁed version which implements the proposedsystem.

A. Memory Trace Generation

The PARSEC benchmark suite was developed for chipmultiprocessors and is composed of a diverse set of mul-tithreaded applications [13]. These benchmarks allow us toobserve how the proposed memory system performs in densememory access scenarios.The gem5 simulator [12] allows us to select the processorconﬁguration used when generating the memory traces. Weuse processors in all simulations. The PARSEC traces canbe split into regions based on computation type. We focus onregions which feature parallel processing because they havethe greatest likelihood of bank conﬂicts.Many attributes affect the performance of our proposedmemory system, most importantly the density of traces, theoverlap of memory accesses among processors, and the sta-tionarity of heavily utilized memory regions.We ﬁnd that memory access patterns that occupy consistentbands of sequential memory addresses beneﬁt most from ourproposed memory system. Figure 15 shows the access patternof one of the dedup PARSEC benchmark.We augment the PARSEC benchmarks in two ways totest our system in additional scenarios, shown in Figures 16and 17, respectively. First we split the given memory bands tosimulate an increased number of bands in the system. Next,we introduce dynamic memory access patterns by adding alinear ramp to the previously static address locations.ig. 15: The memory access pattern of the dedup PARSECbenchmark.

Fig. 16:

The vips benchmark after splitting the primary accessbands into multiple additional bands.B. Ramulator

We use the Ramulator DRAM simulator to compare thenumber of CPU cycles required to execute the PARSEC mem-ory traces. First, we use the original Ramulator simulator tomeasure a baseline number of CPU cycles. Then we implementthe proposed memory system and use the modiﬁed Ramulator(ﬁxing all other conﬁguration) to calculate improvements overthe baseline. Our simulations vary the overhead parameter α . C. Simulation Results

Given sufﬁcient memory overhead, we see a consistent 25%reduction in CPU cycles over the baseline simulation, withCoding Scheme I generally performing best.The proposed memory system performs consistently acrossthe PARSEC benchmarks, and the three proposed schemesyield similar results. Figure 18 shows the simulation resultsFig. 17:

The vips benchmark after adding a ramp to the majormemory bands. for the dedup benchmark with a memory partition coefﬁcient r = 0 . . The plot shows that the number of CPU cycles isreduced by − once sufﬁcient memory overhead α isused.We also see that the number of memory region switchesperformed by the dynamic encoder. When α = 1 , the numberof switches is always zero as expected because the dynamicencoder never needs to switch regions. The performanceremains consistent for α > . . With this amount of overhead,the memory system ﬁnds and encodes the two heavily accessedmemory bands in each of the PARSEC benchmarks. This isbecause (cid:98) αr (cid:99) = 2 , which means we can select regions toencode.When α = 0 . , the number of coded region switches isvery high because the memory system vacillates between thetwo most heavily accessed bands. When α = . , both of themcan be encoded. We see a small numbers of switches when α ≥ . because the memory system is encoding less heavilyaccessed memory bands with little impact on number of CPUcycles.

1) Augmented PARSEC:

Results on the the augmentedPARSEC traces show that our system improves over thebaseline to a lesser extent.Figure 19 shows that for a large number of memory bands,we can achieve the same performance as before only byincreasing the memory overhead or increasing the memorypartition coefﬁcient.Figure 20 shows the results of the ramp augmentation. Herewe see that our system struggles to adapt to a constantlychanging primary access region.Fig. 18:

The simulation results for the dedup PARSEC bench-mark. The line plot represents the number of CPU cyclesexecuted and the bar plot represents the number of times thedynamic coding unit chooses to encode a new memory region.

VI. C

ONCLUSION

Our proposed design emulates multi-port memory usingcoding techniques with single-port memory, and it is able tospeed up the execution time of the PARSEC benchmarks. Weare able to support multiple read and writes in a single memorycycle, and compared to replication-based methods we use farless memory and therefore power and chip area. However,The design of a parity storage requires additional logic at thememory controller to encode/decode and schedule read/writes.Further iterations on our design may include using idle banksig. 19:

The simulation results of the augmented vips tracepictured in Figure 16.

Fig. 20:

The simulation results of the augmented vips tracepictured in Figure 17. to prefetch symbols and improvements to the read and writepattern builders.VII. A

CKNOWLEDGEMENTS

This document is derived from previous conferences, inparticular HPCA 2017. We thank Daniel A. Jimenez, ElviraTeran for their inputs. R

EFERENCES[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implicationsof the obvious,”

SIGARCH Comput. Archit. News , vol. 23, no. 1, pp.20–24, Mar. 1995.[2] M. M. Waldrop, “The chips are down for Moore’s law.”

Nature

Computer , vol. 38,no. 5, pp. 11–13, May 2005.[5] W.-F. Lin, S. K. Reinhardt, and D. Burger, “Reducing DRAM latencieswith an integrated memory hierarchy design,” in

Proceedings of 7thInternational Symposium on High-Performance Computer Architecture(HPCA) , Jan 2001, pp. 301–312.[6] J. L. Hennessy and D. A. Patterson,

Computer Architecture, FourthEdition: A Quantitative Approach . San Francisco, CA, USA: MorganKaufmann Publishers Inc., 2006.[7] A. G. Dimakis, P. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran,“Network coding for distributed storage systems,”

IEEE Transactions onInformation Theory , vol. 56, no. 9, pp. 4539–4551, Sept 2010.[8] P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin, “On the locality ofcodeword symbols,”

IEEE Transactions on Information Theory , vol. 58,no. 11, pp. 6925–6934, Nov 2012.[9] Y. Ishai, E. Kushilevitz, R. Ostrovsky, and A. Sahai, “Batch codes andtheir applications,” in

Proc. of thirty-sixth annual ACM symposium onTheory of computing (STOC) , 2004, pp. 262–271. [Online]. Available:http://doi.acm.org.ezproxy.lib.utexas.edu/10.1145/1007352.1007396 [10] A. S. Rawat, D. S. Papailiopoulos, A. G. Dimakis, and S. Vishwanath,“Locality and availability in distributed storage,”

IEEE Transactions onInformation Theory , vol. 62, no. 8, pp. 4481–4493, Aug 2016.[11] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensibleDRAM simulator,”

IEEE Computer Architecture Letters , vol. 15, no. 1,pp. 45–49, Jan 2016.[12] M. Gebhart, J. Hestness, E. Fatehi, P. Gratz, and S. W. Keckler, “Runningparsec 2.1 on m5,” The University of Texas at Austin, Department ofComputer Science, Tech. Rep., October 2009.[13] C. Bienia and K. Li, “Parsec 2.0: A new benchmark suite for chip-multiprocessors,” in

Proceedings of the 5th Annual Workshop on Mod-eling, Benchmarking and Simulation , June 2009.[14] T. Suzuki, H. Yamauchi, Y. Yamagami, K. Satomi, and H. Akamatsu,“A stable 2-port sram cell design against simultaneously read/write-disturbed accesses,”

IEEE Journal of Solid-State Circuits , vol. 43, no. 9,pp. 2109–2119, Sept 2008.[15] D. P. Wang, H. J. Lin, C. T. Chuang, and W. Hwang, “Low-powermultiport sram with cross-point write word-lines, shared write bit-lines,and shared write row-access transistors,”

IEEE Transactions on Circuitsand Systems II: Express Briefs

IEEE Journal of Solid-State Circuits

The Theory of Error-CorrectingCodes . Amsterdam: North-Holland, 1983.[23] T. M. Cover and J. A. Thomas,

Elements of Information Theory(Wiley Series in Telecommunications and Signal Processing) . Wiley-Interscience, 2006.[24] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li,and S. Yekhanin, “Erasure coding in windows azure storage,” in

Proc.of 2012 USENIX Annual Technical Conference (USENIX ATC 12) .USENIX, 2012, pp. 15–26.[25] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis,R. Vadali, S. Chen, and D. Borthakur, “XORing elephants: Novel erasurecodes for big data,” in

Proc. of 39th International Conference on VeryLarge Data Bases (VLDB) , vol. 6, no. 5, June 2013, pp. 325–336.[26] K. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ram-chandran, “A ”hitchhiker’s” guide to fast and efﬁcient data reconstructionin erasure-coded data centers,”

SIGCOMM Comput. Commun. Rev. ,vol. 44, no. 4, pp. 331–342, Aug. 2014.[27] A. S. Rawat, Z. Song, A. G. Dimakis, and A. Gal, “Batch codes throughdense graphs without short cycles,”

IEEE Transactions on InformationTheory , vol. 62, no. 4, pp. 1592–1604, April 2016.[28] Z. Wang, H. M. Kiah, Y. Cassuto, and J. Bruck, “Switch codes: Codes forfully parallel reconstruction,”

IEEE Transactions on Information Theory ,vol. 63, no. 4, pp. 2061–2075, Apr. 2017.[29] N. P. Anthapadmanabhan, E. Soljanin, and S. Vishwanath, “Update-efﬁcient codes for erasure correction,” in

Proc. of 48th Annual AllertonConference on Communication, Control, and Computing (Allerton) , Sept2010, pp. 376–382.[30] A. Mazumdar, V. Chandar, and G. W. Wornell, “Update-efﬁciency andlocal repairability limits for capacity approaching codes,”