Joseph Nuzman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joseph Nuzman is active.

Explore More

Publication

Featured researches published by Joseph Nuzman.

international conference on parallel architectures and compilation techniques | 2012

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

Mainak Chaudhuri; Jayesh Gaur; Nithiyanandan Bashyam; Sreenivas Subramoney; Joseph Nuzman

The replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of the hierarchy. This paper introduces cache hierarchy-aware replacement (CHAR) algorithms for inclusive LLCs (or L3 caches) and applies the same algorithms to implement efficient bypass techniques for exclusive LLCs in a three-level hierarchy. In a hierarchy with an inclusive LLC, these algorithms mine the L2 cache eviction stream and decide if a block evicted from the L2 cache should be made a victim candidate in the LLC based on the access pattern of the evicted block. Ours is the first proposal that explores the possibility of using a subset of L2 cache eviction hints to improve the replacement algorithms of an inclusive LLC. The CHAR algorithm classifies the blocks residing in the L2 cache based on their reuse patterns and dynamically estimates the reuse probability of each class of blocks to generate selective replacement hints to the LLC. Compared to the static re-reference interval prediction (SRRIP) policy, our proposal offers an average reduction of 10.9% in LLC misses and an average improvement of 3.8% in instructions retired per cycle (IPC) for twelve single-threaded applications. The corresponding reduction in LLC misses for one hundred 4-way multi-programmed workloads is 6.8% leading to an average improvement of 3.9% in through-put. Finally, our proposal achieves an 11.1% reduction in LLC misses and a 4.2% reduction in parallel execution cycles for six 8-way threaded shared memory applications compared to the SRRIP policy. In a cache hierarchy with an exclusive LLC, our CHAR proposal offers an effective algorithm for selecting the subset of blocks (clean or dirty) evicted from the L2 cache that need not be written to the LLC and can be bypassed. Compared to the TC-AGE policy (analogue of SRRIP for exclusive LLC), our best exclusive LLC proposal improves average throughput by 3.2% while saving an average of 66.6% of data transactions from the L2 cache to the on-die interconnect for one hundred 4-way multi-programmed workloads. Compared to an inclusive LLC design with an identical hierarchy, this corresponds to an average throughput improvement of 8.2% with only 17% more data write transactions originating from the L2 cache.

high-performance computer architecture | 2015

High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches

Aamer Jaleel; Joseph Nuzman; Adrian C. Moga; Simon C. Steely; Joel S. Emer

Increasing transistor density enables adding more on-die cache real-estate However, devoting more space to the shared last-level-cache (LLC) causes the memory latency bottleneck to move from memory access latency to shared cache access latency. As such, applications whose working set is larger than the smaller caches spend a large fraction of their execution time on shared cache access latency. To address this problem, this paper investigates increasing the size of smaller private caches in the hierarchy as opposed to increasing the shared LLC. Doing so improves average cache access latency for workloads whose working set fits into the larger private cache while retaining the benefits of a shared LLC. The consequence of increasing the size of private caches is to relax inclusion and build exclusive hierarchies. Thus, for the same total caching capacity, an exclusive cache hierarchy provides better cache access latency. We observe that server workloads benefit tremendously from an exclusive hierarchy with large private caches. This is primarily because large private caches accommodate the large code working-sets of server workloads. For a 16-core CMP, an exclusive cache hierarchy improves server workload performance by 5-12% as compared to an equal capacity inclusive cache hierarchy. The paper also presents directions for further research to maximize performance of exclusive cache hierarchies.

Archive | 2012

Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program to multiple parallel threads

David J. Sager; Ruchira Sasanka; Ron Gabor; Shlomo Raikin; Joseph Nuzman; Leeor Peled; Jason A. Domer; Ho-Seop Kim; Youfeng Wu; Koichi Yamada; Tin-Fook Ngai; Howard H. Chen; Jayaram Bobba; Jeffrey J. Cook; Osmar M. Shaikh; Suresh Srinivas

Archive | 2013

Method and apparatus for store durability and ordering in a persistent memory architecture

Subramanya R. Dulloor; Sanjay Kumar; Rajesh M. Sankaran; Gilbert Neiger; Richard Uhlig; Robert S. Chappell; Joseph Nuzman; Kai Cheng; Sailesh Kottapalli; Yen-Cheng Liu; Mohan J. Kumar; Raj K. Ramanujan; Glenn J. Hinton

Archive | 2012