Carole Jean Wu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Carole Jean Wu is active.

Explore More

Publication

Featured researches published by Carole Jean Wu.

international symposium on microarchitecture | 2011

SHiP: signature-based hit predictor for high performance caching

Carole Jean Wu; Aamer Jaleel; William C. Hasenplaugh; Margaret Martonosi; Simon C. Steely; Joel S. Emer

The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can significantly improve cache performance. A fundamental challenge, however, is how to best predict the re-reference pattern of an incoming cache line. This paper shows that cache performance can be improved by correlating the re-reference behavior of a cache line with a unique signature. We investigate the use of memory region, program counter, and instruction sequence history based signatures. We also propose a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature. Overall, we find that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals. On average, SHiP improves sequential and multiprogrammed application performance by roughly 10% and 12% over LRU replacement, respectively. Compared to recent replacement policy proposals such as Seg-LRU and SDBP, SHiP nearly doubles the performance gains while requiring less hardware overhead.

international symposium on microarchitecture | 2011

PACMan: prefetch-aware cache management for high performance caching

Carole Jean Wu; Aamer Jaleel; Margaret Martonosi; Simon C. Steely; Joel S. Emer

Hardware prefetching and last-level cache (LLC) management are two independent mechanisms to mitigate the growing latency to memory. However, the interaction between LLC management and hardware prefetching has received very little attention. This paper characterizes the performance of state-of-the-art LLC management policies in the presence and absence of hardware prefetching. Although prefetching improves performance by fetching useful data in advance, it can interact with LLC management policies to introduce application performance variability. This variability stems from the fact that current replacement policies treat prefetch and demand requests identically. In order to provide better and more predictable performance, we propose Prefetch-Aware Cache Management (PACMan). PACMan dynamically estimates and mitigates the degree of prefetch-induced cache interference by modifying the cache insertion and hit promotion policies to treat demand and prefetch requests differently. Across a variety of emerging workloads, we show that PACMan eliminates the performance variability in state-of-the-art replacement policies under the influence of prefetching. In fact, PACMan improves performance consistently across multimedia, games, server, and SPEC CPU2006 workloads by an average of 21.9% over the baseline LRU policy. For multiprogrammed workloads, on a 4-core CMP, PACMan improves performance by 21.5% on average.

ieee international symposium on workload characterization | 2013

Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite - MobileBench

Dhinakaran Pandiyan; Shin Ying Lee; Carole Jean Wu

In this paper, we explore key microarchitectural features of mobile computing platforms that are crucial to the performance of smart phone applications. We create and use a selection of representative smart phone applications, which we call MobileBench that aid in this analysis. We also evaluate the effectiveness of current memory subsystem on the mobile platforms. Furthermore, by instrumenting the Android framework, we perform energy characterization for MobileBench on an existing Samsung Galaxy S III smart phone. Based on our energy analysis, we find that application cores on modern smart phones consume significant amount of energy. This motivates our detailed performance analysis centered at the application cores. Based on our detailed performance studies, we reach several key findings. (i) Using a more sophisticated tournament branch predictor can improve the branch prediction accuracy but this does not translate to observable performance gain. (ii) Smart phone applications show distinct TLB capacity needs. Larger TLBs can improve performance by an avg. of 14%. (iii) The current L2 cache on most smart phone platform experiences poor utilization because of the fast-changing memory requirements of smart phone applications. Using a more effective cache management scheme improves the L2 cache utilization by as much as 29.3% and by an avg. of 12%. (iv) Smart phone applications are prefetching-friendly. Using a simple stride prefetcher can improve performance across MobileBench applications by an avg. of 14%. (v) Lastly, the memory bandwidth requirements of MobileBench applications are moderate and well under current smart phone memory bandwidth capacity of 8.3 GB/s. With these insights into the smart phone application characteristics, we hope to guide the design of future smart phone platforms for lower power consumptions through simpler architecture while achieving high performance.

international symposium on computer architecture | 2015

CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads

Shin Ying Lee; Akhil Arunkumar; Carole Jean Wu

The ubiquity of graphics processing unit (GPU) architectures has made them efficient alternatives to chipmultiprocessors for parallel workloads. GPUs achieve superior performance by making use of massive multi-threading and fast context-switching to hide pipeline stalls and memory access latency. However, recent characterization results have shown that general purpose GPU (GPGPU) applications commonly encounter long stall latencies that cannot be easily hidden with the large number of concurrent threads/warps. This results in varying execution time disparity between different parallel warps, hurting the overall performance of GPUs - the warp criticality problem. To tackle the warp criticality problem, we propose a coordinated solution, criticality-aware warp acceleration (CAWA), that efficiently manages compute and memory resources to accelerate the critical warp execution. Specifically, we design (1) an instruction-based and stall-based criticality predictor to identify the critical warp in a thread-block, (2) a criticality-aware warp scheduler that preferentially allocates more time resources to the critical warp, and (3) a criticality-aware cache reuse predictor that assists critical warp acceleration by retaining latency-critical and useful cache blocks in the L1 data cache. CAWA targets to remove the significant execution time disparity in order to improve resource utilization for GPGPU workloads. Our evaluation results show that, under the proposed coordinated scheduler and cache prioritization management scheme, the performance of the GPGPU workloads can be improved by 23% while other state-of-the-art schedulers, GTO and 2-level schedulers, improve performance by 16% and -2% respectively.

international symposium on performance analysis of systems and software | 2011

Characterization and dynamic mitigation of intra-application cache interference

Carole Jean Wu; Margaret Martonosi

Given the emerging dominance of CMPs, an important research problem concerns application memory performance in the face of deep memory hierarchies, where one or more caches are shared by several cores. In current systems, many factors can cause interference in the shared last-level cache (LLC). While predicting an applications memory performance is difficult enough in an idealized setup, it becomes even more complicated in real-machine environments in which interference can stem from operating system memory accesses, and even from an applications own prefetch requests and page table walks caused by TLB misses. This paper characterizes the degree by which intra-application interference factors such as page table walks and hardware prefetching influence performance. Using hardware performance counters on an Intel platform, we first characterize real-system LLC interference and show that application data memory references represent much less than half of the LLC misses, with hardware prefetching and page table walks causing considerable LLC interference. Based on these characterizations, we propose dynamic management methods to reduce intra-application interference. First, we evaluate a dynamic OS-reference-aware cache insertion policy that reduces interference and improves user IPCs by as much as 19% (5% on average). Second, to mitigate prefetch-induced LLC interference, we propose, implement, and evaluate an automatic prefetch manager that uses Intel PEBS capabilities to dynamically estimate prefetch-induced interference and accordingly adjust the aggressiveness of hardware prefetchers as programs run. Overall, our characterizations are important in highlighting the challenges of intra-application interference, and our hardware and software proposals offer significant solutions for addressing them.

international symposium on performance analysis of systems and software | 2015

A study of mobile device utilization

Cao Gao; Anthony Gutierrez; Madhav Rajan; Ronald G. Dreslinski; Trevor N. Mudge; Carole Jean Wu

Mobile devices are becoming more powerful and versatile than ever, calling for better embedded processors. Following the trend in desktop CPUs, microprocessor vendors are trying to meet such needs by increasing the number of cores in mobile device SoCs. However, increasing the number does not translate proportionally into performance gain and power reduction. In the past, studies have shown that there exists little parallelism to be exploited by a multi-core processor in desktop platform applications, and many cores sit idle during runtime. In this paper, we investigate whether the same is true for current mobile applications. We analyze the behavior of a broad range of commonly used mobile applications on real devices. We measure their Thread Level Parallelism (TLP), which is the machine utilization over the non-idle runtime. Our results demonstrate that mobile applications are utilizing less than 2 cores on average, even with background applications running concurrently. We observe a diminishing return on TLP with increasing the number of cores, and low TLP even with heavy-load scenarios. These studies suggest that having many powerful cores is over-provisioning. Further analysis of TLP behavior and big-little core energy efficiency suggests that current mobile workloads can benefit from an architecture that has the flexibility to accommodate both high performance and good energy-efficiency for different application phases.

ACM Transactions on Architecture and Code Optimization | 2011

Adaptive timekeeping replacement: Fine-grained capacity management for shared CMP caches

Carole Jean Wu; Margaret Martonosi

In chip multiprocessors (CMPs), several high-performance cores typically compete for capacity in a shared last-level cache. This causes degraded and unpredictable memory performance for multiprogrammed and parallel workloads. In response, recent schemes apportion cache bandwidth and capacity in ways that offer better aggregate performance for the workloads. These schemes, however, focus primarily on relatively coarse-grained capacity management without concern for operating system process priority levels. In this work, we explore capacity management approaches that are both temporally and spatially more fine-grained than prior work. We also consider operating system priority levels as part of capacity management. We propose a capacity management mechanism based on timekeeping techniques that track the time interval since the last access to cached data. This Adaptive Timekeeping Replacement (ATR) scheme maintains aggregate cache occupancies that reflect the priority and footprint of each application. The key novelties of our work are (1) ATR offers a complete cache capacity management framework taking into account application priorities and memory characteristics, and (2) ATRs fine-grained cache capacity control is demonstrated to be effective and important in improving the performance of parallel workloads in addition to sequential ones. We evaluate our ideas using a full-system simulator and multiprogrammed workloads of both sequential and parallel applications. This is the first detailed study of shared cache capacity management considering thread behaviors in parallel applications. ATR outperforms an unmanaged system by as much as 1.63X and by an average of 1.19X. ATRs fine-grained temporal control is particularly important for parallel applications, which are expected to be increasingly prevalent in years to come.

design automation conference | 2014

Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors

Aviral Shrivastava; Abhishek Rhisheekesan; Carole Jean Wu

Control Flow Checking (CFC) based techniques have gained a reputation of providing effective, yet low-overhead protection from soft errors. The basic idea is that if the control flow - or the sequence of instructions that are executed - is correct, then most probably the execution of the program is correct. Although researchers claim the effectiveness of the proposed CFC techniques, we argue that their evaluation has been inadequate and can even be wrong! Recently, the metric of vulnerability has been proposed to quantify the susceptibility of computation to soft errors. Laced with this comprehensive metric, we quantitatively evaluate the effectiveness of several existing CFC schemes, and obtain surprising results. Our results show that existing CFC techniques are not only ineffective in protecting computation from soft errors, but that they incur additional power and performance overheads. Software-only CFC protection schemes (CFCSS [14], CFCSS+NA [2], and CEDA [18]) increase system vulnerability by 18% to 21% with 17% to 38% performance overhead; Hybrid CFC protection technique, CFEDC [4] also increases the vulnerability by 5%; While the vulnerability remains almost the same for hardware only CFC protection technique, CFCET [15], they cause overheads of design cost, area, and power due to the hardware modifications required for their implementations.

ieee international symposium on workload characterization | 2014

Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms

Dhinakaran Pandiyan; Carole Jean Wu

In portable computing systems like smartphones, energy is generally a key but limited resource where application cores have been proven to consume a significant part of it. To understand the characteristics of the energy consumption, in this paper, we focus our attention on the portion of energy that is spent to move data to the application cores internal registers from the memory system. The primary motivation for this focus comes from the relatively higher energy cost associated with a data movement instruction compared to that of an arithmetic instruction. Another important factor is the distributive computing nature among different units in a SoC which leads to a higher data movement to/from the application cores. We perform a detailed investigation to quantify the impact of data movement on overall energy consumption of a popular, commercially-available smart phone device. To aid this study, we design micro-benchmarks that generate desired data movement patterns between different levels of the memory hierarchy and measure the instantaneous power consumed by the device when running these micro-benchmarks. We extensively make use of hardware performance counters to validate the micro-benchmarks and to characterize the energy consumed in moving data. We take a step further to utilize this calculated energy cost of data movement to characterize the portion of energy that an application spends in moving data for a wide range of popular smart phone workloads. We find that a considerable amount of total device energy is spent in data movement (an average of 35% of the total device energy). Our results also indicate a relatively high stalled cycle energy consumption (an average of 23.5%) for current smart phones. To our knowledge, this is the first study that quantifies the amount of data movement energy for emerging smart phone applications running on a recent, commercial smart phone device. We hope this characterization study and the insights developed in the paper can inspire innovative designs in smart phone architectures with improved performance and energy efficiency.

international symposium on performance analysis of systems and software | 2014

Characterizing the latency hiding ability of GPUs

Shin Ying Lee; Carole Jean Wu

This paper demonstrates a latency profiling approach to characterize and evaluate for the latency-hiding capability of modern GPU architectures. We find that the fast context-switching and massive multi-threading architecture can effectively hide much of the latency by swapping in and out warps. However, for certain GPGPU applications, such as bfs, the performance is limited by other factors. In future work, we plan to use the latency profiling approach to further investigate the limits of GPUs and seek for performance improvement opportunities.

Explore More