Andrew J. Herdrich | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrew J. Herdrich is active.

Explore More

Publication

Featured researches published by Andrew J. Herdrich.

international conference on supercomputing | 2009

Rate-based QoS techniques for cache/memory in CMP platforms

Andrew J. Herdrich; Ramesh Illikkal; Ravi R. Iyer; Donald Newell; Vineet Chadha; Jaideep Moses

As we embrace the era of chip multi-processors (CMP), we are faced with two major architectural challenges: (i) QoS or performance management of disparate applications running on CPU cores contending for shared cache/memory resources and (ii) global/local power management techniques to stay within the overall platform constraints. The problem is exacerbated as the number of cores sharing the resources in a chip increase. In the past, researchers have proposed independent solutions for these two problems. In this paper, we show that rate-based techniques that are employed to address power management can be adapted to address cache/memory QoS issues. The basic approach is to throttle down the processing rate of a core if it is running a low-priority task and its execution is interfering with the performance of a high priority task due to platform resource contention (i.e. cache or memory contention). We evaluate two rate throttling mechanisms (clock modulation, and frequency scaling) for effectively managing the interference between applications running in a CMP platform and delivering QoS/performance management. We show that clock modulation is much more applicable to cache/memory QoS than frequency scaling and that resource monitoring along with rate control provides effective power-performance management in CMP platforms.

high-performance computer architecture | 2016

Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family

Andrew J. Herdrich; Edwin Verplanke; Priya Autee; Ramesh Illikkal; Chris Gianos; Ronak Singhal; Ravi R. Iyer

Over the last decade, addressing quality of service (QoS) in multi-core server platforms has been growing research topic. QoS techniques have been proposed to address the shared resource contention between co-running applications or virtual machines in servers and thereby provide better isolation, performance determinism and potentially improve overall throughput. One of the most important shared resources is cache space. Most proposals for addressing shared cache contention are based on simulations and analysis and no commercial platforms were available that integrated such techniques and provided a practical solution. In this paper, we will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family). We will describe two key technologies: (i) Cache Monitoring Technology (CMT) to enable monitoring of shared cache usage by different applications and (ii) Cache Allocation Technology (CAT) which enables redistribution of shared cache space between applications to address contention. This is the first paper to describing these techniques as they moved from concept to reality, starting from early research to product implementation. We will also present case studies highlighting the value of these techniques using example scenarios of multi-programmed workloads, virtualized platforms in datacenters and communications platforms. Finally, we will describe initial software infrastructure and enabling for industry practitioners and researchers to take advantage of these technologies for their QoS needs.

international conference on parallel architectures and compilation techniques | 2016

CAF: Core to Core Communication Acceleration Framework

Yipeng Wang; Ren Wang; Andrew J. Herdrich; James Tsai; Yan Solihin

As the number of cores in a multicore system increases, core-to-core (C2C) communication is increasingly limiting the performance scaling of workloads that share data frequently. The traditional way cores communicate is by using shared memory space between them. However, shared memory communication fundamentally involves coherence invalidations and cache misses, which cause large performance overheads and incur a high amount of network traffic. Many important workloads incur significant C2C communication and are affected significantly by the costs, including pipelined packet processing which is widely used in software-based networking solutions. In these workloads, threads run on different cores and pass packets from one core to another for different stages of processing using software queues. In this paper, we analyze the behavior and overheads of software queue management. Based on this analysis, we propose a novel C2C Communication Acceleration Framework (CAF) to optimize C2C communication. CAF offloads substantial communication burdens from cores and memory to a designated, efficient hardware device we refer to as Queue Management Device (QMD) attached to the Network on Chip. CAF combines hardware and software optimizations to effectively reduce the queue-induced communication overheads and improve the overall system performance by up to 2-12× over traditional software queue implementations.

international conference on computer design | 2014

QoS management on heterogeneous architecture for parallel applications

Ying Zhang; Li Zhao; Ramesh Illikkal; Ravi R. Iyer; Andrew J. Herdrich; Lu Peng

Quality of service (QoS) management is widely employed to provide differentiable performance to programs with distinctive priorities on conventional chip multi-processor (CMP) platforms. Recently, heterogeneous architecture integrating diverse processor cores on the same silicon has been proposed to better serve various application domains and it is expected to be an important design paradigm of future processors. Therefore, the QoS management on emerging heterogeneous systems will be of great significance. On the other hand, parallel applications are becoming increasingly important in modern computing community in order to explore the benefit of thread-level parallelism on CMPs. However, considering the diverse characteristics of thread synchronization, data sharing, and parallelization pattern, governing the execution of multiple parallel programs with different performance requirements becomes a complicated yet significant problem. In this paper, we study QoS management for parallel applications running on heterogeneous CMP systems. We comprehensively assess a series of task-to-core mapping policies on a real heterogeneous hardware (QuickIA) by characterizing their impacts on performance of individual applications. Our evaluation results show that the proposed QoS policies are effective to improve the performance of programs with highest priority while striking good tradeoff with system fairness.

IEEE Systems Journal | 2017

QoS Management on Heterogeneous Architecture for Multiprogrammed, Parallel, and Domain-Specific Applications

Ying Zhang; Li Zhao; Ramesh Illikkal; Ravi R. Iyer; Andrew J. Herdrich; Lu Peng

Quality-of-service (QoS) management is widely employed to provide differentiable performance to programs with distinctive priorities on conventional chip-multiprocessor (CMP) platforms. Recently, heterogeneous architecture integrating diverse processor cores on the same silicon has been proposed to better serve various application domains, and it is expected to be an important design paradigm of future processors. Therefore, the QoS management on emerging heterogeneous systems will be of great significance. Workloads on heterogeneous architectures can be multiprogrammed, heterogeneous, and/or domain specific depending on the form factor and device of interest. Considering the diverse characteristics of these three classes of workloads is important when managing QoS on heterogeneous architectures. For example, for parallel applications, considering the diverse characteristics of thread synchronization, data sharing, and parallelization pattern of representative parallel applications, governing the execution of multiple parallel programs with different performance requirements becomes a complicated yet significant problem. In this paper, we study QoS management for multiprogrammed, parallel, and domain-specific applications running on heterogeneous CMP systems. We comprehensively assess a series of task-to-core mapping policies on a real heterogeneous hardware (QuickIA) by characterizing their impacts on performance of individual applications. Our evaluation results show that the proposed QoS policies are effective to improve the performance of programs with highest priority while striking good tradeoff with system fairness.

international conference on parallel processing | 2012

Exploiting semantics of virtual memory to improve the efficiency of the on-chip memory system

Bin Li; Zhen Fang; Li Zhao; Xiaowei Jiang; Lin Li; Andrew J. Herdrich; Ravishankar R. Iyer; Srihari Makineni

Different virtual memory regions (e.g., stack and heap) have different properties and characteristics. For example, stack data are thread-private by definition while heap data can be shared between threads. Compared with heap memory, stack memory tends to take a large number of accesses to a rather small number of pages. These facts have been largely ignored by designers. In this paper, we propose two novel designs that exploit stack memorys unique characteristics to optimize the on-chip memory system. The first design is Anticipatory Superpaging - automatically create superpages for stack memory at the first page fault in a potential superpage, increasing TLB reach and reducing TLB misses. It is transparent to applications and does not require kernel to employ online analysis algorithms and page copying. The second design is Stack-Aware Cache Placement - stack accesses are routed to their local slices in a distributed shared cache, while non-stack accesses are still routed using cacheline interleaving. The primary benefit of this mechanism is reduced power consumption of the on-chip interconnect. Our simulation shows that the first innovation reduces TLB misses by 10% - 20%, and the second one reduces interconnect power consumption by over 14%.

Archive | 2011

Method, apparatus, and system for energy efficiency and energy conservation including dynamic c0-state cache resizing

Jaideep Moses; Rameshkumar G. Illikkal; Ravishankar Iyer; Jared E. Bendt; Sadagopan Srinivasan; Andrew J. Herdrich; Ashish V. Choubal; Avinash N. Ananthakrishnan; Vijay S.R. Degalahal

Archive | 2014

Methods and apparatuses for controlling thread contention

Andrew J. Herdrich; Ramesh Illikkal; Donald Newell; Ravishankar Iyer; Vineet Chadha

Archive | 2012

Thread migration support for architectually different cores

Mishali Naik; Ganapati Srinivasa; Alon Naveh; Inder M. Sodhi; Paolo Narvaez; Eugene Gorbatov; Eliezer Weissmann; Andrew D. Henroid; Andrew J. Herdrich; Gaurav Khanna; Scott Hahn; Paul Brett; David A. Koufaty; Dheeraj R. Subbareddy; Abirami Prabhakaran

Archive | 2016