Donald Newell | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Donald Newell is active.

Explore More

Publication

Featured researches published by Donald Newell.

measurement and modeling of computer systems | 2007

QoS policies and architecture for cache/memory in CMP platforms

Ravi R. Iyer; Li Zhao; Fei Guo; Ramesh Illikkal; Srihari Makineni; Donald Newell; Yan Solihin; Lisa R. Hsu; Steven K. Reinhardt

As we enter the era of CMP platforms with multiple threads/cores on the die, the diversity of the simultaneous workloads running on them is expected to increase. The rapid deployment of virtualization as a means to consolidate workloads on to a single platform is a prime example of this trend. In such scenarios, the quality of service (QoS) that each individual workload gets from the platform can widely vary depending on the behavior of the simultaneously running workloads. While the number of cores assigned to each workload can be controlled, there is no hardware or software support in todays platforms to control allocation of platform resources such as cache space and memory bandwidth to individual workloads. In this paper, we propose a QoS-enabled memory architecture for CMP platforms that addresses this problem. The QoS-enabled memory architecture enables more cache resources (i.e. space) and memory resources (i.e. bandwidth) for high priority applications based on guidance from the operating environment. The architecture also allows dynamic resource reassignment during run-time to further optimize the performance of the high priority application with minimal degradation to low priority. To achieve these goals, we will describe the hardware/software support required in the platform as well as the operating environment (O/S and virtual machine monitor). Our evaluation framework consists of detailed platform simulation models and a QoS-enabled version of Linux. Based on evaluation experiments, we show the effectiveness of a QoS-enabled architecture and summarize key findings/trade-offs.

IEEE Computer | 2004

TCP onloading for data center servers

G. Regnier; Srihari Makineni; I. Illikkal; Ravishankar Iyer; David B. Minturn; R. Huggahalli; Donald Newell; L. Cline; A. Foong

To meet the increasing networking needs of server workloads, servers are starting to offload packet processing to peripheral devices to achieve TCP/IP acceleration. Researchers at Intel Labs have experimented with alternative solutions that improve the servers ability to process TCP/IP packets efficiently and at very high rates.

high-performance computer architecture | 2010

CHOP: Adaptive filter-based DRAM caching for CMP server platforms

Xiaowei Jiang; Niti Madan; Li Zhao; Mike Upton; Ravishankar R. Iyer; Srihari Makineni; Donald Newell; Yan Solihin; Rajeev Balasubramonian

As manycore architectures enable a large number of cores on the die, a key challenge that emerges is the availability of memory bandwidth with conventional DRAM solutions. To address this challenge, integration of large DRAM caches that provide as much as 5× higher bandwidth and as low as 1/3rd of the latency (as compared to conventional DRAM) is very promising. However, organizing and implementing a large DRAM cache is challenging because of two primary tradeoffs: (a) DRAM caches at cache line granularity require too large an on-chip tag area that makes it undesirable and (b) DRAM caches with larger page granularity require too much bandwidth because the miss rate does not reduce enough to overcome the bandwidth increase. In this paper, we propose CHOP (Caching HOt Pages) in DRAM caches to address these challenges. We study several filter-based DRAM caching techniques: (a) a filter cache (CHOP-FC) that profiles pages and determines the hot subset of pages to allocate into the DRAM cache, (b) a memory-based filter cache (CHOP-MFC) that spills and fills filter state to improve the accuracy and reduce the size of the filter cache and (c) an adaptive DRAM caching technique (CHOP-AFC) to determine when the filter cache should be enabled and disabled for DRAM caching. We conduct detailed simulations with server workloads to show that our filter-based DRAM caching techniques achieve the following: (a) on average over 30% performance improvement over previous solutions, (b) several magnitudes lower area overhead in tag space required for cache-line based DRAM caches, (c) significantly lower memory bandwidth consumption as compared to page-granular DRAM caches.

virtual execution environments | 2008

Characterization & analysis of a server consolidation benchmark

Padma Apparao; Ravi R. Iyer; Xiaomin Zhang; Donald Newell; Tom J. Adelmeyer

Virtualization is already becoming ubiquitous in data centers for the consolidation of multiple workloads on a single platform. However, there are very few performance studies of server consolidation workloads in the literature. In this paper, our goal is to analyze the performance characteristics of a representative server consolidation workload. To address this goal, we have carried out extensive measurement and profiling experiments of a newly proposed consolidation workload (vConsolidate). vConsolidate consists of a compute intensive workload, a web server, a mail server and a database application running simultaneously on a single platform. We start by studying the performance slowdown of each workload due to consolidation on a contemporary multi-core dual-processor Intel platform. We then look at architectural characteristics such as CPI (cycles per instruction) and L2 MP (L2 misses per instruction) I, and analyze the benefits of larger caches for such a consolidated workload. We estimate the virtualization overheads for events such as context switches, interrupts and page faults and show how these impact the performance of the workload in consolidation. Finally, we also present the execution profile of the server consolidation workload and illustrate the life of each VM in the consolidated environment. We conclude by presenting an approach to developing a preliminary performance model based on the performance.

international conference on parallel architectures and compilation techniques | 2007

CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms

Li Zhao; Ravi R. Iyer; Ramesh Illikkal; Jaideep Moses; Srihari Makineni; Donald Newell

As multi-core architectures flourish in the marketplace, multi-application workload scenarios (such as server consolidation) are growing rapidly. When running multiple applications simultaneously on a platform, it has been shown that contention for shared platform resources such as last-level cache can severely degrade performance and quality of service (QoS). But todays platforms do not have the capability to monitor shared cache usage accurately and disambiguate its effects on the performance behavior of each individual application. In this paper, we investigate low-overhead mechanisms for fine-grain monitoring of the use of shared cache resources along three vectors: (a) occupancy - how much space is being used and by whom, (b) interference - how much contention is present and who is being affected and (c) sharing - how are threads cooperating. We propose the CacheScouts monitoring architecture consisting of novel tagging (software-guided monitoring IDs), and sampling mechanisms (set sampling) to achieve shared cache monitoring on per application basis at low overhead (<0.1%) and with very little loss of accuracy (<5%). We also present case studies to show how CacheScouts can be used by operating systems (OS) and virtual machine monitors (VMMs) for (a) characterizing execution profiles, (b) optimizing scheduling for performance management, (c) providing QoS and (d) metering for chargeback.

ACM Sigarch Computer Architecture News | 2005

Exploring the cache design space for large scale CMPs

Lisa R. Hsu; Ravishankar R. Iyer; Srihari Makineni; Steven K. Reinhardt; Donald Newell

With the advent of dual-core chips in the marketplace, small-scale CMP (chip multiprocessor) architectures are becoming commonplace. We expect a continuing trend of increasing the number of cores on a die to maximize the performance/power efficiency of a single chip. We believe an era of large-scale CMPs (LCMPs) with several tens to hundreds of cores is on the way, but as of now architects have little understanding of how best to build a cache hierarchy given such a large number of cores/threads to support. With this in mind, our initial goals are to prune the cache design space for LCMPs by characterizing basic server workload behavior in such an environment.In this paper, we describe the range of methodologies that we are developing to overcome the challenges of exploring the cache design space for LCMP platforms. We then focus on employing a trace-driven approach to characterizing one key server workload (OLTP) in both a homogeneous and a heterogeneous workload environment. We study the effect of increasing threads (from 1 to 128) on a three-level cache hierarchy with emphasis on second and third level caches. We study the effect of varying sizes at these cache levels and show the effects of threads contending for cache space, the effects of prefetching instruction addresses, and the effects of inclusion. We make initial observations and conclusions about the factors on which LCMP cache hierarchy design decisions should be based and discuss future work.

international conference on computer design | 2007

Exploring DRAM cache architectures for CMP server platforms

Li Zhao; Ravi R. Iyer; Ramesh Illikkal; Donald Newell

As dual-core and quad-core processors arrive in the marketplace, the momentum behind CMP architectures continues to grow strong. As more and more cores/threads are placed on-die, the pressure on the memory subsystem is rapidly increasing. To address this issue, we explore DRAM cache architectures for CMP platforms. In this paper, we investigate the impact of introducing a low latency, large capacity and high bandwidth DRAM-based cache between the last level SRAM cache and memory subsystem. We first show the potential benefits of large DRAM caches for key commercial server workloads. As the primary hurdle to achieving these benefits with DRAM caches is the tag space overheads associated with them, we identify the most efficient DRAM cache organization and investigate various options. Our results show that the combination of 8-bit partial tags and 2-way sectoring achieves the highest performance (20% to 70%) with the lowest tag space (<25%) overhead.

high-performance computer architecture | 2009

Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy

Niti Madan; Li Zhao; Naveen Muralimanohar; Aniruddha N. Udipi; Rajeev Balasubramonian; Ravishankar R. Iyer; Srihari Makineni; Donald Newell

Cache hierarchies in future many-core processors are expected to grow in size and contribute a large fraction of overall processor power and performance. In this paper, we postulate a 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data. We then propose a heterogeneous reconfigurable cache design that takes advantage of the high density of DRAM and the superior power/delay characteristics of SRAM to efficiently meet the working set demands of each individual core. Finally, we analyze the communication patterns for such a processor and show that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network. The above proposals are synergistic: each proposal is made more compelling because of its combination with the other innovations described in this paper. The proposed reconfigurable cache model improves performance by up to 19% along with 48% savings in network power.

Journal of Parallel and Distributed Computing | 2011

CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs

Bin Li; Li Zhao; Ravi R. Iyer; Li-Shiuan Peh; Michael W. Leddige; Michael J. Espig; Seung Eun Lee; Donald Newell

Contention in performance-critical shared resources affects performance and quality-of-service (QoS) significantly. While this issue has been studied recently in CMP architectures, the same problem exists in SoC architectures where the challenge is even more severe due to the contention of shared resources between programmable cores and fixed-function IP blocks. In the SoC environment, efficient resource sharing and a guarantee of a certain level of QoS are highly desirable. Researchers have proposed different techniques to support QoS, but most existing works focus on only one individual resource. Coordinated management of multiple QoS-aware shared resources remains an open problem. In this paper, we propose a class-of-service based QoS architecture (CoQoS), which can jointly manage three performance-critical resources (cache, NoC, and memory) in a NoC-based SoC platform. We evaluate the interaction between the QoS-aware allocation of shared resources in a trace-driven platform simulator consisting of detailed NoC and cache/memory models. Our simulations show that the class-of-service based approach provides a low-cost flexible solution for SoCs. We show that assigning the same class-of-service to multiple resources is not as effective as tuning the class-of-service of each resource while observing the joint interactions. This demonstrates the importance of overall QoS support and the coordination of QoS-aware shared resources.

international conference on networks | 2004

An in-depth analysis of the impact of processor affinity on network performance

Annie P. Foong; Jason M. Fung; Donald Newell

Previous works have shown that in general, performance can be improved by careful affinity of processes/threads to processors in a SMP system. We present a full experimental-based analysis of TCP performance under various affinity modes on SMP servers. Specifically, we made use of mechanisms and interfaces provided by the Redhat Linux-2.4.20 distribution. Best case results (from ttcp bulk-data transfers) showed that interrupt affinity alone provided a throughput gain of up to 25%, and full process and interrupt affinity can achieve gains of 29%. To understand the causes behind the gains, we have broken down the TCP stack into its fundamental logical blocks. This unique view allowed us to showcase exactly where and how affinity affects caching behavior and other architectural events. Where pertinent, we also point out places where affinity has no impact and provide explanations of such.

Explore More