Don Newell | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Don Newell is active.

Explore More

Publication

Featured researches published by Don Newell.

measurement and modeling of computer systems | 2010

Modeling virtual machine performance: challenges and approaches

Omesh Tickoo; Ravi R. Iyer; Ramesh Illikkal; Don Newell

Data centers are increasingly employing virtualization and consolidation as a means to support a large number of disparate applications running simultaneously on server platforms. However, server platforms are still being designed and evaluated based on performance modeling of a single highly parallel application or a set of homogenous work-loads running simultaneously. Since most future datacenters are expected to employ server virtualization, this paper takes a look at the challenges of modeling virtual machine (VM) performance on a datacenter server. Based on vConsolidate (a server virtualization benchmark) and latest multi-core servers, we show that the VM modeling challenge requires addressing three key problems: (a) modeling the contention of visible resources (cores, memory capacity, I/O devices, etc), (b) modeling the contention of invisible resources (shared microarchitecture resources, shared cache, shared memory bandwidth, etc) and (c) modeling overheads of virtual machine monitor (or hypervisor) implementation. We take a first step to addressing this problem by describing a VM performance modeling approach and performing a detailed case study based on the vConsolidate benchmark. We conclude by outlining outstanding problems for future work.

First International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006) | 2006

Characterization of network processing overheads in Xen

Padma Apparao; Srihari Makineni; Don Newell

I/O virtualization techniques developed recently have led to significant changes in network processing. These techniques require network packets go through additional layers of processing. These additional layers have introduced significant overheads. So it is important to understand performance implications of this additional processing on network processing (TCP/IP). Our goals in this paper are to measure network I/O performance in a Xen virtualized environment and to provide a detailed architectural characterization of network processing highlighting major sources of overheads and their impact. In this paper, we study two modes of I/O virtualizations: 1) running I/O service VM along with the guest on the same CPU, and 2) running I/O service VM on a separate CPU. We measure TCP/IP processing performance in these two modes and compare it to that of on the native Linux machine. Our measurements show that both Rx and Tx performance suffer by more than 50% in virtualized environment. We have noticed that pathlength has increased by 3 to 4 times than that of the native processing. Most of this overhead comes from the Xen VMM layer and Dom0 VM processing. Our data also shows that running the Dom0 VM on a separate CPU is more expensive than running both Dom0 and guest VM on the same CPU. We provide a detailed characterization of this additional processing which we hope will help the Xen community focus on right areas for optimization.

Computer Networks | 2009

VM3: Measuring, modeling and managing VM shared resources

Ravi R. Iyer; Ramesh Illikkal; Omesh Tickoo; Li Zhao; Padma Apparao; Don Newell

With cloud and utility computing models gaining significant momentum, data centers are increasingly employing virtualization and consolidation as a means to support a large number of disparate applications running simultaneously on a chip-multiprocessor (CMP) server. In such environments, contention for shared platform resources (CPU cores, shared cache space, shared memory bandwidth, etc.) can have a significant effect on each virtual machines performance. In this paper, we investigate the shared resource contention problem for virtual machines by: (a) measuring the effects of shared platform resources on virtual machine performance, (b) proposing a model for estimating shared resource contention effects, and (c) proposing a transition from a virtual machine (VM) to a virtual platform architecture (VPA) that enables transparent shared resource management through architectural mechanisms for monitoring and enforcement. Our measurement and modeling experiments are based on a consolidation benchmark (vConsolidate) running on a state-of-the-art CMP server. Our virtual platform architecture experiments are based on detailed simulations of consolidation scenarios. Through detailed measurements and simulations, we show that shared resource contention affects virtual machine performance significantly and emphasize that virtual platform architectures is a must for future virtualized datacenters.

ACM Sigarch Computer Architecture News | 2007

From chaos to QoS: case studies in CMP resource management

Fei Guo; Hari Kannan; Li Zhao; Ramesh Illikkal; Ravi R. Iyer; Don Newell; Yan Solihin; Christos Kozyrakis

As more and more cores are enabled on the die of future CMP platforms, we expect that several diverse workloads will run simultaneously on the platform. A key example of this trend is the growth of virtualization usage models. When multiple virtual machines or applications or threads run simultaneously, the quality of service (QoS) that the platform provides to each individual thread is non-deterministic today. This occurs because the simultaneously running threads place very different demands on the shared resources (cache space, memory bandwidth, etc) in the platform and in most cases contend with each other. In this paper, we first present case studies that show how this results in non-deterministic performance. Unlike the compute resources managed through scheduling, platform resource allocation to individual threads cannot be controlled today. In order to provide better determinism and QoS, we then examine resource management mechanisms and present QoS-aware architectures and execution environments. The main contribution of this paper is the architecture feasibility analysis through prototypes that allow experimentation with QoS-Aware execution environments and architectural resources. We describe these QoS prototypes and then present preliminary case studies of multi-tasking and virtualization usage models sharing one critical CMP resource (last-level cache). We then demonstrate how proper management of the cache resource can provide service differentiation and deterministic performance behavior when running disparate workloads in future CMP platforms.

ACM Sigarch Computer Architecture News | 2008

Towards hybrid last level caches for chip-multiprocessors

Li Zhao; Ravi R. Iyer; Mike Upton; Don Newell

As CMP platforms are widely adopted, more and more cores are integrated on to the die. To reduce the off-chip memory access, the last level cache is usually organized as a distributed shared cache. In order to avoid hot-spots, cache lines are interleaved across the distributed shared cache slices using a hash function. However, as we increase the number of cores and cache slices in the platform, this also implies that most of data references go to remote cache slices, thereby increasing the access latency significantly. In this paper, we propose a hybrid last level cache, which has some amount of private space and some amount of shared space on each cache slice. For workloads with no sharing, the goal is to provide more hits into the local slice while still keeping the overall miss rate low. For workloads with sufficient sharing, the goal is to allow more sharing in the last-level cache slice. We present hybrid last-level cache design options and study its hit/miss rate behavior for a number of important server applications and multi-programmed workloads. Our simulation results on running multi-programmed workloads based on SPEC CINT2000 as well as multithreaded workloads based on commercial server benchmarks (TPCC, SPECjbb, SAP and TPCE) show that this architecture is advantageous especially since it can improve the local hit rate significantly while keeping the overall miss rate similar to the shared cache.

ACM Sigarch Computer Architecture News | 2008

Towards modeling & analysis of consolidated CMP servers

Padma Apparao; Ravi R. Iyer; Don Newell

As virtualization becomes ubiquitous in data centers, it becomes imperative that the definition of future multi-core platform architectures take into account the performance behavior and requirements of consolidated servers. However, performance analysis of commercial servers has traditionally been focused on individual parallel benchmarks running in dedicated mode. In this paper, we present an approach to developing a performance model for virtualized CMP servers potentially running heterogeneous workloads simultaneously. We show that a consolidation performance model can be developed by decomposing the problem into three constituent parts: (a) core interference due to consolidation, (b) cache interference due to consolidation and (c) virtualization overheads. Having laid out the consolidation framework, we then perform an initial case study with a new consolidation benchmark (vConsolidate). We present vConsolidate measurement characteristics on a Core 2 Duo-based server platform and then apply the performance model in order to predict the performance slowdown of each workload due to consolidation. We show that the model constructed is capable of achieving sufficient accuracy and discuss how to improve the accuracy in the future. Last but not least, we describe the extensions required to develop a complete generalized consolidation performance model.

measurement and modeling of computer systems | 2009

Virtual platform architectures for resource metering in datacenters

Ravi R. Iyer; Ramesh Illikkal; Li Zhao; Don Newell; Jaideep Moses

ieee international conference on high performance computing, data, and analytics | 2009

HiPPAI: High Performance Portable Accelerator Interface for SoCs

Paul M. Stillwell; Vineet Chadha; Omesh Tickoo; Steven Zhang; Ramesh Illikkal; Ravishankar R. Iyer; Don Newell

Specialized hardware accelerators are enabling todays System on Chip (SoC) platforms to target various applications. In this paper we show that as these SoCs evolve in complexity and usage, the programming models for such platforms need to evolve beyond the traditional driver oriented architecture. Using a test set up that employs a programmable FPGA based accelerator to implement one of the critical computation functions of a Mobile Augmented Reality based workload, we describe the performance drawbacks that a conventional programming model brings to compute environments employing hardware accelerators. We show that these performance issues become more critical as the interface latencies continue to improve over time with better hardware integration and efficient interconnect technologies. Under these usage scenarios, we show with measurements that the software overheads enforced by the current programming model, like those associated with system calls, memory copy and memory address translations account for a major part of the performance overheads. We then propose a novel High Performance Portable Accelerator Interface (HiPPAI) for SoC platforms using hardware accelerators to reduce the software overheads mentioned above. In addition, we position the new programming interface to allow for function portability between software and hardware function accelerators to reduce the application development effort. Our proposed model relies on two major building blocks for performance improvement. A uniform virtual memory addressing model based on hardware IOMMU support and direct user mode access to accelerators. We demonstrate how these enhancements reduce the overheads of system calls and address translations at the user/kernel boundary in traditional software stacks and enable function portability.

ieee international symposium on workload characterization | 2009

Performance characterization and optimization of mobile augmented reality on handheld platforms

Sadagopan Srinivasan; Zhen Fang; Ravi R. Iyer; Steven Zhang; Mike Espig; Don Newell; Daniel M. Cermak; Yi Wu; Igor Kozintsev; Horst W. Haussecker

The introduction of low power general purpose processors (like the Intel® Atom™ processor) expands the capability of handheld and mobile internet devices (MIDs) to include compelling visual computing applications. One rapidly emerging visual computing usage model is known as mobile augmented reality (MAR). In the MAR usage model, the user is able to point the handheld camera to an object (like a wine bottle) or a set of objects (like an outdoor scene of buildings or monuments) and the device automatically recognizes and displays information regarding the object(s). Achieving this on the handheld requires significant compute processing resulting in a response time in the order of several seconds. In this paper, we analyze a MAR workload and identify the primary hotspot functions that incur a large fraction of the overall response time. We also present a detailed architectural characterization of the hotspot functions in terms of CPI, MPI, etc. We then implement and analyze the benefits of several software optimizations: (a) vectorization, (b) multi-threading, (c) cache conflict avoidance and (d) miscellaneous code optimizations that reduce the number of computations. We show that a 3X performance improvement in execution time can be achieved by implementing these optimizations. Overall, we believe our analysis provides a detailed understanding of the processing for a new domain of visual computing workloads (i.e. MAR) running on low power handheld compute platforms.

computing frontiers | 2010

NCID: a non-inclusive cache, inclusive directory architecture for flexible and efficient cache hierarchies

Li Zhao; Ravi R. Iyer; Srihari Makineni; Don Newell; Liqun Cheng

Chip-multiprocessor (CMP) architectures employ multi-level cache hierarchies with private L2 caches per core and a shared L3 cache like Intels Nehalem processor and AMDs Barcelona processor. When designing a multi-level cache hierarchy, one of the key design choices is the inclusion policy: inclusive, non-inclusive or exclusive. Either choice has its benefits and drawbacks. An inclusive cache hierarchy (like Nehalems L3) has the benefit of allowing incoming snoops to be filtered at the L3 cache, but suffers from (a) reduced space efficiency due to replication between the L2 and L3 caches and (b) reduced flexibility since it cannot bypass the L3 cache for transient or low priority data. In an inclusive L2/L3 cache hierarchy, it also becomes difficult to flexibly chop L3 cache size (or increase L2 cache size) for different product instantiations because the inclusion can start to affect performance (due to significant back-invalidates). In this paper, we present a novel approach to addressing the drawbacks of inclusive caches, while retaining its positive features of snoop filtering. We present NCID: a non-inclusive cache, inclusive directory architecture that allows data in the L3 to be non-inclusive or exclusive, but retains tag inclusion in the directory to support complete snoop filtering. We then describe and evaluate a range of NCID-based architecture options and policies. Our evaluation shows that NCID enables a flexible and efficient cache hierarchy for future CMP platforms and has the potential to improve performance significantly for several important server benchmarks.

Explore More