Al Davis
University of Utah
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Al Davis.
high-performance computer architecture | 1999
John B. Carter; Wilson C. Hsieh; Leigh Stoller; Mark R. Swanson; Lixin Zhang; Erik Brunvand; Al Davis; Chen-Chi Kuo; Ravindra Kuramkote; Michael A. Parker; Lambert Schaelicke; Terry Tateyama
Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper we describe the design of the Impulse architecture, and show how an Impulse memory system can be used to improve the performance of memory-bound programs. For the NAS conjugate gradient benchmark, Impulse improves performance by 67%. Because it requires no modification to processor, cache, or bus designs, Impulse can be adopted in conventional systems. In addition to scientific applications, we expect that Impulse will benefit regularly strided memory-bound applications of commercial importance, such as database and multimedia programs.
international symposium on computer architecture | 2010
Aniruddha N. Udipi; Naveen Muralimanohar; Niladrish Chatterjee; Rajeev Balasubramonian; Al Davis; Norman P. Jouppi
DRAM vendors have traditionally optimized the cost-per-bit metric, often making design decisions that incur energy penalties. A prime example is the overfetch feature in DRAM, where a single request activates thousands of bit-lines in many DRAM chips, only to return a single cache line to the CPU. The focus on cost-per-bit is questionable in modern-day servers where operating costs can easily exceed the purchase cost. Modern technology trends are also placing very different demands on the memory system: (i)queuing delays are a significant component of memory access time, (ii) there is a high energy premium for the level of reliability expected for business-critical computing, and (iii) the memory access stream emerging from multi-core systems exhibits limited locality. All of these trends necessitate an overhaul of DRAM architecture, even if it means a slight compromise in the cost-per-bit metric. This paper examines three primary innovations. The first is a modification to DRAM chip microarchitecture that re tains the traditional DDRx SDRAMinterface. Selective Bit-line Activation (SBA) waits for both RAS (row address) and CAS (column address) signals to arrive before activating exactly those bitlines that provide the requested cache line. SBA reduces energy consumption while incurring slight area and performance penalties. The second innovation, Single Subarray Access (SSA), fundamentally re-organizes the layout of DRAM arrays and the mapping of data to these arrays so that an entire cache line is fetched from a single subarray. It requires a different interface to the memory controller, reduces dynamic and background energy (by about 6X), incurs a slight area penalty (4%), and can even lead to performance improvements (54% on average) by reducing queuing delays. The third innovation further penalizes the cost-per-bit metric by adding a checksum feature to each cache line. This checksum error-detection feature can then be used to build stronger RAID-like fault tolerance, including chipkill-level reliability. Such a technique is especially crucial for the SSA architecture where the entire cache line is localized to a single chip. This DRAM chip microarchitectural change leads to a dramatic reduction in the energy and storage overheads for reliability. The proposed architectures will also apply to other emerging memory technologies (such as resistive memories) and will be less disruptive to standards, interfaces, and the design flow if they can be incorporated into first-generation designs.
ieee international conference on high performance computing data and analytics | 2009
Jung Ho Ahn; Nathan L. Binkert; Al Davis; Moray McLaren; Robert Schreiber
In the push to achieve exascale performance, systems will grow to over 100,000 sockets, as growing cores-per-socket and improved single-core performance provide only part of the speedup needed. These systems will need affordable interconnect structures that scale to this level. To meet the need, we consider an extension of the hypercube and flattened butterfly topologies, the HyperX, and give an adaptive routing algorithm, DAL. HyperX takes advantage of high-radix switch components that integrated photonics will make available. Our main contributions include a formal descriptive framework, enabling a search method that finds optimal HyperX configurations; DAL; and a low cost packaging strategy for an exascale HyperX. Simulations show that HyperX can provide performance as good as a folded Clos, with fewer switches. We also describe a HyperX packaging scheme that reduces system cost. Our analysis of efficiency, performance, and packaging demonstrates that the HyperX is a strong competitor for exascale networks.
international conference on parallel architectures and compilation techniques | 2010
Manu Awasthi; David W. Nellans; Kshitij Sudan; Rajeev Balasubramonian; Al Davis
Modern processors such as Tileras Tile64, Intels Nehalem, and AMDs Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, at memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMDs HyperTransport™, or Intels Quick-Path Interconnect™. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular piece of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. To date, no prior work has examined the effects of data placement among multiple MCs in such systems. Future chip-multiprocessors are likely to comprise multiple MCs and an even larger number of cores. This trend will increase the memory access latency variation in these systems. Proper allocation of workload data to the appropriate MC will be important in reducing the latency of memory service requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of the physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. These policies yield average performance improvements of 17% for adaptive first-touch page-placement, and 35% for a dynamic page-migration policy.
international symposium on performance analysis of systems and software | 2014
Seth H. Pugsley; Jeffrey Jestes; Huihui Zhang; Rajeev Balasubramonian; Vijayalakshmi Srinivasan; Alper Buyuktosunoglu; Al Davis; Feifei Li
While Processing-in-Memory has been investigated for decades, it has not been embraced commercially. A number of emerging technologies have renewed interest in this topic. In particular, the emergence of 3D stacking and the imminent release of Microns Hybrid Memory Cube device have made it more practical to move computation near memory. However, the literature is missing a detailed analysis of a killer application that can leverage a Near Data Computing (NDC) architecture. This paper focuses on in-memory MapReduce workloads that are commercially important and are especially suitable for NDC because of their embarrassing parallelism and largely localized memory accesses. The NDC architecture incorporates several simple processing cores on a separate, non-memory die in a 3D-stacked memory package; these cores can perform Map operations with efficient memory access and without hitting the bandwidth wall. This paper describes and evaluates a number of key elements necessary in realizing efficient NDC operation: (i) low-EPI cores, (ii) long daisy chains of memory devices, (iii) the dynamic activation of cores and SerDes links. Compared to a baseline that is heavily optimized for MapReduce execution, the NDC design yields up to 15X reduction in execution time and 18X reduction in system energy.
high performance interconnects | 2008
Raymond G. Beausoleil; Jung Ho Ahn; Nathan L. Binkert; Al Davis; David A. Fattal; Marco Fiorentino; Norman P. Jouppi; Moray McLaren; Charles Santori; Robert Schreiber; Sean M. Spillane; D. Vantrease; Qianfan Palo Alto Xu
Silicon nanophotonics holds the promise of revolutionizing computing by enabling parallel architectures that combine unprecedented performance and ease of use with affordable power consumption. Here we describe the results of a detailed multiyear design study of dense wavelength division multiplexing (DWDM) on-chip and off-chip interconnects and the device technologies that could improve computing performance by a factor of 20 above industry projections over the next decade.
compilers, architecture, and synthesis for embedded systems | 2003
Binu K. Mathew; Al Davis; Zhen Fang
Accurate real-time speech recognition is not currently possible in the mobile embedded space where the need for natural voice interfaces is clearly important. The continuous nature of speech recognition coupled with an inherently large working set creates significant cache interference with other processes. Hence real-time recognition is problematic even on high-performance general-purpose platforms. This paper provides a detailed analysis of CMUs latest speech recognizer (Sphinx 3.2), identifies three distinct processing phases, and quantifies the architectural requirements for each phase. Several optimizations are then described which expose parallelism and drastically reduce the bandwidth and power requirements for real-time recognition. A special-purpose accelerator for the dominant Gaussiann probability phase is developed for a 0.25μ CMOS process which is then analyzed and compared with Sphinxs measured energy and performance on a 0.13μ 2.4 GHz Pentium 4 system. The results show an improvement in power consumption by a factor of 29 at equivalent processing throughput. However after normalizing for process, the special-purpose approach has twice the throughput, and consumes 104 times less energy than the general-purpose processor. The energy-delay product is a better comparison metric due to the inherent design trade-offs between energy consumption and performance. The energy-delay product of the special-purpose approach is 196 times better than the Pentium 4. These results provide strong evidence that real-time large vocabulary speech recognition can be done within a power budget commensurate with embedded processing using todays technology.
hawaii international conference on system sciences | 1993
Al Davis; B. Coates; Kenneth S. Stevens
The Post Office is an asynchronous, 300000 transistor, full-custom CMOS chip designed as the communication component for the Mayfly scalable parallel processor. Performance requirements led to the development of a design style which permits the design of sequential circuits operating under a restricted form of multiple input change signaling called burst-mode. The Post Office complexity forced the authors to develop a set of design tools capable of correctly synthesizing transistor circuits from state machine and equation specifications, and capable of verifying the correctness of the resultant circuitry using implementation specific timing assumptions. A case study of this design experience is provided. >
international symposium on computer architecture | 2011
Aniruddha N. Udipi; Naveen Muralimanohar; Rajeev Balasubramonian; Al Davis; Norman P. Jouppi
It is well-known that memory latency, energy, capacity, band-width, and scalability will be critical bottlenecks in future large-scale systems. This paper addresses these problems, focusing on the interface between the compute cores and memory, comprising the physical interconnect and the memory access protocol. For the physical interconnect, we study the prudent use of emerging silicon-photonic technology to reduce energy consumption and improve capacity scaling. We conclude that photonics are effective primarily to improve socket-edge bandwidth by breaking the pin barrier, and for use on heavily utilized links. For the access protocol, we propose a novel packet based interface that relinquishes most of the tight control that the memory controller holds in current systems and allows the memory modules to be more autonomous, improving flexibility and interoperability. The key enabler here is the introduction of a 3D-stacked interface die that allows both these optimizations without modifying commodity memory dies. The interface die handles all conversion between optics and electronics, as well as all low-level memory device control functionality. Communication beyond the interface die is fully electrical, with TSVs between dies and low-swing wires on-die. We show that such an approach results in substantially lowered energy consumption, reduced latency, better scalability to large capacities, and better support for heterogeneity and interoperability.
Archive | 1981
Uri Weiser; Al Davis
This paper presents an overview of an extension to a mathematically based methodology for mapping an algorithmic description into a concurrent implementation on silicon. The result of this methodology yields a systolic array [4]. The basic mathematical method was initially described by Cohen [1]. Extensions were made by Weiser and Davis [5]; Johnsson, Weiser, Cohen, and Davis [2]; Cohen and Johnsson [3]; and Weiser [6]. This approach focuses on the correspondence between equations defining a certain computation and networks which perform the computation. As the complexity of problems increases, a hierarchical approach can reduce the complexity by hiding detail and thus reduce the design complexity at each level. The purpose of this paper is to introduce a method for treating sets of data as wavefront entities in the equations of the mathematical methodology and in the graphical representation.