Daniel A. Prener | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel A. Prener is active.

Explore More

Publication

Featured researches published by Daniel A. Prener.

Ibm Journal of Research and Development | 2015

Active Memory Cube: A processing-in-memory architecture for exascale systems

Ravi Nair; Samuel F. Antao; Carlo Bertolli; Pradip Bose; José R. Brunheroto; Tong Chen; Chen-Yong Cher; Carlos H. Andrade Costa; J. Doi; Constantinos Evangelinos; Bruce M. Fleischer; Thomas W. Fox; Diego S. Gallo; Leopold Grinberg; John A. Gunnels; Arpith C. Jacob; P. Jacob; Hans M. Jacobson; Tejas Karkhanis; Choon Young Kim; Jaime H. Moreno; John Kevin Patrick O'Brien; Martin Ohmacht; Yoonho Park; Daniel A. Prener; Bryan S. Rosenburg; Kyung Dong Ryu; Olivier Sallenave; Mauricio J. Serrano; Patrick Siegl

Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability | 2012

Programming with relaxed synchronization

Lakshminarayanan Renganarayana; Vijayalakshmi Srinivasan; Ravi Nair; Daniel A. Prener

10^{18}

2016 IEEE International Conference on Rebooting Computing (ICRC) | 2016

Approximate computing: Challenges and opportunities

Ankur Agrawal; Jungwook Choi; Kailash Gopalakrishnan; Suyog Gupta; Ravi Nair; Jinwook Oh; Daniel A. Prener; Sunil Shukla; Vijayalakshmi Srinivasan; Zehra Sura

floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.

architectural support for programming languages and operating systems | 2008

Computing, Approximately

Ravi Nair; Daniel A. Prener

Synchronization overhead is a major bottleneck in scaling parallel applications to a large number of cores. This continues to be true in spite of various synchronization-reduction techniques that have been proposed. Previously studied synchronization-reduction techniques tacitly assume that all synchronizations specified in a source program are essential to guarantee quality of the results produced by the program. Recently there have been proposals to relax the synchronizations in a parallel program and compute approximate results. A fundamental challenge in using relaxed synchronization is guaranteeing that the relaxed program always produces results with a specified quality. We propose a methodology that addresses this challenge in programming with relaxed synchronization. Using our methodology programmers can systematically relax synchronization while always producing results that are of same quality as the original (un-relaxed) program. We demonstrate significant speedups using our methodology on a variety of benchmarks (e.g., up to 15x on KMeans benchmark, and up to 3x on a already highly tuned kernel from Graph500 benchmark).

Archive | 2002

Method and system for transparent dynamic optimization in a multiprocessing environment

Ravi Nair; John Kevin Patrick O'Brien; Kathryn M. O'Brien; Peter Howland Oden; Daniel A. Prener

Approximate computing is gaining traction as a computing paradigm for data analytics and cognitive applications that aim to extract deep insight from vast quantities of data. In this paper, we demonstrate that multiple approximation techniques can be applied to applications in these domains and can be further combined together to compound their benefits. In assessing the potential of approximation in these applications, we took the liberty of changing multiple layers of the system stack: architecture, programming model, and algorithms. Across a set of applications spanning the domains of DSP, robotics, and machine learning, we show that hot loops in the applications can be perforated by an average of 50% with proportional reduction in execution time, while still producing acceptable quality of results. In addition, the width of the data used in the computation can be reduced to 10-16 bits from the currently common 32/64 bits with potential for significant performance and energy benefits. For parallel applications we reduced execution time by 50% using relaxed synchronization mechanisms. Finally, our results also demonstrate that benefits compounded when these techniques are applied concurrently. Our results across different applications demonstrate that approximate computing is a widely applicable paradigm with potential for compounded benefits from applying multiple techniques across the system stack. In order to exploit these benefits it is essential to re-think multiple layers of the system stack to embrace approximations ground-up and to design tightly integrated approximate accelerators. Doing so will enable moving the applications into a world in which the architecture, programming model, and even the algorithms used to implement the application are all fundamentally designed for approximate computing.

Archive | 1989

Method and apparatus for providing multiple condition code fields to to allow pipelined instructions contention free access to separate condition codes

Daniel A. Prener

Computation today brings with it an expectation of preciseness – preciseness in the definition of the architecture, preciseness in the implementation of the architecture, and preciseness in the program designed to solve problems of interest. But is such preciseness important when the program itself encodes an approximate solution to a problem and is not sacrosanct? Is such preciseness important when an instruction executed by a program does not need all the restrictions indicated by its definition in the architecture, and hence does not make use of all the hardware associated with executing the instruction? Is such preciseness important when it is perfectly acceptable for an implementation to generate a result for a set of instructions that is close enough to one produced by a precise implementation? It is clear that the preciseness of today’s computational model comes at a cost – a cost in the complexity of programming a solution, a cost in the verification of complex behavior specification, and a cost in the energy expended beyond the minimum needed to solve the problem.

Archive | 2002