Ronald G. Dreslinski | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ronald G. Dreslinski is active.

Explore More

Publication

Featured researches published by Ronald G. Dreslinski.

international symposium on microarchitecture | 2006

The M5 Simulator: Modeling Networked Systems

Nathan L. Binkert; Ronald G. Dreslinski; Lisa R. Hsu; Kevin T. Lim; Ali G. Saidi; Steven K. Reinhardt

The M5 simulator is developed specifically to enable research in TCP/IP networking. The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically. M5s usefulness as a general-purpose architecture simulator and its liberal open-source license has led to its adoption by several academic and commercial groups

Proceedings of the IEEE | 2010

Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits

Ronald G. Dreslinski; Michael Wieckowski; David T. Blaauw; Dennis Sylvester; Trevor N. Mudge

Power has become the primary design constraint for chip designers today. While Moores law continues to provide additional transistors, power budgets have begun to prohibit those devices from actually being used. To reduce energy consumption, voltage scaling techniques have proved a popular technique with subthreshold design representing the endpoint of voltage scaling. Although it is extremely energy efficient, subthreshold design has been relegated to niche markets due to its major performance penalties. This paper defines and explores near-threshold computing (NTC), a design space where the supply voltage is approximately equal to the threshold voltage of the transistors. This region retains much of the energy savings of subthreshold operation with more favorable performance and variability characteristics. This makes it applicable to a broad range of power-constrained computing segments from sensors to high performance servers. This paper explores the barriers to the widespread adoption of NTC and describes current work aimed at overcoming these obstacles.

architectural support for programming languages and operating systems | 2006

PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor

Taeho Kgil; Shaun D'Souza; Ali G. Saidi; Nathan L. Binkert; Ronald G. Dreslinski; Trevor N. Mudge; Steven K. Reinhardt; Krisztian Flautner

In this paper, we show how 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing. Our proposed architecture, PicoServer, employs 3D technology to bond one die containing several simple slow processing cores to multiple DRAM dies sufficient for a primary memory. The 3D technology also enables wide low-latency buses between processors and memory. These remove the need for an L2 cache allowing its area to be re-allocated to additional simple cores. The additional cores allow the clock frequency to be lowered without impairing throughput. Lower clock frequency in turn reduces power and means that thermal constraints, a concern with 3D stacking, are easily satisfied.The PicoServer architecture specifically targets Tier 1 server applications, which exhibit a high degree of thread level parallelism. An architecture targeted to efficient throughput is ideal for this application domain. We find for a similar logic die area, a 12 CPU system with 3D stacking and no L2 cache outperforms an 8 CPU system with a large on-chip L2 cache by about 14% while consuming 55% less power. In addition, we show that a PicoServer performs comparably to a Pentium 4-like class machine while consuming only about 1/10 of the power, even when conservative assumptions are made about the power consumption of the PicoServer.

international symposium on microarchitecture | 2012

Composite Cores: Pushing Heterogeneity Into a Core

Andrew Lukefahr; Shruti Padmanabha; Reetuparna Das; Faissal M. Sleiman; Ronald G. Dreslinski; Thomas F. Wenisch; Scott A. Mahlke

Heterogeneous multicore systems -- comprised of multiple cores with varying capabilities, performance, and energy characteristics -- have emerged as a promising approach to increasing energy efficiency. Such systems reduce energy consumption by identifying phase changes in an application and migrating execution to the most efficient core that meets its current performance requirements. However, due to the overhead of switching between cores, migration opportunities are limited to coarse-grained phases (hundreds of millions of instructions), reducing the potential to exploit energy efficient cores. We propose Composite Cores, an architecture that reduces switching overheads by bringing the notion of heterogeneity within a single core. The proposed architecture pairs big and little compute µEngines that together can achieve high performance and energy efficiency. By sharing much of the architectural state between the µEngines, the switching overhead can be reduced to near zero, enabling fine-grained switching and increasing the opportunities to utilize the little µEngine without sacrificing performance. An intelligent controller switches between the µEngines to maximize energy efficiency while constraining performance loss to a configurable bound. We evaluate Composite Cores using cycle accurate micro architectural simulations and a detailed power model. Results show that, on average, the controller is able to map 25% of the execution to the little µEngine, achieving an 18% energy savings while limiting performance loss to 5%.

international symposium on computer architecture | 2013

Catnap: energy proportional multiple network-on-chip

Reetuparna Das; Satish Narayanasamy; Sudhir Satpathy; Ronald G. Dreslinski

Multiple networks have been used in several processor implementations to scale bandwidth and ensure protocol-level deadlock freedom for different message classes. In this paper, we observe that a multiple-network design is also attractive from a power perspective and can be leveraged to achieve energy proportionality by effective power gating. Unlike a single-network design, a multiple-network design is more amenable to power gating, as its subnetworks (subnets) can be power gated without compromising the connectivity of the network. To exploit this opportunity, we propose the Catnap architecture which consists of synergistic subnet selection and power-gating policies. Catnap maximizes the number of consecutive idle cycles in a router, while avoiding performance loss due to overloading a subnet. We evaluate a 256-core processor with a concentrated mesh topology using synthetic traffic and 35 applications. We show that the average network power of a power-gating optimized multiple-network design with four subnets could be 44% lower than a bandwidth equivalent single-network design for an average performance cost of about 5%.

international symposium on low power electronics and design | 2007

Energy efficient near-threshold chip multi-processing

Bo Zhai; Ronald G. Dreslinski; David T. Blaauw; Trevor N. Mudge; Dennis Sylvester

Subthreshold circuit design has become a popular approach for building energy efficient digital circuits. One drawback is performance degradation due to the exponentially reduced driving current. This had limited subthreshold circuits to relatively low performance applications such as sensor networks. To retain the excellent energy efficiency while reducing performance loss, we propose to apply subthreshold and near-threshold techniques to chip multi-processors. We show that an architecture where several slower cores are clustered together with a shared faster L1 cache is optimal for energy efficiency, because processor cores and memory operate best at different supply and threshold voltages. In particular, SPLASH2 benchmarks show about a 53% energy improvement over the traditional CMP approach (about 70% over a single core machine).

international symposium on computer architecture | 2010

Evolution of thread-level parallelism in desktop applications

Geoffrey Blake; Ronald G. Dreslinski; Trevor N. Mudge; Krisztian Flautner

As the effective limits of frequency and instruction level parallelism have been reached, the strategy of microprocessor vendors has changed to increase the number of processing cores on a single chip each generation. The implicit expectation is that software developers will write their applications with concurrency in mind to take advantage of this sudden change in direction. In this study we analyze whether software developers for laptop/desktop machines have followed the recent hardware trends by creating software for chip multi-processing. We conduct a study of a wide range of applications on Microsoft Windows 7 and Apples OS X Snow Leopard, measuring Thread Level Parallelism on a high performance workstation and a low power desktop. In addition, we explore graphics processing units (GPUs) and their impact on chip multi-processing. We compare our findings to a study done 10 years ago which concluded that a second core was sufficient to improve system responsiveness. Our results on todays machines show that, 10 years later, surprisingly 2-3 cores are more than adequate for most applications and that the GPU often remains under-utilized. However, in some application specific domains an 8 core SMT system with a 240 core GPU can be effectively utilized. Overall these studies suggest that many-core architectures are not a natural fit for current desktop/laptop applications.

international solid-state circuits conference | 2012

Centip3De: A 3930DMIPS/W configurable near-threshold 3D stacked system with 64 ARM Cortex-M3 cores

David Fick; Ronald G. Dreslinski; Bharan Giridhar; Gyouho Kim; Sangwon Seo; Matthew Fojtik; Sudhir Satpathy; Yoonmyung Lee; Daeyeon Kim; Nurrachman Liu; Michael Wieckowski; Gregory K. Chen; Trevor N. Mudge; Dennis Sylvester; David T. Blaauw

Recent high performance IC design has been dominated by power density constraints. 3D integration increases device density even further, and these devices will not be usable without viable strategies to reduce power consumption. This paper proposes the use of near-threshold computing (NTC) to address this issue in a stacked 3D system. In NTC, cores are operated near the threshold voltage (~200mV above Vth) to optimally balance power and performance [1]. In Centip3De, we operate cores at 650mV, as opposed to the wear-out limited supply voltage of 1.5V. This improves measured energy efficiency by 5.1×. The dramatically lower power consumption of NTC makes it an attractive match for 3D design, which has limited power dissipation capabilities, but also has improved innate power and performance compared to 2D design.

architectural support for programming languages and operating systems | 2015

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

Johann Hauswald; Michael A. Laurenzano; Yunqi Zhang; Cheng Li; Austin Rovinski; Arjun Khurana; Ronald G. Dreslinski; Trevor N. Mudge; Vinicius Petrucci; Lingjia Tang; Jason Mars

As user demand scales for intelligent personal assistants (IPAs) such as Apples Siri, Googles Google Now, and Microsofts Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this paper, we present the design of Sirius, an open end-to-end IPA web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.

international symposium on microarchitecture | 2009

Proactive transaction scheduling for contention management

Geoffrey Blake; Ronald G. Dreslinski; Trevor N. Mudge

Hardware transactional memory offers a promising high performance and easier to program alternative to lock-based synchronization for creating parallel programs. This is particularly important as hardware manufacturers continue to put more cores on die. But transactional memory still has one main drawback: contention. Contention is caused by multiple transactions trying to speculatively modify the same memory location concurrently causing one or more transactions to abort and retry its execution. Contention serializes the execution, meaning high contention leads to very poor parallel performance. As more cores are added, contention worsens. To date contention-manager designs have been primarily reactive in nature and limited to various forms of randomized backoff to effectively stall contending transactions when conflicts occur. While backoff-based managers have been popular due to their simplicity, at higher core counts our analysis on the STAMP benchmark suite shows that backoff-based managers perform poorly. In particular, small groups of transactions create hot spots of contention that lead to this poor performance. We show these hot spots commonly consist of small sets of conflicts that occur in a predictable manner. To counter this challenge we introduce a dynamic contention management strategy that minimizes contention by using past history to identify when these hot spots will reoccur in the future and proactively schedule affected transactions around these hot spots. The strategy used predicts future contention and schedules to avoid it at runtime without the need for programmer input. Our experiments show that by using our proactive scheduling technique we outperform a backoff-based policy for a 16 processor system by an average of 85%.

Explore More