Is this you? Create Your Porfile

Daniel Wong

University of Southern California

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Wong is active.

Explore More

Publication

Featured researches published by Daniel Wong.

international symposium on microarchitecture | 2012

KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity

Daniel Wong; Murali Annavaram

Server energy proportionality has been improving over the past several years. Many components in a system, such as CPU, memory and disk, have been achieving good energy proportionality behavior. Using a wide range of server power data from the published SPEC power data we show that the overall system energy proportionality has reached 80%. We present two novel metrics, linear deviation and proportionality gap, that provide insights into accurately quantifying energy proportionality. Using these metrics we show that energy proportionality improvements are not uniform across various server utilization levels. In particular, the energy proportionality of even a highly proportional server suffers significantly at non-zero but low utilizations. We propose to tackle the lack of energy proportionality at low utilization using server-level heterogeneity. We present Knight Shift, a server-level heterogenous server architecture that introduces an active low power mode, through the addition of a tightly-coupled compute node called the Knight, enabling two energy-efficient operating regions. We evaluated Knight Shift against a variety of real-world data center workloads using a combination of prototyping and simulation, showing up to 75% energy savings with tail latency bounded by the latency of the Knight and up to 14% improvement to Performance per TCO dollar spent.

international symposium on microarchitecture | 2013

Warped gates: gating aware scheduling and power gating for GPGPUs

Mohammad Abdel-Majeed; Daniel Wong; Murali Annavaram

With the widespread adoption of GPGPUs in varied application domains, new opportunities open up to improve GPGPU energy efficiency. Due to inherent application-level inefficiencies, GPGPU execution units experience significant idle time. In this work we propose to power gate idle execution units to eliminate leakage power, which is becoming a significant concern with technology scaling. We show that GPGPU execution units are idle for short windows of time and conventional microprocessor power gating techniques cannot fully exploit these idle windows efficiently due to power gating overhead. Current warp schedulers greedily intersperse integer and floating point instructions, which limit power gating opportunities for any given execution unit type. In order to improve power gating opportunities in GPGPU execution units, we propose a Gating Aware Two-level warp scheduler (GATES) that issues clusters of instructions of the same type before switching to another instruction type. We also propose a new power gating scheme, called Blackout, that forces a power gated execution unit to sleep for at least the break-even time necessary to overcome the power gating overhead before returning to the active state. The combination of GATES and Blackout, which we call Warped Gates, can save 31.6% and 46.5% of integer and floating point unit static energy. The proposed solutions suffer less than 1% performance and area overhead.

high-performance computer architecture | 2014

Implications of high energy proportional servers on cluster-wide energy proportionality

Daniel Wong; Murali Annavaram

Cluster-level packing techniques have long been used to improve the energy proportionality of server clusters by masking the poor energy proportionality of individual servers. With the emergence of high energy proportional servers, we revisit whether cluster-level packing techniques are still the most effective way to achieve high cluster-wide energy proportionality. Our findings indicate that cluster-level packing techniques can eventually limit cluster-wide energy proportionality and it may be more beneficial to depend solely on server-level low power techniques. Server-level low power techniques generally require a high latency slack to be effective due to diminishing idle periods as server core count increases. In order for server-level low power techniques to be a viable alternative, the latency slack required for these techniques must be lowered. We found that server-level active low power modes offer the lowest latency slack, independent of server core count, and propose low power mode switching policies to meet the best-case latency slack under realistic conditions. By overcoming these major issues, we show that server-level low power modes can be a viable alternative to cluster-level packing techniques in providing high cluster-wide energy proportionality.

high-performance computer architecture | 2016

Approximating warps with intra-warp operand value similarity

Daniel Wong; Nam Sung Kim; Murali Annavaram

Value locality, the recurrence of a previously-seen value, has been the enabler of myriad optimization techniques in traditional processors. Value similarity relaxes the constraint of value locality by allowing values to differ in the lowest significant bits where values are micro-architecturally near. With the end of Dennard Scaling and the turn towards massively parallel accelerators, we revisit value similarity in the context of GPUs. We identify a form of value similarity called intra-warp operand value similarity, which is abundant in GPUs. We present Warp Approximation, which leverages intra-warp operand value similarity to trade off accuracy for energy. Warp Approximation dynamically identifies intra-warp operand value similarity in hardware, and executes a single representative thread on behalf of all the active threads in a warp, thereby producing a representative value with approximate value locality. This representative value can then be stored compactly in the register file as a value similar scalar, reducing the read and write energy when dealing with approximate data. With Warp Approximation, we can reduce execution unit energy by 37%, register file energy by 28%, and improve overall GPGPU energy efficiency by 26% with minimal quality degradation.

IEEE Micro | 2013

Scaling the Energy Proportionality Wall with KnightShift

Daniel Wong; Murali Annavaram

Measuring energy proportionality accurately and understanding the reasons for disproportionality are critical first steps in designing future energy-efficient servers. This article presents two metrics-linear deviation and proportionality gap-that let system designers analyze and understand server energy consumption at various utilization levels. An analysis of published SPECpower results shows that energy proportionality improvements are not uniform across various server utilization levels. Even highly energy proportional servers suffer significantly at nonzero but low utilizations. To address the lack of energy proportionality at low utilization, the authors present KnightShift, a server-level heterogeneous server providing an active low-power mode. KnightShift is tightly coupled with a low-power compute node called Knight. Knight responds to low-utilization requests whereas the primary server responds only to high-utilization requests, enabling two energy-efficient operating regions. The authors evaluate KnightShift against a variety of real-world datacenter workloads using a combination of prototyping and simulation.

international symposium on low power electronics and design | 2016

DynSleep: Fine-grained Power Management for a Latency-Critical Data Center Application

Chih-Hsun Chou; Daniel Wong; Laxmi N. Bhuyan

Servers running in datacenters are commonly kept underutilized to meet stringent latency targets. Due to poor energy-proportionality in commodity servers, the low utilization results in wasteful power consumption that cost millions of dollars. Applying dynamic power management on datacenter workloads is challenging, especially when tail latency requirements often fall in the sub-millisecond level. The fundamental issue is randomness due to unpredictable request arrival times and request service times. Prior techniques applied per-core DVFS to have fine-grain control of slowing down request processing without violating the tail latency target. However, most commodity servers only support per-core DFS, which greatly limits potential energy saving. In this paper, we propose DynSleep, a fine-grain power management scheme for datacenter workloads through the use of per-core sleep states (C-states). DynSleep dynamically postpones the processing of some requests, creating longer idle periods, which allow the use of deeper C-states to save energy. We design and implement DynSleep with Mem-cached, a popular key-value store application used in datacenters. The experimental results show that DynSleep achieves up to 65% core power saving, and 27% better than the per-core DVFS power management scheme, while still satisfying the tail latency constraint. To the best of our knowledge, this is the first work to analyze and develop power management technique with CPU C-states in latency-critical datacenter workloads

ieee international symposium on workload characterization | 2015

A Retrospective Look Back on the Road Towards Energy Proportionality

Daniel Wong; Julia Chen; Murali Annavaram

In this paper, we take a retrospective look back at the road taken towards improving energy proportionality, in order to find out where we are currently, and how we got here. Through statistical regression of published SPEC power results, were able to identify and quantify the sources of past EP improvements.

international conference on supercomputing | 2016

Origami: Folding Warps for Energy Efficient GPUs

Mohammad Abdel-Majeed; Daniel Wong; Justin Kuang; Murali Annavaram

Graphical processing units (GPUs) are increasingly used to run a wide range of general purpose applications. Due to wide variation in application parallelism and inherent application level inefficiencies, GPUs experience significant idle periods. In this work, we first show that significant fine-grain pipeline bubbles exist regardless of warp scheduling policies or workloads. We propose to convert these bubbles into energy saving opportunities using Origami. Origami consists of two components: Warp Folding and the Origami scheduler. With Warp Folding, warps are split into two half-warps which are issued in succession. Warp Folding leaves half of the execution lanes idle, which is then exploited to improve energy efficiency through power gating. Origami scheduler is a new warp scheduler that is cognizant of the Warp Folding process and tries to further extend the sleep times of idle execution lanes. By combining the two techniques Origami can save 49% and 46% of the leakage energy in the integer and floating point pipelines, respectively. These savings are better than or at least on-par with Warped-Gates, a prior power gating technique that power gates the entire cluster of execution lanes. But Origami achieves these energy savings without relying on forcing idleness on execution lanes, which leads to performance losses, as has been proposed in Warped-Gates. Hence, Origami is able to achieve these energy savings with virtually no performance overhead.

international symposium on low power electronics and design | 2018