Philo Juang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Philo Juang is active.

Explore More

Publication

Featured researches published by Philo Juang.

architectural support for programming languages and operating systems | 2004

Formal online methods for voltage/frequency control in multiple clock domain microprocessors

Qiang Wu; Philo Juang; Margaret Martonosi; Douglas W. Clark

Multiple Clock Domain (MCD) processors are a promising future alternative to todays fully synchronous designs. Dynamic Voltage and Frequency Scaling (DVFS) in an MCD processor has the extra flexibility to adjust the voltage and frequency in each domain independently. Most existing DVFS approaches are profile-based offline schemes which are mainly suitable for applications whose execution char-acteristics are constrained and repeatable. While some work has been published about online DVFS schemes, the prior approaches are typically heuristic-based. In this paper, we present an effective online DVFS scheme for an MCD processor which takes a formal analytic approach, is driven by dynamic workloads, and is suitable for all applications. In our approach, we model an MCD processor as a queue-domain network and the online DVFS as a feedback control problem with issue queue occupancies as feedback signals. A dynamic stochastic queuing model is first proposed and linearized through an accu-rate linearization technique. A controller is then designed and verified by stability analysis. Finally we evaluate our DVFS scheme through a cycle-accurate simulation with a broad set of applications selected from MediaBench and SPEC2000 benchmark suites. Compared to the best-known prior approach, which is heuristic-based, the proposed online DVFS scheme is substantially more effective due to its automatic regulation ability. For example, we have achieved a 2-3 fold increase in efficiency in terms of energy-delay product improvement. In addition, our control theoretic technique is more resilient, requires less tuning effort, and has better scalability as compared to prior online DVFS schemes.We believe that the techniques and methodology described in this paper can be generalized for energy control in processors other than MCD, such as tiled stream processors.

international symposium on microarchitecture | 2005

Formal control techniques for power-performance management

Qiang Wu; Philo Juang; Margaret Martonosi; Li-Shiuan Peh; Douglas W. Clark

These techniques determine when to speed up a processor to reach performance targets and when to slow it down to save energy. They use dynamic voltage and frequency scaling to balance speed and avoid worst case frequency limitations for both multiple-clock-domain and chip multiprocessors.

high-performance computer architecture | 2005

Voltage and frequency control with adaptive reaction time in multiple-clock-domain processors

Qiang Wu; Philo Juang; Margaret Martonosi; Douglas W. Clark

Dynamic voltage and frequency scaling (DVFS) is a widely used method for energy-efficient computing. In this paper, we present a new intra-task online DVFS scheme for multiple clock domain (MCD) processors. Most existing online DVFS schemes for MCD processors use a fixed time interval between possible voltage/frequency changes. The downside to this approach is that the interval boundaries are predetermined and independent of workload changes. Thus, they can be late in responding to large, severe activity swings. In this work, we propose an alternative online DVFS scheme in which the reaction time is self-tuned and adaptive to application and work-load changes. In addition to designing such a scheme, we model the proposed DVFS control and use the derived model in a formal stability analysis. The obtained analytical insight is then used to guide and improve the design in terms of stability margin and control effectiveness. We evaluate our DVFS scheme through cycle-accurate simulation over a wide set of MediaBench and SPEC2000 benchmarks. Compared to the best-known prior fixed-interval DVFS schemes for MCD processors, the proposed DVFS scheme has a simpler decision process, which leads to smaller and cheaper hardware. Our scheme has achieved significant energy savings over all studied benchmarks (19% energy savings with 3% performance degradation on average, which is close to the best results from existing fixed-interval DVFS schemes). For a group of applications with fast workload variations, our scheme outperforms existing fixed-interval DVFS schemes significantly due to its adaptive nature. Overall, we feel the proposed adaptive online DVFS scheme is an effective and promising alternative to existing fixed-interval DVFS schemes. Designers may choose the new scheme for processors with limited hardware budget, or if the anticipated work-load behavior is variable. In addition, the modeling and analysis techniques in this work serve as examples of using stability analysis in other aspects of high-performance CPU design and control.

international conference on computer design | 2002

Applying decay strategies to branch predictors for leakage energy savings

Zhigang Hu; Philo Juang; Kevin Skadron; Douglas W. Clark; Margaret Martonosi

With technology advancing toward deep submicron, leakage energy is of increasing concern, especially for large onchip array structures such as caches and branch predictors. Recent work has suggested that even larger branch predictors can and should be used in order to improve microprocessor performance. A further consideration is that the branch predictor is a thermal hot spot, thus further increasing its leakage. For these reasons, it is natural to consider applying decay techniques-already shown to reduce leakage energy for caches-to branch-prediction structures. Due to the structural difference between caches and branch predictors, applying decay techniques to branch predictors is not straightforward. This paper explores the strategies for exploiting spatial and temporal locality to make decay effective for bimodal, gshare, and hybrid predictors, as well as the branch target buffer Overall, this paper demonstrates that decay techniques apply more broadly than just to caches, but that careful policy and implementation make the difference between success and failure in building decay-based branch predictors. Multi-component hybrid predictors offer especially interesting implementation tradeoffs for decay.

international symposium on low power electronics and design | 2002

Managing leakage for transient data: decay and quasi-static 4T memory cells

Zhigang Hu; Philo Juang; Phil Diodato; Stefanos Kaxiras; Kevin Skadron; Margaret Martonosi; Douglas W. Clark

Much of on-chip storage is devoted to transient, often short-lived, data. Despite this, virtually all on-chip array structures use six-transistor (6T) static RAM cells that store data indefinitely. In this paper we propose the use of quasi-static four-transistor (4T) RAM cells. Quasi-static 4T cells provide both energy and area savings. These cells have no connection to Vdd and thus inherently provide decay functionality: values are refreshed upon access but discharge over time without use. This makes 4T cells uniquely well-suited for predictive structures like branch predictors and BTBs where data integrity is not essential. We use quantitative evaluations (both circuit-level and cycle-level) to explore the design space and quantify the opportunities. Overall, 4T-based branch predictors offer 12-33% area savings and 60-80% leakage savings with minimal performance impact. More broadly, this paper suggests a new view of how to support transient data in power-aware processors.

ACM Sigarch Computer Architecture News | 2005

Hardware-modulated parallelism in chip multiprocessors

Julia Chen; Philo Juang; Kevin Ko; Gilberto Contreras; David A. Penry; Ram Rangan; Adam Stoler; Li-Shiuan Peh; Margaret Martonosi

Chip multi-processors (CMPs) already have widespread commercial availability, and technology roadmaps project enough on-chip transistors to replicate tens or hundreds of current processor cores. How will we express parallelism, partition applications, and schedule/place/migrate threads on these highly-parallel CMPs?This paper presents and evaluates a new approach to highly-parallel CMPs, advocating a new hardware-software contract. The software layer is encouraged to expose large amounts of multi-granular, heterogeneous parallelism. The hardware, meanwhile, is designed to offer low-overhead, low-area support for orchestrating and modulating this parallelism on CMPs at runtime. Specifically, our proposed CMP architecture consists of architectural and ISA support targeting thread creation, scheduling and context-switching, designed to facilitate effective hardware run-time mapping of threads to cores at low overheads.Dynamic modulation of parallelism provides the ability to respond to run-time variability that arises from dataset changes, memory system effects and power spikes and lulls, to name a few. It also naturally provides a long-term CMP platform with performance portability and tolerance to frequency and reliability variations across multiple CMP generations. Our simulations of a range of applications possessing do-all, streaming and recursive parallellism show speedups of 4-11.5X and energy-delay-product savings of 3.8X, on average, on a 16-core vs. a 1-core system. This is achieved with modest amounts of hardware support that allows for low overheads in thread creation, scheduling and context-switching. In particular, our simulations motivated the need for hardware support, showing that the large thread management overheads of current run-time software systems can lead to up to 6.5X slowdown. The difficulties faced in static scheduling were shown in our simulations with a static scheduling algorithm, fed with oracle profiled inputs suffering up to 107% slowdown compared to NDPs hardware scheduler, due to its inability to handle memory system variabilities. More broadly, we feel that the ideas presented here show promise for scaling to the systems expected in ten years, where the advantages of high transistor counts may be dampened by difficulties in circuit variations and reliability. These issues will make dynamic scheduling and adaptation mandatory; our proposals represent a first step towards that direction.

ACM Transactions on Architecture and Code Optimization | 2004

Implementing branch-predictor decay using quasi-static memory cells

Philo Juang; Kevin Skadron; Margaret Martonosi; Zhigang Hu; Douglas W. Clark; Philip W. Diodato; Stefanos Kaxiras

With semiconductor technology advancing toward deep submicron, leakage energy is of increasing concern, especially for large on-chip array structures such as caches and branch predictors. Recent work has suggested that larger, aggressive branch predictors can and should be used in order to improve microprocessor performance. A further consideration is that more aggressive branch predictors, especially multiported predictors for multiple branch prediction, may be thermal hot spots, thus further increasing leakage. Moreover, as the branch predictor holds state that is transient and predictive, elements can be discarded without adverse effect. For these reasons, it is natural to consider applying decay techniques---already shown to reduce leakage energy for caches---to branch-prediction structures.Due to the structural difference between caches and branch predictors, applying decay techniques to branch predictors is not straightforward. This paper explores the strategies for exploiting spatial and temporal locality to make decay effective for bimodal, gshare, and hybrid predictors, as well as the branch target buffer (BTB). Furthermore, the predictive behavior of branch predictors steers them towards decay based not on state-preserving, static storage cells, but rather quasi-static, dynamic storage cells. This paper will examine the results of implementing decaying branch-predictor structures with dynamic---appropriately, decaying---cells rather than the standard static SRAM cell.Overall, this paper demonstrates that decay techniques can apply to more than just caches, with the branch predictor and BTB as an example. We show decay can either be implemented at the architectural level, or with a wholesale replacement of static storage cells with quasi-static storage cells, which naturally implement decay. More importantly, decay techniques can be applied and should be applied to other such transient and/or predictive structures.

IEEE Computer Architecture Letters | 2002

Implementing Decay Techniques using 4T Quasi-Static Memory Cells

Philo Juang; Phil Diodato; Stefanos Kaxiras; Kevin Skadron; Zhigang Hu; Margaret Martonosi; Douglas W. Clark

This paper proposes the use of four-transistor (4T) cacheand branch predictor array cell designs to address increasingworries regarding leakage power dissipation. While 4T designslose state when infrequently accessed, they have very lowleakage, smaller area, and no capacitive loads to switch. Thisshort paper gives an overview of 4T implementation issues anda preliminary evaluation of leakage-energy savings that showsimprovements of 60-80%

architectural support for programming languages and operating systems | 2002