Hsien-Hsin S. Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hsien-Hsin S. Lee is active.

Explore More

Publication

Featured researches published by Hsien-Hsin S. Lee.

international symposium on microarchitecture | 2007

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs

Mrinmoy Ghosh; Hsien-Hsin S. Lee

DRAMs require periodic refresh for preserving data stored in them. The refresh interval for DRAMs depends on the vendor and the design technology they use. For each refresh in a DRAM row, the stored information in each cell is read out and then written back to itself as each DRAM bit read is self-destructive. The refresh process is inevitable for maintaining data correctness, unfortunately, at the expense of power and bandwidth overhead. The future trend to integrate layers of 3D die-stacked DRAMs on top of a processor further exacerbates the situation as accesses to these DRAMs will be more frequent and hiding refresh cycles in the available slack becomes increasingly difficult. Moreover, due to the implication of temperature increase, the refresh interval of 3D die-stacked DRAMs will become shorter than those of conventional ones. This paper proposes an innovative scheme to alleviate the energy consumed in DRAMs. By employing a time-out counter for each memory row of a DRAM module, all the unnecessary periodic refresh operations can be eliminated. The basic concept behind our scheme is that a DRAM row that was recently read or written to by the processor (or other devices that share the same DRAM) does not need to be refreshed again by the periodic refresh operation, thereby eliminating excessive refreshes and the energy dissipated. Based on this concept, we propose a low-cost technique in the memory controller for DRAM power reduction. The simulation results show that our technique can reduce up to 86% of all refresh operations and 59.3% on the average for a 2 GB DRAM. This in turn results in a 52.6% energy savings for refresh operations. The overall energy saving in the DRAM is up to 25.7% with an average of 12.13% obtained for SPLASH-2, SPECint2000, and Biobench benchmark programs simulated on a 2 GB DRAM. For a 64 MB 3D DRAM, the energy saving is up to 21% and 9.37% on an average when the refresh rate is 64 ms. For a faster 32 ms refresh rate the maximum and average savings are 12% and 6.8% respectively.

international symposium on computer architecture | 2010

Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping

Nak Hee Seong; Dong Hyuk Woo; Hsien-Hsin S. Lee

Phase change memory (PCM) is an emerging memory technology for future computing systems. Compared to other non-volatile memory alternatives, PCM is more matured to production, and has a faster read latency and potentially higher storage density. The main roadblock precluding PCM from being used, in particular, in the main memory hierarchy, is its limited write endurance. To address this issue, recent studies proposed to either reduce PCMs write frequency or use wear-leveling to evenly distribute writes. Although these techniques can extend the lifetime of PCM, most of them will not prevent deliberately designed malicious codes from wearing it out quickly. Furthermore, all the prior techniques did not consider the circumstances of a compromised OS and its security implication to the overall PCM design. A compromised OS will allow adversaries to manipulate processes and exploit side channels to accelerate wear-out. In this paper, we argue that a PCM design not only has to consider normal wear-out under normal application behavior, most importantly, it must take the worst-case scenario into account with the presence of malicious exploits and a compromised OS to address the durability and security issues simultaneously. In this paper, we propose a novel, low-cost hardware mechanism called Security Refresh to avoid information leak by constantly migrating their physical locations inside the PCM, obfuscating the actual data placement from users and system software. It uses a dynamic randomized address mapping scheme that swaps data using random keys upon each refresh due. The hardware overhead is tiny without using any table. The best lifetime we can achieve under the worst-case malicious attack is more than six years. Also, our scheme incurs around 1% performance degradation for normal program operations.

high-performance computer architecture | 2010

An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth

Dong Hyuk Woo; Nak Hee Seong; Dean L. Lewis; Hsien-Hsin S. Lee

Memory bandwidth has become a major performance bottleneck as more and more cores are integrated onto a single die, demanding more and more data from the system memory. Several prior studies have demonstrated that this memory bandwidth problem can be addressed by employing a 3D-stacked memory architecture, which provides a wide, high frequency memory-bus interface. Although previous 3D proposals already provide as much bandwidth as a traditional L2 cache can consume, the dense through-silicon-vias (TSVs) of 3D chip stacks can provide still more bandwidth. In this paper, we contest that we need to re-architect our memory hierarchy, including the L2 cache and DRAM interface, so that it can take full advantage of this massive bandwidth. Our technique, SMART-3D, is a new 3D-stacked memory architecture with a vertical L2 fetch/write-back network using a large array of TSVs. Simply stated, we leverage the TSV bandwidth to hide latency behind very large data transfers. We analyze the design trade-offs for the DRAM arrays, careful enough to avoid compromising the DRAM density because of TSV placement. Moreover, we propose an efficient mechanism to manage the false sharing problem when implementing SMART-3D in a multi-socket system. For single-threaded memory-intensive applications, the SMART-3D architecture achieves speedups from 1.53 to 2.14 over planar designs and from 1.27 to 1.72 over prior 3D designs. We achieve similar speedups for multi-program and multi-threaded workloads on multi-core and multi-socket processors. Furthermore, SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM for it reduces the total number of row buffer misses.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2007

Multiobjective Microarchitectural Floorplanning for 2-D and 3-D ICs

Michael B. Healy; Mario Vittes; Mongkol Ekpanyapong; Chinnakrishnan S. Ballapuram; Sung Kyu Lim; Hsien-Hsin S. Lee; Gabriel H. Loh

This paper presents the first multiobjective microarchitectural floorplanning algorithm for high-performance processors implemented in two-dimensional (2-D) and three-dimensional (3-D) ICs. The floorplanner takes a microarchitectural netlist and determines the dimension as well as the placement of the functional modules into single- or multiple-device layers while simultaneously achieving high performance and thermal reliability. The traditional design objectives such as area and wirelength are also considered. The 3-D floorplanning algorithm considers the following 3-D-specific issues: vertical overlap optimization and bonding-aware layer partitioning. The hybrid floorplanning approach combines linear programming and simulated annealing, which is shown to be very effective in obtaining high-quality solutions in a short runtime under multiobjective goals. This paper provides comprehensive experimental results on making tradeoffs among performance, thermal, area, and wirelength for both 2-D and 3-D ICs

international conference on parallel architectures and compilation techniques | 2004

Architectural Support for High Speed Protection of Memory Integrity and Confidentiality in Multiprocessor Systems

Weidong Shi; Hsien-Hsin S. Lee; Mrinmoy Ghosh; Chenghuai Lu

Recently there is a growing effort in both the architecture and the security community to create a hardware solution for authenticating system memory. As shown in the previous work, hardware-based memory authentication becomes a vital component for creating future trusted computing environments and digital rights protection. Almost all these prior work have focused on authenticating memory exclusively owned by a single processing element. However, in todays computing platforms, memory is often shared by multiple processing elements that support a shared system memory with a snooping cache coherence protocol. Authenticating shared memory is a new challenge to memory protection. In this paper, we present a secure and fast architecture for authenticating shared memory. In terms of incorporating memory authentication into the processor pipeline, we propose a new scheme called authentication speculative execution. Unlike the prior approaches, our scheme does not compromise security for performance. The novel ASE scheme is not only secure as it is combined with a onetime-pad (OTP) based memory encryption but also efficient to tolerate authentication latency by executing unauthenticated instructions speculatively. Results using modified RSIM running SPLASH2 benchmark show only 5% overhead in performance on dual and quad processor platforms. Furthermore, ASE shows 80% better performance on average over conservative nonspeculative execution based authentication schemes. The scheme is of practical use for both multiprocessor systems and uni-processor systems where memory is shared by one main processor and other co-processors on the system bus.

acm symposium on parallel algorithms and architectures | 2008

Kicking the tires of software transactional memory: why the going gets tough

Richard M. Yoo; Yang Ni; Adam Welc; Bratin Saha; Ali-Reza Adl-Tabatabai; Hsien-Hsin S. Lee

Transactional Memory (TM) promises to simplify concurrent programming, which has been notoriously difficult but crucial in realizing the performance benefit of multi-core processors. Software Transaction Memory (STM), in particular, represents a body of important TM technologies since it provides a mechanism to run transactional programs when hardware TM support is not available, or when hardware TM resources are exhausted. Nonetheless, most previous researches on STMs were constrained to executing trivial, small-scale workloads. The assumption was that the same techniques applied to small-scale workloads could readily be applied to real-life, large-scale workloads. However, by executing several nontrivial workloads such as particle dynamics simulation and game physics engine on a state of the art STM, we noticed that this assumption does not hold. Specifically, we identified four major performance bottlenecks that were unique to the case of executing large-scale workloads on an STM: false conflicts, over-instrumentation, privatization-safety cost, and poor amortization. We believe that these bottlenecks would be common for any STM targeting real-world applications. In this paper, we describe those identified bottlenecks in detail, and we propose novel solutions to alleviate the issues. We also thoroughly validate these approaches with experimental results on real machines.

international symposium on computer architecture | 2005

High Efficiency Counter Mode Security Architecture via Prediction and Precomputation

Weidong Shi; Hsien-Hsin S. Lee; Mrinmoy Ghosh; Chenghuai Lu; Alexandra Boldyreva

Encrypting data in unprotected memory has gained much interest lately for digital rights protection and security reasons. Counter mode is a well-known encryption scheme. It is a symmetric-key encryption scheme based on any block cipher, e.g. AES. The schemes encryption algorithm uses a block cipher, a secret key and a counter (or a sequence number) to generate an encryption pad which is XORed with the data stored in memory. Like other memory encryption schemes, this method suffers from the inherent latency of decrypting encrypted data when loading them into the on-chip cache. In this paper, we present a novel technique to hide the latency overhead of decrypting counter mode encrypted memory by predicting the sequence number and pre-computing the encryption pad that we call one-time-pad or OTP. In contrast to the prior techniques of sequence number caching, our mechanism solves the latency issue by using idle decryption engine cycles to speculatively predict and pre-compute OTPs before the corresponding sequence number is loaded. This technique incurs very little area overhead. In addition, a novel adaptive OTP prediction technique is also presented to further improve our regular OTP prediction and precomputation mechanism. This adaptive scheme is not only able to predict encryption pads associated with static and infrequently updated cache lines but also those frequently updated ones as well. Experimental results using SPEC2000 benchmark show an 82% prediction rate. Moreover, we also explore several optimization techniques for improving the prediction accuracy. Two specific techniques, two-level prediction and context-based prediction are presented and evaluated.

2009 IEEE International Conference on 3D System Integration | 2009

Architectural evaluation of 3D stacked RRAM caches

Dean L. Lewis; Hsien-Hsin S. Lee

The first memristor, originally theorized by Dr. Leon Chua in 1971, was identiffed by a team at HP Labs in 2008. This new fundamental circuit element is unique in that its resistance changes as current passes through it, giving the device a memory of the past system state. The immediately obvious application of such a device is in a non-volatile memory, wherein high- and low-resistance states are used to store binary values. A memory array of memristors forms what is called a resistive RAM or RRAM. In this paper, we survey the memristors that have been produced by a number of different research teams and present a point-by-point comparison between DRAM and this new RRAM, based on both existent and expected near-term memristor devices. In particular, we consider the case of a die-stacked 3D memory that is integrated onto a logic die and evaluate which memory is best suited for the job. While still suffering a few shortcomings, RRAM proves itself a very interesting design alternative to well-established DRAM technologies.

international symposium on computer architecture | 2013

Tri-level-cell phase change memory: toward an efficient and reliable memory system

Nak Hee Seong; Sungkap Yeo; Hsien-Hsin S. Lee

There are several emerging memory technologies looming on the horizon to compensate the physical scaling challenges of DRAM. Phase change memory (PCM) is one such candidate proposed for being part of the main memory in computing systems. One salient feature of PCM is its multi-level-cell (MLC) property, which can be used to multiply the memory capacity at the cell level. However, due to the nature of PCM that the value written to the cell can drift over time, PCM is prone to a unique type of soft errors, posing a great challenge for their practical deployment. This paper first quantitatively studied the current art for MLC PCM in dealing with the resistance drift problem and showed that the previously proposed techniques such as scrubbing or error correction mechanisms have significant reliability challenges to overcome. We then propose tri-level-cell PCM and demonstrate its ability to achieving 105 x lower soft error rate than four-level-cell PCM and 1.33 x higher information density than single-level-cell PCM. According to our findings, the tri-level-cell PCM shows 36.4% performance improvement over the four-level-cell PCM while achieving the soft error rate of DRAM.

international symposium on computer architecture | 2006

An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors

Weidong Shi; Hsien-Hsin S. Lee; Laura Falk; Mrinmoy Ghosh

This paper presents a high-availability system architecture called INDRA an INtegrated framework for Dependable and Revivable Architecture that enhances a multicore processor (or CMP) with novel security and fault recovery mechanisms. INDRA represents the first effort to create remote attack immune, self-healing network services using the emerging multicore processors. By exploring the property of a tightly-coupled multicore system, INDRA pioneers several concepts. It creates a hardware insulation, establishes finegrained fault monitoring, exploits monitoring/backup concurrency, and facilitates fast recovery services with minimal performance impact. In addition, INDRAs fault/exploit monitoring is implemented in software rather than in hardware logic, thereby providing better flexibility and upgradability. To provide efficient service recovery and thus improve service availability, we propose a novel delta state backup and recovery on-demand mechanism in INDRA that substantially outperforms conventional checkpointing schemes. We demonstrate and evaluate INDRAs capability and performance using real network services and a cycle-level architecture simulator. As indicated by our performance results, INDRA is highly effective in establishing a more dependable system with high service availability using emerging multicore processors.

Explore More