Is this you? Create Your Porfile

Weng-Fai Wong

National University of Singapore

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Weng-Fai Wong is active.

Explore More

Publication

Featured researches published by Weng-Fai Wong.

symposium on operating systems principles | 2009

Automatically patching errors in deployed software

Jeff H. Perkins; Sunghun Kim; Samuel Larsen; Saman P. Amarasinghe; Jonathan Bachrach; Michael Carbin; Carlos Pacheco; Frank Sherwood; Stelios Sidiroglou; Greg Sullivan; Weng-Fai Wong; Yoav Zibin; Michael D. Ernst; Martin C. Rinard

We present ClearView, a system for automatically patching errors in deployed software. ClearView works on stripped Windows x86 binaries without any need for source code, debugging information, or other external information, and without human intervention. ClearView (1) observes normal executions to learn invariants thatcharacterize the applications normal behavior, (2) uses error detectors to distinguish normal executions from erroneous executions, (3) identifies violations of learned invariants that occur during erroneous executions, (4) generates candidate repair patches that enforce selected invariants by changing the state or flow of control to make the invariant true, and (5) observes the continued execution of patched applications to select the most successful patch. ClearView is designed to correct errors in software with high availability requirements. Aspects of ClearView that make it particularly appropriate for this context include its ability to generate patches without human intervention, apply and remove patchesto and from running applications without requiring restarts or otherwise perturbing the execution, and identify and discard ineffective or damaging patches by evaluating the continued behavior of patched applications. ClearView was evaluated in a Red Team exercise designed to test its ability to successfully survive attacks that exploit security vulnerabilities. A hostile external Red Team developed ten code injection exploits and used these exploits to repeatedly attack an application protected by ClearView. ClearView detected and blocked all of the attacks. For seven of the ten exploits, ClearView automatically generated patches that corrected the error, enabling the application to survive the attacks and continue on to successfully process subsequent inputs. Finally, the Red Team attempted to make Clear-View apply an undesirable patch, but ClearViews patch evaluation mechanism enabled ClearView to identify and discard both ineffective patches and damaging patches.

international symposium on microarchitecture | 2011

Multi retention level STT-RAM cache designs with a dynamic refresh scheme

Zhenyu Sun; Xiuyuan Bi; Hai Helen Li; Weng-Fai Wong; Zhong-Liang Ong; Xiaochun Zhu; Wenqing Wu

Spin-transfer torque random access memory (STT-RAM) has received increasing attention because of its attractive features: good scalability, zero standby power, non-volatility and radiation hardness. The use of STT-RAM technology in the last level on-chip caches has been proposed as it minimizes cache leakage power with technology scaling down. Furthermore, the cell area of STT-RAM is only 1/9 ∼ 1/3 that of SRAM. This allows for a much larger cache with the same die footprint, improving overall system performance through reducing cache misses. However, deploying STT-RAM technology in L1 caches is challenging because of the long and power-consuming write operations. In this paper, we propose both L1 and lower level cache designs that use STT-RAM. In particular, our designs use STTRAM cells with various data retention time and write performances, made possible by different magnetic tunneling junction (MTJ) designs. For the fast STT-RAM bits with reduced data retention time, a counter controlled dynamic refresh scheme is proposed to maintain the data validity. Our dynamic scheme saves more than 80% refresh energy compared to the simple refresh scheme proposed in previous works. A L1 cache built with ultra low retention STTRAM coupled with our proposed dynamic refresh scheme can achieve 9.2% in performance improvement, and saves up to 30% of the total energy when compared to one that uses traditional SRAM. For lower level caches with relative large cache capacity, we propose a data migration scheme that moves data between portions of the cache with different retention characteristics so as to maximize the performance and power benefits. Our experiments show that on the average, our proposed multi retention level STT-RAM cache reduces 30 ∼ 70% of the total energy compared to previous works, while improving IPC performance for both 2-level and 3-level cache hierarchy.

IEEE Transactions on Computers | 1994

Fast hardware-based algorithms for elementary function computations using rectangular multipliers

Weng-Fai Wong; E. Gogo

As the name suggests, elementary functions play a vital role in scientific computations. Yet due to their inherent nature, they are a considerable computing task by themselves. Not surprisingly, since the dawn of computing, the goal of speeding up elementary function computation has been pursued. This paper describes new hardware based algorithms for the computation of the common elementary functions, namely division, logarithm, reciprocal square root, arc tangent, sine and cosine. These algorithms exploit microscopic parallelism using specialized hardware with heavy use of truncation based on detailed accuracy analysis. The contribution of this work lies in the fact that these algorithms are very fast and yet are accurate. If we let the time to perform an IEEE Standard 754 double precision floating point multiplication be /spl tau//sub /spl times//, our algorithms to achieve roughly 3.68/spl tau//sub /spl times//,4.56/spl tau//sub /spl times//, 5.25/spl tau//sub /spl times//, 3.69/spl tau//sub /spl times//, 7.06/spl tau//sub /spl times//, and 6.5/spl tau//sub /spl times//, for division, logarithm, square root, exponential, are tangent and complex exponential (sine and cosine) respectively. The trade-off is the need for tables and some specialized hardware. The total amount of tables required, however, is less than 128 Kbytes. We discuss the hardware, algorithmic and accuracy aspects of these algorithms. >

local computer networks | 2005

Sensor grid: integration of wireless sensor networks and the grid

Hock Beng Lim; Yong Meng Teo; Protik Mukherjee; Weng-Fai Wong; Simon See

Wireless sensor networks have emerged as an exciting technology for a wide range of important applications that acquire and process information from the physical world. Grid computing has evolved as a standards-based approach for coordinated resource sharing. Sensor grids combine these two promising technologies by extending the grid computing paradigm to the sharing of sensor resources in wireless sensor networks. There are several issues and challenges in the design of sensor grids. In this paper, we propose a sensor grid architecture, called the scalable proxy-based architecture for sensor grid (SPRING), to address these design issues. We also developed a sensor grid testbed to study the design issues of sensor grids and to improve our sensor grid architecture design

international conference on computer aided design | 2004

Configuration bitstream compression for dynamically reconfigurable FPGAs

Ju Hwa Pan; Tulika Mitra; Weng-Fai Wong

Field programmable gate arrays (FPGAs) holds the possibility of dynamic reconfiguration. The key advantages of dynamic reconfiguration are the ability to rapidly adapt to dynamic changes and better utilization of the programmable hardware resources for multiple applications. However, with the advent of multi-million gate equivalent FPGAs, configuration time is increasingly becoming a concern. High reconfiguration cost can potentially wipe out any gains from dynamic reconfiguration. One solution to alleviate this problem is to exploit the high levels of redundancy in the configuration bitstream by compression. In this paper, we propose a novel configuration compression technique that exploits redundancies both within a configurations bitstream as well as between bitstreams of multiple configurations. By maximizing reuse, our results show that the proposed technique performs 26.5-75.8% better than the previously proposed techniques. To the best of our knowledge, ours is the first work that performs inter-configuration compression.

IEEE Transactions on Computers | 1995

Fast evaluation of the elementary functions in single precision

Weng-Fai Wong; Eiichi Goto

In this paper we introduce a new method for the fast evaluation of the elementary functions in single precision based on the evaluation of truncated Taylor series using a difference method. We assume the availability of large and fast (at least for read purposes) memory. We call this method the ATA (Add-Table lookup-Add) method. As the name implies, the hardware required for the method are adders (both two/ and multi/operand adders) and fast tables. For IEEE single precision numbers our initial estimates indicate that we can calculate the basic elementary functions, namely reciprocal, square root, logarithm, exponential, trigonometric and inverse trigonometric functions, within the latency of two to four floating point multiplies. >

virtual execution environments | 2011

Dynamic cache contention detection in multi-threaded applications

Qin Zhao; David Koh; Syed Raza; Derek Bruening; Weng-Fai Wong; Saman P. Amarasinghe

In todays multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded applications behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy. In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.

international symposium on low power electronics and design | 2011

Processor caches built using multi-level spin-transfer torque RAM cells

Yiran Chen; Weng-Fai Wong; Hai Li; Cheng-Kok Koh

It has been predicted that a processors caches could occupy as much as 90% of chip area for technology nodes from the current. In this paper, we study the use of multi-level spin-transfer torque RAM (STT-RAM) cells in the design of processor caches. Compared to the traditional SRAM caches, a multi-level cell (MLC) STT-RAM cache design is denser, fast, and consumes less energy. However, a number of critical issues remains to be solved before MLC STT-RAM technology can be deployed in processor caches. In this paper, we shall offer solutions to the issue of bit encoding as well as tackle the write endurance problem. The latter has been neglected in previous works on STT-RAM caches. We propose a set remapping scheme that can potentially prolong the lifetime of a MLC STT-RAM cache by 80× on average. Furthermore, a method for recovering the performance that may be lost in some applications due to set remapping is introduced.

ACM Journal on Emerging Technologies in Computing Systems | 2013

On-chip caches built on multilevel spin-transfer torque RAM cells and its optimizations

Yiran Chen; Weng-Fai Wong; Hai Li; Cheng-Kok Koh; Yaojun Zhang; Wujie Wen

It has been predicted that a processors caches could occupy as much as 90% of chip area a few technology nodes from the current ones. In this article, we investigate the use of multilevel spin-transfer torque RAM (STT-RAM) cells in the design of processor caches. We start with examining the access (read and write) scheme for multilevel cell (MLC) STT-RAM from a circuit design perspective, detailing the read and write circuits. Compared to traditional SRAM caches, a multilevel cell (MLC) STT-RAM cache design is denser, fast, and requires less energy. However, a number of critical architecture-level issues remain to be solved before MLC STT-RAM technology can be deployed in processor caches. We shall offer solutions to the issue of bit encoding as well as tackle the write endurance problem. In particular, the latter has been neglected in previous works on STT-RAM caches. We propose a set remapping scheme that can potentially prolong the lifetime of a MLC STT-RAM cache by 80× on average. Furthermore, a method for recovering the performance that may be lost in some applications due to set remapping is proposed. The impacts of process variations of the MLC STT-RAM cell on the robustness of the memory hierarchy is also discussed, together with various enhancement techniques, namely, ECC and design redundancy.

ACM Sigarch Computer Architecture News | 2005

Dynamic memory optimization using pool allocation and prefetching

Qin Zhao; Rodric M. Rabbah; Weng-Fai Wong

Heap memory allocation plays an important role in modern applications. Conventional heap allocators, however, generally ignore the underlying memory hierarchy of the system, favoring instead a low runtime overhead and fast response times. Unfortunately, with little concern for the memory hierarchy, the data layout may exhibit poor spatial locality, and degrade cache performance. In this paper, we describe a dynamic heap allocation scheme called pool allocation. The strategy aims to improve cache performance by inspecting memory allocation requests, and allocating memory from appropriate heap pools as dictated by the requesting context. The advantages are two fold. First, by pooling together data with a common context, we expect to improve spatial locality, as data fetched to the caches will contain fewer items from different contexts. If the allocation patterns are closely matched to the traversal patterns, the end result is faster memory performance. Second, by pooling heap objects, we expect access patterns to exhibit more regularity, thus creating more opportunities for data prefetching. Our dynamic memory optimizer exploits the increased regularity to insert prefetch instructions at runtime. The optimizations are implemented in DynamoRIO, a dynamic optimization framework. We evaluate the work using various bench-marks, and measure a 17% speedup over gcc −03 on an Athlon MP, and a 13% speedup on a Pentium 4.

Explore More