Is this you? Create Your Porfile

Hung-Wei Tseng

University of California, San Diego

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hung-Wei Tseng is active.

Explore More

Publication

Featured researches published by Hung-Wei Tseng.

design automation conference | 2011

Understanding the impact of power loss on flash memory

Hung-Wei Tseng; Laura M. Grupp; Steven Swanson

Flash memory is quickly becoming a common component in computer systems ranging from music players to mission-critical server systems. As flash plays a more important role, data integrity in flash memories becomes a critical question. This paper examines one aspect of that data integrity by measuring the types of errors that occur when power fails during a flash memory operation. Our findings demonstrate that power failure can lead to several non-intuitive behaviors. We find that increasing the time before power failure does not always reduce error rates and that a power failure during a program operation can corrupt data that a previous, successful program operation wrote to the device. Our data also show that interrupted program operations leave data more susceptible to read disturb and increase the probability that the programmed data will decay over time. Finally, we show that incomplete erase operations make future program operations to the same block unreliable.

IEEE Transactions on Very Large Scale Integration Systems | 2008

Energy-Aware Flash Memory Management in Virtual Memory System

Han-Lin Li; Chia-Lin Yang; Hung-Wei Tseng

The traditional virtual memory system is designed for decades assuming a magnetic disk as the secondary storage. Recently, flash memory becomes a popular storage alternative for many portable devices with the continuing improvements on its capacity, reliability and much lower power consumption than mechanical hard drives. The characteristics of flash memory are quite different from a magnetic disk. Therefore, in this paper, we revisit virtual memory system design considering limitations imposed by flash memory. In particular, we focus on the energy efficient aspect since power is the first-order design consideration for embedded systems. Due to the write-once feature of flash memory, frequent writes incur frequent garbage collection thereby introducing significant energy overhead. Therefore, in this paper, we propose three methods to reduce writes to flash memory. The HotCache scheme adds an SRAM cache to buffer frequent writes. The subpaging technique partitions a page into subunits, and only dirty subpages are written to flash memory. The duplication-aware garbage collection method exploits data redundancy between the main memory and flash memory to reduce writes incurred by garbage collection. We also identify one type of data locality that is inherent in accesses to flash memory in the virtual memory system, intrapage locality. Intrapage locality needs to be carefully maintained for data allocation in flash memory. Destroying intrapage locality causes noticeable increases in energy consumption. Experimental results show that the average energy reduction of combined subpaging, HotCache, and duplication-aware garbage collection techniques is 42.2%.

high-performance computer architecture | 2011

Data-triggered threads: Eliminating redundant computation

Hung-Wei Tseng; Dean M. Tullsen

This paper introduces the concept of data-triggered threads. Unlike threads in parallel programs in conventional programming models, these threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation. This paper focuses primarily on the latter. It is shown that 78% of all loads fetch redundant data, leading to a high incidence of redundant computation. By expressing computation through data-triggered threads, that computation is executed once when the data changes, and is skipped whenever the data does not change. The set of C SPEC benchmarks show performance speedup of up to 5.9X, and averaging 46%.

ACM Transactions on Architecture and Code Optimization | 2004

Tolerating memory latency through push prefetching for pointer-intensive applications

Chia-Lin Yang; Alvin R. Lebeck; Hung-Wei Tseng; Chien-Hao Lee

Prefetching is often used to overlap memory latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular memory access pattern and pointer-chasing problem. In this paper, we proposed a cooperative hardware/software prefetching framework, the push architecture, which is designed specifically for linked data structures. The push architecture exploits program structure for future address generation instead of relying on past address history. It identifies the load instructions that traverse a LDS and uses a prefetch engine to execute them ahead of the CPU execution. This allows the prefetch engine to successfully generate future addresses. To overcome the serial nature of LDS address generation, the push architecture employs a novel data movement model. It attaches the prefetch engine to each level of the memory hierarchy and pushes, rather than pulls, data to the CPU. This push model decouples the pointer dereference from the transfer of the current node up to the processor. Thus a series of pointer dereferences becomes a pipelined process rather than a serial process. Simulation results show that the push architecture can reduce up to 100% of memory stall time on a suite of pointer-intensive applications, reducing overall execution time by an average 15%.

international symposium on low power electronics and design | 2006

An energy-efficient virtual memory system with flash memory as the secondary storage

Hung-Wei Tseng; Han-Lin Li; Chia-Lin Yang

The traditional virtual memory system is designed for decades assuming a magnetic disk as the secondary storage. Recently, flash memory becomes a popular storage alternative for many portable devices with the continuing improvements on its capacity, reliability and much lower power consumption than mechanical hard drives. The NAND flash memory is organized with blocks, and each block contains a set of pages. The characteristics of flash memory are quite different from a magnetic disk. Therefore, in this paper, we revisit virtual memory system design considering limitations imposed by flash memory. In particular, we study the effects of the subpaging technique and storage cache management. In the traditional virtual memory system, a full page is written back to the secondary storage on a page fault. We found that this could result in unnecessary writes thereby wasting energy. The subpaging technique that partitions a page into subunits, and only dirty subpages are written to flash memory is beneficial to the energy efficiency. For the storage cache management, unlike traditional disk cache management, care needs to be taken to guarantee that the flash pages of a main memory page are replaced from the cache in sequence. Experimental results show that the average energy reduction of combined subpaging and caching techniques is 35.6%

international symposium on computer architecture | 2016

Morpheus: creating application objects efficiently for heterogeneous computing

Hung-Wei Tseng; Qianchen Zhao; Yuxiao Zhou; Mark Gahagan; Steven Swanson

In high performance computing systems, object deserialization can become a surprisingly important bottleneck-in our test, a set of general-purpose, highly parallelized applications spends 64% of total execution time deserializing data into objects. This paper presents the Morpheus model, which allows applications to move such computations to a storage device. We use this model to deserialize data into application objects inside storage devices, rather than in the host CPU. Using the Morpheus model for object deserialization avoids unnecessary system overheads, frees up scarce CPU and main memory resources for compute-intensive workloads, saves I/O bandwidth, and reduces power consumption. In heterogeneous, co-processor-equipped systems, Morpheus allows application objects to be sent directly from a storage device to a coprocessor (e.g., a GPU) by peer-to-peer transfer, further improving application performance as well as reducing the CPU and main memory utilizations. This paper implements Morpheus-SSD, an SSD supporting the Morpheus model. Morpheus-SSD improves the performance of object deserialization by 1.66×, reduces power consumption by 7%, uses 42% less energy, and speeds up the total execution time by 1.32×. By using NVMe-P2P that realizes peer-to-peer communication between Morpheus-SSD and a GPU, Morpheus-SSD can speed up the total execution time by 1.39× in a heterogeneous computing platform.

conference on object-oriented programming systems, languages, and applications | 2012

Software data-triggered threads

Hung-Wei Tseng; Dean M. Tullsen

The data-triggered threads (DTT) programming and execution model can increase parallelism and eliminate redundant computation. However, the initial proposal requires significant architecture support, which impedes existing applications and architectures from taking advantage of this model. This work proposes a pure software solution that supports the DTT model without any hardware support. This research uses a prototype compiler and runtime libraries running on top of existing machines. Several enhancements to the initial software implementation are presented, which further improve the performance. The software runtime system improves the performance of serial C SPEC benchmarks by 15% on a Nehalem processor, but by over 7X over the full suite of single-thread applications. It is shown that the DTT model can work in conjunction with traditional parallelism. The DTT model provides up to 64X speedup over parallel applications exploiting traditional parallelism.

very large data bases | 2016

HippogriffDB: balancing I/O and GPU bandwidth in big data analytics

Jing Li; Hung-Wei Tseng; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson

As data sets grow and conventional processor performance scaling slows, data analytics move towards heterogeneous architectures that incorporate hardware accelerators (notably GPUs) to continue scaling performance. However, existing GPU-based databases fail to deal with big data applications efficiently: their execution model suffers from scalability limitations on GPUs whose memory capacity is limited; existing systems fail to consider the discrepancy between fast GPUs and slow storage, which can counteract the benefit of GPU accelerators. In this paper, we propose HippogriffDB, an efficient, scalable GPU-accelerated OLAP system. It tackles the bandwidth discrepancy using compression and an optimized data transfer path. HippogriffDB stores tables in a compressed format and uses the GPU for decompression, trading GPU cycles for the improved I/O bandwidth. To improve the data transfer efficiency, HippogriffDB introduces a peer-to-peer, multi-threaded data transfer mechanism, directly transferring data from the SSD to the GPU. HippogriffDB adopts a query-over-block execution model that provides scalability using a stream-based approach. The model improves kernel efficiency with the operator fusion and double buffering mechanism. We have implemented HippogriffDB using an NVMe SSD, which talks directly to a commercial GPU. Results on two popular benchmarks demonstrate its scalability and efficiency. HippogriffDB outperforms existing GPU-based databases (YDB) and in-memory data analytics (MonetDB) by 1-2 orders of magnitude.

integrating technology into computer science education | 2013

Evaluating student understanding of core concepts in computer architecture

Leo Porter; Saturnino Garcia; Hung-Wei Tseng; Daniel Zingaro

Many studies have demonstrated that students tend to learn less than instructors expect in CS1. In light of these studies, a natural question is: to what extent do these results hold for subsequent, upper-division computer science courses? In this paper we describe our work in creating high-level concept questions for an upper-division computer architecture course. The questions were designed and agreed upon by subject-matter and teaching experts to measure desired minimum proficiency of students post-course. These questions were administered to four separate computer architecture courses at two different institutions: a large public university and a small liberal arts college. Our results show that students in these courses were indeed not learning as much as the instructors expected, performing poorly overall: the per-question average was only 56%, with many questions showing no statistically significant improvement from pre-course to post-course. While these results follow the trend from CS1 courses, they are still somewhat surprising given that the courses studied were taught using research-based pedagogy that is known to be effective across the CS curriculum. We discuss implications of our findings and offer possible future directions of this work.

IEEE Transactions on Circuits and Systems for Video Technology | 2005

Software-controlled cache architecture for energy efficiency

Chia-Lin Yang; Hung-Wei Tseng; Chia-Chiang Ho; Ja-Ling Wu

Power consumption is an important design issue of current multimedia embedded systems. Data caches consume a significant portion of total processor power for multimedia applications because they are data intensive. In an integrated multimedia system, the cache architecture cannot be tuned specifically for an application. Therefore, a significant amount of cache energy is actually wasted. In this paper, we propose the software-controlled cache architecture that improves the energy efficiency of the shared cache in an integrated multimedia system on an application-specific base. Data types in an application are allocated to different cache regions. On each access, only the allocated cache regions need to be activated. We test the effectiveness of the software-controlled cache of the MPEG-2 software decoder. The results show up to 40% of cache energy reduction on an ARM-like cache architecture without sacrificing performance.

Explore More