Is this you? Create Your Porfile

Feihui Li

Pennsylvania State University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Feihui Li is active.

Explore More

Publication

Featured researches published by Feihui Li.

international symposium on computer architecture | 2006

Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

Feihui Li; Chrysostomos Nicopoulos; Thomas Richardson; Yuan Xie; Vijaykrishnan Narayanan; Mahmut T. Kandemir

Long interconnects are becoming an increasingly important problem from both power and performance perspectives. This motivates designers to adopt on-chip network-based communication infrastructures and three-dimensional (3D) designs where multiple device layers are stacked together. Considering the current trends towards increasing use of chip multiprocessing, it is timely to consider 3D chip multiprocessor design and memory networking issues, especially in the context of data management in large L2 caches. The overall goal of this paper is to study the challenges for L2 design and management in 3D chip multiprocessors. Our first contribution is to propose a router architecture and a topology design that makes use of a network architecture embedded into the L2 cache memory. Our second contribution is to demonstrate, through extensive experiments, that a 3D L2 memory architecture generates much better results than the conventional two-dimensional (2D) designs under different number of layers and vertical (inter-wafer) connections. In particular, our experiments show that a 3D architecture with no dynamic data migration generates better performance than a 2D architecture that employs data migration. This also helps reduce power consumption in L2 due to a reduced number of data movements.

design automation conference | 2008

Application mapping for chip multiprocessors

Guangyu Chen; Feihui Li; Seung Woo Son; Mahmut T. Kandemir

The problem attacked in this paper is one of automatically mapping an application onto a network-on-chip (NoC) based chip multiprocessor (CMP) architecture in a locality-aware fashion. The proposed compiler approach has four major steps: task scheduling, processor mapping, data mapping, and packet routing. In the first step, the application code is parallelized and the resulting parallel threads are assigned to virtual processors. The second step implements a virtual processor-to-physical processor mapping. The goal of this mapping is to ensure that the threads that are expected to communicate frequently with each other are assigned to neighboring processors as much as possible. In the third step, data elements are mapped to memories attached to CMP nodes. The main objective of this mapping is to place a given data item into a node which is close to the nodes that access it. The last step of our approach determines the paths (between memories and processors) for data to travel in an energy efficient manner. In this paper, we describe the compiler algorithms we implemented in detail and present an experimental evaluation of the framework. In our evaluation, we test our entire framework as well as the impact of omitting some of its steps. This experimental analysis clearly shows that the proposed framework reduces energy consumption of our applications significantly (27.41% on average over a pure performance oriented application mapping strategy) as a result of improved locality of data accesses.

ieee international conference on high performance computing data and analytics | 2008

A novel migration-based NUCA design for chip multiprocessors

Mahmut T. Kandemir; Feihui Li; Mary Jane Irwin; Seung Woo Son

Chip Multiprocessors (CMFs) and Non-Uniform Cache Architectures (NUCAs) represent two emerging trends in computer architecture. Targeting future CMP based systems with NUCA type L2 caches, this paper proposes a novel data migration algorithm for parallel applications and evaluates it. The goal of this migration scheme is to determine a suitable location for each data block within a large L2 space at any given point during execution. A unique characteristic of the proposed scheme is that it models the problem of optimal data placement in the L2 cache space as a two-dimensional post office placement problem, presents a practical architectural implementation of this model, and gives a detailed evaluation of the proposed implementation. In our experimental evaluation, we also compare our approach to a previously-proposed NUCA management scheme using applications from the specomp suite, oltp, specjbb, and specweb. These experiments show that our migration approach generates about 35% improvement, on average, in average L2 access latency over the previous migration scheme, and these L2 latency savings translate, on average, to 9.5% improvement in IPC (instructions per cycle). We also observed during our experiments that both the careful initial placement of data (which itself triggers migrations within the L2 space) and subsequent migrations (due to inter-processor data sharing) play an important role in achieving our performance improvements.

design, automation, and test in europe | 2005

Compiler-Directed Instruction Duplication for Soft Error Detection

Jie S. Hu; Feihui Li; Vijay Degalahal; Mahmut T. Kandemir; Narayanan Vijaykrishnan; Mary Jane Irwin

We experiment with compiler-directed instruction duplication to detect soft errors in VLIW datapaths. In the proposed approach, the compiler determines the instruction schedule by balancing the permissible performance degradation with the required degree of duplication. Our experimental results show that our algorithms allow the designer to perform tradeoff analysis between performance and reliability.

ACM Transactions in Embedded Computing Systems | 2009

Compiler-assisted soft error detection under performance and energy constraints in embedded systems

Jie S. Hu; Feihui Li; Vijay Degalahal; Mahmut T. Kandemir; Narayanan Vijaykrishnan; Mary Jane Irwin

Soft errors induced by terrestrial radiation are becoming a significant concern in architectures designed in newer technologies. If left undetected, these errors can result in catastrophic consequences or costly maintenance problems in different embedded applications. In this article, we focus on utilizing the compilers help in duplicating instructions for error detection in VLIW datapaths. The instruction duplication mechanism is further supported by a hardware enhancement for efficient result verification, which avoids the need of additional comparison instructions. In the proposed approach, the compiler determines the instruction schedule by balancing the permissible performance degradation and the energy constraint with the required degree of duplication. Our experimental results show that our algorithms allow the designer to perform trade-off analysis between performance, reliability, and energy consumption.

languages, compilers, and tools for embedded systems | 2006

Compiler-directed thermal management for VLIW functional units

Madhu Mutyam; Feihui Li; Vijaykrishnan Narayanan; Mahmut T. Kandemir; Mary Jane Irwin

As processors, memories, and other components of todays embedded systems are pushed to higher performance in more enclosed spaces, processor thermal management is quickly becoming a limiting design factor. While previous proposals mostly approached this thermal management problem from circuit and architecture angles, software can also play an important role in identifying and eliminating thermal hotspots as it is the main factor that shapes the order and frequency of accesses to different hardware components in the chip. This is particularly true for compiler-scheduled Very Long Instruction Word (VLIW) datapath.In this paper, we focus on a compiler-based approach to make the thermal profile more balanced in the integer functional units of VLIW architectures. For balanced thermal behavior and peak temperature minimization, we propose techniques based on load balancing across the integer functional units with or without rotation of functional unit usage. As leakage power is exponentially dependent on temperature and temperature is dependent on total power (i.e., switching and leakage), in our techniques, we also consider leakage power optimization by IPC tuning (instructions issued per cycle). By taking a code that is already scheduled for maximum performance as input, our scheduling strategies modify this performance-oriented schedule for balanced thermal behavior with negligible performance degradation. We simulate our scheduling strategies using a framework that consists of the Trimaran infrastructure, a power model, and the HotSpot. Our experimental results using several benchmark programs reveal that the peak temperature can be reduced through compiler scheduling.

symposium on principles of programming languages | 2006

Compiler-directed channel allocation for saving power in on-chip networks

Guangyu Chen; Feihui Li; Mahmut T. Kandemir

Increasing complexity in the communication patterns of embedded applications parallelized over multiple processing units makes it difficult to continue using the traditional bus-based on-chip communication techniques. The main contribution of this paper is to demonstrate the importance of compiler technology in reducing power consumption of applications designed for emerging multi processor, NoC (Network-on-Chip) based embedded systems. Specifically, we propose and evaluate a compiler-directed approach to NoC power management in the context of array-intensive applications, used frequently in embedded image/video processing. The unique characteristic of the compiler-based approach proposed in this paper is that it increases the idle periods of communication channels by reusing the same set of channels for as many communication messages as possible. The unused channels in this case take better advantage of the underlying power saving mechanism employed by the network architecture. However, this channel reuse optimization should be applied with care as it can hurt performance if two or more simultaneous communications are mapped onto the same set of channels. Therefore, the problem addressed in this paper is one of reducing the number of channels used to implement a set of communications without increasing the communication latency significantly. To test the effectiveness of our approach, we implemented it within an optimizing compiler and performed experiments using twelve application codes and a network simulation environment. Our experiments show that the proposed compiler-based approach is very successful in practice and works well under both hardware based and software based channel turn-off schemes.

design, automation, and test in europe | 2006

Dynamic partitioning of processing and memory resources in embedded MPSoC architectures

Liping Xue; Ozcan Ozturk; Feihui Li; Mahmut T. Kandemir; Ibrahim Kolcu

Current trends indicate that multiprocessor-system-on-chip (MPSoC) architectures are being increasingly used in building complex embedded systems. While circuit/architectural support for MPSoC based systems are making significant strides, programming these devices and providing suitable software support (e.g., compiler and operating systems) seem to be a tougher problem. This is because either programmers or compilers will have to make code explicitly parallel to run on these systems. An additional difficulty occurs when multiple applications use an MPSoC at the same time, because MPSoC resources should be partitioned across these applications carefully. This paper explores a proactive resource partitioning scheme for parallel applications simultaneously exercising the same MPSoC system. The proposed approach has two major components. The first component includes an offline preprocessing of applications which gives us an estimated profile for each application. Each application to be executed on our MPSoC is profiled and annotated with the profile information. The second component of our approach is an online resource partitioning, which partitions both the processing cores (i.e., computation resources) and on-chip memory space (i.e., storage resource) among simultaneously-executing applications. Our experimental evaluation with this partitioner shows that it generates much better results than conventional operating system based resource management. The results also reveal that both memory partitioning and processor partitioning are very important for obtaining the best results

programming language design and implementation | 2006

Reducing NoC energy consumption through compiler-directed channel voltage scaling

Guangyu Chen; Feihui Li; Mahmut T. Kandemir; Mary Jane Irwin

While scalable NoC (Network-on-Chip) based communication architectures have clear advantages over long point-to-point communication channels, their power consumption can be very high. In contrast to most of the existing hardware-based efforts on NoC power optimization, this paper proposes a compiler-directed approach where the compiler decides the appropriate voltage/frequency levels to be used for each communication channel in the NoC. Our approach builds and operates on a novel graph based representation of a parallel program and has been implemented within an optimizing compiler and tested using 12 embedded benchmarks. Our experiments indicate that the proposed approach behaves better - from both performance and power perspectives - than a hardwarebased scheme and the energy savings it achieves are very close to the savings that could be obtained from an optimal, but hypothetical voltage/frequency scaling scheme.

international conference on computer aided design | 2005

Improving scratch-pad memory reliability through compiler-guided data block duplication

Feihui Li; Guilin Chen; Mahmut T. Kandemir; Ibrahim Kolcu

Recent trends in embedded computing indicates an increasing use of scratch-pad memories (SPMs) as on-chip store for instructions and data. An important characteristic of these memory components is that they are managed by software, instead of hardware. Ever-scaling process technology and employment of several power-saving techniques in embedded systems (e.g., voltage scaling) make these systems particularly vulnerable to soft errors and other transient errors. Therefore, it is very important in practice to consider the impact of soft errors in SPMs. While it is possible to employ classical memory protection mechanisms such as parity checks and ECC, each of these has its drawbacks. Specifically, a pure parity-based protection cannot correct any errors, and ECCs can be an overkill in the normal operation state when no soft error is experienced. This paper proposes an alternate approach to protect SPMs against soft errors. The proposed approach is based on data block duplication under compiler control. More specifically, an optimizing compiler duplicates data blocks within the SPM and protects each data block by parity if such a duplication does not hurt performance. The goal of this scheme is to provide only parity protection for data blocks (and reduce the overheads at runtime when no error occurs) but correct errors using the duplicate (when an error occurs in the primary copy), provided that the duplicate is not corrupted.

Explore More