Naifeng Jing | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Naifeng Jing is active.

Explore More

Publication

Featured researches published by Naifeng Jing.

international symposium on computer architecture | 2013

An energy-efficient and scalable eDRAM-based register file architecture for GPGPU

Naifeng Jing; Yao Shen; Yao Lu; Shrikanth Ganapathy; Zhigang Mao; Minyi Guo; Ramon Canal; Xiaoyao Liang

The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional SRAM designs in the future technologies. In this paper, we propose to use embedded-DRAM (eDRAM) as an alternative in future GPGPUs. Compared with SRAM, eDRAM provides higher density and lower leakage power. However, the limited data retention time in eDRAM poses new challenges. Periodic refresh operations are needed to maintain data integrity. This is exacerbated with the scaling of eDRAM density, process variations and temperature. Unlike conventional CPUs which make use of multi-ported RF, most of the RFs in modern GPGPU are heavily banked but not multi-ported to reduce the hardware cost. This provides a unique opportunity to hide the refresh overhead. We propose two different eDRAM implementations based on 3T1D and 1T1C memory cells. To mitigate the impact of periodic refresh, we propose two novel refresh solutions using bank bubble and bank walk-through. Plus, for the 1T1C RF, we design an interleaved bank organization together with an intelligent warp scheduling strategy to reduce the impact of the destructive reads. The analysis shows that our schemes present better energy efficiency, scalability and variation tolerance than traditional SRAM-based designs.

international symposium on low power electronics and design | 2013

Compiler assisted dynamic register file in GPGPU

Naifeng Jing; Haopeng Liu; Yao Lu; Xiaoyao Liang

The large Register File (RF) in General Purpose Graphic Processing Units (GPGPUs) demands tremendous chip area and energy consumption. For a sustainable growth of the size of RF in future GPGPUs, emerging on-chip memory technologies such as embedded-DRAM (eDRAM) have been proposed to replace the conventional SRAM for higher density and lower leakage but with the possible penalty from the periodic refresh operations. This paper explicitly shows that the refresh penalty can be effectively mitigated by leveraging the uniqueness of GPGPU operations. A compiler assisted refresh rescheduling policy can greatly reduce the refresh overhead for maintaining the correctness of the RF operations. The proposed scheme adequately exploits the features in both architecture and compilation, and delivers comparable performance to the SRAM counterpart. At the same time, the energy savings via the removal of large SRAM leakage well compensate for the additional refresh energy. This study promotes the eDRAM-based RF as a promising alternative that enables larger capacity and better power efficiency for future GPGPUs.

field-programmable technology | 2012

Heterogeneous configuration memory scrubbing for soft error mitigation in FPGAs

Ju-Yueh Lee; Cheng-Ru Chang; Naifeng Jing; Juexiao Su; Shi-Jie Wen; Rich Wong; Lei He

In this paper, we present HCS - Heterogeneous CRAM Scrubbing - for FPGAs. By utilizing stochastic fault modeling for SEUs in CRAM, we present a quantitative estimate of system MTTF improvement through CRAM scrubbing. HCS then leverages the fact that different SEUs have unequal effects on the circuit system operation, and thus the CRAM bits can be scrubbed at different rates based on the sensitivity of the bits to the circuit system failures. To maximize the improvement on system MTTF for a given circuit system, we present a dynamic programming algorithm which solves the problem efficiently and effectively. Through a detailed case study on system level study by an H.264/AVC decoder implemented on a Xilinx Virtex-5 FPGA, we show an estimation of 60% MTTF improvement by HCS over the existing homogeneous CRAM scrubbing method, while contributing virtually no area, performance and power overhead to the system.

field-programmable logic and applications | 2011

Quantitative SEU Fault Evaluation for SRAM-Based FPGA Architectures and Synthesis Algorithms

Naifeng Jing; Ju-Yueh Lee; Zhe Feng; Weifeng He; Zhigang Mao; Shi-Jie Wen; R. Wong; Lei He

This paper studies the SEU (Single Event Upset) fault for SRAM-based FPGAs. Considering detailed fault behavior on various circuit elements in a post-layout FPGA application, we develop a simulation-based SEU evaluation tool that quantifies fault contribution for each configuration bit. Using this tool and MCNC benchmark circuits, we study the fault characteristics of FPGA circuits and architectures. We show that interconnects not only contribute to the lion share of functional failures, but also have higher failure rate per configuration bit than LUTs. Particularly, multiplexers in local interconnects have the highest failure rate per bit. We find that tuning LUT and cluster sizes helps to reduce the rate (up to 38% in our experiments). In addition, we evaluate two recent fault mitigation algorithms IPD and IPF, which reduce LUT faults by an average of 74% and 15% respectively. But when interconnects are taken into account, the reduction via IPD which considers only LUT faults is merely 6% on chip level. Yet the reduction via IPF which implicitly considers interconnect faults is still around 15%. Therefore, synthesis algorithm should be evaluated with interconnect faults and future algorithms should be developed with consideration of interconnect faults explicitly.

international conference on computer aided design | 2011

Mitigating FPGA interconnect soft errors by in-place LUT inversion

Naifeng Jing; Ju-Yueh Lee; Weifeng He; Zhigang Mao; Lei He

Modern SRAM-based FPGAs (Field Programmable Gate Arrays) use multiplexer-based unidirectional routing, and SRAM configuration cells in these multiplexers contribute to the majority of soft errors in FPGAs. In this paper, we formulate an In-Placed inVersion (IPV) on LUT (Look-Up Table) logic polarities to reduce the Soft Error Rate (SER) at chip level, and reveal a locality and NP-Hardness of the IPV problem. We then develop an exact algorithm based on the binary integer linear programming (ILP) and also a heuristic based on the simulated annealing (SA), both enabled by the locality. We report results for the 10 largest MCNC combinational benchmarks synthesized by ABC and then placed and routed by VPR. The results show that IPV obtains close to 4× chip level SER reduction on average and SA is highly effective by obtaining the same SER reduction as ILP does. A recent work IPD has the largest LUT level SER reduction of 2.7× in literature, but its chip level SER reduction is merely 7% due to the dominance of interconnects. In contrast, SA-based IPV obtains nearly 4× chip level SER reduction and runs 30× faster. Furthermore, combining IPV and IPD leads to a chip level SER reduction of 5.3×. This does not change placement and routing, and does not affect design closure. To the best of our knowledge, our work is the first in-depth study on SER reduction for modern multiplexer-based FPGA routing by in-placed logic re-synthesis.

IEEE Transactions on Computers | 2016

Energy-Efficient eDRAM-Based On-Chip Storage Architecture for GPGPUs

Naifeng Jing; Li Jiang; Tao Zhang; Chao Li; Fengfeng Fan; Xiaoyao Liang

In a typical GPGPU, the on-chip storage is critical to the massive parallelism and is desired to be large. However, the fast increasing size of the on-chip storage based on traditional SRAM cells, such as register file (RF), shared memory and first level data (L1D) cache, makes the area cost and energy consumption unsustainable for future GPGPUs. In this paper, we first propose to use the embedded-DRAM (eDRAM) as an alternative for the on-chip storage. Compared to the conventional SRAM, eDRAM enables higher density and lower leakage power, but suffers from limited data retention time. Periodic refresh operation is a viable approach to maintain data integrity but aggravates the performance and energy consumption with the scaling of eDRAM cells into deep sub-micron technology nodes. To recover the performance loss, we exploit the features in the GPGPU architecture and propose various novel refresh schemes to mitigate the refresh penalty. To improve the energy efficiency, we apply lightweight compiler techniques and runtime monitoring for selective refreshing that intelligently eliminate the unnecessary refreshes. The evaluation on our proposed refresh schemes demonstrates that, comparing to the conventional SRAM-based designs, our eDRAM-based on-chip storage exhibits comparable performance but less energy consumption and smaller silicon area, enabling the sustainable on-chip storage scaling for even higher parallelism in future GPGPUs.

ACM Transactions on Design Automation of Electronic Systems | 2013

SEU fault evaluation and characteristics for SRAM-based FPGA architectures and synthesis algorithms

Naifeng Jing; Ju-Yueh Lee; Zhe Feng; Weifeng He; Zhigang Mao; Lei He

Reliability has become an increasingly important concern for SRAM-based field programmable gate arrays (FPGAs). Targeting SEU (single event upset) in SRAM-based FPGAs, this article first develops an SEU evaluation framework that can quantify the failure sensitivity for each configuration bit during design time. This framework considers detailed fault behavior and logic masking on a post-layout FPGA application and performs logic simulation on various circuit elements for fault evaluation. Applying this framework on MCNC benchmark circuits, we first characterize SEUs with respect to different FPGA circuits and architectures, for example, bidirectional routing and unidirectional routing. We show that in both routing architectures, interconnects not only contribute to the lions share of the SEU-induced functional failures, but also present higher failure rates per configuration bits than LUTs. Particularly, local interconnect multiplexers in logic blocks have the highest failure rate per configuration bit. Then, we evaluate three recently proposed SEU mitigation algorithms, IPD, IPF, and IPV, which are all logic resynthesis-based with little or no overhead on placement and routing. Different fault mitigating capabilities at the chip level are revealed, and it demonstrates that algorithms with explicit consideration for interconnect significantly mitigate the SEU at the chip level, for example, IPV achieves 61% failure rate reduction on average against IPF with about 15%. In addition, the combination of the three algorithms delivers over 70% failure rate reduction on average at the chip level. The experiments also reveal that in order to improve fault tolerance at the chip level, it is necessary for future fault mitigation algorithms to concern not only LUT or interconnect faults, but also their interactions. We envision that our framework can be used to cast more useful insights for more robust FPGA circuits, architectures, and better synthesis algorithms.

international symposium on low power electronics and design | 2015

Bank stealing for conflict mitigation in GPGPU Register File

Naifeng Jing; Shuang Chen; Shunning Jiang; Li Jiang; Chao Li; Xiaoyao Liang

Modern General Purpose Graphic Processing Unit (GPGPU) demands a large Register File (RF), which is typically organized into multiple banks to support the massive parallelism. Although heavy banking benefits RF throughput, its associated area and energy costs with diminishing performance gains greatly limit future RF s-caling. In this paper, we propose an improved RF design with a bank stealing technique, which enables a high RF throughput with compact area. By deeply investigating the GPGPU microarchitecture, we identify the deficiency in the state-of-the-art RF designs as the bank conflict problem, while the majority of conflicts can be eliminated leveraging the fact that the highly-banked RF oftentimes experiences under-utilization. This is especially true in GPGPU where multiple ready warps are available at the scheduling stage with their operands to be wisely coordinated. Our lightweight bank stealing technique can opportunistically fill the idle banks for better operand service, and the average GPGPU performance can be improved under smaller energy budget with significant area saving, which makes it promising for sustainable RF scaling.

international soc design conference | 2012

Contention and energy aware mapping for real-time applications on Network-on-Chip

Bingjing Ge; Naifeng Jing; Weifeng He; Zhigang Mao

Real-time constraints pose a new challenge when performing real-time application mapping onto Network-On-Chip. To this problem, we first propose a new task graph description in this paper, to enable both computation mapping and communication scheduling. Based on the proposed graph, we then propose a contention and energy aware mapping algorithm to eliminate the communication conflicts and reduce energy cost, thus delivering higher throughput and lower energy consumption on communicational links. In the experiments, we show that different real-time constraints impacts the on-chip network a lot, and our algorithm is able to find a better mapping and scheduling solution for a given real-time application and on-chip network structure. For example, comparing to traditional mapping without considering timing constraints, our algorithm reduces the energy up to 44% on average. It also improves throughput of the system up to 25%.

international symposium on microarchitecture | 2016

Cache-emulated register file: an integrated on-chip memory architecture for high performance GPGPUs

Naifeng Jing; Jianfei Wang; Fengfeng Fan; Wenkang Yu; Li Jiang; Chao Li; Xiaoyao Liang

The on-chip memory design is critical to the GPGPU performance because it serves between the massive threads and the huge external memory as a low-latency and high-throughput data communication point. However, the existing on-chip memory hierarchy is inherited from the conventional CPU architecture and is oftentimes sub-optimal to the SIMT (single instruction, multiple threads) execution. In this study, we surpass the traditional memory hierarchy design and reform the on-chip memory into an integrated architecture with the cache-emulated register file (RF) capability tailored for high performance GPGPU computing. With the lightweight support from ISA, compiler and the modified microarchitecture, this integrated architecture can dynamically emulate a variable-sized RF and a cache in a uniform way. Evaluation results demonstrate that this novel architecture can deliver better performance and energy efficiency with smaller on-chip memory size. For example, it can gain an average of 50% performance improvement for the cache-sensitive applications.

Explore More