Yanjing Li | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yanjing Li is active.

Explore More

Publication

Featured researches published by Yanjing Li.

design, automation, and test in europe | 2008

CASP: concurrent autonomous chip self-test using stored test patterns

Yanjing Li; Samy Makar; Subhasish Mitra

CASP, concurrent autonomous chip self-test using stored test patterns, is a special kind of self-test where a system tests itself concurrently during normal operation without any downtime visible to the end-user. CASP consists of two ideas: 1. Storage of very thorough test patterns in non-volatile memory; and, 2. Architectural and system-level support for autonomous testing of one or more cores in a multi-core system using stored patterns, concurrently with normal system operation, without bringing down the entire system. CASP enables design of robust systems with built-in features for circuit failure prediction, error detection, self-diagnosis and self-repair. Such systems are necessary to overcome major reliability challenges in scaled-CMOS technologies. Implementation of CASP in the OpenSPARC Tl multi-core processor demonstrates its effectiveness and practicality.

Journal of Mechanical Design | 2008

Diagonal Quadratic Approximation for Parallelization of Analytical Target Cascading

Yanjing Li; Zhaosong Lu; Jeremy J. Michalek

Analytical Target Cascading (ATC) is an effective decomposition approach used for engineering design optimization problems that have hierarchical structures. With ATC, the overall system is split into subsystems, which are solved separately and coordinated via target/response consistency constraints. As parallel computing becomes more common, it is desirable to have separable subproblems in ATC so that each subproblem can be solved concurrently to increase computational throughput. In this paper, we first examine existing ATC methods, providing an alternative to existing nested coordination schemes by using the block coordinate descent method (BCD). Then we apply diagonal quadratic approximation (DQA) by linearizing the cross term of the augmented Lagrangian function to create separable subproblems. Local and global convergence proofs are described for this method. To further reduce overall computational cost, we introduce the truncated DQA (TDQA) method that limits the number of inner loop iterations of DQA. These two new methods are empirically compared to existing methods using test problems from the literature. Results show that computational cost of nested loop methods is reduced by using BCD and generally the computational cost of the truncated methods, TDQA and ALAD, are superior to other nested loop methods with lower overall computational cost than the best previously reported results.© 2007 ASME

IEEE Design & Test of Computers | 2009

Overcoming Early-Life Failure and Aging for Robust Systems

Yanjing Li; Young Moon Kim; Evelyn Mintarno; Donald S. Gardner; Subhasish Mitra

The prospect of system failure has increased because of device and chip-level effects in the late CMOS era. In this article, the authors present novel system-level architecture and design innovations to cope with these lifetime reliability challenges. At nanometer-scale geometries, several hardware failure mechanisms, which were largely benign in the past, are becoming visible at the system level. Moreover, recent studies indicate that, depending on the application, hardware failures can be significant contributors to overall system failure rates.Design of robust systems ensuring required hardware reliability, although nontrivial, is achievable but at high costs. Concurrent error detection during system operation is an extremely important aspect of such systems.Hardware reliability challenges arise from three major sources: early-life failures (also called infant mortality), radiation-induced soft errors, and circuit aging. Several techniques, such as Built-in Soft-Error Resilience (BISER), can be effectively used for correcting radiation-induced transient (soft) errors. Focus on early-life failures (ELF) and circuit aging was discussed. These techniques utilize specific characteristics of reliability mechanisms without incurring the high costs of traditional concurrent error detection.

international test conference | 2010

QED: Quick Error Detection tests for effective post-silicon validation

Ted Hong; Yanjing Li; Sung-Boem Park; Diana Mui; David Lin; Ziyad Abdel Kaleq; Nagib Hakim; Helia Naeimi; Donald S. Gardner; Subhasish Mitra

Long error detection latency, the time elapsed between the occurrence of an error caused by a bug and its manifestation as a system-level failure, is a major challenge in post-silicon validation of robust systems. In this paper, we present a new technique called Quick Error Detection (QED), which transforms existing post-silicon validation tests into new validation tests that significantly reduce error detection latency. QED transformations allow flexible tradeoffs between error detection latency, coverage, and complexity, and can be implemented in software with little or no hardware changes. Results obtained from hardware experiments on quad-core Intel® Core™ i7 hardware platforms and from simulations on a multi-core MIPS processor design demonstrate that: 1. QED significantly improves error detection latencies by six orders of magnitude, i.e., from billions of cycles to a few thousand cycles or less. 2. QED transformations do not degrade the coverage of validation tests as estimated empirically by measuring the maximum operating frequencies over a wide range of operating voltage points. 3. QED tests improve coverage by detecting errors that escape the original non-QED tests.

IEEE Journal on Emerging and Selected Topics in Circuits and Systems | 2011

Robust System Design to Overcome CMOS Reliability Challenges

Subhasish Mitra; Kevin Brelsford; Young Moon Kim; Hsiao-Heng Kelin Lee; Yanjing Li

Todays mainstream electronic systems typically assume that transistors and interconnects operate correctly over their useful lifetime. With enormous complexity and significantly increased vulnerability to failures compared to the past, future system designs cannot rely on such assumptions. For coming generations of silicon technologies, several causes of hardware reliability failures, largely benign in the past, are becoming significant at the system level. Robust system design is essential to ensure that future systems perform correctly despite rising complexity and increasing disturbances. This paper describes three techniques that can enable a sea change in robust system design through cost-effective tolerance and prediction of failures in hardware during system operation: 1) efficient soft error resilience; 2) circuit failure prediction; and 3) effective on-line self-test and diagnostics. The need for global optimization across multiple abstraction layers is also demonstrated.

international test conference | 2008

VAST: Virtualization-Assisted Concurrent Autonomous Self-Test

Hiroaki Inoue; Yanjing Li; Subhasish Mitra

Virtualization-assisted concurrent, autonomous self-test, or VAST, enables a multi-/many-core system to test itself, concurrently during normal operation, without any user-visible downtime. Such on-line self-test is required for large-scale robust systems with built-in support for circuit failure prediction, failure detection, diagnosis, and self-healing. The main idea behind VAST is hardware and software co-design of on-line self-test features in a multi-/many-core system through integration of: 1. multi-/many-core architecture, 2. virtualization software, and, 3. special self-test techniques such as BIST (built-in self-test) or CASP (concurrent autonomous chip self-test using stored patterns). As a result, optimized trade-offs in system design complexity, system performance and power impact, and test thoroughness are possible. Experimental results from an actual multi-core system demonstrate that: 1. VAST is practical and effective; and, 2. Special VAST-supported self-test policies enable extremely thorough on-line self-test with very small performance impact.

international conference on computer aided design | 2009

Operating system scheduling for efficient online self-test in robust systems

Yanjing Li; Onur Mutlu; Subhasish Mitra

Very thorough online self-test is essential for overcoming major reliability challenges such as early-life failures and transistor aging in advanced technologies. This paper demonstrates the need for operating system (OS) support to efficiently orchestrate online self-test in future robust systems. Experimental data from an actual dual quad-core system demonstrate that, without software support, online self-test can significantly degrade performance of soft real-time and computation-intensive applications (by up to 190%), and can result in perceptible delays for interactive applications. To mitigate these problems, we develop OS scheduling techniques that are aware of online self-test, and schedule/migrate tasks in multi-core systems by taking into account the unavailability of one or more cores undergoing online self-test. These techniques eliminate any performance degradation and perceptible delays in soft real-time and interactive applications (otherwise introduced by online self-test), and significantly reduce the impact of online self-test on the performance of computation-intensive applications. Our techniques require minor modifications to existing OS schedulers, thereby enabling practical and efficient online self-test in real systems.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2014

Effective Post-Silicon Validation of System-on-Chips Using Quick Error Detection

David Lin; Ted Hong; Yanjing Li; Eswaran S; Sharad Kumar; Farzan Fallah; Nagib Hakim; Donald S. Gardner; Subhasish Mitra

This paper presents the Quick Error Detection (QED) technique for systematically creating families of post-silicon validation tests that quickly detect bugs inside processor cores and uncore components (cache controllers, memory controllers, and on-chip interconnection networks) of multicore system on chips (SoCs). Such quick detection is essential because long error detection latency, the time elapsed between the occurrence of an error due to a bug and its manifestation as an observable failure, severely limits the effectiveness of traditional post-silicon validation approaches. QED can be implemented completely in software, without any hardware modification. Hence, it is readily applicable to existing designs. Results using multiple hardware platforms, including the Intel® Core™ i7 SoC, and a state-of-the-art commercial multicore SoC, along with simulation results using an OpenSPARC T2-like multicore SoC with bug scenarios from commercial multicore SoCs demonstrate: 1) error detection latencies of post-silicon validation tests can be very long, up to billions of clock cycles, especially for bugs inside uncore components; 2) QED shortens error detection latencies by up to nine orders of magnitude to only a few hundred cycles for most bug scenarios; and 3) QED enables up to a fourfold increase in bug coverage.

vlsi test symposium | 2010

Concurrent autonomous self-test for uncore components in system-on-chips

Yanjing Li; Onur Mutlu; Donald S. Gardner; Subhashish Mitra

Concurrent autonomous self-test, or online self-test, allows a system to test itself, concurrently during normal operation, with no system downtime visible to the end-user. Online self-test is important for overcoming major reliability challenges such as early-life failures and circuit aging in future System-on-Chips (SoCs). To ensure required levels of overall reliability of SoCs, it is essential to apply online self-test to uncore components, e.g., cache controllers, DRAM controllers, and I/O controllers, in addition to processor cores. This is because uncore components can account for a significant portion of the overall logic area of a multi-core SoC. In this paper, we present an efficient online self-test technique for uncore components in SoCs. We achieve extremely high test coverage by storing high-quality test patterns in off-chip non-volatile storage. However, a simple technique that stalls the uncore-component-under-test can result in significant system performance degradation or even visible system unresponsiveness. Our new techniques overcome these challenges and enable cost-effective online self-test of uncore components through three special hardware features: 1. resource reallocation and sharing (RRS); 2. no-performance-impact testing; and, 3. smart backups. Implementation of online self-test for uncore components of the open-source OpenSPARC T2 multi-core SoC, using a combination of these three techniques, achieves high test coverage at < 1% area impact, < 1% power impact, and < 3% system-level performance impact. These results demonstrate the effectiveness and practicality of our techniques.

international test conference | 2013

Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study

Yanjing Li; Eric Cheng; Samy Makar; Subhasish Mitra

Self-repair replaces/bypasses faulty components in a system-on-chip (SoC) to keep the system functioning correctly even in the presence of permanent faults. Such faults may result from early-life failures, circuit aging, and manufacturing defects and variations. Unlike on-chip memories, processor cores, and networks-on-chip, little attention has been paid to self-repair of uncore components (e.g., cache controllers, memory controllers, and I/O controllers) that occupy significant portions of multi-core SoCs. In this paper, we present new techniques that utilize architectural features to achieve self-repair of uncore components while incurring low area, power, and performance costs. We demonstrate the effectiveness and practicality of our techniques, using the industrial OpenSPARC T2 SoC with 8 processor cores that support 64 hardware threads. Our key results are: 1. Our techniques enable effective self-repair of any single faulty uncore component with 7.5% post-layout chip-level area impact and 3% power impact. In contrast, existing redundancy techniques impose high (e.g., 16%) area costs. Our techniques do not incur any performance impact in fault-free systems. In the presence of a single faulty uncore component, there can be a 5% application performance impact. 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with graceful degradation of application performance. 3. Our techniques achieve high self-repair coverage of 97.5% in the presence of a single fault. Our self-repair techniques also enable flexible tradeoffs between self-repair coverage and area costs. For example, 75% self-repair coverage can be achieved with 3.2% post-layout chip-level area impact.

Explore More