Ilya Wagner
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ilya Wagner.
Archive | 2010
Ilya Wagner; Valeria Bertacco
The purpose of this book is to survey the state of the art and evolving directions in post-silicon and runtime verification. The authors start by giving an overview of the state of the art in verification, particularly current post-silicon methodologies in use in the industry, both for the domain of processor pipeline design and for memory subsystems. They then dive into the presentation of several new post-silicon verification solutions aimed at boosting the verification coverage of modern processors, dedicating several chapters to this topic. The presentation of runtime verification solutions follows a similar approach. This is an area of processor design that is still in its early stages of exploration and that holds the promise of accomplishing the ultimate goal of achieving complete correctness guarantees for microprocessor-based computation.Theauthorsconclude the book with a look towards the future of late-stage verification and its growing role in the processor life-cycle.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2015
Biruk Mammo; Valeria Bertacco; Andrew DeOrio; Ilya Wagner
Shared-memory chip-multiprocessor (CMP) architectures define memory consistency models that establish the ordering rules for memory operations from multiple threads. Validating the correctness of a CMPs implementation of its memory consistency model requires extensive monitoring and analysis of memory accesses while multiple threads are executing on the CMP. In this paper, we present a low overhead solution for observing, recording and analyzing shared-memory interactions for use in an emulation and/or post-silicon validation environment. Our approach leverages portions of the CMPs own data caches, augmented only by a small amount of hardware logic, to log information relevant to memory accesses. After transferring this information to a central memory location, we deploy our own analysis algorithm to detect any possible memory consistency violations. We build on the property that a violation corresponds to a cycle in an appropriately defined graph representing memory interactions. The solution we propose allows a designer to choose where to run the analysis algorithm: 1) on the CMP itself; 2) on a separate processor residing on the validation platform; or 3) off-line on a separate host machine. Our experimental results show an 83% bug detection rate, in our testbed CMP, over three distinct memory consistency models, namely: relaxed-memory order, total-store order, and sequential consistency. Finally, note that our solution can be disabled in the final product, leading to zero performance overhead and a per-core area overhead that is smaller than the size of a physical integer register file in a modern processor.
design, automation, and test in europe | 2011
Ilya Wagner; Shih-Lien Lu
Modern systems on chip (SoCs) are rapidly becoming complex high-performance computational devices, featuring multiple general purpose processor cores and a variety of functional IP blocks, communicating with each other through on-die fabric. While modular SoC design provides power savings and simplifies the development process, it also leaves significant room for a special type of hardware bugs, interaction errors, to slip through pre- and post-silicon verification. Consequently, hard to fix silicon escapes may be discovered late in production schedule or even after a market release, potentially causing costly delays or recalls. In this work we propose a unified error detection and recovery framework that incorporates programmable features into the on-die fabric of an SoC, so triggers of escaped interaction bugs can be detected at runtime. Furthermore, upon detection, our solution locks the interface of an IP for a programmed time period, thus altering interactions between accesses and bypassing the bug in a manner transparent to software. For classes of errors that cannot be circumvented by this in-hardware technique our framework is programmed to propagate the error detection to the software layer. Our experiments demonstrate that the proposed framework is capable of detecting a range of interaction errors with less than 0.01% performance penalty and 0.45% area overhead.
Archive | 2011
Ilya Wagner; Valeria Bertacco
With this chapter we begin our discussion of the functional correctness solutions that can be pursued past the release of a new microprocessor design, when the device is already shipped and installed in an end-customer’s system. Error detection at this stage of the processor’s life cycle entails monitoring its behavior dynamically, by observing the internal state of the device with dedicated hardware components residing on the silicon die. In addition to error detection, runtime validation solutions, also called in-the-field solutions, must include an effective recovery and error bypass algorithm, to ensure minimal performance loss and forward progress of the system even in presence of bugs. To make the case for dynamic validation in this chapter, we first discuss the type, criticality and number of escaped design errors reported in several processors products.We then overview two major classes of runtime solutions: checker-based approaches and patching-based ones: themain difference between these two techniques lies in the underlying error detection mechanism. Checker-based solutions focus on verifying high-level system invariants, usually specified at design-time and then mapped to dedicated hardware components. Patching techniques address bugs of which the manufacturer becomes aware after product release and provide programmablemeans to describe these bugs so that the system can later identify their occurrence at runtime. We then contrast these two frameworks in terms of error coverage, usage flow and performance overhead, and present in detail some of the most popular academic and industrial solutions known today for each of the two classes.
Archive | 2011
Ilya Wagner; Valeria Bertacco
Verification remains an integral and crucial phase of the modern microprocessor design and manufacturing process. Unfortunately, with soaring design complexities and decreasing time-to-market windows, today’s verification approaches are incapable of fully validating a processor design before its release to the public. Increasingly, post-silicon validation is deployed to detect complex functional bugs, in addition to exposing electrical and manufacturing defects. This is due to the significantly higher execution performance offered by post-silicon methods, compared to pre-silicon approaches. We begin this chapter with an overview of traditional post-silicon validation techniques, as reported by the industry.We pay special attention to error detection and debuggingmethodologies discussed in the literature and identify several crucial drawbacks in traditional post-silicon techniques. In particular, we show how the performance of architectural simulators, used to determine the correctness of post-silicon tests, have become a bottleneck in current methodologies.We then discuss in detail a novel solution to address this issue, called Reversi. Reversi generates random programs in such a way that their correct final state is known at generation time, thus completely eliminating the need for architectural simulation. At the end of the chapter, we demonstrate experimentally that Reversi generates tests exposingmore bugs faster, and can speed up post-silicon validation by 20 times, when compared to traditional flows.
Archive | 2011
Ilya Wagner; Valeria Bertacco
Today, the number of functional errors escaping design verification and released into final silicon is growing, due to the increasing complexity of microprocessor systems and the shrinking production schedules of their development. The recent, widespread adoption of multi-core processor architectures is exacerbating the problem, due to the variable communication delays of their memory subsystem, making them even more prone to subtle and devastating bugs. This deteriorating situation calls for high-efficiency, high-coverage results in functional validation, results that could be achieved by leveraging the performance of post-silicon validation, that is, all that verification activity that surrounds and applies to prototype silicon hardware. The orders-of-magnitude faster testing in post-silicon enables designers to achieve much higher coverage before market release, but only if the limitations of this technology concerning bug diagnosis and internal node observability can be overcome. In this chapter we demonstrate the full performance of post-silicon vali- dation through Dacota, a new high-coverage solution for the validation of memory consistency models in multi-cores. When activated, Dacota reconfigures a portion of the cache storage to log the activity of memory operations using a compact data-coloring scheme. Logs are periodically aggregated and checked by a distributed software algorithm running in-situ on the processor cores to verify the correct ordering of the memory operations observed. When a design is ready for customer shipment, Dacota can be deactivated, releasing all the cache storage to mainstream data and instruction caching. The only remaining mark of Dacota is a small sili- con area footprint, less than 0.01% of the area for an open source multi-core design used in our evaluation (three orders of magnitude smaller than previous solutions). We found in our experimental analysis that Dacota is effective in exposing a variety of memory subsystem bugs, and that it delivers high design coverage capabilities at a 26% performance slowdown for real-world applications, incurred only when Dacota is active during post-silicon validation.
Archive | 2011
Ilya Wagner; Valeria Bertacco
This chapter describes in detail field-repairable control logic (FRCL), a solution that we recently developed [WBA06]. FRCL is a patching-based runtime verification technique that relies on an on-die programmable matcher. The matcher compares the state of the processor to patterns describing known bugs. In particular, FRCL targets control bugs in microprocessor cores, which, as our analysis in Section 6.1 shows, dominate the landscape of escaped errors in commercial products. To detect such control bugs, the matcher is engineered to monitor multiple critical signals in a core’s control logic block. The patterns stored in the matcher are developed by the manufacturer after an escaped error is detected and diagnosed; patterns are then distributed to end-users via patches, such as BIOS updates. When a buggy situation is detected, the matcher recovers from it with a pipeline flush and invokes a degradedmode of operation. In this mode, the complexity of the processor is greatly reduced, sacrificing some performance features, but allowing to formally prove the complete functional correctness of the system. Once the buggy situation is bypassed, the processor resumes normal high-performance operation.We analyze different aspects of FRCL operation and describe a methodology for automatic selection of signals to be monitored by the matcher. Finally, we extend the field-repairable control logic framework with semantic guardians - hardware circuits encoding all control states of the design that have been verified prior to its release. With the help of the guardians the processor can be guaranteed to always operate in a verified state (in either normal or degraded mode), thus enabling trusted computation.
Archive | 2011
Ilya Wagner; Valeria Bertacco
Over the past four decades, microprocessors have come to be a vital and inseparable part of the modern world, becoming the digital brain of numerous electronic devices and gadgets that make today’s lifestyle possible. Processors are capable of performing computation at astonishingly high speeds and are extremely integrated, occupying only a few square centimeters of silicon die. However, this computational power comes at a price: the task of verifying a modern microprocessor and guaranteeing the correctness of its operation is increasingly challenging, even for most established processor vendors. To deliver always higher performance to end-users, processor manufacturers are forced to design progressively more complex circuits and employ immense verification teams to eliminate critical design bugs in a timely manner. Unfortunately, too often size doesn’t seem to matter in verification, as schedules continue to slip, and microprocessors find their way to the marketplace with design errors. In this chapter we overview the life-cycle of a microprocessors, discuss the challenges of verifying these devices, and show examples of hardware errors that have escaped into production silicon because of insufficient validation an d their impact.
Archive | 2011
Ilya Wagner; Valeria Bertacco
In this chapter we take the reader through a typical microprocessor’s life-cycle, from its first high-level specification to a finished product deployed in a enduser’s system, and overview the verification techniques that are applied at each step of this flow. We first discuss pre-silicon verification, the process of validating a model of the processor at various levels of abstraction, from an architectural specification to a gate-level netlist. Throughout the pre-silicon phase, two main families of techniques are commonly used: formal methods and simulation-based solutions. While the former provide mathematical guarantees of design correctness, the latter are significantly more scalable and, consequently, are more commonly used in the industry today. After the first few prototypes of a processor are manufactured, validation enters the post-silicon domain, where tests can run on the actual silicon hardware. The raw performance of in-hardware execution is one of the major advantages of post-silicon validation, while lack of internal observability and limited debuggability are its main drawbacks. To alleviate this, designers often augment their creations with special features for silicon state acquisition, which we review here. After an arduous process of pre- and post-silicon validation, the device is released to the market and finds its way into a final system. Yet, it may still contain subtle bugs, which could not be exposed earlier by designers due to very compressed production timelines. To combat these escaped errors, vendors and researchers in industry and academia have began investigating alternative dynamic verification techniques:with minimal impact on the processor’s performance, these solutions monitor its health and invoke specialized correction mechanisms when errors manifest at runtime. As we show in this chapter, all three phases of verification, pre-silicon, post-silicon and runtime, have their unique advantages and limitations, which must be taken into account by design houses to attain sufficient verification coverage within their time and cost budgets and to avoid major catastrophes caused by releasing faulty processor products to the commercial market.
Archive | 2011
Ilya Wagner; Valeria Bertacco
In this chapter we shift the focus of our discussion to multi-core processors and issues specific to their runtime verification. In Chapter 4 we overviewed features of modern multi-core designs and described the growing challenge of their verification. As we pointed out, this arduous task is exacerbated by the increasing complexity of the shared memory communication subsystem and the need to verify two of its major system-wide properties: cache coherence and memory consistency. In this chapter we present two runtime solutions designed specifically for this purpose. The first technique, called Dynamic Verification of Memory Consistency (DVMC), designed by Meixner, et al. is a checker-based solution, which employs multiple distributed monitors to validate different aspects of communication at runtime. The second solution, Caspar, was developed by us as a patching approach that uses on-die matchers, programmed with patterns describing known bugs, to identify errors. Moreover, to be effective at runtime, both solutions include not only a detection, but also a recovery mechanism, so bugs can be sidestepped and forward progress can be maintained. Thus, as a part of our discussion, we also overview the recovery techniques used in both DVMC and Caspar.