Craig B. Zilles
University of Illinois at Urbana–Champaign
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Craig B. Zilles.
international symposium on computer architecture | 2001
Craig B. Zilles; Gurindar S. Sohi
A relatively small set of static instructions has significant leverage on program execution performance. These problem instructions contribute a disproportionate number of cache misses and branch mispredictions because their behavior cannot be accurately anticipated using existing prefetching or branch prediction mechanisms. The behavior of many problem instructions can be predicted by executing a small code fragment called a speculative slice. If a speculative slice is executed before the corresponding problem instructions are fetched, then the problem instructions can move smoothly through the pipeline because the slice has tolerated the latency of the memory hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine. To benefit from branch predictions generated by speculative slices, the predictions must be bound to specific dynamic branch instances. We present a technique that invalidates predictions when it can be determined (by monitoring the programs execution path) that they will not be used. This enables the remaining predictions to be correctly correlated.
interactive 3d graphics and games | 1995
K. Salisbury; David L. Brock; Thomas H. Massie; N. Swarup; Craig B. Zilles
Haptic rendering is the process of computing and generating forces in response to user interactions with virtual objects. Recent efforts by our team at MITs AI laboratory have resulted in the development of haptic interface devices and algorithms for generating the forces of interaction with virtual objects. This paper focuses on the software techniques needed to generate sensations of contact interaction and material properties. In particular, the techniques we describe are appropriate for use with the Phantom haptic interface, a force generating display device developed in our laboratory. We also briefly describe a technique for representing and rendering the feel of arbitrary polyhedral shapes and address issues related to rendering the feel of non-homogeneous materials. A number of demonstrations of simple haptic tasks which combine our rendering techniques are also described.
international symposium on microarchitecture | 2002
Craig B. Zilles; Gurindar S. Sohi
Master/Slave Speculative Parallelization (MSSP) is an execution paradigm for improving the execution rate of sequential programs by parallelizing them speculatively for execution on a multiprocessor. In MSSP one processor - the master - executes an approximate version of the program to compute selected values that the full programs execution is expected to compute. The masters results are checked by slave processors that execute the original program. This validation is parallelized by cutting the programs execution into tasks. Each slave uses its predicted inputs (as computed by the master) to validate the input predictions of the next task, inductively validating the entire execution. The performance of MSSP is largely determined by the execution rate of the approximate program. Since approximate code has no correctness requirements (in essence it is a software value predictor), it can be optimized more effectively than traditionally generated code. It is free to sacrifice correctness in the uncommon case to maximize performance in the common case. A simulation-based evaluation of an initial MSSP implementation achieves speedups of up to 1.7 (harmonic mean 1.25) on the SPEC2000 integer benchmarks. Performance is currently limited by the effectiveness with which our current automated infrastructure approximates programs, which can likely be improved significantly.
international symposium on computer architecture | 2000
Craig B. Zilles; Gurindar S. Sohi
For many applications, branch mispredictions and cache misses limit a processors performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instructions latency. To overcome conservative dependence assumptions that result in large slices, speculation can be used, resulting in speculative slices. This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction.
high performance computer architecture | 2001
Craig B. Zilles; Gurindar S. Sohi
Aggressive program optimization requires accurate profile information, but such accuracy requires many samples to be collected. We explore a novel profiling architecture that reduces the overhead of collecting each sample by including a programmable co-processor that analyzes a stream of profile samples generated by a microprocessor. From this stream of samples, the co-processor can detect correlations between instructions (e.g., memory dependence profiling) as well as those between different dynamic instances of the same instruction (e.g., value profiling). The profilers programmable nature allows a broad range of data to be extracted, post-processed, and formatted, as well as provides the flexibility to tailor the profiling application to the program under test. Because the co-processor is specialized for profiling, it can execute profiling applications more efficiently than a general-purpose processor. The co-processor should not significantly impact the cost or performance of the main processor because it can be implemented using a small number of transistors at the chips periphery We demonstrate the proposed design through a detailed evaluation of load value profiling. Our implementation quickly and accurately estimates the value invariance of loads, with time overhead roughly proportional to the size of the instruction working set of the program. This algorithm demonstrates a number of general techniques for profiling, including: estimating the completeness of a profile, a means to focus profiling on particular instructions, management of profiling resources.
high-performance computer architecture | 2009
Brian Greskamp; Lu Wan; Ulya R. Karpuzcu; Jeffrey J. Cook; Josep Torrellas; Deming Chen; Craig B. Zilles
Several recent processor designs have proposed to enhance performance by increasing the clock frequency to the point where timing faults occur, and by adding error-correcting support to guarantee correctness. However, such Timing Speculation (TS) proposals are limited in that they assume traditional design methodologies that are suboptimal under TS. In this paper, we present a new approach where the processor itself is designed from the ground up for TS. The idea is to identify and optimize the most frequently-exercised critical paths in the design, at the expense of the majority of the static critical paths, which are allowed to suffer timing errors. Our approach and design optimization algorithm are called BlueShift. We also introduce two techniques that, when applied under BlueShift, improve processor performance: On-demand Selective Biasing (OSB) and Path Constraint Tuning (PCT). Our evaluation with modules from the OpenSPARC T1 processor shows that, compared to conventional TS, BlueShift with OSB speeds up applications by an average of 8% while increasing the processor power by an average of 12%. Moreover, compared to a high-performance TS design, BlueShift with PCT speeds up applications by an average of 6% with an average processor power overhead of 23% . providing a way to speed up logic modules that is orthogonal to voltage scaling.
international symposium on computer architecture | 2007
Naveen Neelakantam; Ravi Rajwar; Suresh Srinivas; Uma Srinivasan; Craig B. Zilles
Speculative compiler optimizations are effective in improving both single-thread performance and reducing power consumption, but their implementation introduces significant complexity, which can limit their adoption, limit their optimization scope, and negatively impact the reliability of the compilers that implement them. To eliminate much of this complexity, as well as increase the effectiveness of these optimizations, we propose that microprocessors provide architecturally-visible hardware primitives for atomic execution. These primitives provide to the compiler the ability to optimize the programs hot path in isolation, allowing the use of non-speculative formulations of optimization passes to perform speculative optimizations. Atomic execution guarantees that if a speculation invariant does not hold, the speculative updates are discarded, the register state is restored, and control is transferred to a non-speculative version of the code, thereby relieving the compiler from the responsibility of generating compensation code. We demonstrate the benefit of hardware atomicity in the context of a Java virtual machine. We find incorporating the notion of atomic regions into an existing compiler intermediate representation to be natural, requiring roughly 3,000 lines of code (~3% of a JVMs optimizing compiler), most of which were for region formation. Its incorporation creates new opportunities for existing optimization passes, as well as greatly simplifying the implementation of additional optimizations (e.g., partial inlining, partial loop unrolling, and speculative lock elision). These optimizations reduce dynamic instruction count by 11% on average and result in a 10-15% average speedup, relative to a baseline compiler with a similar degree of inlining.
technical symposium on computer science education | 2008
Kenneth Goldman; Paul Gross; Cinda Heeren; Geoffrey L. Herman; Lisa C. Kaczmarczyk; Michael C. Loui; Craig B. Zilles
A Delphi process is a structured multi-step process that uses a group of experts to achieve a consensus opinion. We present the results of three Delphi processes to identify topics that are important and difficult in each of three introductory computing subjects: discrete math, programming fundamentals, and logic design. The topic rankings can be used to guide both the coverage of standardized tests of student learning (i.e., concept inventories) and can be used by instructors to identify what topics merit emphasis.
technical symposium on computer science education | 2010
Geoffrey L. Herman; Michael C. Loui; Craig B. Zilles
A concept inventory (CI) is a standardized assessment tool that evaluates how well a students conceptual framework matches the accepted conceptual framework of a discipline. In this paper, we present our process in creating and evaluating the alpha version of a CI to assess student understanding of digital logic. We have checked the validity and reliability of the CI through an alpha administration, follow-up interviews with students, analysis of administration results, and expert feedback. So far the feedback on the digital logic concept inventory is positive and promising.
Ibm Journal of Research and Development | 2006
Lee Baugh; Craig B. Zilles
Because they are based on large, content-addressable memories, load-store queues (LSQs) present implementation challenges in superscalar processors. In this paper, we propose an alternate LSQ organization that separates the time-critical forwarding functionality from the process of checking that loads received their correct values. Two main techniques are exploited: First, the store-forwarding logic is accessed only by those loads and stores that are likely to be involved in forwarding, and second, the checking structure is banked by address. The result of these techniques is that the LSQ can be implemented by a collection of small, low-bandwidth structures yielding an estimated three to five times reduction in LSQ dynamic power.