Andreas Moshovos
Northwestern University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Andreas Moshovos.
international symposium on computer architecture | 1997
Andreas Moshovos; Scott E. Breach; T. N. Vijaykumar; Gurindar S. Sohi
Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-specalation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor.
architectural support for programming languages and operating systems | 1998
Amir Roth; Andreas Moshovos; Gurindar S. Sohi
We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By identzj+ing producer-consumer pairs, we construct a compact internal representation for the associated structure and its traversal. To achieve a prefetching eflect, a small prefetch engine speculatively traverses this representation ahead of the executing program. Dependence-based prefetching achieves speedups of up to 25% on a suite of pointer-intensive programs.
international symposium on computer architecture | 2000
Zhi Alex Ye; Andreas Moshovos; Scott Hauck; Prithviraj Banerjee
Reconfigurable hardware has the potential for significant performance improvements by providing support for application-specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically-scheduled superscalar processor. Chimaera is capable of performing 9-input/1-output operations on integer data. We discuss the Chimaera C compiler that automatically maps computations for execution in the RFU. Chimaera is capable of: (1) collapsing a set of instructions into RFU operations, (2) converting control-flow into RFU operations, and (3) supporting a more powerful fine-grain data-parallel model than that supported by current multimedia extension instruction sets (for integer operations). Using a set of multimedia and communication applications we show that even with simple optimizations, the Chimaera C compiler is able to map 22% of all instructions to the RFU on the average. A variety of computations are mapped into RFU operations ranging from as simple as add/sub-shift pairs to operations of more than 10 instructions including several branches. Timing experiments demonstrate that for a 4-way out-of-order superscalar processor Chimaera results in average performance improvements of 21%, assuming a very aggressive core processor design (most pessimistic RFU latency model) and communication overheads from and to the RFU.
international symposium on microarchitecture | 1997
Andreas Moshovos; Gurindar S. Sohi
We revisit memory hierarchy design viewing memory as an inter-operation communication agent. This perspective leads to the development of novel methods of performing inter-operation memory communication. We use data dependence prediction to identify and link dependent loads and stores so that they can communicate speculatively without incurring the overhead of address calculation, disambiguation and data cache access. We also use data dependence prediction to convert, DEF-store-load-USE chains within the instruction window into DEF-USE chains prior to address calculation and disambiguation. We use true and output data dependence status prediction to introduce and manage a small storage structure called the transient value cache (TVC). The TVC captures memory values that are short-lived. It also captures recently stored values that are likely to be accessed soon. Accesses that are serviced by the TVC do not have to be serviced by other parts of the memory hierarchy, e.g., the data cache. The first two techniques are aimed at reducing the effective communication latency whereas the last technique is aimed at reducing data cache bandwidth requirements. Experimental analysis of the proposed techniques shows that: the proposed speculative communication methods correctly handle a large fraction of memory dependences; and a large number of the loads and stores do not have to ever reach the data cache when the TVC is in place.
international conference on supercomputing | 1999
Amir Roth; Andreas Moshovos; Gurindar S. Sohi
We introduce dependence-based pre-computation as a complement to history-based target prediction schemes. We present pre-computation in the context of virtual function calls (v-calls), a class of control transfers that is becoming increasingly important and has resisted conventional prediction. Our proposed technique dynamically identifies the sequence of operations that computes a v-call’s target. When the first instruction in such a sequence is encountered, a small execution engine speculatively and aggressively pre-executes the rest. The pre-computed target is stored and subsequently used when a prediction needs to be made. We show that a common v-call instruction sequence can be exploited to implement pre-computation using a previously proposed prefetching mechanism and minimal additional hardware. In a suite of C++ programs, dependence-based pre-computation eliminates 46% of the mispredictions incurred by a simple BTB and 24% of those associated with a path-based two-level predictor.
international symposium on microarchitecture | 1999
Andreas Moshovos; Gurindar S. Sohi
We revisit memory hierarchy design viewing memory as an inter-operation communication mechanism. We show how dynamically collected information about inter-operation memory communication can be used to improve memory latency. We propose two techniques: (1) Speculative Memory Cloaking, and (2) Speculative Memory Bypassing. In the first technique, we use memory dependence prediction to speculatively identify dependent loads and stores early in the pipeline. These instructions may then communicate prior to address calculation and disambiguation via a fast communication mechanism. In the second technique, we use memory dependence prediction to speculatively transform DEF-store-load-USE dependence chains within the instruction window into DEF-USE ones. As a result, dependent stores and loads are taken off the communication path resulting in further reduction in communication latency. Experimental analysis shows that our methods, on the average, correctly handle 40% (integer) and 19% (floating point) of all memory loads. Moreover, our techniques result in performance improvements of 4.28% (integer) and 3.20% (floating point) over a highly aggressive, dynamically scheduled processor implementing naive memory dependence speculation. We also study the value and address locality characteristics of the values our methods correctly handle. We demonstrate that our methods are orthogonal to both address and value prediction.
international symposium on microarchitecture | 1999
Andreas Moshovos; Gurindar S. Sohi
We identify that typical programs which exhibit highly regular read-after-read (RAR) memory dependence streams. We exploit this regularity by introducing read-after-read (RAR) memory dependence prediction. We also present two RAR memory dependence prediction-based memory latency reduction techniques. In the first technique, a load can obtain a value by simply naming a preceding load with which a RAR dependence is predicted. The second technique speculatively converts a series of LOAD/sub 1/-USE/sub 1/,...,LOAD/sub N/-USE/sub N/ chains into a single LOAD/sub 1/-USE/sub 1/...USE/sub N/ producer/consumer graph. Our techniques can be implemented as surgical extensions to the recently proposed read-after-write (RAW) dependence prediction based speculative memory cloaking and speculative memory bypassing. On average, our techniques provide correct values for an additional 20% (integer codes) and 30% (floating-point codes) of all loads. Moreover, a combined RAW- and RAR-based cloaking/bypassing mechanism improves performance by 6.44% (integer) and 4.66% (floating-point) even when naive memory dependence speculation is used. The original RAW-based cloaking/bypassing mechanism yields improvements of 4.28% (integer) and 3.20% (floating-point).
high performance computer architecture | 2000
Andreas Moshovos; Gurindar S. Sohi
We consider a variety of dynamic, hardware-based methods for exploiting load/store parallelism, including mechanisms that use memory dependence speculation. While previous work has also investigated such methods, this has been done primarily for split, distributed window processor models. We focus on centralized continuous-window processor models (the common configuration today). We confirm that exploiting load/store parallelism can greatly improve performance. Moreover, we show that much of this performance potential can be captured if addresses of the memory locations accessed by both loads and stores can be used to schedule loads. However, using addresses to schedule load execution may not always be an option due to complexity, latency, and cost considerations. For this reason, we also consider configurations that use just memory dependence speculation to guide load execution. We consider a variety of methods and show that speculation/synchronization can be used to effectively exploit virtually all load/store parallelism. We demonstrate that this technique is competitive to or better than the one that uses addresses for scheduling loads. We conclude by discussing why our findings differ, in part, from those reported for split, distributed window processor models.
Archive | 1998
Andreas Moshovos; Gurindar S. Sohi
Archive | 1996
Andreas Moshovos; Scott E. Breach; T. N. Vijaykumar; Gurindar S. Sohi