Jared Stark | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jared Stark is active.

Explore More

Publication

Featured researches published by Jared Stark.

high-performance computer architecture | 2003

Runahead execution: an alternative to very large instruction windows for out-of-order processors

Onur Mutlu; Jared Stark; Chris Wilkerson; Yale N. Patt

Todays high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel/spl reg/ Pentium/spl reg/ processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window.

international symposium on microarchitecture | 2000

On pipelining dynamic instruction scheduling logic

Jared Stark; Mary D. Brown; Yale N. Patt

A machines performance is the product of its IPC (Instructions Per Cycle) and clock frequency. Recently, S. Palacharla et al. (1997) warned that the dynamic instruction scheduling logic for current machines performs an atomic operation. Either you sacrifice IPC by pipelining this logic, thereby eliminating its ability to execute dependent instructions in consecutive cycles. Or you sacrifice clock frequency by not pipelining it, performing this atomic operation in a single long cycle. Both alternatives are unacceptable for high performance. The paper offers a third, acceptable, alternative: pipelined scheduling with speculative wakeup. This technique pipelines the scheduling logic without eliminating its ability to execute dependent instructions in consecutive cycles. With this technique, you sacrifice little IPC, and no clock frequency. Our results show that on the SPECint95 benchmarks, a machine using this technique has an average IPC that is 13% greater than the IPC of a baseline machine that pipelines the scheduling logic but sacrifices the ability to execute dependent instructions in consecutive cycles, and within 2% of the IPC of a conventional machine that uses single cycle scheduling logic.

international symposium on microarchitecture | 2001

Select-free instruction scheduling logic

Mary D. Brown; Jared Stark; Yale N. Patt

Pipelining allows processors to exploit parallelism. Unfortunately, critical loops-pieces of logic that must evaluate in a single cycle to meet IPC (Instructions Per Cycle) goals-prevent deeper pipelining. In todays processors, one of these loops is the instruction scheduling (wakeup and select) logic [10]. This paper describes a technique that pipelines this loop by breaking it into two smaller loops: a critical, single-cycle loop for wakeup; and a noncritical, potentially multi-cycle, loop for select. For the 12 SPECint*2000 benchmarks, a machine with two-cycle select logic (i.e., three-cycle scheduling logic) using this technique has an average IPC 15% greater than a machine with three-cycle pipelined conventional scheduling logic, and an IPC within 3% of a machine of the same pipeline depth and one-cycle (ideal) scheduling logic. Since select accounts for more than half the scheduling latency [10], this technique could significantly increase clock frequency while having minimal impact on IPC.

IEEE Computer | 1997

One billion transistors, one uniprocessor, one chip

Yale N. Patt; Sanjay Jeram Patel; Marius Evers; Daniel Holmes Friendly; Jared Stark

Billion-transistor processors will be much as they are today, just bigger, faster and wider (issuing more instructions at once). The authors describe the key problems (instruction supply, data memory supply and an implementable execution core) that prevent current superscalar computers from scaling up to 16- or 32-instructions per issue. They propose using out-of-order fetching, multi-hybrid branch predictors and trace caches to improve the instruction supply. They predict that replicated first-level caches, huge on-chip caches and data value speculation will enhance the data supply. To provide a high-speed, implementable execution core that is capable of sustaining the necessary instruction throughput, they advocate a large, out-of-order-issue instruction window (2,000 instructions), clustered (separated) banks of functional units and hierarchical scheduling of ready instructions. They contend that the current uniprocessor model can provide sufficient performance and use a billion transistors effectively without changing the programming model or discarding software compatibility.

international conference on supercomputing | 2002

Bloom filtering cache misses for accurate data speculation and prefetching

Jih-Kwon Peir; Shih-Chang Lai; Shih-Lien Lu; Jared Stark; Konrad K. Lai

A processor must know a load instructions latency to schedule the loads dependent instructions at the correct time. Unfortunately, modern processors do not know this latency until well after the dependent instructions should have been scheduled to avoid pipeline bubbles between themselves and the load. One solution to this problem is to predict the loads latency, by predicting whether the load will hit or miss in the data cache. Existing cache hit/miss predictors, however, can only correctly predict about 50% of cache misses.This paper introduces a new hit/miss predictor that uses a Bloom Filter to identify cache misses early in the pipeline. This early identification of cache misses allows the processor to more accurately schedule instructions that are dependent on loads and to more precisely prefetch data into the cache. Simulations using a modified SimpleScalar model show that the proposed Bloom Filter is nearly perfect, with a prediction accuracy greater than 99% for the SPECint2000 benchmarks. IPC (Instructions Per Cycle) performance improved by 19% over a processor that delayed the scheduling of instructions dependent on a load until the load latency was known, and by 6% and 7% over a processor that always predicted a load would hit the cache and with a counter-based hit/miss predictor respectively. This IPC reaches 99.7% of the IPC of a processor with perfect scheduling.

IEEE Micro | 2003

Runahead execution: An effective alternative to large instruction windows

Onur Mutlu; Jared Stark; Chris Wilkerson; Yale N. Patt

An instruction window that can tolerate latencies to DRAM memory is prohibitively complex and power hungry. To avoid having to build such large windows, runahead execution uses otherwise-idle clock cycles to achieve an average 22 percent performance improvement for processors with instruction windows of contemporary sizes. This technique incurs only a small hardware cost and does not significantly increase the processors complexity.

architectural support for programming languages and operating systems | 1998

Variable length path branch prediction

Jared Stark; Marius Evers; Yale N. Patt

Accurate branch prediction is required to achieve high performance in deeply pipelined, wide-issue processors. Recent studies have shown that conditional and indirect (or computed) branch targets can be accuratelypredicted by recording the path, which consists of the target addresses of recent branches, leading up to the branch. In current path based branch predictors, the N most recent target addresses are hashed together to form an index into a table, where N is some fixed integer. The indexed table entry isused to make a prediction for the current branch.This paper introduces a new branch predictor in which the value of N is allowed to vary. By constructing the index into the table using the last N target addresses, and using profiling information to select the proper value of N for each branch, extremely accurate branch prediction is achieved. For the SPECint95 gee benchmark, this new predictor has a conditional branch misprediction rate of 4.3% given a 4K byte hardware budget. For comparison, the gshare predictor, a predictor known for its high accuracy, has a conditional branch misprediction rate of 8.8% given the same hardware budget. For the indirect branches in gee, the new predictor achieves a misprediction rate of 27.7% when given a hardware budget of 512 bytes, whereas the best competingpredictor achieves a misprediction rate of 44.2% when given the same hardware budget.

international symposium on microarchitecture | 2005

Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution

Hyesoon Kim; Onur Mutlu; Jared Stark; Yale N. Patt

Predicated execution has been used to reduce the number of branch mispredictions by eliminating hard-to-predict branches. However, the additional instruction overhead and additional data dependencies due to predicated execution sometimes offset the performance advantage of having fewer mispredictions. We propose a mechanism in which the compiler generates code that can be executed either as predicated code or non-predicated code (i.e., code with normal conditional branches). The hardware decides whether the predicated code or the non-predicated code is executed based on a run-time confidence estimation of the branchs prediction. The code generated by the compiler is the same as predicated code, except the predicated conditional branches are NOT removed - they are left intact in the program code. These conditional branches are called wish branches. The goal of wish branches is to use predicated execution for hard-to-predict dynamic branches and branch prediction for easy-to-predict dynamic branches, thereby obtaining the best of both worlds. We also introduce a class of wish branches, called wish loops, which utilize predication to reduce the misprediction penalty for hard-to-predict backward (loop) branches. We describe the semantics, types, and operation of wish branches along with the software and hardware support required to generate and utilize them. Our results show that wish branches decrease the average execution time of a subset of SPEC INT 2000 benchmarks by 14.2% compared to traditional conditional branches and by 13.3% compared to the best-performing predicated code binary

international conference on parallel architectures and compilation techniques | 1996

The effects of mispredicted-path execution on branch prediction structures

Stéphan Jourdan; Tse-Hao Hsing; Jared Stark; Yale N. Patt

Branch prediction accuracies determined using trace-driven simulation do not include the effects of executing branches along a mispredicted path. However, branches along a mispredicted path will pollute the branch prediction structures if no recovery mechanisms are provided. Without recovery mechanisms, prediction roles will suffer. In this paper, we determine the appropriateness of recovery mechanisms for the four structures of the Two-Level Adaptive Branch Predictor: the Branch Target Buffer (BTB), the Branch History Register (BHR), the Pattern History Tables (PHTs), and the Return Address Stack (RAS). We then propose cost-effective recovery mechanisms for these branch prediction structures. Far five benchmarks from the SPECint92 suite we show that performance is not affected if recovery mechanisms are not provided for the BTB and the PHTs. On the other hand, without any recovery mechanisms for the BHR and RAS, performance drops by an average of and 29%.

international symposium on microarchitecture | 1997

Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order

Jared Stark; Paul Racunas; Yale N. Patt

In conventional processors, each instruction cache fetch brings in a group of instructions. Upon encountering an instruction cache miss, the processor will wait until the instruction cache miss is serviced before continuing to fetch any new instructions. The paper presents a new technique, called out-of-order issue, which allows the processor to temporarily ignore the instructions associated with the instruction cache miss. The processor attempts to fetch the instructions that follow the group of instructions associated with the miss. These instructions are then decoded and written into the processors reservation stations. Later, after the instruction cache miss has been serviced, the instructions associated with the miss are decoded and written into the reservation stations. (We use the term issue to indicate the act of writing instructions into the reservation stations. With this technique, instructions are not written into the reservation stations in program order. Hence, the term out-of-order issue.) We introduce the concept of out-of-order issue, describe its implementation, and present some initial data showing the performance gains possible with out-of-order issue.

Explore More