Matthew K. Farrens | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthew K. Farrens is active.

Explore More

Publication

Featured researches published by Matthew K. Farrens.

international symposium on computer architecture | 2000

HLS: combining statistical and symbolic simulation to guide microprocessor designs

Mark Oskin; Frederic T. Chong; Matthew K. Farrens

As microprocessors continue to evolve, many optimizations reach a point of diminishing returns. We introduce HLS, a hybrid processor simulator which uses statistical models and symbolic execution to evaluate design alternatives. This simulation methodology allows for quick and accurate contour maps to be generated of the performance space spanned by design parameters. We validate the accuracy of HLS through correlation with existing cycle-by-cycle simulation techniques and current generation hardware. We demonstrate. The power of HLS by exploring design spaces defined by two parameters: code properties and value prediction. These examples motivate how HLS can be used to set design goals and individual component performance targets.

international symposium on computer architecture | 1991

Dynamic base register caching: a technique for reducing address bus width

Matthew K. Farrens; Arvin Park

When address reference degrees of spatial and temporal higher order address lines carry streams exhibit high locality, many of the redundant information. By caching the higher order portions of address references in a set of dynamically allocated base registers, it becomes possible to transmit small register indices between the processor and memory instead of the high order address bits themselves. Trace driven simulations indicate that this technique can significantly reduce processor-to-memory address bus width without an appreciable loss in performance, fhereby increasing available processor bandwidth. Our resulfs imply that as much as 25% of the available 1/0 bandwidth of a processor is used less than 1% of the time.

international symposium on computer architecture | 1991

Strategies for achieving improved processor throughput

Matthew K. Farrens; Andrew R. Pleszkun

Deeply pipelined processors have relatively low issue rates due to dependencies between instructions. In this paper we examine the possibility of interleaving a second stream of instructions into the pipeline, which would issue instructions during the cycles the first stream was unable to. Such an interleaving has the potential to significantly increase the throughput of a processor without seriously imparing the execution of either process. We propose a dynamic interleaving of at most 2 instructions streams, which share the the pipelined functional units of a machine. To support the interleaving of 2 instruction streams a number of interleaving policies are. described and discused. Finally, the amount of improvement in processor throughput is evaluated by simulating the interleaving policies for several machine varianv;.

international conference on supercomputing | 1998

Utilizing reuse information in data cache management

Jude A. Rivers; Edward S. Tam; Gary S. Tyson; Edward S. Davidson; Matthew K. Farrens

1. ABSTRACT As microprocessor speeds continue to outgrow memory subsystem speeds, minimizing the average data access time grows in importance. As current data caches are often poorly and inefficiently managed, a good management technique can improve the average data access time. This paper presents a comparative evaluation of two approaches that utilize reuse information for more efficiently managing the firstlevel cache. While one approach is based on the effective address of the data being referenced, the other uses the program counter of the memory instruction generating the reference. Our evaluations show that using effective address reuse information performs better than using program counter reuse information. In addition, we show that the Victim cache performs best for multi-lateral caches with a direct-mapped main cache and high L2 cache latency, while the NTS (effective-addressbased) approach performs better as the L2 latency decreases or the associativity of the main cache increases.

international symposium on microarchitecture | 1992

MISC: a Multiple Instruction Stream Computer

Gary S. Tyson; Matthew K. Farrens; Andrew R. Pleszkun

This paper describes a single chip Multiple Instruction Stream Computer (MISC) capable of extracting instruction level parallelism from a broad spectrum of programs. The MISC architecture uses multiple asynchronous processing elements to separate a program into streams that can be executed in parallel, and integrates a conflict-fne message passing system into the lowest level of the processor design to facilitate low latency intraMISC communication. This approach allows for increased machine parallelism with minimal code expansion, and provides an alternative approach to single instruction stream multi-issue machines such as SuperScalar and VLIW. 1. The MISC Design The MISC processor, a direct descendant of the PIPE project [CGKP87,GHLP85] will exploit both the instruction and data parallelism available in a task by combining the capabilities of traditional data parallel architectures with those found in machines designed to exploit instruction level parallelism. Unlike the two pm cessor PIPE design, the MISC system is capable of balancing the processor load of instructions performing memory access and execute operations among four processors. The characteristics of the MISC design also allow the introduction of a number of new and unique instructions, like the Sentinel and Vector instructions described later in this paper. As its name indicates, MISC is composed of multiple Processing Elements (PES) which cooperate in the execution of a task. MIX

international symposium on computer architecture | 1994

A study of single-chip processor/cache organizations for large numbers of transistors

Matthew K. Farrens; Gary S. Tyson; Andrew R. Pleszkun

This paper presents a trace-driven simulation-based study of a wide range of cache configurations and processor counts. This study was undertaken in an attempt to help answer the question of how best to allocate large numbers of transistors, a question that is rapidly increasing in importance as transistor densities continue to climb. At what point does continuing to increase the size of the on-chip first level cache cease to provide sufficient increases in hit rate and become prohibitively difficult to access in a single cycle? In order to compare different configurations, the concept of an Equivalent Cache Transistor is presented. Results indicate that the access time of the first-level data cache is more important than the size. In addition, it appears that once approximately 15 million transistors become available, a two processor configuration is preferable to a single processor with correspondingly larger caches.

international symposium on microarchitecture | 1990

Address compression through base register caching

Arvin Park; Matthew K. Farrens

The paper presents a technique to reduce processor-to-memory address bandwidth by exploiting temporal and spatial locality in address reference streams. Higher order portions of address words are cached in base registers at both the processor and memory. This makes it possible to transmit small register indexes between processor and memory instead of the high order address bits themselves. Trace driven simulations indicate that base register caching reduces processor-to-memory address bandwidth up to 60% without appreciable loss in performance.<<ETX>>

high performance computer architecture | 2000

Branch transition rate: a new metric for improved branch classification analysis

Michael Haungs; Phil Sallee; Matthew K. Farrens

Recent studies have shown significantly improved branch prediction through the use of branch classification. By separating static branches into groups, or classes, with similar dynamic behavior predictors may be selected that are best suited for each class. Previous methods have classified branches according to taken rate (or bias). We propose a new metric for branch classification: branch transition rate, which is defined as the number of times a branch changes direction between taken and not taken during execution. We show that transition rate is a more appropriate indicator of branch behavior than taken rate for determining predictor performance. When both metrics are combined, an even clearer picture of dynamic branch behavior emerges, in which expected predictor performance for a branch is closely correlated with its combined taken and transition rate class. Using this classification, a small group of branches is identified for which two-level predictors are ineffective.

IEEE ACM Transactions on Networking | 2014

Simultaneously reducing latency and power consumption in openflow switches

Paul T. Congdon; Prasant Mohapatra; Matthew K. Farrens; Venkatesh Akella

The Ethernet switch is a primary building block for todays enterprise networks and data centers. As network technologies converge upon a single Ethernet fabric, there is ongoing pressure to improve the performance and efficiency of the switch while maintaining flexibility and a rich set of packet processing features. The OpenFlow architecture aims to provide flexibility and programmable packet processing to meet these converging needs. Of the many ways to create an OpenFlow switch, a popular choice is to make heavy use of ternary content addressable memories (TCAMs). Unfortunately, TCAMs can consume a considerable amount of power and, when used to match flows in an OpenFlow switch, put a bound on switch latency. In this paper, we propose enhancing an OpenFlow Ethernet switch with per-port packet prediction circuitry in order to simultaneously reduce latency and power consumption without sacrificing rich policy-based forwarding enabled by the OpenFlow architecture. Packet prediction exploits the temporal locality in network communications to predict the flow classification of incoming packets. When predictions are correct, latency can be reduced, and significant power savings can be achieved from bypassing the full lookup process. Simulation studies using actual network traces indicate that correct prediction rates of 97% are achievable using only a small amount of prediction circuitry per port. These studies also show that prediction circuitry can help reduce the power consumed by a lookup process that includes a TCAM by 92% and simultaneously reduce the latency of a cut-through switch by 66%.

IEEE Computer | 1991

Implementation of the PIPE processor

Matthew K. Farrens; A.R. Pleszhun

The PIPE (parallel instruction with pipelined execution) processor, which is the result of a research project initiated to investigate high-performance computer architectures for VLSI implementation, is described. The lessons learned from the implementation are discussed. The most important result was the discovery that supporting architectural queues does not complicate the instruction issue logic and fees the processor clock rate from external memory speed influences. It was also found that the decision to support an instruction set with two instruction sizes and to allow consecutive two-parcel instruction issues profoundly affected the instruction fetch logic design. Other significant results concerned the issue logic, barrel shifter, cache control logic, and branch count.<<ETX>>

Explore More