Matthew K. Farrens
University of California, Davis
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Matthew K. Farrens.
international symposium on computer architecture | 2000
Mark Oskin; Frederic T. Chong; Matthew K. Farrens
As microprocessors continue to evolve, many optimizations reach a point of diminishing returns. We introduce HLS, a hybrid processor simulator which uses statistical models and symbolic execution to evaluate design alternatives. This simulation methodology allows for quick and accurate contour maps to be generated of the performance space spanned by design parameters. We validate the accuracy of HLS through correlation with existing cycle-by-cycle simulation techniques and current generation hardware. We demonstrate. The power of HLS by exploring design spaces defined by two parameters: code properties and value prediction. These examples motivate how HLS can be used to set design goals and individual component performance targets.
international symposium on computer architecture | 1991
Matthew K. Farrens; Arvin Park
When address reference degrees of spatial and temporal higher order address lines carry streams exhibit high locality, many of the redundant information. By caching the higher order portions of address references in a set of dynamically allocated base registers, it becomes possible to transmit small register indices between the processor and memory instead of the high order address bits themselves. Trace driven simulations indicate that this technique can significantly reduce processor-to-memory address bus width without an appreciable loss in performance, fhereby increasing available processor bandwidth. Our resulfs imply that as much as 25% of the available 1/0 bandwidth of a processor is used less than 1% of the time.
international symposium on computer architecture | 1991
Matthew K. Farrens; Andrew R. Pleszkun
Deeply pipelined processors have relatively low issue rates due to dependencies between instructions. In this paper we examine the possibility of interleaving a second stream of instructions into the pipeline, which would issue instructions during the cycles the first stream was unable to. Such an interleaving has the potential to significantly increase the throughput of a processor without seriously imparing the execution of either process. We propose a dynamic interleaving of at most 2 instructions streams, which share the the pipelined functional units of a machine. To support the interleaving of 2 instruction streams a number of interleaving policies are. described and discused. Finally, the amount of improvement in processor throughput is evaluated by simulating the interleaving policies for several machine varianv;.
international conference on supercomputing | 1998
Jude A. Rivers; Edward S. Tam; Gary S. Tyson; Edward S. Davidson; Matthew K. Farrens
1. ABSTRACT As microprocessor speeds continue to outgrow memory subsystem speeds, minimizing the average data access time grows in importance. As current data caches are often poorly and inefficiently managed, a good management technique can improve the average data access time. This paper presents a comparative evaluation of two approaches that utilize reuse information for more efficiently managing the firstlevel cache. While one approach is based on the effective address of the data being referenced, the other uses the program counter of the memory instruction generating the reference. Our evaluations show that using effective address reuse information performs better than using program counter reuse information. In addition, we show that the Victim cache performs best for multi-lateral caches with a direct-mapped main cache and high L2 cache latency, while the NTS (effective-addressbased) approach performs better as the L2 latency decreases or the associativity of the main cache increases.
international symposium on microarchitecture | 1992
Gary S. Tyson; Matthew K. Farrens; Andrew R. Pleszkun
This paper describes a single chip Multiple Instruction Stream Computer (MISC) capable of extracting instruction level parallelism from a broad spectrum of programs. The MISC architecture uses multiple asynchronous processing elements to separate a program into streams that can be executed in parallel, and integrates a conflict-fne message passing system into the lowest level of the processor design to facilitate low latency intraMISC communication. This approach allows for increased machine parallelism with minimal code expansion, and provides an alternative approach to single instruction stream multi-issue machines such as SuperScalar and VLIW. 1. The MISC Design The MISC processor, a direct descendant of the PIPE project [CGKP87,GHLP85] will exploit both the instruction and data parallelism available in a task by combining the capabilities of traditional data parallel architectures with those found in machines designed to exploit instruction level parallelism. Unlike the two pm cessor PIPE design, the MISC system is capable of balancing the processor load of instructions performing memory access and execute operations among four processors. The characteristics of the MISC design also allow the introduction of a number of new and unique instructions, like the Sentinel and Vector instructions described later in this paper. As its name indicates, MISC is composed of multiple Processing Elements (PES) which cooperate in the execution of a task. MIX
international symposium on computer architecture | 1994
Matthew K. Farrens; Gary S. Tyson; Andrew R. Pleszkun
This paper presents a trace-driven simulation-based study of a wide range of cache configurations and processor counts. This study was undertaken in an attempt to help answer the question of how best to allocate large numbers of transistors, a question that is rapidly increasing in importance as transistor densities continue to climb. At what point does continuing to increase the size of the on-chip first level cache cease to provide sufficient increases in hit rate and become prohibitively difficult to access in a single cycle? In order to compare different configurations, the concept of an Equivalent Cache Transistor is presented. Results indicate that the access time of the first-level data cache is more important than the size. In addition, it appears that once approximately 15 million transistors become available, a two processor configuration is preferable to a single processor with correspondingly larger caches.
international symposium on microarchitecture | 1990
Arvin Park; Matthew K. Farrens
The paper presents a technique to reduce processor-to-memory address bandwidth by exploiting temporal and spatial locality in address reference streams. Higher order portions of address words are cached in base registers at both the processor and memory. This makes it possible to transmit small register indexes between processor and memory instead of the high order address bits themselves. Trace driven simulations indicate that base register caching reduces processor-to-memory address bandwidth up to 60% without appreciable loss in performance.<<ETX>>
high performance computer architecture | 2000
Michael Haungs; Phil Sallee; Matthew K. Farrens
Recent studies have shown significantly improved branch prediction through the use of branch classification. By separating static branches into groups, or classes, with similar dynamic behavior predictors may be selected that are best suited for each class. Previous methods have classified branches according to taken rate (or bias). We propose a new metric for branch classification: branch transition rate, which is defined as the number of times a branch changes direction between taken and not taken during execution. We show that transition rate is a more appropriate indicator of branch behavior than taken rate for determining predictor performance. When both metrics are combined, an even clearer picture of dynamic branch behavior emerges, in which expected predictor performance for a branch is closely correlated with its combined taken and transition rate class. Using this classification, a small group of branches is identified for which two-level predictors are ineffective.
IEEE ACM Transactions on Networking | 2014
Paul T. Congdon; Prasant Mohapatra; Matthew K. Farrens; Venkatesh Akella
The Ethernet switch is a primary building block for todays enterprise networks and data centers. As network technologies converge upon a single Ethernet fabric, there is ongoing pressure to improve the performance and efficiency of the switch while maintaining flexibility and a rich set of packet processing features. The OpenFlow architecture aims to provide flexibility and programmable packet processing to meet these converging needs. Of the many ways to create an OpenFlow switch, a popular choice is to make heavy use of ternary content addressable memories (TCAMs). Unfortunately, TCAMs can consume a considerable amount of power and, when used to match flows in an OpenFlow switch, put a bound on switch latency. In this paper, we propose enhancing an OpenFlow Ethernet switch with per-port packet prediction circuitry in order to simultaneously reduce latency and power consumption without sacrificing rich policy-based forwarding enabled by the OpenFlow architecture. Packet prediction exploits the temporal locality in network communications to predict the flow classification of incoming packets. When predictions are correct, latency can be reduced, and significant power savings can be achieved from bypassing the full lookup process. Simulation studies using actual network traces indicate that correct prediction rates of 97% are achievable using only a small amount of prediction circuitry per port. These studies also show that prediction circuitry can help reduce the power consumed by a lookup process that includes a TCAM by 92% and simultaneously reduce the latency of a cut-through switch by 66%.
IEEE Computer | 1991
Matthew K. Farrens; A.R. Pleszhun
The PIPE (parallel instruction with pipelined execution) processor, which is the result of a research project initiated to investigate high-performance computer architectures for VLSI implementation, is described. The lessons learned from the implementation are discussed. The most important result was the discovery that supporting architectural queues does not complicate the instruction issue logic and fees the processor clock rate from external memory speed influences. It was also found that the decision to support an instruction set with two instruction sizes and to allow consecutive two-parcel instruction issues profoundly affected the instruction fetch logic design. Other significant results concerned the issue logic, barrel shifter, cache control logic, and branch count.<<ETX>>