Is this you? Create Your Porfile

Timothy Sherwood

University of California, Santa Barbara

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Timothy Sherwood is active.

Explore More

Publication

Featured researches published by Timothy Sherwood.

architectural support for programming languages and operating systems | 2002

Automatically characterizing large scale program behavior

Timothy Sherwood; Erez Perelman; Greg Hamerly; Brad Calder

Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ramifications for many architectural and compiler techniques, from thread scheduling, to feedback directed optimizations, to the way programs are simulated. However, in order to take advantage of time-varying behavior, we must first develop the analytical tools necessary to automatically and efficiently analyze program behavior over large sections of execution.Our goal is to develop automatic techniques that are capable of finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of instructions). The first step towards this goal is the development of a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a program. To this end we examine the use of Basic Block Vectors. We quantify the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explore the large scale behavior of several programs, and develop a set of algorithms based on clustering capable of analyzing this behavior. We then demonstrate an application of this technology to automatically determine where to simulate for a program to help guide computer architecture research.

international conference on parallel architectures and compilation techniques | 2001

Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Timothy Sherwood; Erez Perelman; Brad Calder

Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To overcome this problem researchers choose a very small portion of a programs execution to evaluate their results, rather than simulating the entire program. In this paper we propose Basic Block Distribution Analysis as an automated approach for finding these small portions of the program to simulate that are representative of the entire programs execution. This approach is based upon using profiles of a programs code structure (basic blocks) to uniquely identify different phases of execution in the program. We show that the periodicity of the basic block frequency profile reflects the periodicity of detailed simulation across several different architectural metrics (e.g., IPC, branch miss rate, cache miss rate, value misprediction, address misprediction, and reorder buffer occupancy). Since basic block frequencies can be collected using very fast profiling tools, our approach provides a practical technique for finding the periodicity and simulation points in applications.

international symposium on computer architecture | 2003

Phase tracking and prediction

Timothy Sherwood; Suleyman Sair; Brad Calder

In a single second a modern processor can execute billions of instructions. Obtaining a birds eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and understanding the patterns of behavior, at run-time, can unlock a multitude of optimization opportunities.In this paper, we present a unified profiling architecture that can efficiently capture, classify, and predict phase-based program behavior on the largest of time scales. By examining the proportion of instructions that were executed from different sections of code, we can find generic phases that correspond to changes in behavior across many metrics. By classifying phases generically, we avoid the need to identify phases for each optimization, and enable a unified prediction scheme that can forecast future behavior. Our analysis shows that our design can capture phases that account for over 80% of execution using less that 500 bytes of on-chip memory.

international conference on computer communications | 2004

Deterministic memory-efficient string matching algorithms for intrusion detection

Nathan Tuck; Timothy Sherwood; Brad Calder; George Varghese

Intrusion detection systems (IDSs) have become widely recognized as powerful tools for identifying, deterring and deflecting malicious attacks over the network. Essential to almost every intrusion detection system is the ability to search through packets and identify content that matches known attacks. Space and time efficient string matching algorithms are therefore important for identifying these packets at line rate. We examine string matching algorithms and their use for intrusion detection, in particular, we focus our efforts on providing worst-case performance that is amenable to hardware implementation. We contribute modifications to the Aho-Corasick string-matching algorithm that drastically reduce the amount of memory required and improve its performance on hardware implementations. We also show that these modifications do not drastically affect software performance on commodity processors, and therefore may be worth considering in these cases as well.

international symposium on computer architecture | 2005

A High Throughput String Matching Architecture for Intrusion Detection and Prevention

Lin Tan; Timothy Sherwood

Network intrusion detection and prevention systems have emerged as one of the most effective ways of providing security to those connected to the network, and at the heart of almost every modern intrusion detection system is a string matching algorithm. String matching is one of the most critical elements because it allows for the system to make decisions based not just on the headers, but the actual content flowing through the network. Unfortunately, checking every byte of every packet to see if it matches one of a set of ten thousand strings becomes a computationally intensive task as network speeds grow into the tens, and eventually hundreds, of gigabits/second. To keep up with these speeds a specialized device is required, one that can maintain tight bounds on worst case performance, that can be updated with new rules without interrupting operation, and one that is efficient enough that it could be included on chip with existing network chips or even into wireless devices. We have developed an approach that relies on a special purpose architecture that executes novel string matching algorithms specially optimized for implementation in our design. We show how the problem can be solved by converting the large database of strings into many tiny state machines, each of which searches for a portion of the rules and a portion of the bits of each rule. Through the careful co-design and optimization of our architecture with a new string matching algorithm we show that it is possible to build a system that is 10 times more efficient than the currently best known approaches.

international symposium on microarchitecture | 2003

Discovering and exploiting program phases

Timothy Sherwood; Erez Perelman; Greg Hamerly; Suleyman Sair; Brad Calder

Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the largest of scales (that is, over the programs complete execution). During one part of the execution, a program can be completely memory bound; in another, it can repeatedly stall on branch mispredicts. Average statistics gathered about a program might not accurately picture where the real problems lie. This realization has ramifications for many architecture and compiler techniques, from how to best schedule threads on a multithreaded machine, to feedback-directed optimizations, power management, and the simulation and test of architectures. Taking advantage of time-varying behavior requires a set of automated analytic tools and hardware techniques that can discover similarities and changes in program behavior on the largest of time scales. The challenge in building such tools is that during a programs lifetime it can execute billions or trillions of instructions. How can high-level behavior be extracted from this sea of instructions? Some programs change behavior drastically, switching between periods of high and low performance, yet system design and optimization typically focus on average system behavior. It is argued that instead of assuming average behavior, it is now time to model and optimize phase-based program behavior.

measurement and modeling of computer systems | 2003

Using SimPoint for accurate and efficient simulation

Erez Perelman; Greg Hamerly; Michael Van Biesbrouck; Timothy Sherwood; Brad Calder

Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of a single industry standard benchmark at this level of detail takes on the order of months to complete. This problem is exacerbated by the fact that to properly perform an architectural evaluation requires multiple benchmarks to be evaluated across many separate runs. To address this issue we recently created a tool called SimPoint that automatically finds a small set of Simulation Points to represent the complete execution of a program for efficient and accurate simulation. In this paper we describe how to use the SimPoint tool, and introduce an improved SimPoint algorithm designed to significantly reduce the simulation time required when the simulation environment relies upon fast-forwarding.

design automation conference | 2006

A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy

Gian Luca Loi; Banit Agrawal; Navin Srivastava; Sheng-Chih Lin; Timothy Sherwood; Kaustav Banerjee

Three-dimensional (3D) integrated circuits have emerged as promising candidates to overcome the interconnect bottlenecks of nanometer scale designs. While they offer several other advantages, it is expected that the benefits from this technology can potentially be off-set by thermal considerations which impact chip performance and reliability. The work presented in this paper is the first attempt to study the performance benefits of 3D technology under the influence of such thermal constraints. Using a processor-cache-memory system and carefully chosen applications encompassing different memory behaviors, the performance of 3D architecture is compared with a conventional planar (2D) design. It is found that the substantial increase in memory bus frequency and bus width contribute to a significant reduction in execution time with a 3D design. It is also found that increasing the clock frequency translates into larger gains in system performance with 3D designs than for planar 2D designs in memory intensive applications. The thermal profile of the vertically stacked chip is generated taking into account the highly temperature sensitive leakage power dissipation. The maximum allowed operating frequency imposed by temperature constraint is shown to be lower for 3D than for 2D designs. In spite of these constraints, it is shown that the 3D system registers large performance improvement for memory intensive applications

architectural support for programming languages and operating systems | 2009

Complete information flow tracking from the gates up

Mohit Tiwari; Hassan M. G. Wassel; Bita Mazloom; Shashidhar Mysore; Frederic T. Chong; Timothy Sherwood

For many mission-critical tasks, tight guarantees on the flow of information are desirable, for example, when handling important cryptographic keys or sensitive financial data. We present a novel architecture capable of tracking all information flow within the machine, including all explicit data transfers and all implicit flows (those subtly devious flows caused by not performing conditional operations). While the problem is impossible to solve in the general case, we have created a machine that avoids the general-purpose programmability that leads to this impossibility result, yet is still programmable enough to handle a variety of critical operations such as public-key encryption and authentication. Through the application of our novel gate-level information flow tracking method, we show how all flows of information can be precisely tracked. From this foundation, we then describe how a class of architectures can be constructed, from the gates up, to completely capture all information flows and we measure the impact of doing so on the hardware implementation, the ISA, and the programmer.

international conference on supercomputing | 1999

Reducing cache misses using hardware and software page placement

Timothy Sherwood; Brad Calder; Joel S. Emer

As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction and data cache performance for virtually indexed caches by mapping code and data with temporal locality to different cache blocks. In this paper we examine the performance of compiler and hardware approaches for reordering pages in physically addressed caches to eliminate cache misses. The software approach provides a color mapping at compile-time for code and data pages, which can then be used by the operating system to guide its allocation of physical pages. The hardware approach works by adding a page remap field to the TLB, which is used to allow a page to be remapped to a different color in the physically indexed cache while keeping the same physical page in memory. The results show that software page placement provided a 28% speedup and hardware page placement provided a 21% speedup on average for a superscalar processor. For a 4 processor single-chip multiprocessor, the miss rate was reduced from 8.7% down to 7.2% on average.

Explore More