Matthew J. Thazhuthaveetil

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthew J. Thazhuthaveetil is active.

Explore More

Publication

Featured researches published by Matthew J. Thazhuthaveetil.

high-performance computer architecture | 2006

Construction and use of linear regression models for processor performance analysis

P. J. Joseph; Kapil Vaswani; Matthew J. Thazhuthaveetil

Processor architects have a challenging task of evaluating a large design space consisting of several interacting parameters and optimizations. In order to assist architects in making crucial design decisions, we build linear regression models that relate processor performance to micro-architectural parameters, using simulation based experiments. We obtain good approximate models using an iterative process in which Akaikes information criteria is used to extract a good linear model from a small set of simulations, and limited further simulation is guided by the model using D-optimal experimental designs. The iterative process is repeated until desired error bounds are achieved. We used this procedure to establish the relationship of the CPI performance response to 26 key micro-architectural parameters using a detailed cycle-by-cycle superscalar processor simulator. The resulting models provide a significance ordering on all micro-architectural parameters and their interactions, and explain the performance variations of micro-architectural techniques.

international symposium on microarchitecture | 2006

A Predictive Performance Model for Superscalar Processors

P. J. Joseph; Kapil Vaswani; Matthew J. Thazhuthaveetil

Designing and optimizing high performance microprocessors is an increasingly difficult task due to the size and complexity of the processor design space, high cost of detailed simulation and several constraints that a processor design must satisfy. In this paper, we propose the use of empirical non-linear modeling techniques to assist processor architects in making design decisions and resolving complex trade-offs. We propose a procedure for building accurate non-linear models that consists of the following steps: (i) selection of a small set of representative design points spread across processor design space using latin hypercube sampling, (ii) obtaining performance measures at the selected design points using detailed simulation, (iii) building non-linear models for performance using the function approximation capabilities of radial basis function networks, and (iv) validating the models using an independently and randomly generated set of design points. We evaluate our model building procedure by constructing non-linear performance models for programs from the SPEC CPU2000 benchmark suite with a microarchitectural design space that consists of 9 key parameters. Our results show that the models, built using a relatively small number of simulations, achieve high prediction accuracy (only 2.8% error in CPI estimates on average) across a large processor design space. Our models can potentially replace detailed simulation for common tasks such as the analysis of key microarchitectural trends or searches for optimal processor design points

architectural support for programming languages and operating systems | 2013

Improving GPGPU concurrency with elastic kernels

Sreepathi Pai; Matthew J. Thazhuthaveetil; R. Govindarajan

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

symposium on code generation and optimization | 2005

A Programmable Hardware Path Profiler

Kapil Vaswani; Matthew J. Thazhuthaveetil; Y. N. Srikant

For aggressive path-based program optimizations to be profitable in cost-sensitive environments, accurate path profiles must be available at low overheads. In this paper, we propose a low-overhead, non-intrusive hardware path profiling scheme that can be programmed to detect several types of paths including acyclic, intra-procedural paths, paths for a whole program path and extended paths. The profiler consists of a path stack, which detects paths and generates a sequence of path descriptors using branch information from the processor pipeline, and a hot path table that collects a profile of hot paths for later use by a program optimizer. With assistance from the processors event detection logic, our profiler can track a host of architectural metrics along paths, enabling context-sensitive performance monitoring and bottleneck analysis. We illustrate the utility of our scheme by associating paths with a power metric that estimates power consumption in the cache hierarchy caused by instructions along the path. Experiments using programs from the SPEC CPU2000 benchmark suite show that our path profiler, occupying 7KB of hardware real-estate, collects accurate path profiles (average overlap of 88% with a perfect profile) at negligible execution time overheads (0.6% on average).

international conference on parallel architectures and compilation techniques | 2012

Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Sreepathi Pai; R. Govindarajan; Matthew J. Thazhuthaveetil

Exploiting the performance potential of GPUs requires managing the data transfers to and from them efficiently which is an error-prone and tedious task. In this paper, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale accesses and uses a runtime to initiate transfers as necessary. This allows us to avoid redundant transfers that are exhibited by all other existing automatic memory management proposals. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06× speedup over hand-tuned manual memory management, and (ii) a 1.29× speedup over another recently proposed compiler-runtime automatic memory management system. Compared to other existing runtime-only and compiler-only proposals, it also transfers 2.2× to 13.3× less data on average.

languages, compilers, and tools for embedded systems | 2009

Synergistic execution of stream programs on multicores with accelerators

Abhishek Udupa; R. Govindarajan; Matthew J. Thazhuthaveetil

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware. In this paper, we describe a novel method to orchestrate the execution of a StreamIt program on a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task on the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as an integrated Integer Linear Program (ILP) which can then be solved by an ILP solver.We also propose an efficient heuristic algorithm for the work partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark suite. The partitioned tasks are then software pipelined to execute on the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.84X with a maximum of 51.96X over a single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over a partitioning strategy that maps only the filters that cannot be executed on the GPU -- the filters with state that is persistent across firings -- onto the CPU.

Image and Vision Computing | 1991

Parallel Hough transform algorithm performance

Matthew J. Thazhuthaveetil; Anish V. Shah

Abstract The Hough Transform is a computationally intensive transform useful in detecting lines in digital images. Efforts have been made to parallelize the HT, focusing on partitioning the work for a multiprocessor system. Variations on two partitioning schemes are compared, and a breakdown of the various computation overheads is presented and analyzed.

conference on high performance computing (supercomputing) | 1990

A write update cache coherence protocol for MIN-based multiprocessors with accessibility-based split caches

Mazin S. Algudady; Chita R. Das; Matthew J. Thazhuthaveetil

The authors present a cache coherence protocol for MIN-based multiprocessors with two distinct private caches: private-block caches containing information private to a processor and shared-block caches containing data accessible by all processors. The protocol utilizes a coherence control bus (snooping) for connecting all shared-block cache controllers. Timing problems due to variable transit delay through the MIN are dealt with by introducing transient states in the protocol. Assuming homogeneity of all nodes, a single-node queuing model is developed to analyze the system performance. This model is solved using the mean-value-analysis technique with protocol state probabilities, and few communication delays as input parameters. System performance measures are verified through simulation.<<ETX>>

parallel computing | 2008

Impact of message compression on the scalability of an atmospheric modeling application on clusters

V. Santhosh Kumar; R. Nanjundiah; Matthew J. Thazhuthaveetil; R. Govindarajan

In this paper, we study the scalability of an atmospheric modeling application on a cluster with commercially available off-the-shelf interconnects. It is found that interconnects with large latency and low bandwidth are major bottlenecks for performance scalability. Response curves for latency shows that for large message sizes latency is extremely sensitive to the size of the message. Thus, decreasing the message size could reduce the latency and hence improve the scalability. We propose both lossless and lossy (i.e., with loss of some information) compression schemes to reduce message sizes. These compression techniques are investigated for the Community Atmospheric Model (CAM), which is a large scale parallel application used for global climate simulation, on a IBM Power 5 Cluster with Gigabit interconnect. This is a floating point intensive application which involves both point-to-point and collective all-to-all communication of large messages ( >128KB). Floating point data which constitute the messages in CAM application results in 14.8% compression when lossless compression is employed and the speedup improves by about 18% on 32 processors. We further evaluate three lossy compression schemes with very low overheads (0.15%). We study the acceptability criteria for information loss in the lossy compression schemes using a perturbation growth test procedure. The lossy compression schemes achieve a message size reduction of 66.2% and an execution time speedup of up to 20.78 on 32 processors. We also look at the criteria for acceptability of loss of information in lossy compression techniques.

Advances in Computers | 1995

Cache Coherence in Multiprocessors: A Survey

Mazin S. Yousif; Matthew J. Thazhuthaveetil; Chita R. Das

Abstract Shared-memory multiprocessor systems use private (or per processor) caches to enhance system performance by reducing average memory access time. In-cache modification of shared data in such systems leads to a data inconsistency problem referred to as the cache coherence problem. A solution to the cache coherence problem must ensure that any read access to shared data is satisfied with the most recent version of that data item. Both hardware-based and software-assisted solutions have been developed, reported in the literature, and implemented in multiprocessors. This paper surveys the impact of cache coherence on multiprocessor architecture design. First, general hardware approaches to dealing with cache coherence in shared-memory multiprocessors are presented. The approaches presented are interconnection medium-dependent, as follows: bus-based, multistage interconnection network (MIN)-based, and crossbarbased. The possibility of implementing protocols in hypercubes is also discussed. Software solutions to cache coherence are also dealt with. Coherency requirements and correctness of protocols are later described. Finally, a performance analysis summary is included.

Explore More