Matthew A. Watkins | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthew A. Watkins is active.

Explore More

Publication

Featured researches published by Matthew A. Watkins.

international symposium on microarchitecture | 2006

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Nevin Kirman; Meyrem Kirman; Rajeev K. Dokania; Jose F. Martinez; Alyssa B. Apsel; Matthew A. Watkins; David H. Albonesi

Although silicon optical technology is still in its formative stages, and the more near-term application is chip-to-chip communication, rapid advances have been made in the development of on-chip optical interconnects. In this paper, we investigate the integration of CMOS-compatible optical technology to on-chip cache-coherent buses in future CMPs. While not exhaustive, our investigation yields a hierarchical opto-electrical system that exploits the advantages of optical technology while abiding by projected limitations. Our evaluation shows that, for the applications considered, compared to an aggressive all-electrical bus of similar power and area, significant performance improvements can be achieved using an opto-electrical bus. This performance improvement is largely dependent on the applications bandwidth demand and on the number of implemented wavelengths per optical waveguide. We also present a number of critical areas for future work that we discover in the course of our research

international symposium on microarchitecture | 2007

On-Chip Optical Technology in Future Bus-Based Multicore Designs

Nevin Kirman; Meyrem Kirman; Rajeev K. Dokania; Jose F. Martinez; Alyssa B. Apsel; Matthew A. Watkins; David H. Albonesi

This work investigates the integration of CMOS-compatible optical technology to on-chip coherent buses for future CMPs. The analysis results in a hierarchical optoelectrical bus that exploits the advantages of optical technology while abiding by projected limitations. This bus achieves significant performance improvement for high-bandwidth applications relative to a state-of-the-art fully electrical bus

field-programmable logic and applications | 2008

Shared reconfigurable architectures for CMPS

Matthew A. Watkins; Mark J. Cianchetti; David H. Albonesi

This paper investigates reconfigurable architectures suitable for chip multiprocessors (CMPs). Prior research has established that augmenting a conventional processor with reconfigurable logic can dramatically improve the performance of certain application classes, but this comes at non-trivial power and area costs. Given substantial observed time and space differences in fabric usage, we propose that pools of programmable logic should be shared among multiple cores. While a common shared pool is more compact and power efficient, fabric conflicts may lead to large performance losses relative to per-core private fabrics. We identify particular characteristics of past reconfigurable fabric designs that are particularly amenable to fabric sharing. We then propose spatially and temporally shared fabrics in a CMP. The sharing policies that we devise incur negligible performance loss compared to private fabrics, while cutting the area and peak power of the fabric by 4X.

international conference on parallel architectures and compilation techniques | 2010

Dynamically managed multithreaded reconfigurable architectures for chip multiprocessors

Matthew A. Watkins; David H. Albonesi

Prior work has demonstrated that reconfigurable logic can significantly benefit certain applications. However, recon-figurable architectures have traditionally suffered from high area overhead and limited application coverage. We present a dynamically managed multithreaded reconfigurable architecture consisting of multiple clusters of shared reconfigurable fabrics that greatly reduces the area overhead of reconfigurability while still offering the same power efficiency and performance benefits. Like other shared SMT and CMP resources, the dynamic partitioning of the reconfigurable resource among sharing threads, along with the co-scheduling of threads among different reconfigurable clusters, must be intelligently managed for the full benefits of the shared fabrics to be realized. We propose a number of sophisticated dynamic management approaches, including the application of machine learning, multithreaded phase-based management, and stability detection. Overall, we show that, with our dynamic management policies, multithreaded reconfigurable fabrics can achieve better energy × delay2, at far less area and power, than providing each core with a much larger private fabric. Moreover, our approach achieves dramatically higher performance and energy-efficiency for particular workloads compared to what can be ideally achieved by allocating the fabric area to additional cores.

international symposium on microarchitecture | 2011

ReMAP: A Reconfigurable Architecture for Chip Multiprocessors

Matthew A. Watkins; David H. Albonesi

ReMAP is a reconfigurable architecture for accelerating and parallelizing applications within a heterogeneous chip multiprocessor (CMP). Clusters of cores share a common reconfigurable fabric adaptable for individual thread computation or fine-grained communication with integrated computation. ReMAP demonstrates significantly higher performance and energy efficiency than hard-wired communication-only mechanisms, and over allocating the fabric area to additional or more powerful cores.

high-performance computer architecture | 2016

Software transparent dynamic binary translation for coarse-grain reconfigurable architectures

Matthew A. Watkins; Tony Nowatzki; Anthony Carno

The end of Dennard Scaling has forced architects to focus on designing for execution efficiency. Course-grained reconfigurable architectures (CGRAs) are a class of architectures that provide a configurable grouping of functional units that aim to bridge the gap between the power and performance of custom hardware and the flexibility of software. Despite their potential benefit, CGRAs face a major adoption challenge as they do not execute a standard instruction stream. Dynamic translation for CGRAs has the potential to solve this problem, but faces non-trivial challenges. Existing attempts either do not achieve the full power and performance potential CGRAs offer or suffer from excessive translation time. In this work we propose DORA, a Dynamic Optimizer for Reconfigurable Architectures, which achieves substantial (2X) power and performance improvements while having low hardware and insertion overhead and benefiting the current execution. In addition to traditional optimizations, DORA leverages dynamic register information to perform optimizations not available to compilers and achieves performance similar to or better than CGRA-targeted compiled code.

high performance embedded architectures and compilers | 2008

Revisiting Cache Block Superloading

Matthew A. Watkins; Sally A. McKee; Lambert Schaelicke

Technological advances and increasingly complex and dynamic application behavior argue for revisiting mechanisms that adapt logical cache block size to application characteristics. This approach to bridging the processor/memory performance gap has been studied before, but mostly via trace-driven simulation, looking only at L1 caches. Given changes in hardware/software technology, we revisit the general approach: we propose a transparent, phase-adaptive, low-complexity mechanism for L2 superloading and evaluate it on a full-system simulator for 23 SPEC CPU2000 codes. Targeting L2 benefits instruction and data fetches. We investigate cache blocks of 32-512B, confirming that no fixed size performs well for all applications: differences range from 5-49% between best and worst fixed block sizes. Our scheme obtains performance similar to the per application best static block size. In a few cases, we minimally decrease performance compared to the best static size, but best size varies per application, and rarely matches real hardware. We generally improve performance over best static choices by up to 10%. Phase adaptability particularly benefits multiprogrammed workloads with conflicting locality characteristics, yielding performance gains of 5-20%. Our approach also outperforms next-line and delta prefetching.

international conference on parallel architectures and compilation techniques | 2007

A Phase-Adaptive Approach to Increasing Cache Performance

Matthew A. Watkins; Sally A. McKee; Lambert Schaelicke

Technological advances along with more complex and dynamic application behavior argue for revisiting mechanisms that adapt logical cache block size to application characteristics. This approach to bridging the processor/memory performance gap has been studied in the past, but most studies used trace-driven simulation and only looked at L1 caches. Given the changes in hardware and software since these seminal studies, we revisit the general approach: we present a transparent, phase-adaptive mechanism for L2 cache block superloading with minimal hardware complexity, evaluating it on a full-system simulator running 23 SPEC CPU2000 applications run to completion using training inputs.

international symposium on performance analysis of systems and software | 2017

Characterization of GPGPU workloads on a multidimensional heterogeneous processor

Matthew A. Watkins; Philip Bedoukian

Systems with multiple forms of heterogeneity, including functional, performance, and dynamic heterogeneity, are now commercially available. The use and tuning of any of these options can impact other options and so it is important to understand their interactions. This work characterizes the power and performance implications of multiple dimensions of heterogeneity from a commercial multidimensional heterogeneous processor on commonly evaluated GPGPU workloads.

international symposium on performance analysis of systems and software | 2018