Shannon K. Kuntz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shannon K. Kuntz is active.

Explore More

Publication

Featured researches published by Shannon K. Kuntz.

international conference on supercomputing | 1999

Microservers: a new memory semantics for massively parallel computing

Jay B. Brockman; Peter M. Kogge; Thomas L. Sterling; Vincent W. Freeh; Shannon K. Kuntz

The semantics of memory-a large state which can only be read or changed a small piece at a time-has remained virtually untouched since von Neumann, and its effects-latency and bandwidth-have proved to be the major stumbling block for high performance computing. This paper suggests a new model, termed “microservers,” that exploits “Processing-In- Memory” VLSI technology, and that can reduce latency and memory traffic, increase inherent opportunities for concurrency, and support a variety of highly concurrent programming paradigms. Application of this model is then discussed in the framework of several on-going supercomputing programs, particularly the HTMT petaflops project.

international symposium on computer architecture | 2004

A low cost, multithreaded processing-in-memory system

Jay B. Brockman; Shyamkumar Thoziyoor; Shannon K. Kuntz; Peter M. Kogge

This paper discusses die cost vs. performance tradeoffs for a PIM system that could serve as the memory system of a host processor. For an increase of less than twice the cost of a commodity DRAM part, it is possible to realize a performance speedup of nearly a factor of 4 on irregular applications. This cost efficiency derives from developing a custom multithreaded processor architecture and implementation style that is well-suited for embedding in a memory. Specifically, it takes advantage of the low latency and high row bandwidth to both simplify processor design --- reducing area --- as well as to improve processing throughput. To support our claims of cost and performance, we have used simulation, analysis of existing chips, and also designed and fully implemented a prototype chip, PIM Lite.

international parallel and distributed processing symposium | 2007

A Heterogeneous Lightweight Multithreaded Architecture

Sheng Li; Amit Kashyap; Shannon K. Kuntz; Jay B. Brockman; Peter M. Kogge; Paul L. Springer; Gary L. Block

Programs with irregular patterns of dynamic data structures and/or those with complicated control structures such as recursion are notoriously difficult to parallelize efficiently. For some highly-irregular applications, such as a SAT solver, it has been nearly impossible to obtain significant parallel speedups on conventional SMP systems over serial implementations. Lightweight multithreading, as found in the Cray MTA and the upcoming XMT (Eldorado), has been demonstrated as an effective approach to attacking these problems. In this paper, we describe a heterogeneous lightweight multithreading that extends ideas found in the Cray machines to support larger numbers of threads while reducing the cost of thread management and synchronization.

international parallel and distributed processing symposium | 2008

Memory model effects on application performance for a lightweight multithreaded architecture

Sheng Li; Shannon K. Kuntz; Peter M. Kogge; Jay B. Brockman

In this paper, we evaluate the effects of a partitioned global address space (PGAS) versus aflat, randomized distributed global address space (DGAS) in the context of a lightweight multithreaded parallel architecture. We also execute the benchmarks on the Cray MTA-2, a multithreaded architecture with a DGAS mapping. Key results demonstrate that distributing data under the PGAS mapping increases locality, effectively reducing the memory latency and the number of threads needed to achieve a given level of performance. In contrast, the DGAS mapping provides a simpler programming model by eliminating the need to distribute data and, assuming sufficient application parallelism, can achieve similar performance by leveraging large numbers of threads to hide the longer latencies.

international parallel and distributed processing symposium | 2001

A microserver view of HTMT

Lilia Yerosheva; Shannon K. Kuntz; Jay B. Brockman; Peter M. Kogge

Hybrid technology multithreaded architecture (HTMT) is an ambitious new architecture combining cutting edge technologies to reach petaflop performance sooner than current technology trends allow. It is a massively parallel architecture with multi-threaded hardware and a multi-level memory hierarchy. Microservers provide a new perspective for viewing this memory hierarchy whereby memory is actively involved in process execution. This paper discusses the microserver memory semantics and initial HTMT execution models to analyze application at each level of the system hierarchy and to develop user-level functions for expressing this inherent concurrency and parallelism. In order to do this we studied several applications to model the control and data flow within the HTMT hierarchy and developed pseudo-code representing the user-level functions necessary to express application concurrency and parallelism.

irregular applications: architectures and algorithms | 2016

Highly scalable near memory processing with migrating threads on the emu system architecture

Timothy J. Dysart; Peter M. Kogge; Martin M. Deneroff; Eric Bovell; Preston Briggs; Jay B. Brockman; Kenneth Jacobsen; Yujen Juan; Shannon K. Kuntz; Richard Lethin; Janice O. McMahon; Chandra Pawar; Martin Perrigo; Sarah Rucker; John Ruttenberg; Max Ruttenberg; Steve Stein

There is growing evidence that current architectures do not well handle cache-unfriendly applications such as sparse math operations, data analytics, and graph algorithms. This is due, in part, to the irregular memory access patterns demonstrated by these applications, and in how remote memory accesses are handled. This paper introduces a new, highly-scalable PGAS memory-centric system architecture where migrating threads travel to the data they access. Scaling both memory capacities and the number of cores can be largely invisible to the programmer.The first implementation of this architecture, implemented with FPGAs, is discussed in detail. A comparison of key parameters with a variety of todays systems, of differing architectures, indicates the potential advantages. Early projections of performance against several well-documented kernels translate these advantages into comparative numbers. Future implementations of this architecture may expand the performance advantages by the application of current state of the art silicon technology.

irregular applications: architectures and algorithms | 2017

A Case for Migrating Execution for Irregular Applications

Peter M. Kogge; Shannon K. Kuntz

Modern supercomputers have millions of cores, each capable of executing one or more threads of program execution. In these computers the site of execution for program threads rarely, if ever, changes from the node in which they were born. This paper discusses the advantages that may accrue when thread states migrate freely from node to node, especially when migration is managed by hardware without requiring software intervention. Emphasis is on supporting the growing classes of algorithms where there is significant sparsity, irregularity, and lack of locality in the memory reference patterns. Evidence is drawn from reformulation of several kernels into a migrating thread context approximating that of an emerging architecture with such capabilities.

PPSC | 2001