Kyle Bruce Wheeler | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kyle Bruce Wheeler is active.

Explore More

Publication

Featured researches published by Kyle Bruce Wheeler.

international parallel and distributed processing symposium | 2008

Qthreads: An API for programming with millions of lightweight threads

Kyle Bruce Wheeler; Richard C. Murphy; Douglas Thain

Large scale hardware-supported multithreading, an attractive means of increasing computational power, benefits significantly from low per-thread costs. Hardware support for lightweight threads is a developing area of research. Each architecture with such support provides a unique interface, hindering development for them and comparisons between them. A portable abstraction that provides basic lightweight thread control and synchronization primitives is needed. Such an abstraction would assist in exploring both the architectural needs of large scale threading and the semantic power of existing languages. Managing thread resources is a problem that must be addressed if massive parallelism is to be popularized. The qthread abstraction enables development of large-scale multithreading applications on commodity architectures. This paper introduces the qthread API and its Unix implementation, discusses resource management, and presents performance results from the HPCCG benchmark.

computing frontiers | 2005

SPANIDS: a scalable network intrusion detection loadbalancer

Lambert Schaelicke; Kyle Bruce Wheeler; Curt Freeland

Network intrusion detection systems (NIDS) are becoming an increasingly important security measure. With rapidly increasing network speeds, the capacity of the NIDS sensor can limit the ability of the system to detect intrusions. The SPANIDS parallel NIDS architecture overcomes this limitation by distributing network traffic load over an array of sensor nodes. Based on a custom hardware load balancer and cost-effective off-the-shelf sensors, the system employs novel stateless load balancing heuristics to thwart scalability limitations. It also uses dynamic feedback from the sensor nodes to adapt to changes in network traffic. This paper describes the overall system architecture, discusses some of the critical design decisions and presents experimental results that demonstrate the performance advantage of this approach

international parallel and distributed processing symposium | 2009

Implementing a portable Multi-threaded Graph Library: The MTGL on Qthreads

Brian W. Barrett; Jonathan W. Berry; Richard C. Murphy; Kyle Bruce Wheeler

Developing multi-threaded graph algorithms, even when using the MTGL infrastructure, provides a number of challenges, including discovering appropriate levels of parallelism, preventing memory hot spotting, and eliminating accidental synchronization. In this paper, we have demonstrated that using the combination of Qthreads and MTGL with commodity processors enables the development and testing of algorithms without the expense and complexity of a Cray XMT. While achievable performance is lower for both the Opteron and Niagara platform, performance issues are similar. While we believe it is possible to port Qthreads to the Cray XMT, this work is still on-going. Therefore, porting work still must be done to move algorithm implementations between commodity processors and the XMT. Although it is likely that the Qthreads-version of an algorithm will not be as optimized as a natively implemented version of the algorithm, such a performance impact may be an acceptable trade-off for ease of implementation.

Archive | 2012

The Portals 4.0 network programming interface.

Ronald B. Brightwell; Kevin Pedretti; Kyle Bruce Wheeler; Karl Scott Hemmert; Rolf Riesen; Keith D. Underwood; Arthur B. Maccabe; Trammell Hudson

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities. 3

Archive | 2012

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale.

Matthew L. Curry; Kurt Brian Ferreira; Kevin Pedretti; Vitus J. Leung; Kenneth Moreland; Gerald Fredrick Lofstead; Ann C. Gentile; Ruth Klundt; H. Lee Ward; James H. Laros; Karl Scott Hemmert; Nathan D. Fabian; Michael J. Levenhagen; Ronald B. Brightwell; Richard Frederick Barrett; Kyle Bruce Wheeler; Suzanne M. Kelly; Arun F. Rodrigues; James M. Brandt; David C. Thompson; John P. VanDyke; Ron A. Oldfield; Thomas Tucker

This report documents thirteen of Sandias contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Applications Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

international conference on cluster computing | 2006

Fine-Grained Message Pipelining for Improved MPI Performance

Arun F. Rodrigues; Kyle Bruce Wheeler; Peter M. Kogge; Keith D. Underwood

By its nature, MPI leads to coarse grained communications. This is because all current MPI implementations deliver two orders of magnitude more bandwidth for large message sizes (kilobytes) than small message sizes (bytes). This translates into applications that bundle their small communications into larger communications whenever possible. In modern implementations, this sacrifice in the granularity of communication translates directly into a sacrifice in the granularity of synchronization. MPI requires that the entire message arrive before any of the data can be delivered to the application, because message completion is the only synchronization semantic the network can expose to the processor. This paper explores the implications of providing synchronization between the network and the processor at the memory word level using a mechanism such as Full/Empty Bits. This enables the application to begin computing as soon as the data for the first memory referenced has arrived without having to wait for all of the data in the message

Archive | 2012

A Comparative Critical Analysis of Modern Task-Parallel Runtimes

Kyle Bruce Wheeler; Dylan Stark; Richard C. Murphy

The rise in node-level parallelism has increased interest in task-based parallel runtimes for a wide array of application areas. Applications have a wide variety of task spawning patterns which frequently change during the course of application execution, based on the algorithm or solver kernel in use. Task scheduling and load balance regimes, however, are often highly optimized for specific patterns. This paper uses four basic task spawning patterns to quantify the impact of specific scheduling policy decisions on execution time. We compare the behavior of six publicly available tasking runtimes: Intel Cilk, Intel Threading Building Blocks (TBB), Intel OpenMP, GCC OpenMP, Qthreads, and High Performance ParalleX (HPX). With the exception of Qthreads, the runtimes prove to have schedulers that are highly sensitive to application structure. No runtime is able to provide the best performance in all cases, and those that do provide the best performance in some cases, unfortunately, provide extremely poor performance when application structure does not match the scheduler’s assumptions.

Concurrency and Computation: Practice and Experience | 2010