David A. Penry
Brigham Young University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David A. Penry.
Operating Systems Review | 2009
David A. Penry
Commodity microprocessors with tens to hundreds of processor cores will require the widespread deployment of parallel programs. This deployment will be hindered by the architectural and environmental diversity introduced by multicore processors. To overcome diversity, the operating system must change its interactions with the program runtime and parallel runtime systems must be developed that can automatically adapt programs to the architecture and usage environment.
computing frontiers | 2010
David A. Penry; Daniel Richins; Tyler S. Harris; David Greenland; Koy D. Rehme
Runtime parallel optimization has been suggested as a means to overcome the difficulties of parallel programming. For runtime parallel optimization to be effective, parallelism and locality that are expressed in the programming model need to be communicated to the runtime system. We suggest that the compiler should expose this information to the runtime using a representation that is independent of the programming model. Such a representation allows a single runtime environment to support many different models and architectures and to perform automatic parallelization optimization.
international symposium on circuits and systems | 2010
Zhuo Ruan; David A. Penry
Pure software simulators are too slow to simulate modern complex computer architectures and systems. Hybrid software/hardware simulators have been proposed to accelerate architecture simulation. However, the design of the hardware portions and hardware/software interface of the simulator is time-consuming, making it difficult to modify and improve these simulators. We here describe the Simulation Partitioning Research Infrastructure (SPRI), an infrastructure which partitions the software architectural model under user guidance and automatically synthesizes hybrid simulators. We also present a case study using SPRI to investigate the performance limitations and bottlenecks of the generated hybrid simulators.
international conference on embedded computer systems: architectures, modeling, and simulation | 2011
David A. Penry; Kurtis D. Cahill
Functional simulators find widespread use as sub-systems within microarchitectural simulators. The speed of functional simulators is strongly influenced by the implementation style of the functional simulator, e.g. interpreted vs. binary-translated simulation. Speed is also strongly influenced by the level of detail of the interface the functional simulator presents to the rest of the timing simulator. This level of detail may change during design space exploration, requiring corresponding changes to the interface and the simulator. However, for many implementation styles, changing the interface is difficult. As a result, architects may choose either implementation styles which are more malleable or interfaces with more detail than is necessary. In either case, simulation speed is traded for simulator design time. We show that this tradeoff is unnecessary if an orthogonal-specification design principle is practiced: specify how a simulator is to be implemented separately from what it is implementing and then synthesize a simulator from the combined specifications. We show that the use of an Architectural Description Language (ADL) with constructs for implementation style specification makes it possible to synthesize interfaces with different implementation styles with reasonable effort.
international conference on computer design | 2011
Tyler S. Harris; Zhuo Ruan; David A. Penry
Computer designers rely upon near-cycle-accurate microarchitectural simulation to explore the design space of new systems. Unfortunately, such simulators are becoming increasingly slow as systems become more complex. Hybrid simulators which offload some of the simulation work onto FPGAs can increase the speed; however, such simulators must be automatically synthesized or the time to design them becomes prohibitive. Furthermore, FPGA implementations of simulators may require multiple FPGA clock cycles to implement behavior that takes place within one simulated clock cycle, making correct arbitrary composition of simulator components impossible and limiting the amount of hardware concurrency which can be achieved. Latency-Insensitive Bounded Dataflow Networks (LI-BDNs) have been suggested as a means to permit composition of simulator components in FPGAs. However, previous work has required that LI-BDNs be created manually. This paper introduces techniques for automated synthesis of LI-BDNs from the processes of a System-C microarchitectural model. We demonstrate that LI-BDNs can be successfully synthesized. We also introduce a technique for reducing the overhead of LI-BDNs when the latency-insensitive property is unnecessary, resulting in up to a 60% reduction in FPGA resource requirements.
international conference on computer design | 2010
Zhuo Ruan; Kurtis D. Cahill; David A. Penry
Structural modeling serves as an efficient method for creating detailed microarchitectural models of complex microprocessors. High-level language constructs such as templates and object polymorphism are used to achieve a high degree of code reuse, thereby reducing development time. However, these modeling frameworks are currently too slow to evaluate future design of multicore microprocessors. The synthesis of portions of these models into hardware to form hybrid simulators promises to improve their speed substantially. Unfortunately, the high-level language constructs used in structural simulation frameworks are not typically synthesizable. One factor which limits their synthesis is that it is very difficult to determine statically what exactly the code and data to synthesize are. We propose an elaboration-time synthesis method for SystemC-based microarchitectural simulators. As part of the runtime environment of our infrastructure, the synthesis tool extracts architectural information after elaboration, binds dynamic information to a low-level intermediate representation (IR), and synthesizes the IR to VHDL. We show that this approach permits the synthesis of high-level language constructs which could not be easily synthesized before.
International Journal of Parallel Programming | 2013
David A. Penry; Kurtis D. Cahill
Functional simulators find widespread use as subsystems within microarchitectural simulators. The speed of a functional simulator is strongly influenced by its implementation style, e.g. interpreted versus binary-translated simulation. Speed is also strongly influenced by the level of detail of the interface the functional simulator presents to the rest of the timing simulator. This level of detail may change during design space exploration, requiring corresponding changes to the interface and the simulator. However, for many implementation styles, changing the interface is difficult. As a result, architects may choose either implementation styles which are more malleable or interfaces with more detail than is necessary. In either case, simulation speed is traded for simulator design time. Such a tradeoff has become particularly unfortunate as multicore processor designs proliferate and multi-threaded benchmarks must be simulated. We show that this tradeoff is unnecessary if an orthogonal-specification design principle is practiced: specify how a simulator is to be implemented separately from what it is implementing and then synthesize a simulator from the combined specifications. We show that the use of an architectural description language with constructs for implementation style specification makes it possible to synthesize interfaces with different implementation styles with reasonable effort.
international conference on computer design | 2012
Zhuo Ruan; David A. Penry
Computer designers rely upon near-cycle-accurate microarchitectural simulators to explore the design space of new systems. Hybrid simulators which offload simulation work onto FPGAs overcome the speed limitations of software-only simulators as systems become more complex, however, such simulators must be automatically synthesized or the time to design them becomes prohibitive. The performance of a hybrid simulator is significantly affected by how the interface between software and hardware is constructed. We characterize the design space of interfaces for synthesized structural hybrid microarchitectural simulators, provide implementations for several such interfaces, and determine the tradeoffs involved in choosing an efficient design candidate.
International Journal of Parallel Programming | 2017
Matthew B. Ashcraft; Alexander Lemon; David A. Penry; Quinn Snell
Accelerators such as GPUs, FPGAs, and many-core processors can provide significant performance improvements, but their effectiveness is dependent upon the skill of programmers to manage their complex architectures. One area of difficulty is determining which data to transfer on and off of the accelerator and when. Poorly placed data transfers can result in overheads that completely dwarf the benefits of using accelerators. To know what data to transfer, and when, the programmer must understand the data-flow of the transferred memory locations throughout the program, and how the accelerator region fits into the program as a whole. We argue that compilers should take on the responsibility of data transfer scheduling, thereby reducing the demands on the programmer, and resulting in improved program performance and program efficiency from the reduction in the number of bytes transferred. We show that by performing whole-program transfer scheduling on accelerator data transfers we are able to automatically eliminate up to 99% of the bytes transferred to and from the accelerator compared to transfering all data immediately before and after kernel execution for all data involved. The analysis and optimization are language and accelerator-agnostic, but for our examples and testing they have been implemented into an OpenMP to LLVM-IR to CUDA workflow.
international conference on computer design | 2015
David A. Penry
Computer designers rely upon near-cycle-accurate microarchitectural simulators to explore the design space of new systems. Hybrid simulators which offload simulation work onto FPGAs (also known as FAME simulators) can overcome the speed limitations of software-only simulators. However such simulators must be automatically synthesized or the time to design them becomes prohibitive. Previous work has shown that synthesized simulators should use a latency-insensitive design style in the hardware and a concurrent interface with the software. We show that the performance of the interface in such a simulator can be improved significantly by scheduling all communication between hardware and software. Scheduling reduces the amount of hardware/software communication and reduces software overhead. Scheduling is made possible by exploiting the properties of the latency-insensitive design technique recommended in previous work. We observe speedups of up to 1.54 versus the former interface for a multi-core simulator.