Gary L. Vondran | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gary L. Vondran is active.

Explore More

Publication

Featured researches published by Gary L. Vondran.

acm symposium on parallel algorithms and architectures | 1996

Optimal latency-throughput tradeoffs for data parallel pipelines

Jaspar Subhlok; Gary L. Vondran

This paper addressesoptimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referrecl toasa data parallel pipeline, iscommon inseveralapplication domains including digital signal processing, image processing, and computer vision. Theparameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present anew algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and chscuss optimization of the throughput with latency constraints. The problem formulation uses a general and realistic model of inter-task communication, and addresses the entree problem of mapping, which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in thenumber of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.

acm sigplan symposium on principles and practice of parallel programming | 1995

Optimal mapping of sequences of data parallel tasks

Jaspal Subhlok; Gary L. Vondran

Many applications in a variety of domains including digital signal processing, image processing and computer vision are composed of a sequence of tasks that act on a stream of input data sets in a pipelined manner. Recent research has established that these applications are best mapped to a massively parallel machine by dividing the tasks into modules and assigning a subset of the available processors to each module. This paper addresses the problem of optimally mapping such applications onto a massively parallel machine. We formulate the problem of optimizing throughput in task pipelines and present two new solution algorithms. The formulation uses a general and realistic model for inter-task communication, takes memory constraints into account, and addresses the entire problem of mapping which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The first algorithm is based on dynamic programming and finds the optimal mapping of k tasks onto P processors in O(P4k2) time. We also present a heuristic algorithm that is linear in the number of processors and establish with theoretical and practical results that the solutions obtained are optimal in practical situations. The entire framework is implemented as an automatic mapping tool for the Fx parallelizing compiler for High Performance Fortran. We present experimental results that demonstrate the importance of choosing a good mapping and show that the methods presented yield efficient mappings and predict optimal performance accurately.

Journal of Parallel and Distributed Computing | 2000

Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations

Jaspal Subhlok; Gary L. Vondran

This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of optimizing throughput with a latency constraint. The problem formulation uses a general and realistic model of intertask communication and addresses the entire problem of mapping, which includes clustering tasks into modules, assigning of processors to modules, and possible replicating of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.

intersociety conference on thermal and thermomechanical phenomena in electronic systems | 2012

Thermal performance of Inkjet-assisted spray cooling in a closed system

Gary L. Vondran; Kostas Makris; Dimosthenis Fragopoulos; C. Papadas; Niru Kumari

As the number of processors per die increase, chip thermal hotspots become increasingly more concentrated within smaller and smaller areas. Furthermore, these hotspots can change as processors are dynamically throttled or taken in and out of sleep mode based upon load and overall thermal budgets. Current cooling solutions (e.g. heatsinks, heatpipes, and even liquid cooling solutions) extract heat from the chip level but cannot independently control temperature at the hotspot level. The presented solution utilizes InkJet heads to deliver precise coolant flow rate independently to each chip location to maintain very high heat transfer rate via sustained liquid-to-vapor phase change. The result is a 10-100x improvement in thermal extraction rates over existing cooling solutions, achieving heat transfer rate as high as 4.5kW/cm2. Additionally, because each hotspot is maintained independently eliminating any large temperature gradient over the entire chip surface area, the ability to operate chips at higher operating points is now possible. This paper presents a heat sink prototype based on the inkjet-assisted spray cooling technology. The heat sink utilizes an air-cooled vapor chamber to condense and recirculate the evaporated liquid to achieve a fully closed system within the vapor chamber enclosure. The design of the prototyped solution is presented.

Proceedings of SPIE | 2011

GPU color space conversion

Patrick J. Chase; Gary L. Vondran

Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a

electronic imaging | 2006

Automated campaign system

Gary L. Vondran; Hui Chao; Dirk Beyer; Parag Joshi; Brian Atkins; Pere Obrador

500 NVIDIA GTX-580 GPU is 3x faster than a

Archive | 2005

System and method for producing a page using frames of a video stream

Tong Zhang; C. Brian Atkins; Gary L. Vondran; Mei Chen; Charles A. Untulis; Stephen Philip Cheatle; Dominic Lee

1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.

Archive | 2000

Efficient I-cache structure to support instructions crossing line boundaries

Gary L. Vondran

To run a targeted campaign involves coordination and management across numerous organizations and complex process flows. Everything from market analytics on customer databases, acquiring content and images, composing the materials, meeting the sponsoring enterprise brand standards, driving through production and fulfillment, and evaluating results; all processes are currently performed by experienced highly trained staff. Presented is a developed solution that not only brings together technologies that automate each process, but also automates the entire flow so that a novice user could easily run a successful campaign from their desktop. This paper presents the technologies, structure, and process flows used to bring this system together. Highlighted will be how the complexity of running a targeted campaign is hidden from the user through technologies, all while providing the benefits of a professionally managed campaign.

Archive | 1997