Burton J. Smith | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Burton J. Smith is active.

Explore More

Publication

Featured researches published by Burton J. Smith.

ieee international conference on high performance computing data and analytics | 2008

High performance discrete Fourier transforms on graphics processors

Naga K. Govindaraju; Brandon Lloyd; Burton J. Smith; John L. Manferdelli

We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluesteins algorithm. We use modular arithmetic in Bluesteins algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIAs CUFFT library and an optimized CPU-implementation (Intels MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.

international conference on supercomputing | 1992

Exploiting heterogeneous parallelism on a multithreaded multiprocessor

Gail A. Alverson; Robert Alverson; David Callahan; Brian D. Koblenz; Allan Porterfield; Burton J. Smith

This paper describes an integrated architecture, compiler, runtime, and operating system solution to exploiting heterogeneous parallelism. The architecture is a pipelined multi-threaded multiprocessor, enabling the execution of very fine (multiple operations within an instruction) to very coarse (multiple jobs) parallel activities. The compiler and runtime focus on managing parallelism within a job, while the operating system focuses on managing parallelism across jobs. By considering the entire system in the design, we were able to smoothly interface its four components. While each component is primarily responsible for managing its own level of parallel activity, feedback mechanisms between components enable resource allocation and usage to be dynamically updated. This dynamic adaptation to changing requirements and available resources fosters both high utilization of the machine and the efficient expression and execution of parallelism.

conference on high performance computing (supercomputing) | 2007

Efficient gather and scatter operations on graphics processors

Bingsheng He; Naga K. Govindaraju; Qiong Luo; Burton J. Smith

Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). With superior computing power and high memory bandwidth, GPUs have become a commodity multiprocessor platform for general-purpose high-performance computing. However, due to the random access nature of gather and scatter, a naive implementation of the two operations suffers from a low utilization of the memory bandwidth and consequently a long, unhidden memory latency. Additionally, the architectural details of the GPUs, in particular, the memory hierarchy design, are unclear to the programmers. Therefore, we design multi-pass gather and scatter operations to improve their data access locality, and develop a performance model to help understand and optimize these two operations. We have evaluated our algorithms in sorting, hashing, and the sparse matrix-vector multiplication in comparison with their optimized CPU counterparts. Our results show that these optimizations yield 2--4X improvement on the GPU bandwidth utilization and 30--50% improvement on the response time. Overall, our optimized GPU implementations are 2--7X faster than their optimized CPU counterparts.

international symposium on microarchitecture | 2007

Transactional Memory: An Overview

Tim Harris; Adrián Cristal; Osman S. Unsal; Eduard Ayguadé; Fabrizio Gagliardi; Burton J. Smith; Mateo Valero

Writing applications that benefit from the massive computational power of future multicore chip multiprocessors will not be an easy task for mainstream programmers accustomed to sequential algorithms rather than parallel ones. This article presents a survey of transactional memory, a mechanism that promises to enable scalable performance while freeing programmers from some of the burden of modifying their parallel code.

Omics A Journal of Integrative Biology | 2011

Policy and data-intensive scientific discovery in the beginning of the 21st century.

Arnold L. Smith; Magdalena Balazinska; Chaitan Baru; Mark Gomelsky; Michael McLennan; Lynn Rose; Burton J. Smith; Elizabeth Stewart; Eugene Kolker

Recent developments in our ability to capture, curate, and analyze data, the field of data-intensive science (DIS), have indeed made these interesting and challenging times for scientific practice as well as policy making in real time. We are confronted with immense datasets that challenge our ability to pool, transfer, analyze, or interpret scientific observations. We have more data available than ever before, yet more questions to be answered as well, and no clear path to answer them. We are excited by the potential for science-based solutions to humankinds problems, yet stymied by the limitations of our current cyberinfrastructure and existing public policies. Importantly, DIS signals a transformation of the hypothesis-driven tradition of science (first hypothesize, then experiment) to one that is typified by first experiment, then hypothesize mode of discovery. Another hallmark of DIS is that it amasses data that are public goods (i.e., creates a commons) that can further be creatively mined for various applications in different sectors. As such, this calls for a science policy vision that is long term. We herein reflect on how best to approach to policy making at this critical inflection point when DIS applications are being diversified in agriculture, ecology, marine biology, and environmental research internationally. This article outlines the key policy issues and gaps that emerged from the multidisciplinary discussions at the NSF-funded DIS workshop held at the Seattle Childrens Research Institute in Seattle, on September 19-20, 2010.

symposium on cloud computing | 2017

Selecting the best VM across multiple public clouds: a data-driven performance modeling approach

Neeraja J. Yadwadkar; Bharath Hariharan; Joseph E. Gonzalez; Burton J. Smith; Randy H. Katz

Users of cloud services are presented with a bewildering choice of VM types and the choice of VM can have significant implications on performance and cost. In this paper we address the fundamental problem of accurately and economically choosing the best VM for a given workload and user goals. To address the problem of optimal VM selection, we present PARIS, a data-driven system that uses a novel hybrid offline and online data collection and modeling framework to provide accurate performance estimates with minimal data collection. PARIS is able to predict workload performance for different user-specified metrics, and resulting costs for a wide range of VM types and workloads across multiple cloud providers. When compared to sophisticated baselines, including collaborative filtering and a linear interpolation model using measured workload performance on two VM types, PARIS produces significantly better estimates of performance. For instance, it reduces runtime prediction error by a factor of 4 for some workloads on both AWS and Azure. The increased accuracy translates into a 45% reduction in user cost while maintaining performance.

international parallel and distributed processing symposium | 2007

TCPP Presentation and Invited Speech Reinventing Computing

Burton J. Smith

The many-core inflection point presents a new challenge for our industry, namely general-purpose parallel computing. Unless this challenge is met, the continued growth and importance of computing itself and of the businesses engaged in it are at risk. We must make parallel programming easier and more generally applicable than it is now, and build hardware and software that will execute arbitrary parallel programs on whatever scale of system the user has. The changes needed to accomplish this are significant and affect computer architecture, the entire software development tool chain, and the army of application developers that will rely on those tools to develop parallel applications. This talk will point out a few of the hard problems that face us and some prospects for addressing them.

international parallel and distributed processing symposium | 2011

IPDPS 2011 Wednesday 25th Year Panel: What's ahead?

Per Stenström; Doug Burger; Wen-mei W. Hwu; Vipin Kumar; Kunle Olukotun; David A. Padua; Burton J. Smith

BY CARL -LUDWIG HOLTFRER ICH , FRE IE UN IVERS ITÄT BERL IN I focus my remarks on the implications of the monetary policy response to the crisis for inflation and/or deflation and discuss the following questions: Have we successfully avoided deflation and a great depression in its wake? • Will the expansion of central banks’ balance sheets eventually result in • unacceptable inflation? Have unconventional measures impaired central bank credibility and/or • compromised central bank independence? What are the potential lessons from history for central bank policies in the • future?Summary form only given, as follows. Parallel computing has become ubiquitous and relates to challenging computational problems in science via business-driven computing to mobile computing. The scope has widened dramatically over the last decade. This panel will debate and speculate on how the parallel computing landscape is expected to change in the years to come. Areas of focus will include: (1) Computing platforms: How will we be able to maintain the performance growth of the past and what will be the major challenges in the next 10 years and beyond that? What technical barriers are anticipated and what disruptive technologies are behind the corner? (2) Software: How will software infrastructures evolve to meet performance requirements in the next 10 years and beyond? How will we ever be able to hide parallelism obstacles for the masses and what is the road forward towards that? (3) Algorithms: What will be the major computational problems to tackle in the next 10 years and beyond? What are the most challenging algorithmic problems to solve? (4) Applications: What will be the next wave of grand challenge problems to focus on in the next 10 years and beyond? What will be the major performance driving applications in the general and mobile computing domains? A record of the panel discussion was not made available for publication as part of the conference proceedings.

international parallel and distributed processing symposium | 2010

Operating system resource management

Burton J. Smith

Resource management is the dynamic allocation and de-allocation by an operating system of processor cores, memory pages, and various types of bandwidth to computations that compete for those resources. The objective is to allocate resources so as to optimize responsiveness subject to the finite resources available. Historically, resource management solutions have been relatively unsystematic, and now the very assumptions underlying the traditional strategies fail to hold. First, applications increasingly differ in their ability to exploit resources, especially processor cores. Second, application responsiveness is approximately two-valued for Quality-Of-Service (QOS) applications, depending on whether deadlines are met. Third, power and battery energy have become constrained. This talk will propose a scheme for addressing the operating system resource management problem.

conference on high performance computing (supercomputing) | 2005

System Balance and Fast Clocks

Burton J. Smith

As clock rates have risen over the years, nearly all aspects of computer implementation from programming model (got caches? got cores?) to component technology have been forced to adapt. The falling cost of transistors has enabled some of this, but does not always help. For example, we have now reached clock rates even in CMOS where skin effect in copper-based transmission lines limits the global bandwidth of large-scale systems so strongly that optical interconnect looks like the only way to retain balance.

Explore More