Bryan Catanzaro | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bryan Catanzaro is active.

Explore More

Publication

Featured researches published by Bryan Catanzaro.

international parallel and distributed processing symposium | 2014

Nitro: A Framework for Adaptive Code Variant Tuning

Saurav Muralidharan; Manu Shantharam; Mary W. Hall; Michael Garland; Bryan Catanzaro

Autotuning systems intelligently navigate a search space of possible implementations of a computation to find the implementation(s) that best meets a specific optimization criteria, usually performance. This paper describes Nitro, a programmer-directed auto tuning framework that facilitates tuning of code variants, or alternative implementations of the same computation. Nitro provides a library interface that permits programmers to express code variants along with meta-information that aids the system in selecting among the set of variants at run time. Machine learning is employed to build a model through training on this meta-information, so that when a new input is presented, Nitro can consult the model to select the appropriate variant. In experiments with five real-world irregular GPU benchmarks from sparse numerical methods, graph computations and sorting, Nitro-tuned variants achieve over 93% of the performance of variants selected through exhaustive search. Further, we describe optimizations and heuristics in Nitro that substantially reduce training time and other overheads.

acm sigplan symposium on principles and practice of parallel programming | 2014

A decomposition for in-place matrix transposition

Bryan Catanzaro; Alexander Keller; Michael Garland

We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m, n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny matrices that arise when converting Arrays of Structures to Structures of Arrays yields median throughput of 34.3 GB/s, and a maximum throughput of 51 GB/s. Because of the simple structure of this algorithm, it is particularly suited for implementation using SIMD instructions to transpose the small arrays that arise when SIMD processors load from or store to Arrays of Structures. Using this algorithm to cooperatively perform accesses to Arrays of Structures, we measure 180 GB/s throughput on the K20c, which is up to 45 times faster than compiler-generated Array of Structures accesses. In this paper, we explain the algorithm, prove its correctness and complexity, and explain how it can be instantiated efficiently for solving various transpose problems on both CPUs and GPUs.

acm sigplan symposium on principles and practice of parallel programming | 2015

A collection-oriented programming model for performance portability

Saurav Muralidharan; Michael Garland; Bryan Catanzaro; Albert Sidelnik; Mary W. Hall

This paper describes Surge, a collection-oriented programming model that enables programmers to compose parallel computations using nested high-level data collections and operators. Surge exposes a code generation interface, decoupled from the core computation, that enables programmers and autotuners to easily generate multiple implementations of the same computation on various parallel architectures such as multi-core CPUs and GPUs. By decoupling computations from architecture-specific implementation, programmers can target multiple architectures more easily, and generate a search space that facilitates optimization and customization for specific architectures. We express in Surge four real-world benchmarks from domains such as sparse linear-algebra and machine learning and from the same performance-portable specification, generate OpenMP and CUDA C++ implementations. Surge generates efficient, scalable code which achieves up to 1.32x speedup over handcrafted, well-optimized CUDA code.

european conference on computer vision | 2018

SDC-Net: Video prediction using spatially-displaced convolution

Fitsum A. Reda; Guilin Liu; Kevin J. Shih; Robert Kirby; Jon Barker; David Tarjan; Andrew J. Tao; Bryan Catanzaro

We present an approach for high-resolution video frame prediction by conditioning on both past frames and past optical flows. Previous approaches rely on resampling past frames, guided by a learned future optical flow, or on direct generation of pixels. Resampling based on flow is insufficient because it cannot deal with disocclusions. Generative models currently lead to blurry results. Recent approaches synthesis a pixel by convolving input patches with a predicted kernel. However, their memory requirement increases with kernel size. Here, we present spatially-displaced convolution (SDC) module for video frame prediction. We learn a motion vector and a kernel for each pixel and synthesize a pixel by applying the kernel at a displaced location in the source image, defined by the predicted motion vector. Our approach inherits the merits of both vector-based and kernel-based approaches, while ameliorating their respective disadvantages. We train our model on 428K unlabelled 1080p video game frames. Our approach produces state-of-the-art results, achieving an SSIM score of 0.904 on high-definition YouTube-8M videos, 0.918 on Caltech Pedestrian videos. Our model handles large motion effectively and synthesizes crisp frames with consistent motion.

international conference on machine learning | 2013

Deep learning with COTS HPC systems

Adam Coates; Brody Huval; Tao Wang; David J. Wu; Bryan Catanzaro; Ng Andrew

arXiv: Computation and Language | 2014

Deep Speech: Scaling up end-to-end speech recognition.

Awni Y. Hannun; Carl Case; Jared Casper; Bryan Catanzaro; Greg Diamos; Erich Elsen; Ryan Prenger; Sanjeev Satheesh; Shubho Sengupta; Adam Coates; Andrew Y. Ng

international conference on machine learning | 2016

Deep speech 2: end-to-end speech recognition in English and mandarin

Dario Amodei; Sundaram Ananthanarayanan; Rishita Anubhai; Jingliang Bai; Eric Battenberg; Carl Case; Jared Casper; Bryan Catanzaro; Qiang Cheng; Guoliang Chen; Jie Chen; Jingdong Chen; Zhijie Chen; Mike Chrzanowski; Adam Coates; Greg Diamos; Ke Ding; Niandong Du; Erich Elsen; Jesse Engel; Weiwei Fang; Linxi Fan; Christopher Fougner; Liang Gao; Caixia Gong; Awni Y. Hannun; Tony Han; Lappi Vaino Johannes; Bing Jiang; Cai Ju

arXiv: Neural and Evolutionary Computing | 2014

cuDNN: Efficient Primitives for Deep Learning

Sharan Chetlur; Cliff Woolley; Philippe Vandermersch; Jonathan Cohen; John Tran; Bryan Catanzaro; Evan Shelhamer

computer vision and pattern recognition | 2018

High-Resolution Image Synthesis and Semantic Manipulation With Conditional GANs

Ting-Chun Wang; Ming-Yu Liu; Jun-Yan Zhu; Andrew J. Tao; Jan Kautz; Bryan Catanzaro

international conference on learning representations | 2017

DSD: Dense-Sparse-Dense Training for Deep Neural Networks

Song Han; Jeff Pool; Sharan Narang; Huizi Mao; Enhao Gong; Shijian Tang; Erich Elsen; Peter Vajda; Manohar Paluri; John Tran; Bryan Catanzaro; William J. Dally

Explore More