Jason Clemons
Nvidia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jason Clemons.
international symposium on microarchitecture | 2016
Minsoo Rhu; Natalia Gimelshein; Jason Clemons; Arslan Zulfiqar; Stephen W. Keckler
The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researchers flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.
international symposium on microarchitecture | 2016
Jason Clemons; Chih-Chi Cheng; Iuri Frosio; Daniel R. Johnson; Stephen W. Keckler
From self-driving cars to high dynamic range (HDR) imaging, the demand for image-based applications is growing quickly. In mobile systems, these applications place particular strain on performance and energy efficiency. As traditional memory systems are optimized for 1D memory access, they are unable to efficiently exploit the multi-dimensional locality characteristics of image-based applications which often operate on sub-regions of 2D and 3D image data. We have developed a new Patch Memory System (PMEM) tailored to application domains that process 2D and 3D data streams. PMEM supports efficient multidimensional addressing, automatic handling of image boundaries, and efficient caching and prefetching of image data. In addition to an optimized cache, PMEM includes hardware for offloading structured address calculations from processing units. We improve average energy-delay by 26% compared to EVA, a memory system for computer vision applications. Compared to a traditional cache, our results show that PMEM can reduce processor energy by 34% for a selection of CV and IP applications, leading to system performance improvement of up to 32% and energy-delay product improvement of 48-86% on the applications in this study.
design automation conference | 2016
Injoon Hong; Iuri Frosio; Jason Clemons; Brucek Khailany; Rangharajan Venkatesan; Stephen W. Keckler
Superpixel generation is a common preprocessing step in vision processing aimed at dividing an image into non-overlapping regions. Simple Linear Iterative Clustering (SLIC) is a commonly used superpixel algorithm that offers a good balance between performance and accuracy. However, the algorithms high computational and memory bandwidth requirements result in performance and energy efficiency that do not meet the requirements of realtime embedded applications. In this work, we explore the design of an energy-efficient superpixel accelerator for real-time computer vision applications. We propose a novel algorithm, Subsampled SLIC (S-SLIC), that uses pixel subsampling to reduce the memory bandwidth by 1.8x. We integrate S-SLIC into an energy-efficient superpixel accelerator and perform an in-depth design space exploration to optimize the design. We completed a detailed design in a 16nm FinFET technology using commercially-available EDA tools for high-level synthesis to map the design automatically from a C-based representation to a gate-level implementation. The proposed S-SLIC accelerator achieves real-time performance (30 frames per second) with 250 x better energy efficiency than an optimized SLIC software implementation running on a mobile GPU.
design automation conference | 2018
Brucek Khailany; Evgeni Khmer; Rangharajan Venkatesan; Jason Clemons; Joel S. Emer; Matthew Fojtik; Alicia Klinefelter; Michael Pellauer; Nathaniel Ross Pinckney; Yakun Sophia Shao; Shreesha Srinath; Christopher Torng; Sam Likun Xi; Yanqing Zhang; Brian Zimmer
A high-productivity digital VLSI flow for designing complex SoCs is presented. The flow includes high-level synthesis tools, an object-oriented library of synthesizable SystemC and C++ components, and a modular VLSI physical design approach based on fine-grained globally asynchronous locally synchronous (GALS) clocking. The flow was demonstrated on a 16nm FinFET testchip targeting machine learning and computer vision.
international conference on learning representations | 2017
Mohammad Babaeizadeh; Iuri Frosio; Stephen Tyree; Jason Clemons; Jan Kautz
arXiv: Learning | 2016
Mohammad Babaeizadeh; Iuri Frosio; Stephen Tyree; Jason Clemons; Jan Kautz
arXiv: Distributed, Parallel, and Cluster Computing | 2016
Minsoo Rhu; Natalia Gimelshein; Jason Clemons; Arslan Zulfiqar; Stephen W. Keckler
design automation conference | 2018
Brucek Khailany; Evgeni Krimer; Rangharajan Venkatesan; Jason Clemons; Joel S. Emer; Matthew Fojtik; Alicia Klinefelter; Michael Pellauer; Nathaniel Ross Pinckney; Yakun Sophia Shao; Shreesha Srinath; Christopher Torng; Sam Likun Xi; Yanqing Zhang; Brian Zimmer
arXiv: Learning | 2018
Maohua Zhu; Jason Clemons; Jeff Pool; Minsoo Rhu; Stephen W. Keckler; Yuan Xie
IEEE Micro | 2018
Hsien-Hsin Sean Lee; Jason Clemons