Debbie Marr
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Debbie Marr.
international conference on acoustics, speech, and signal processing | 2017
Ganesh Venkatesh; Eriko Nurvitadhi; Debbie Marr
We explore techniques to significantly improve the compute efficiency and performance of Deep Convolution Networks without impacting their accuracy. To improve the compute efficiency, we focus on achieving high accuracy with extremely low-precision (2-bit) weight networks, and to accelerate the execution time, we aggressively skip operations on zero-values. We achieve the highest reported accuracy of 76.6% Top-1/93% Top-5 on the Imagenet object classification challenge with low-precision network while reducing the compute requirement by ∼3× compared to a full-precision network that achieves similar accuracy. Furthermore, to fully exploit the benefits of our low-precision networks, we build a deep learning accelerator core, DLAC, that can achieve up to 1 TFLOP/mm2 equivalent for single-precision floating-point operations (∼2 TFLOP/mm2 for half-precision), which is ∼5× better than Linear Algebra Core [16] and ∼4× better than previous deep learning accelerator proposal [8].
field programmable logic and applications | 2016
Eriko Nurvitadhi; Jaewoong Sim; David Sheffield; Asit K. Mishra; Srivatsan Krishnan; Debbie Marr
Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU. Then, we study the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU. Results show that FPGA provides superior performance/Watt over CPU and GPU because FPGAs on-chip BRAMs, hard DSPs, and reconfigurable fabric allow for efficiently extracting fine-grained parallelisms from small/medium size matrices used by GRU. Moreover, newer FPGAs with more DSPs, on-chip BRAMs, and higher frequency have the potential to narrow the FPGA-ASIC efficiency gap.
field-programmable technology | 2016
Eriko Nurvitadhi; David Sheffield; Jaewoong Sim; Asit K. Mishra; Ganesh Venkatesh; Debbie Marr
Deep neural networks (DNNs) are widely used in data analytics, since they deliver state-of-the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized variant of DNNs. BNNs constraint network weight and/or neuron value to either +1 or −1, which is representable in 1 bit. This leads to dramatic algorithm efficiency improvement, due to reduction in the memory and computational demands. This paper evaluates the opportunity to further improve the execution efficiency of BNNs through hardware acceleration. We first proposed a BNN hardware accelerator design. Then, we implemented the proposed accelerator on Aria 10 FPGA as well as 14-nm ASIC, and compared them against optimized software on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia TX1 mobile GPU. Our evaluation shows that FPGA provides superior efficiency over CPU and GPU. Even though CPU and GPU offer high peak theoretical performance, they are not as efficiently utilized since BNNs rely on binarized bit-level operations that are better suited for custom hardware. Finally, even though ASIC is still more efficient, FPGA can provide orders of magnitudes in efficiency improvements over software, without having to lock into a fixed ASIC solution.
compilers, architecture, and synthesis for embedded systems | 2015
Eriko Nurvitadhi; Asit K. Mishra; Debbie Marr
Sparse matrix vector multiplication (SpMV) is a linear algebra construct commonly found in machine learning (ML) algorithms, such as support vector machine (SVM). We profiled a popular SVM software (libSVM) on an energy-efficient microserver and a high-performance server for real-world ML datasets, and observed that SpMV dominates runtime. We propose a novel SpMV algorithm tailored for ML and a hardware accelerator architecture design based on this algorithm. Our evaluations show that the proposed algorithm and hardware accelerator achieves significant efficiency improvements over the conventional SpMV algorithm used in libSVM.
design, automation, and test in europe | 2016
Eriko Nurvitadhi; Asit K. Mishra; Yu Wang; Ganesh Venkatesh; Debbie Marr
Rapid growth of Internet led to web applications that produce large unstructured sparse datasets (e.g., texts, ratings). Machine learning (ML) algorithms are the basis for many important analytics workloads that extract knowledge from these datasets. This paper characterizes such workloads on a high-end server for real-world datasets and shows that a set of sparse matrix operations dominates runtime. Further, they run inefficiently due to low compute-per-byte and challenging thread scaling behavior. As such, we propose a hardware accelerator to perform these operations with extreme efficiency. Simulations and RTL synthesis to 14nm ASIC demonstrate significant performance and performance/Watt improvements over conventional processors, with only a small area overhead.
IEEE Computer Architecture Letters | 2016
Milad Hashemi; Debbie Marr; Doug Carmean; Yale N. Patt
The performance of user-facing applications is critical to client platforms. Many of these applications are event-driven and exhibit “bursty” behavior: the application is generally idle but generates bursts of activity in response to human interaction. We study one example of a bursty application, web-browsers, and produce two important insights: (1) Activity bursts contain false parallelism, bringing many cores out of a deep sleep to inefficiently render a single webpage, and (2) these bursts are highly compute driven, and thus scale nearly linearly with frequency. We show average performance gains/energy reductions of 14%/17% respectively on real hardware by statically moving threads from multiple cores to a single core. We then propose dynamic hardware driven thread migration and scheduling enhancements that detect these bursts, leading to further benefits.
field programmable gate arrays | 2018
Duncan J. M. Moss; Srivatsan Krishnan; Eriko Nurvitadhi; Piotr Ratuszniak; Chris N. Johnson; Jaewoong Sim; Asit K. Mishra; Debbie Marr; Suchit Subhaschandra; Philip Heng Wai Leong
General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching; interleaving and block size control; and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary. We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and FPGA. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.
field programmable logic and applications | 2017
Duncan J. M. Moss; Eriko Nurvitadhi; Jaewoong Sim; Asit K. Mishra; Debbie Marr; Suchit Subhaschandra; Philip Heng Wai Leong
Convolutional neural networks (CNNs) are deployed in a wide range of image recognition, scene segmentation and object detection applications. Achieving state of the art accuracy in CNNs often results in large models and complex topologies that require significant compute resources to complete in a timely manner. Binarised neural networks (BNNs) have been proposed as an optimised variant of CNNs, which constrain the weights and activations to +1 or —1 and thus offer compact models and lower computational complexity per operation. This paper presents a high performance BNN accelerator on the Intel®Xeon+FPGA™ platform. The proposed accelerator is designed to take advantage of the Xeon+FPGA system in a way that a specialised FPGA architecture can be targeted for the most compute intensive parts of the BNN whilst other parts of the topology can be handled by the Xeon™ CPU. The implementation is evaluated by comparing the raw compute performance and energy efficiency for key layers in standard CNN topologies against an Nvidia Titan X Pascal GPU and other published FPGA BNN accelerators. The results show that our single-package integrated Arria™ 10 FPGA accelerator coupled with a high-end Xeon CPU can offer comparable performance and better energy efficiency than a high-end discrete Titan X GPU card. In addition, our solution delivers the best performance compared to previous BNN FPGA implementations.
asia and south pacific design automation conference | 2017
Asit K. Mishra; Eriko Nurvitadhi; Ganesh Venkatesh; Jonathan Pearce; Debbie Marr
Text analytics applications using machine learning techniques have grown in importance with ever increasing amount of data being generated from web-scale applications, social media and digital repositories. Apart from being large in size, these generated data are often unstructured and are heavily sparse in nature. The performance of these applications on current systems is hampered by hard to predict branches and low compute-per-byte ratio. This paper proposes a set of fine-grained accelerators that improve the performance and energy-envelope of these applications by an order of magnitude.
field programmable gate arrays | 2017
Eriko Nurvitadhi; Ganesh Venkatesh; Jaewoong Sim; Debbie Marr; Randy Renfu Huang; Jason Ong Gee Hock; Yeong Tat Liew; Krishnan Srivatsan; Duncan Moss; Suchit Subhaschandra; Guy Boudoukh