Jorge Albericio
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jorge Albericio.
international symposium on computer architecture | 2016
Jorge Albericio; Patrick Judd; Tayler H. Hetherington; Tor M. Aamodt; Natalie D. Enright Jerger; Andreas Moshovos
This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.
international symposium on microarchitecture | 2015
Joshua San Miguel; Jorge Albericio; Andreas Moshovos; Natalie D. Enright Jerger
Modern processors contain large last level caches (LLCs) that consume substantial energy and area yet are imperative for high performance. Cache designs have improved dramatically by considering reference locality. Data values are also a source of optimization. Compression and deduplication exploit data values to use cache storage more efficiently resulting in smaller caches without sacrificing performance. In multi-megabyte LLCs, many identical or similar values may be cached across multiple blocks simultaneously. This redundancy effectively wastes cache capacity. We observe that a large fraction of cache values exhibit approximate similarity. More specifically, values across cache blocks are not identical but are similar. Coupled with approximate computing which observes that some applications can tolerate error or inexactness, we leverage approximate similarity to design a novel LLC architecture: the Doppelganger cache. The Doppelganger cache associates the tags of multiple similar blocks with a single data array entry to reduce the amount of data stored. Our design achieves 1.55×, 2.55× and 1.41 × reductions in LLC area, dynamic energy and leakage energy without harming performance nor incurring high application error.
international symposium on microarchitecture | 2013
Jorge Albericio; Pablo Ibáñez; Víctor Viñals; José M. Llabería
Over recent years, a growing body of research has shown that a considerable portion of the shared last-level cache (SLLC) is dead, meaning that the corresponding cache lines are stored but they will not receive any further hits before being replaced. Conversely, most hits observed by the SLLC come from a small subset of already reused lines. In this paper, we propose the reuse cache, a decoupled tag/data SLLC which is designed to only store the data of lines that have been reused. Thus, the size of the data array can be dramatically reduced. Specifically, we (i) introduce a selective data allocation policy to exploit reuse locality and maintain reused data in the SLLC, (ii) tune the data allocation with a suitable replacement policy and coherence protocol, and finally, (iii) explore different ways of organizing the data/tag arrays and study the performance sensitivity to the size of the resulting structures. The role of a reuse cache to maintain performance with decreasing sizes is investigated in the experimental part of this work, by simulating multi programmed and multithreaded workloads in an eight-core chip multiprocessor. As an example, we show that a reuse cache with a tag array equivalent to a conventional 4 MB cache and only a 1 MB data array would perform as well as a conventional cache of 8 MB, requiring only 16.7% of the storage capacity.
international symposium on microarchitecture | 2017
Jorge Albericio; Alberto Delmas; Patrick Judd; Sayeh Sharify; Gerard O'Leary; Roman Genov; Andreas Moshovos
Deep Neural Networks expose a high degree of parallelism, making them amenable to highly data parallel architectures. However, data-parallel architectures often accept inefficiency in individual computations for the sake of overall efficiency. We show that on average, activation values of convolutional layers during inference in modern Deep Convolutional Neural Networks (CNNs) contain 92% zero bits. Processing these zero bits entails ineffectual computations that could be skipped. We propose Pragmatic (PRA), a massively data-parallel architecture that eliminates most of the ineffectual computations on-the-fly, improving performance and energy efficiency compared to state-of-the-art high-performance accelerators [5]. The idea behind PRA is deceptively simple: use serial-parallel shift-and-add multiplication while skipping the zero bits of the serial input. However, a straightforward implementation based on shift-and-add multiplication yields unacceptable area, power and memory access overheads compared to a conventional bit-parallel design. PRA incorporates a set of design decisions to yield a practical, area and energy efficient design. Measurements demonstrate that for convolutional layers, PRA is 4.31
IEEE Computer Architecture Letters | 2017
Patrick Judd; Jorge Albericio; Andreas Moshovos
\times
international symposium on microarchitecture | 2015
André Seznec; Joshua San Miguel; Jorge Albericio
faster than DaDianNao [5] (DaDN) using a 16-bit fixed-point representation. While PRA requires 1.68
international symposium on microarchitecture | 2014
Jorge Albericio; Joshua San Miguel; Natalie D. Enright Jerger; Andreas Moshovos
\times
international symposium on microarchitecture | 2016
Joshua San Miguel; Jorge Albericio; Natalie D. Enright Jerger; Aamer Jaleel
more area than DaDN, the performance gains yield a 1.70
international conference on embedded computer systems architectures modeling and simulation | 2014
G. Narancic; Patrick Judd; D. Wu; Islam Atta; M. Elnacouzi; Jason Zebchuk; Jorge Albericio; N. Enright Jerger; Andreas Moshovos; K. Kutulakos; S. Gadelrab
\times
IEEE Micro | 2018
Andreas Moshovos; Jorge Albericio; Patrick Judd; Alberto Delmas Lascorz; Sayeh Sharify; Tayler H. Hetherington; Tor M. Aamodt; Natalie D. Enright Jerger
increase in energy efficiency in a 65nm technology. With 8-bit quantized activations, PRA is 2.25