Natalie D. Enright Jerger
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Natalie D. Enright Jerger.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2009
Radu Marculescu; Umit Y. Ogras; Li-Shiuan Peh; Natalie D. Enright Jerger; Yatin Hoskote
To alleviate the complex communication problems that arise as the number of on-chip components increases, network-on-chip (NoC) architectures have been recently proposed to replace global interconnects. In this paper, we first provide a general description of NoC architectures and applications. Then, we enumerate several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis, and solution evaluation. Motivation, problem description, proposed approaches, and open issues are discussed for each problem from system, microarchitecture, and circuit perspectives. Finally, we address the interactions among these research problems and put the NoC design process into perspective.
international symposium on microarchitecture | 2009
Mitchell Hayenga; Natalie D. Enright Jerger; Mikko H. Lipasti
As technology scaling drives the number of processor cores upward, current on-chip routers consume substantial portions of chip area and power budgets. Since existing research has greatly reduced router latency overheads and capitalized on available on-chip bandwidth, power constraints dominate interconnection network design. Recently research has proposed bufferless routers as a means to alleviate these constraints, but to date all designs exhibit poor operational frequency, throughput, or latency. In this paper, we propose an efficient bufferless router which lowers average packet latency by 17.6% and dynamic energy by 18.3% over existing bufferless on-chip network designs. In order to maintain the energy and area benefit of bufferless routers while delivering ultra-low latencies, our router utilizes an opportunistic processor-side buffering technique and an energy-efficient circuit-switched network for delivering negative acknowledgments for dropped packets.
international symposium on computer architecture | 2011
Sheng Ma; Natalie D. Enright Jerger; Zhiying Wang
With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness of locally adaptive routing algorithms limits performance due to poor network congestion avoidance. Globally adaptive routing algorithms attack this issue by introducing a congestion propagation network to obtain network status information beyond neighboring nodes. However, they may suffer from intra- and inter-application interference during output port selection for consolidated workloads, coupling the behavior of otherwise independent applications and negatively affecting performance. To address these two issues, we propose Destination-Based Adaptive Routing (DBAR). We design a novel low-cost congestion propagation network that leverages both local and non-local network information for more accurate congestion estimates. Thus, DBAR offers effective adaptivity for congestion beyond neighboring nodes. More importantly, by integrating the destination into the selection function, DBAR mitigates intra- and inter-application interference and offers dynamic isolation among regions. Experimental results show that DBAR can offer better performance than the best baseline algorithm for all measured configurations; it is well suited for workload consolidation. The wiring overhead of DBAR is low and DBAR provides improvement in the energy-delay product for medium and high injection rates.
international symposium on computer architecture | 2009
Dennis Abts; Natalie D. Enright Jerger; John Kim; Dan Gibson; Mikko H. Lipasti
In the near term, Moores law will continue to provide an increasing number of transistors and therefore an increasing number of on-chip cores. Limited pin bandwidth prevents the integration of a large number of memory controllers on-chip. With many cores, and few memory controllers, where to locate the memory controllers in the on-chip interconnection fabric becomes an important and as yet unexplored question. In this paper we show how the location of the memory controllers can reduce contention (hot spots) in the on-chip fabric and lower the variance in reference latency. This in turn provides predictable performance for memory-intensive applications regardless of the processing core on which a thread is scheduled. We explore the design space of on-chip fabrics to find optimal memory controller placement relative to different topologies (i.e. mesh and torus), routing algorithms, and workloads.
IEEE Computer Architecture Letters | 2007
Natalie D. Enright Jerger; Mikko H. Lipasti; Li-Shiuan Peh
Circuit-switched networks can significantly lower the communication latency between processor cores, when compared to packet-switched networks, since once circuits are set up, communication latency approaches pure interconnect delay. However, if circuits are not frequently reused, the long set up time and poorer interconnect utilization can hurt overall performance. To combat this problem, we propose a hybrid router design which intermingles packet-switched flits with circuit-switched flits. Additionally, we co-design a prediction-based coherence protocol that leverages the existence of circuits to optimize pair-wise sharing between cores. The protocol allows pair-wise sharers to communicate directly with each other via circuits and drives up circuit reuse. Circuit-switched coherence provides overall system performance improvements of up to 17% with an average improvement of 10% and reduces network latency by up to 30%.
Synthesis Lectures on Computer Architecture | 2009
Li-Shiuan Peh; Natalie D. Enright Jerger
With the ability to integrate a large number of cores on a single chip, research into on-chip networks to facilitate communication becomes increasingly important. On-chip networks seek to provide a scalable and high-bandwidth communication substrate for multi-core and many-core architectures. High bandwidth and low latency within the on-chip network must be achieved while fitting within tight area and power budgets. In this lecture, we examine various fundamental aspects of on-chip network design and provide the reader with an overview of the current state-of-the-art research in this field. Table of Contents: Introduction / Interface with System Architecture / Topology / Routing / Flow Control / Router Microarchitecture / Conclusions
ieee international symposium on workload characterization | 2007
Natalie D. Enright Jerger; Dana Vantrease; Mikko H. Lipasti
While chip multiprocessors with ten or more cores will be feasible within a few years, the search for applications that fully exploit their attributes continues. In the meantime, one sure-fire application for such machines will be to serve as consolidation platforms for sets of workloads that previously occupied multiple discrete systems. Such server consolidation scenarios will simplify system administration and lead to savings in power, cost, and physical infrastructure. This paper studies the behavior of server consolidation workloads, focusing particularly on sharing of caches across a variety of configurations. Noteworthy interactions emerge within a workload, and notably across workloads, when multiple server workloads are scheduled on the same chip. These workloads present an interesting design point and will help designers better evaluate trade-offs as we push forward into the many-core era.
international symposium on computer architecture | 2016
Jorge Albericio; Patrick Judd; Tayler H. Hetherington; Tor M. Aamodt; Natalie D. Enright Jerger; Andreas Moshovos
This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.
high performance computer architecture | 2012
Sheng Ma; Natalie D. Enright Jerger; Zhiying Wang
Routing algorithms for networks-on-chip (NoCs) typically only have a small number of virtual channels (VCs) at their disposal. Limited VCs pose several challenges to the design of fully adaptive routing algorithms. First, fully adaptive routing algorithms based on previous deadlock-avoidance theories require a conservative VC re-allocation scheme: a VC can only be re-allocated when it is empty, which limits performance. We propose a novel VC re-allocation scheme, whole packet forwarding (WPF), which allows a non-empty VC to be re-allocated. WPF leverages the observation that the majority of packets in NoCs are short. We prove that WPF does not induce deadlock if the routing algorithm is deadlock-free using conservative VC re-allocation. WPF is an important extension of previous deadlock-avoidance theories. Second, to efficiently utilize WPF in VC-limited networks, we design a novel fully adaptive routing algorithm which maintains packet adaptivity without significant hardware cost. Compared with conservative VC re-allocation, WPF achieves an average 88.9% saturation throughput improvement in synthetic traffic patterns and an average 21.3% and maximal 37.8% speedup for PARSEC applications with heavy network loads. Our design also offers higher performance than several partially adaptive and deterministic routing algorithms.
international symposium on microarchitecture | 2014
Joshua San Miguel; Mario Badr; Natalie D. Enright Jerger
Approximate computing explores opportunities that emerge when applications can tolerate error or inexactness. These applications, which range from multimedia processing to machine learning, operate on inherently noisy and imprecise data. We can trade-off some loss in output value integrity for improved processor performance and energy-efficiency. As memory accesses consume substantial latency and energy, we explore load value approximation, a micro architectural technique to learn value patterns and generate approximations for the data. The processor uses these approximate data values to continue executing without incurring the high cost of accessing memory, removing load instructions from the critical path. Load value approximation can also inhibit approximated loads from accessing memory, resulting in energy savings. On a range of PARSEC workloads, we observe up to 28.6% speedup (8.5% on average) and 44.1% energy savings (12.6% on average), while maintaining low output error. By exploiting the approximate nature of applications, we draw closer to the ideal latency and energy of accessing memory.