Avi Mendelson
Technion – Israel Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Avi Mendelson.
Proceedings of the IEEE | 2001
Ronny Ronen; Avi Mendelson; Konrad K. Lai; Shih-Lien Lu; Fred J. Pollack; John Paul Shen
In the past several decades, the world of computers and especially that of microprocessors has witnessed phenomenal advances. Computers have exhibited ever-increasing performance and decreasing costs, making them more affordable and in turn, accelerating additional software and hardware development that fueled this process even more. The technology that enabled this exponential growth is a combination of advancements in process technology, microarchitecture, architecture, and design and development tools. While the pace of this progress has been quite impressive over the last two decades, it has become harder and harder to keep up this pace. New process technology requires more expensive megafabs and new performance levels require larger die, higher power consumption, and enormous design and validation effort. Furthermore, as CMOS technology continues to advance, microprocessor design is exposed to a new set of challenges. In the near future, microarchitecture has to consider and explicitly manage the limits of semiconductor technology, such as wire delays, power dissipation, and soft errors. In this paper we describe the role of microarchitecture in the computer world present the challenges ahead of us, and highlight areas where microarchitecture can help address these challenges.
international symposium on microarchitecture | 1997
Freddy Gabbay; Avi Mendelson
This paper explores the possibility of using program profiling to enhance the efficiency of value prediction. Value prediction attempts to eliminate true-data dependencies by predicting the outcome values of instructions at run-time and executing true-data dependent instructions based on that prediction. So far, all published papers in this area have examined hardware-only value prediction mechanisms. In order to enhance the efficiency of value prediction, it is proposed to employ program profiling to collect information that describes the tendency of instructions in a program to be value-predictable. The compiler that acts as a mediator can pass this information to the value-prediction hardware mechanisms. Such information can be exploited by the hardware in order to reduce mispredictions, better utilize the prediction table resources, distinguish between different value predictability patterns and still benefit from the advantages of value prediction to increase instruction-level parallelism. We show that our new method outperforms the hardware-only mechanisms in most of the examined benchmarks.
IEEE Computer Architecture Letters | 2009
Zvika Guz; Evgeny Bolotin; Idit Keidar; Avinoam Kolodny; Avi Mendelson; Uri C. Weiser
We study the tradeoffs between many-core machines like Intelpsilas Larrabee and many-thread machines like Nvidia and AMD GPGPUs. We define a unified model describing a superposition of the two architectures, and use it to identify operation zones for which each machine is more suitable. Moreover, we identify an intermediate zone in which both machines deliver inferior performance. We study the shape of this ldquoperformance valleyrdquo and provide insights on how it can be avoided.
programming language design and implementation | 2009
Bratin Saha; Xiaocheng Zhou; Hu Chen; Ying Gao; Shoumeng Yan; Mohan Rajagopalan; Jesse Fang; Peinan Zhang; Ronny Ronen; Avi Mendelson
The client computing platform is moving towards a heterogeneous architecture consisting of a combination of cores focused on scalar performance, and a set of throughput-oriented cores. The throughput oriented cores (e.g. a GPU) may be connected over both coherent and non-coherent interconnects, and have different ISAs. This paper describes a programming model for such heterogeneous platforms. We discuss the language constructs, runtime implementation, and the memory model for such a programming environment. We implemented this programming environment in a x86 heterogeneous platform simulator. We ported a number of workloads to our programming environment, and present the performance of our programming environment on these workloads.
international conference on parallel architectures and compilation techniques | 2011
Carlos Villavieja; Vasileios Karakostas; Lluis Vilanova; Yoav Etsion; Alex Ramirez; Avi Mendelson; Nacho Navarro; Adrián Cristal; Osman S. Unsal
Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.
international symposium on computer architecture | 1998
Freddy Gabbay; Avi Mendelson
Value prediction attempts to eliminate true-data dependencies by dynamically predicting the outcome values of instructions and executing true-data dependent instructions based on that prediction. In this paper we attempt to understand the limitations of using this paradigm in realistic machines. We show that the instruction-fetch bandwidth and the issue rate have a very significant impact on the efficiency of value prediction. In addition, we study how recent techniques to improve the instruction-fetch rate affect the efficiency of value prediction and its hardware organization.
IEEE Computer Architecture Letters | 2003
Aviad Cohen; Finkelstein Finkelstein; Avi Mendelson; Ronny Ronen; Dmitry Rudoy
In this paper we focus on dynamic thermal management(DTM) strategies that use dynamic voltage scaling (DVS)for power control. We perform a theoretical analysis targeted atestimating the optimal strategy, and show two facts: (1) whenthere is a gap between the initial and the limit temperatures,it is best to start with a high (though not necessarily maximal)frequency and decrease it exponentially until the limit temperatureis reached; (2) when being close to the limit temperature,the best strategy is to stay there. We use the patterns exhibitedby the optimal strategy in order to analyze some existing DTMtechniques.
Microprocessors and Microsystems | 2014
Roberto Giorgi; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Rahulkumar Gayatri; Sylvain Girbal; Daniel Goodman; Behram Khan; Souad Koliai; Joshua Landwehr; Nhat Minh Lê; Feng Li; Mikel Luján; Avi Mendelson; Laurent Morin; Nacho Navarro; Tomasz Patejko; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Ian Watson; Sebastian Weis; Stéphane Zuckerman; Mateo Valero
The improvements in semiconductor technologies are gradually enabling extreme-scale systems such as teradevices (i.e., chips composed by 1000 billion of transistors), most likely by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper presents an overview of the research carried out by the TERAFLUX partners and some preliminary results. Our platform comprises 1000+ general purpose cores per chip in order to properly explore the above challenges. An architectural template has been proposed and applications have been ported to the platform. Programming models, compilation tools, and reliability techniques have been developed. The evaluation is carried out by leveraging on modifications of the HP-Labs COTSon simulator.
international symposium on computer architecture | 2004
Roni Rosner; Yoav Almog; Micha Moffie; Naftali Schwartz; Avi Mendelson
We present the PARROT concept that seeks to achieve higher performance with reduced energy consumption through gradual optimization of frequently executed code traces. The PARROT microarchitectural framework integrates trace caching, dynamic optimizations and pipeline decoupling. We employ a selective approach for applying complex mechanisms only upon the most frequently used traces to maximize the performance gain at any given power constraint, thus attaining finer control of tradeoffs between performance and power awareness. We show that the PARROT based microarchitecture can improve the performance of aggressively designed processors by providing the means to improve the utilization of their more elaborate resources. At the same time, rigorous selection of traces prior to storage and optimization provides the key to attenuating increases in the power budget. For resource-constrained designs, PARROT based architectures deliver better performance (up to an average 16% increase in IPC) at a comparable energy level, whereas the conventional path to a similar performance improvement consumes an average 70% more energy. Meanwhile, for those designs which can tolerate a higher power budget, PARROT gracefully scales up to use additional execution resources in a uniformly efficient manner. In particular, a PARROT-style doubly-wide machine delivers an average 45% IPC improvement while actually improving the cubic-MIPS-per-WATT power awareness metric by over 50%.
dependable systems and networks | 2000
Avi Mendelson; Neeraj Suri
As VLSI geometry continues to shrink and the level of integration increases, it is expected that the probability of faults, particularly transient faults, will increase in future microprocessors. So far, fault tolerance has chiefly been considered for special purpose or safety critical systems, but future technology will likely require integrating fault tolerance techniques into commercial systems. Such systems require low cost solutions that are transparent to the system operation and do not degrade overall performance. This paper introduces a new superscalar architecture, termed as 03RS that aims to incorporate such simple fault tolerance mechanisms as part of the basic architecture.