Gadi Haber
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gadi Haber.
high performance embedded architectures and compilers | 2008
Roy Levin; Ilan Newman; Gadi Haber
Edge profiling is a very common means for providing feedback on program behavior that can be used statically by an optimizer to produce highly optimized binaries. However collecting full edge profile carries a significant runtime overhead. This overhead creates addition problems for real-time applications, as it may prevent the system from meeting runtime deadlines and thus alter its behavior. In this paper we show how a low overhead sampling technique can be used to collect inaccurate profile which is later used to approximate the full edge profile using a novel technique based on the Minimum Cost Circulation Problem. The outcome is a machine independent profile gathering scheme that creates a slowdown of only 2%-3% during the training set, and produces an optimized binary which is only 0.6% less than a fully optimized one.
symposium on code generation and optimization | 2003
Gadi Haber; Moshe Klausner; Vadim Eisenberg; Bilha Mendelson; Maxim Gurevich
Memory access has proven to be one of the bottlenecks in modern architectures. Improving memory locality and eliminating the amount of memory access can help release this bottleneck. We present a method for link-time profile-based optimization by reordering the global data of the program and modifying its code accordingly. The proposed optimization reorders the entire global data of the program, according to a representative execution rate of each instruction (or basic block) in the code. The data reordering is done in a way that enables the replacement of frequently-executed Load instructions, which reference the global data, with fast Add Immediate instructions. In addition, it tries to improve the global data locality and to reduce the total size of the global data area. The optimization was implemented into FDPR (Feedback Directed Program Restructuring), a post-link optimizer, which is part of the IBM AIX operating system for the IBM pSeries servers. Our results on SPECint2000 show a significant improvement of up to 11% (average 3%) in execution time, along with up to 97.9% (average 83%) reduction in memory references to the global variables via the global data access mechanism of the program.
Proceedings of the 1st international forum on Next-generation multicore/manycore technologies | 2008
Duc Vianney; Gadi Haber; Andre Heilper; Marcel Zalmanovici
Code porting, optimizing and tuning has become a challenging task in multicore/many cores environment. It requires a different set of performance visualization tools to handle the complexity of the many cores and the size of performance data to find opportunities for optimization. This paper discusses performance visualization tools available for Cell/B.E. under the IBM Software Development Kit (SDK) for Multicore Acceleration Version 3.0. It also presents a methodology for porting, optimizing and tuning Cell applications by utilizing those tools. The paper starts with a simple scalar program example which can also be found in the IBM tutorial for the Cell programming, and then describes all the needed steps to make it fully tuned and scaled for the Cell Broadband Engine.
embedded software | 2004
Daniel Citron; Gadi Haber; Roy Levin
Constraints on the memory size of embedded systems require reducing the image size of executing programs. Common techniques include code compression and reduced instruction sets. We propose a novel technique that eliminates large portions of the executable image without compromising execution time (due to decompression) or code generation (due to reduced instruction sets). Frozen code and data portions are identified using profiling techniques and removed from the loadable image. They are replaced with branches to code stubs that load them in the unlikely case that they are accessed. The executable is sustained in a runnable mode.Analysis of the frozen portions reveals that most are error and uncommon input handlers. Only a minority of the code (less than 1%) that was identified as frozen during a training run, is also accessed with production datasets.The technique was applied on three benchmark suites (SPEC CINT2000, SPEC CFP2000, and MediaBench) and results in image size reductions of up to 73%, 92%, and 85% per suite, The average reductions are 59%, 79%, and 78% per suite.
Journal of Systems and Software | 1996
Yosi Ben-Asher; Gadi Haber
Abstract The problem of producing efficient parallel programs against different possible execution orders or schedulings is addressed. Our method emphasizes experimental modifications of the machine parameters, using a parallel simulator, so that by a proper choice of the simulator parameters, the user can detect potential harmful schedulings. Two types of time statistics named ideal and effective times are defined. We show that the gap between them can be used to detect whether the current scheduling is a possible cause for a performance degradation. This allows the user to make a more “compact” search of the huge space of all possible schedulings. This search for “bad” schedulings is done by allowing the user to change the simulations parameters, in particular, the context-switch delay and the scheduling policy. The main objective of the Simparc simulator is to help the programmer develop efficient parallel programs independent of the hardware which will be used. Therefore, the article includes a set of typical experiments with the simulator, covering the main factors for bad schedulings. Altogether, the simulator and the experiments form a methodology for detecting possible bottlenecks of performance degradation.
automated software engineering | 2012
Yosi Ben Asher; Tomer Gal; Gadi Haber; Marcel Zalmanovici
Object Inlining (OI) is a known optimization in object oriented programming in which referenced objects of class B are inlined into their referencing objects of class A by making all fields and methods of class B part of class A. The optimization saves all the new operations of B type objects from class A and at the same time replaces all indirect accesses, from A to fields of B, by direct accesses. To the best of our knowledge, in-spite of the significant performance potential of the OI optimization, reported performance measurements were relatively moderate. This is because an aggressive OI optimization requires complex analysis and code transformations to overcome problems like multiple references to the inlinable object, object references that escape their object scope, etc.To extract the full potential of OI, we propose a two-stage process. The first stage includes automatic analysis of the source code that informs the user, via comments in the IDE, about code transformations that are needed in order to enable or to maximize the potential of the OI optimization. In the second stage, the OI optimization is applied automatically on the source code as a code refactoring operation, or preferably, as part of the compilation process prior to javac run.We show that this half-automated technique helps to extract the full potential of OI. The proposed OI refactoring process also determines the order of applying the inlinings of the objects and enables us to apply inlinings of objects created inside a method; thus enabling us to reach better performance gain.In this work we also include an evaluation of the OI optimization effects on multithreaded applications running on multicore machines.The comments and the OI transformation were implemented in the Eclipse JDT (Java Development Tools) plugin. The system was then applied on the SPECjbb2000 source code along with profiling data collected by the Eclipse TPTP plugin. The proposed system achieved 46% improvement in performance.
international parallel and distributed processing symposium | 2004
Yosi Ben-Asher; Daniel Citron; Gadi Haber
Summary form only given. We consider the problem of compiling programs, written in a general high-level programming language, into hardware circuits executed by an FPGA (field programmable gate array) unit. In particular, we consider the problem of synthesizing nested loops that frequently access array elements stored in an external memory (outside the FPGA). We propose an aggressive compilation scheme, based on loop unrolling and code flattening techniques, where array references from/to the external memory are overlapped with uninterrupted hardware evaluation of the synthesized loops circuit. We implement a restricted programming language called DOL based on the proposed compilation scheme and our experimental results provide preliminary evidence that aggressive compilation can be used to compile large code segments into circuits, including overlapping of hardware operations and memory references.
international conference on parallel processing | 2012
Yosi Ben Asher; Eldar Fisher; Gadi Haber; Vladislav Tartakovsky
In this work we consider the problem of fast parallel evaluation of boolean circuits - namely to evaluate a boolean circuit C, with input leaf values, faster than its depth, which would practically require log depth iterations to complete. Finding a general parallel algorithm that can evaluate any circuit using log depth iterations is known as the Circuit Value Problem (CVP). The CVP and its approximations are known to be P-complete and therefore, a heuristically solution that practically works for all “real-computations” is sought. In this work we propose a new algorithm based on a two players game that can reduce the evaluation time of a boolean circuit C by upto min(h, min(log d, log co - d)) iterations where h is the maximal number of and-or alternations along any path in C and d (and co - d) is the algebraic degree (and co-degree) of C. This improves the theoretical bound of the MRK algorithm (Miller, Ramachandran and Kaltofen 86) for the case of parallel evaluation of boolean circuits. More importantly we show, via experiments, that for circuits emanating from real programs, the proposed algorithm can practically evaluate circuits in log - depth iterations. Each iteration can be evaluated in parallel using a connectivity step, and although it can be implemented using log-depth boolean circuits, we consider an optical switching realization that is based on Optical Ring Resonators (ORR). Due to quantum effects, propagating a light beam through a sequence of ORRs can be done with zero latency, thus making ORRs ideal for implementing the connectivity step required by the proposed algorithm. In order to obtain the needed experiments, we have extended the LLVM compiler to transform C-code into boolean circuits and then simulated the optical evaluation of these circuits using the proposed two player game. Our experiments indeed show that circuits emanating from real applications can be evaluated in log-depth iterations of the proposed algorithm and that the optical implementation is feasible.
haifa experimental systems conference | 2010
Omer Y. Boehm; Gadi Haber; Helena Kosachevsky
Todays architectures exploit long pipelines in order to increase instruction-level parallelism by grouping sets of consecutive instructions and feeding them into the pipeline with the purpose of executing them in a single cycle. The IBM Power architecture executes programs by dispatching groups of instructions where a dispatch group is fed as a whole into the pipeline to be executed in a single cycle. Such architecture, however, includes many cases of pipeline delays that result from dependencies between the resources of separate groups. As a result, there is a need to optimize the code in order to help the architecture place all the instructions in such a way that will produce as few delays as possible. Optimizing the alignment and the placement of the code is therefore, crucial to the performance of the program in such architectures. We show that in some cases, without proper code alignment, performance can degrade by 40% due to the impact of code alignment on the grouping and pipeline delays. We present a new binary-level and profile-based code alignment algorithm for architectures that make use of group dispatching. We show a steady performance gain of about 2--3% for fully optimized code running on IBM Power 6 while completely eliminating performance instability which can sometimes result in up to 40% variation in performance in the absence of the proposed code alignment algorithm. As the algorithm is based on gathered profiling and applies at binary-level, it can, therefore, be used as part of existing dynamic compilers and enabled on top of the operating system at runtime.
haifa verification conference | 2007
Orna Raz; Moshe Klausner; Nitzan Peleg; Gadi Haber; Eitan Farchi; Shachar Fienblit; Yakov S. Filiarsky; Shay Gammer; Sergey Novikov
Code coverage is often defined as a measure of the degree to which the source code of a program has been tested [19]. Various metrics for measuring code coverage exist. The vast majority of these metrics require instrumenting the source code to produce coverage data. However, for certain coverage metrics, it is also possible to instrument object code to produce coverage data. Traditionally, such instrumentation has been considered inferior to source level instrumentation because source code is the focus of code coverage. Our experience shows that object code instrumentation, specifically post-link instrumentation, can be very useful to users. Moreover, it does not only alleviate certain side-effects of source-level instrumentation, especially those related to compiler optimizations, but also lends itself to performance optimization that enables low-overhead instrumentation. Our experiments show an average of less than 1% overhead for instrumentation at the function level and an average of 4.1% and 0.4% overhead for SPECint2000 and SPECfp2000, respectively, for instrumentation at the basic block level. This paper demonstrates the advantages of post-link coverage and describes effective methodology and technology for applying it.