Bilha Mendelson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bilha Mendelson is active.

Explore More

Publication

Featured researches published by Bilha Mendelson.

IEEE Computer | 1988

A data-driven VLSI array for arbitrary algorithms

Israel Koren; Bilha Mendelson; Irit Peled; Gabriel M. Silberman

The design of specialized processing array architectures, capable of executing any given arbitrary algorithm, is proposed. An approach is adopted in which the algorithm is first represented in the form of a dataflow graph and then mapped onto the specialized processor array. The processors in this array execute the operations included in the corresponding nodes (or subsets of nodes) of the dataflow graph, while regular interconnections of these elements serve as edges of the graph. To speed up the execution, the proposed array allows the generation of computation fronts and their cancellation at a later time, depending on the arriving data operands; thus it is called a data-driven array. The structure of the basic cell and its programming are examined. Some design details are presented for two selected blocks, the instruction memory and the flag array. A scheme for mapping a dataflow graph (program) onto a hexagonally connected array is described and analyzed. Two distinct performance measures-mapping efficiency and array utilization-and some performance results are discussed.<<ETX>>

international conference on parallel architectures and compilation techniques | 2007

Detecting Change in Program Behavior for Adaptive Optimization

Nitzan Peleg; Bilha Mendelson

Feedback information has proven useful in guiding optimizations in compilers and post-link optimizers. Program performance behavior can change over time and may invalidate the feedback information. Low overhead monitoring can be used to detect such changes, using performance metrics such as CPI. On a loaded SMT system, where other threads are simultaneously activated on the same CPU, the CPI shows large variability. We introduce an efficient monitoring method that is insensitive to other activities in the system and can be safely used to collect program behavior on a loaded SMT system. The overhead of this method is 0.58% with SPECint2000. We also introduce a novel transformation to the program behavior representation, which makes it insensitive to code optimizations and enables a comparison of the program behavior collected in different optimization cycles. This approach opens new opportunities and enables adaptive optimizations on modern SMT architectures.

symposium on code generation and optimization | 2003

Optimization opportunities created by global data reordering

Gadi Haber; Moshe Klausner; Vadim Eisenberg; Bilha Mendelson; Maxim Gurevich

Memory access has proven to be one of the bottlenecks in modern architectures. Improving memory locality and eliminating the amount of memory access can help release this bottleneck. We present a method for link-time profile-based optimization by reordering the global data of the program and modifying its code accordingly. The proposed optimization reorders the entire global data of the program, according to a representative execution rate of each instruction (or basic block) in the code. The data reordering is done in a way that enables the replacement of frequently-executed Load instructions, which reference the global data, with fast Add Immediate instructions. In addition, it tries to improve the global data locality and to reduce the total size of the global data area. The optimization was implemented into FDPR (Feedback Directed Program Restructuring), a post-link optimizer, which is part of the IBM AIX operating system for the IBM pSeries servers. Our results on SPECint2000 show a significant improvement of up to 11% (average 3%) in execution time, along with up to 97.9% (average 83%) reduction in memory references to the global variables via the global data access mechanism of the program.

international symposium on performance analysis of systems and software | 2008

Trace-based Performance Analysis on Cell BE

Marina Biberstein; Uzi Shvadron; Javier Turek; Bilha Mendelson; Moon S. Chang

The transition to multicore architectures creates significant challenges for programming systems. Taking advantage of specialized processing cores such as those in the Cell BE processor and managing all the required data movement inside the processor cannot be done efficiently without help from the software infrastructure. Alongside new programming models and compiler support for multicores, programmers need performance evaluation and analysis tools. In this paper, we present tools that help analyze the performance of applications executing on the Cell platform. The performance debugging tool (PDT) provides a means for recording significant events during program execution, maintaining the sequential order of events, and preserving important runtime information such as core assignment and relative timing of events. The trace analyzer (TA) reads and visualizes the PDT traces. We describe the architecture of the PDT and present several important use cases demonstrating the usage of PDT and TA to understand the performance of several workloads. We also discuss the overhead of tracing and its impact on the benchmark execution and performance analysis.

empirical software engineering and measurement | 2013

Evaluating the FITTEST Automated Testing Tools: An Industrial Case Study

Cu D. Nguyen; Bilha Mendelson; Daniel Citron; Onn Shehory; Tanja E. J. Vos; Nelly Condori-Fernandez

This paper aims at evaluating a set of automated tools of the FITTEST EU project within an industrial case study. The case study was conducted at the IBM Research lab in Haifa, by a team responsible for building the testing environment for future development versions of an IBM system management product. The main function of that product is resource management in a networked environment. This case study has investigated whether current IBM Research testing practices could be improved or complemented by using some of the automated testing tools that were developed within the FITTEST EU project. Although the existing Test Suite from IBM Research (TSibm) that was selected for comparison is substantially smaller than the Test Suite generated by FITTEST (TSfittest), the effectiveness of TSfittest, measured by the injected faults coverage is significantly higher (50% vs 70%). With respect to efficiency, by normalizing the execution times, we found the TSfittest runs faster (9.18 vs. 6.99). This is due to the fact that the TSfittest includes shorter tests. Within IBM Research and for the testing of the target product in the simulated environment: the FITTEST tools can increase the effectiveness of the current practice and the test cases automatically generated by the FITTEST tools can help in more efficient identification of the source of the identified faults. Moreover, the FITTEST tools have shown the ability to automate testing within a real industry case.

International Journal of Parallel Programming | 1995

Solutions and debugging for data consistency in multiprocessors with noncoherent caches

David Bernstein; Mauricio Breternitz; Ahmed Gheith; Bilha Mendelson

We analyze two important problems that arise in shared-memory multiprocessor systems. Thestale data problem involves ensuring that data items in local memory of individual processors are current, independent of writes done by other processors.False sharing occurs when two processors have copies of the same shared data block but update different portions of the block. The false sharing problem involves guaranteeing that subsequent writes are properly combined. In modern architectures these problems are usually solved in hardware, by exploiting mechanisms for hardware controlled cache consistency. This leads to more expensive and nonscalable designs. Therefore, we are concentrating on software methods for ensuring cache consistency that would allow for affordable and scalable multiprocessing systems. Unfortunately, providing software control is nontrivial, both for the compiler writer and for the application programmer. For this reason we are developing a debugging environment that will facilitate the development of compiler-based techniques and will help the programmer to tune his or her application using explicit cache management mechanisms. We extend the notion of a race condition for IBM Shared Memory System POWER/4, taking into consideration its noncoherent caches, and propose techniques for detection of false sharing problems. Identification of the stale data problem is discussed as well, and solutions are suggested.

Journal of Parallel and Distributed Computing | 1992

Estimating the potential parallelism and pipelining of algorithms for data flow machines

Bilha Mendelson; Israel Koren

Abstract When porting an application to a parallel data driven machine is considered, the maximum achievable parallelism and pipelining need to be estimated. These estimates can be obtained in a hierarchical manner from a data flow graph representation of the given algorithm. A method for estimating these performance measures has been developed and is presented in this paper. Examples illustrating our method and comparing the estimated performance to simulation results are also included.

international conference on application specific array processors | 1993

Mapping algorithms onto a multiple-chip data-driven array

Bilha Mendelson; Israel Koren

Data-driven arrays provide high levels of parallelism and pipelining for algorithms with no internal regularity. Most of the methods previously developed for mapping algorithms onto processor arrays assumed an unbounded array (i.e., one in which there will always be a sufficient number of processing elements (PEs) for the mapping). Implementing such an array is not practical. A more practical approach would be to assign the PEs to chips and map the given algorithm onto the new array of chips. The authors describe a way to directly map algorithms onto a multiple-chip data-driven array, where each chip contains a limited number of PEs. There are two optimization steps in the mapping. The first is to produce an efficient mapping by minimizing the area (i.e., the number of PEs used) as well as optimizing the performance (pipeline period and latency) for the given algorithm, or finding a trade-off between area and performance. The second is to divide the unbounded array among several chips each containing a bounded number of PEs.<<ETX>>

Ibm Journal of Research and Development | 2009

Cell broadband engine processor performance optimization: tracing tools implementation and use

Marina Biberstein; Shiri Dori-Hacohen; Yuval Harel; Andre Heilper; Bilha Mendelson; Uzi Shvadron; Eran Treister; Javier Turek; Moon S. Chang

Optimizing performance on multicore processors is a daunting task M. S. Chang because of the increased importance of such factors as thread communication, memory contention, and memory access latency. This paper presents two tools that programmers and performance analysts can use to understand application performance on the Cell Broadband Engine® (Cell/B.E.) processor: the Performance Debugging Tool (PDT) and the Trace Analyzer (TA). PDT traces user-space events, augmenting them with scheduling data from the operating system; those traces are then read, analyzed, and presented visually by the TA. This paper describes the implementation issues arising from the fact that a common lowoverhead clock shared by all cores, essential for analysis and visualization, is not available on the Cell/B.E. processor. The TA employs an offline analysis to align the collected events to a common time based only on thread-local timestamps, event order, and context switch information. We also discuss the overhead of tracing and its impact on execution and performance analysis. We illustrate the use of the PDT and TA by analyzing several significant Cell/B.E. processor workloads, including native code and higher-level abstractions offered by the Data Communication and Synchronization services. We show how trace analysis can help identify performance issues in these workloads and how it can be used by programmers to spot performance antipatterns (common programming practices leading to suboptimal performance).

Archive | 1997