Yosi Ben-Asher | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yosi Ben-Asher is active.

Explore More

Publication

Featured researches published by Yosi Ben-Asher.

Journal of Parallel and Distributed Computing | 1995

Efficient self-simulation algorithms for reconfigurable arrays

Yosi Ben-Asher; Dan Gordon; Assaf Schuster

Perhaps the most basic question concerning a model for parallel computation is the self simulation problem: given an algorithm which is designed for a large machine, can it be executed efficiently on a smaller one? In this work we give several positive answers to the self simulation problem on dynamically reconfigurable meshes. We show that the simulation of a reconfiguring mesh by a smaller one can be carried optimally, by using standard methods, on meshes such that buses are established along rows or along columns. A novel technique is shown to achieve asymptotically optimal self simulation on models which allow buses to switch column and row edges, provided that a bus is a “linear” path of connected edges. Finally, for models in which a bus is any sub-graph of the underlying mesh efficient simulations are presented, paying by an extra factor which is polylogarithmic in the size of the simulated mesh. Although the self simulation algorithms are complex and require extensive bookkeeping operations, the required space is asymptotically optimal.

international parallel and distributed processing symposium | 2003

Heuristics for finding concurrent bugs

Yaniv Eytani; Eitan Farchi; Yosi Ben-Asher

This paper presents new heuristics that increase the probability of manifesting concurrent bugs. The heuristics are based on cross-run monitoring. A contended shared variable is chosen and random context switching is performed at accesses to that variable. The relative strength of the new heuristics is analyzed. In comparison to previous works, our heuristics increase the frequency of bug manifestation. In addition, the new heuristics were able to find bugs that previous methods did not discover.

IEEE Transactions on Control Systems and Technology | 2008

Distributed Decision and Control for Cooperative UAVs Using Ad Hoc Communication

Yosi Ben-Asher; Sharoni Feldman; Pini Gurfil; Moran Feldman

This study develops a novel distributed algorithm for task assignment (TA), coordination, and communication of multiple unmanned aerial vehicles (UAVs) engaging multiple targets and conceives an ad hoc routing algorithm for synchronization of target lists utilizing a distributed computing topology. Assuming limited communication bandwidth and range, coordination of UAV motion is achieved by implementing a simple behavioral flocking algorithm utilizing a tree topology for distributed flight coordination. Distributed TA is implemented by a relaxation process, wherein each node computes a temporary TA based on the union of the TAs of its neighbors in the tree. The computation of the temporary TAs at each node is based on weighted matching in the UAV-target distances graph. A randomized sampling mechanism is used to propagate TAs among different parts of the tree. Thus, changes in the location of the UAVs and targets do not pass through the root of the tree. Simulation experiments show that the combination of the flocking and the TA algorithms yields the best performance.

international conference on hardware/software codesign and system synthesis | 2010

Automatic memory partitioning: increasing memory parallelism via data structure partitioning

Yosi Ben-Asher; Nadav Rotem

In high-level synthesis, pipelined designs are often restricted by the number of memory banks available to the synthesis system. Using multiple memory banks can improve the performance of accelerated applications. Currently, programmers must manually assign data structures to specific memory banks on the accelerator. This paper presents Automatic Memory Partitioning, a method for automatically partitioning data structures into multiple memory banks for increased parallelism and performance. We use source code instrumentation to collect memory traces in order to detect linear memory access patterns. The memory traces are used to split data structures into disjoint memory regions and determine which segments may benefit from parallel memory access. Experiments show significant improvements in performance while using a minimal number of memory banks.

Software - Practice and Experience | 1996

ParC —an extension of C for shared memory parallel processing

Yosi Ben-Asher; Dror G. Feitelson; Larry Rudolph

ParC is an extension of the C programming language with block‐oriented parallel constructs that allow the programmer to express fine‐grain parallelism in a shared‐memory model. It is suitable for the expression of parallel shared‐memory algorithms, and also conducive for the parallelization of sequential C programs. In addition, performance enhancing transformations can be applied within the language, without resorting to low‐level programming. The language includes closed constructs to create parallelism, as well as instructions to cause the termination of parallel activities and to enforce synchronization. The parallel constructs are used to define the scope of shared variables, and also to delimit the sets of activities that are influenced by termination or synchronization instructions. The semantics of parallelism are discussed, especially relating to the discrepancy between the limited number of physical processors and the potentially much larger number of parallel activities in a program.

workshop on i o in parallel and distributed systems | 2006

Producing scheduling that causes concurrent programs to fail

Yosi Ben-Asher; Yaniv Eytani; Eitan Farchi; Shmuel Ur

A noise maker is a tool that seeds a concurrent program with conditional synchronization primitives (such as yield()) for the purpose of increasing the likelihood that a bug manifest itself. This work explores the theory and practice of choosing where in the program to induce such thread switches at runtime. We introduce a novel fault model that classifies locations as ¿good¿, ¿neutral¿, or ¿bad,¿ based on the effect of a thread switch at the location. We validate our approach by experimenting with a set of programs taken from publicly available multi-threaded benchmark. Our empirical evidence demonstrates that real-life behavior is similar to that derived from the model.

international symposium on system-on-chip | 2008

Synthesis for variable pipelined function units

Yosi Ben-Asher; Nadav Rotem

Usually, in high level hardware synthesis, all functional units of the same type have a fixed known ldquolengthrdquo (number of stages) and the scheduler mainly determines when each unit is activated. We focus on scheduling techniques for the high-level synthesis of pipelined functional units where the number of stages of these operations is a free parameter of the synthesis. This problem is motivated by the ability to create pipelined functional units, such as multipliers, with different pipe lengths. These units have different characteristics in terms of parallelism level, frequency, latency, etc. In this paper presents the variable pipeline scheduler (VPS). The ability to synthesize variable pipelined units expands the known scheduling problem of high-level synthesis to include a 2D search for a minimal number of instances and their desired number of stages. The proposed search procedure is based on algorithms that find a local minima in a d-dimensional grid, thus avoiding the need to evaluate all possible points in the space. We have implemented a C language compiler for VPS. Our results demonstrate that using variable pipeline units can reduce the overall resource usage and improve the execution time.

Journal of Systems Architecture | 2010

Finding the best compromise in compiling compound loops to Verilog

Yosi Ben-Asher; Nadav Rotem; Eddie Shochat

In this work we consider a special optimization problem involved with compiling compound loops (combining nested and consecutive sub-loops) to Verilog. Each sub-loop of the compound loop may require a different optimized hardware configuration (OHC) for optimized execution times. For example, one loop requires at least two memory ports and one multiplier for an optimized execution time, while another loop may require only one memory port but two multipliers, yet one OHC should be selected for both loops. The goal is to compute a minimal OHC which, based on the different heat levels (expected number of iterations) of the sub-loops, is a good compromise between all the conflicting requirements of each sub-loop. Though synthesis of nested loops has been implemented in quite a few systems this aspect has not been considered so far. We avoid the use of time consuming integer linear programming (ILP) techniques and instead use a fast space exploration technique combined with an efficient variant of list scheduling. Another novel aspect of the proposed system is the observation that the real latencies of the hardware units should be considered as variables of the OHC rather than fixed real values as is usually done in high-level synthesis systems. Experimental results show a significant improvement in the OHC without a significant increase in the execution time due to the use of this search procedure.

ACM Transactions on Reconfigurable Technology and Systems | 2010

Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs

Yosi Ben-Asher; Danny Meisler; Nadav Rotem

In High-Level Synthesis (HLS), extracting parallelism in order to create small and fast circuits is the main advantage of HLS over software execution. Modulo Scheduling (MS) is a technique in which a loop is parallelized by overlapping different parts of successive iterations. This ability to extract parallelism makes MS an attractive synthesis technique for loop acceleration. In this work we consider two problems involved in the use of MS which are central when targeting FPGAs. Current MS scheduling techniques sacrifice execution times in order to meet resource and delay constraints. Let “ideal” execution times be the ones that could have been obtained by MS had we ignored resource and delay constraints. Here we pose the opposite problem, which is more suitable for HLS, namely, how to reduce resource constraints without sacrificing the ideal execution time. We focus on reducing the number of memory ports used by the MS synthesis, which we believe is a crucial resource for HLS. In addition to reducing the number of memory ports we consider the need to develop MS techniques that are fast enough to allow interactive synthesis times and repeated applications of the MS to explore different possibilities of synthesizing the circuits. Current solutions for MS synthesis that can handle memory constraints are too slow to support interactive synthesis. We formalize the problem of reducing the number of parallel memory references in every row of the kernel by a novel combinatorial setting. The proposed technique is based on inserting dummy operations in the kernel and by doing so, performing modulo-shift operations such that the maximal number of parallel memory references in a row is reduced. Experimental results suggest improved execution times for the synthesized circuit. The synthesis takes only a few seconds even for large-size loops.

ACM Transactions in Embedded Computing Systems | 2013

Using memory profile analysis for automatic synthesis of pointers code

Yosi Ben-Asher; Nadav Rotem

One of the main advantages of high-level synthesis (HLS) is the ability to synthesize circuits that can access multiple memory banks in parallel. Current HLS systems synthesize parallel memory references based on explicit array declarations in the source code. We consider the need to synthesize not only array references but also memory operations targeting pointers and dynamic data structures. This paper describes Automatic Memory Partitioning, a method for automatically synthesizing general data structures (arrays and pointers) into multiple memory banks for increased parallelism and performance. We use source code instrumentation to collect memory traces in order to detect linear memory access patterns. The memory traces are used to split data structures into disjoint memory regions and determine which segments may benefit from parallel memory access. We present an algorithm for allocating memory segments into multiple memory banks. Experiments show significant improvements in performance while conserving the number of memory banks.

Explore More