Nadav Rotem | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nadav Rotem is active.

Explore More

Publication

Featured researches published by Nadav Rotem.

international conference on hardware/software codesign and system synthesis | 2010

Automatic memory partitioning: increasing memory parallelism via data structure partitioning

Yosi Ben-Asher; Nadav Rotem

In high-level synthesis, pipelined designs are often restricted by the number of memory banks available to the synthesis system. Using multiple memory banks can improve the performance of accelerated applications. Currently, programmers must manually assign data structures to specific memory banks on the accelerator. This paper presents Automatic Memory Partitioning, a method for automatically partitioning data structures into multiple memory banks for increased parallelism and performance. We use source code instrumentation to collect memory traces in order to detect linear memory access patterns. The memory traces are used to split data structures into disjoint memory regions and determine which segments may benefit from parallel memory access. Experiments show significant improvements in performance while using a minimal number of memory banks.

international symposium on system-on-chip | 2008

Synthesis for variable pipelined function units

Yosi Ben-Asher; Nadav Rotem

Usually, in high level hardware synthesis, all functional units of the same type have a fixed known ldquolengthrdquo (number of stages) and the scheduler mainly determines when each unit is activated. We focus on scheduling techniques for the high-level synthesis of pipelined functional units where the number of stages of these operations is a free parameter of the synthesis. This problem is motivated by the ability to create pipelined functional units, such as multipliers, with different pipe lengths. These units have different characteristics in terms of parallelism level, frequency, latency, etc. In this paper presents the variable pipeline scheduler (VPS). The ability to synthesize variable pipelined units expands the known scheduling problem of high-level synthesis to include a 2D search for a minimal number of instances and their desired number of stages. The proposed search procedure is based on algorithms that find a local minima in a d-dimensional grid, thus avoiding the need to evaluate all possible points in the space. We have implemented a C language compiler for VPS. Our results demonstrate that using variable pipeline units can reduce the overall resource usage and improve the execution time.

field-programmable logic and applications | 2009

Binary Synthesis with multiple memory banks targeting array references

Yosi Ben Asher; Nadav Rotem

High-Level Synthesis (HLS) is the field of transforming a high-level programming language, such as C, into a register transfer level(RTL) description of the design. In HLS, Binary Synthesis[14] is a method for synthesizing existing compiled applications for which the source code is not available. One of the advantages of FPGAs over software is the availability of multiple memory banks. Until now, binary synthesis systems have not made use of the multiple memory banks on FPGAs. In our work, we decompile the binary executable into an intermediate representation, and we target architectures with multiple memory banks and multiple memory ports. We present methods for detecting memory regions and synthesis of the decompiled code. The proposed methods accelerate the execution time of applications which use multiple memory regions concurrently.

Journal of Systems Architecture | 2010

Finding the best compromise in compiling compound loops to Verilog

Yosi Ben-Asher; Nadav Rotem; Eddie Shochat

In this work we consider a special optimization problem involved with compiling compound loops (combining nested and consecutive sub-loops) to Verilog. Each sub-loop of the compound loop may require a different optimized hardware configuration (OHC) for optimized execution times. For example, one loop requires at least two memory ports and one multiplier for an optimized execution time, while another loop may require only one memory port but two multipliers, yet one OHC should be selected for both loops. The goal is to compute a minimal OHC which, based on the different heat levels (expected number of iterations) of the sub-loops, is a good compromise between all the conflicting requirements of each sub-loop. Though synthesis of nested loops has been implemented in quite a few systems this aspect has not been considered so far. We avoid the use of time consuming integer linear programming (ILP) techniques and instead use a fast space exploration technique combined with an efficient variant of list scheduling. Another novel aspect of the proposed system is the observation that the real latencies of the hardware units should be considered as variables of the OHC rather than fixed real values as is usually done in high-level synthesis systems. Experimental results show a significant improvement in the OHC without a significant increase in the execution time due to the use of this search procedure.

ACM Transactions on Reconfigurable Technology and Systems | 2010

Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs

Yosi Ben-Asher; Danny Meisler; Nadav Rotem

In High-Level Synthesis (HLS), extracting parallelism in order to create small and fast circuits is the main advantage of HLS over software execution. Modulo Scheduling (MS) is a technique in which a loop is parallelized by overlapping different parts of successive iterations. This ability to extract parallelism makes MS an attractive synthesis technique for loop acceleration. In this work we consider two problems involved in the use of MS which are central when targeting FPGAs. Current MS scheduling techniques sacrifice execution times in order to meet resource and delay constraints. Let “ideal” execution times be the ones that could have been obtained by MS had we ignored resource and delay constraints. Here we pose the opposite problem, which is more suitable for HLS, namely, how to reduce resource constraints without sacrificing the ideal execution time. We focus on reducing the number of memory ports used by the MS synthesis, which we believe is a crucial resource for HLS. In addition to reducing the number of memory ports we consider the need to develop MS techniques that are fast enough to allow interactive synthesis times and repeated applications of the MS to explore different possibilities of synthesizing the circuits. Current solutions for MS synthesis that can handle memory constraints are too slow to support interactive synthesis. We formalize the problem of reducing the number of parallel memory references in every row of the kernel by a novel combinatorial setting. The proposed technique is based on inserting dummy operations in the kernel and by doing so, performing modulo-shift operations such that the maximal number of parallel memory references in a row is reduced. Experimental results suggest improved execution times for the synthesized circuit. The synthesis takes only a few seconds even for large-size loops.

ACM Transactions in Embedded Computing Systems | 2013

Using memory profile analysis for automatic synthesis of pointers code

Yosi Ben-Asher; Nadav Rotem

One of the main advantages of high-level synthesis (HLS) is the ability to synthesize circuits that can access multiple memory banks in parallel. Current HLS systems synthesize parallel memory references based on explicit array declarations in the source code. We consider the need to synthesize not only array references but also memory operations targeting pointers and dynamic data structures. This paper describes Automatic Memory Partitioning, a method for automatically synthesizing general data structures (arrays and pointers) into multiple memory banks for increased parallelism and performance. We use source code instrumentation to collect memory traces in order to detect linear memory access patterns. The memory traces are used to split data structures into disjoint memory regions and determine which segments may benefit from parallel memory access. We present an algorithm for allocating memory segments into multiple memory banks. Experiments show significant improvements in performance while conserving the number of memory banks.

Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference on | 2009

The effect of unrolling and inlining for Python bytecode optimizations

Yosi Ben Asher; Nadav Rotem

In this study, we consider bytecode optimizations for Python, a programming language which combines object-oriented concepts with features of scripting languages, such as dynamic dictionaries. Due to its design nature, Python is relatively slow compared to other languages. It operates through compiling the code into powerful bytecode instructions that are executed by an interpreter. Pythons speed is limited due to its interpreter design, and thus there is a significant need to optimize the language. In this paper, we discuss one possible approach and limitations in optimizing Python based on bytecode transformations. In the first stage of the proposed optimizer, the bytecode is expanded using function inline and loop unrolling. The second stage of transformations simplifies the bytecode by applying a complete set of data-flow optimizations, including constant propagation, algebraic simplifications, dead code elimination, copy propagation, common sub expressions elimination, loop invariant code motion and strength reduction. While these optimizations are known and their implementation mechanism (data flow analysis) is well developed, they have not been successfully implemented in Python due to its dynamic features which prevent their use. In this work we attempt to understand the dynamic features of Python and how these features affect and limit the implementation of these optimizations. In particular, we consider the significant effects of first unrolling and then inlining on the ability to apply the remaining optimizations. The results of our experiments indicate that these optimizations can indeed be implemented and dramatically improve execution times.

ACM Transactions on Architecture and Code Optimization | 2013

Hybrid type legalization for a sparse SIMD instruction set

Yosi Ben Asher; Nadav Rotem

SIMD vector units implement only a subset of the operations used by vectorizing compilers, and there are multiple conflicting techniques to legalize arbitrary vector types into register-sized data types. Traditionally, type legalization is performed using a set of predefined rules, regardless of the operations used in the program. This method is not suitable to sparse SIMD instruction sets and often prevents the vectorization of programs. In this work we introduce a new technique for type legalization, namely vector element promotion, as well as a hybrid method for combining multiple techniques of type legalization. Our hybrid type legalization method makes decisions based on the knowledge of the available instruction set as well as the operations used in the program. Our experimental results demonstrate that program-dependent hybrid type legalization improves the execution time of vector programs, outperforms the existing legalization method, and allows the vectorization of workloads which were not vectorized before.

IEEE Computer Architecture Letters | 2014

Block Unification IF-conversion for High Performance Architectures

Nadav Rotem; Yosi Ben Asher

Graphics Processing Units accelerate data-parallel graphic calculations using wide SIMD vector units. Compiling programs to use the GPUs SIMD architectures require converting multiple control flow paths into a single stream of instructions. IF-conversion is a compiler transformation, which converts control dependencies into data dependencies, and it is used by vectorizing compilers to eliminate control flow and enable efficient code generation. In this work we enhance the IF-conversion transformation by using a block unification method to improve the currently used block flattening method. Our experimental results demonstrate that our IF-conversion method is effective in reducing the number of predicated instructions and in boosting kernel execution speed.

Design Automation for Embedded Systems | 2011

Combining static and dynamic array detection for binary synthesis with multiple memory ports

Nadav Rotem; Yosi Ben Asher

In High-Level Synthesis, Binary Synthesis is a method for synthesizing compiled applications for which the source code is not available. One of the advantages of FPGAs over processors is the availability of multiple internal and external memory banks. Binary synthesis tools use multiple memory banks if they are able to recover data-structures from the binary. In this work we improve the recovery of data-structures by introducing dynamic memory analysis and combining it with improved static memory analysis. We show that many applications can only be synthesized using dynamic memory analysis. We present two FPGA based architectures for implementing the bound-checking and recovery for the synthesized code. Our experiments show that the proposed technique accelerates the execution of applications which use multiple memory banks concurrently. We demonstrate that many binary applications indeed benefit from this technique.

Explore More