Dorit Nuzman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dorit Nuzman is active.

Explore More

Publication

Featured researches published by Dorit Nuzman.

programming language design and implementation | 2006

Auto-vectorization of interleaved data for SIMD

Dorit Nuzman; Ira Rosen; Ayal Zaks

Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data reorganization manipulations. Computations on non-contiguous and especially interleaved data appear in important applications, which can greatly benefit from SIMD instructions once the data is reorganized properly. Vectorizing such computations efficiently is therefore an ambitious challenge for both programmers and vectorizing compilers. We demonstrate an automatic compilation scheme that supports effective vectorization in the presence of interleaved data with constant strides that are powers of 2, facilitating data reorganization. We demonstrate how our vectorization scheme applies to dominant SIMD architectures, and present experimental results on a wide range of key kernels, showing speedups in execution time up to 3.7 for interleaving levels (stride) as high as 8.

international conference on parallel architectures and compilation techniques | 2008

Outer-loop vectorization: revisited for short SIMD architectures

Dorit Nuzman; Ayal Zaks

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multimedia and embedded applications on short SIMD architectures such as MMX, SSE and AltiVec. Most of the focus has been directed at innermost loops, effectively executing their iterations concurrently as much as possible. Outer loop vectorization refers to vectorizing a level of a loop nest other than the innermost, which can be beneficial if the outer loop exhibits greater data-level parallelism and locality than the innermost loop. Outer loop vectorization has traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures. In this paper we revisit the method of outer loop vectorization, paying special attention to properties of modern short SIMD architectures. We show that even though current optimizing compilers for such targets do not apply outer-loop vectorization in general, it can provide significant performance improvements over innermost loop vectorization. Our implementation of direct outer-loop vectorization, available in GCC 4.3, achieves speedup factors of 3.13 and 2.77 on average across a set of benchmarks, compared to 1.53 and 1.39 achieved by innermost loop vectorization, when running on a Cell BE SPU and PowerPC970 processors respectively. Moreover, outer-loop vectorization provides new reuse opportunities that can be vital for such short SIMD architectures, including efficient handling of alignment. We present an optimization tapping such opportunities, capable of further boosting the performance obtained by outer-loop vectorization to achieve average speedup factors of 5.26 and 3.64.

symposium on code generation and optimization | 2006

Multi-platform Auto-vectorization

Dorit Nuzman; Richard Henderson

The recent proliferation of the single instruction multiple data (SIMD) model has lead to a wide variety of implementations. These have been incorporated into many platforms, from gaming machines and DSPs to general purpose architectures. In this paper, we present an automatic vectorizer as implemented in GCC, the most multi-targetable compiler available today. We discuss the considerations involved in developing a multi-platform vectorization technology, and demonstrate how our vectorization scheme is suited to a variety of SIMD architectures. Experiments on four different SIMD platforms demonstrate that our automatic vectorization scheme is able to efficiently support individual platforms, achieving significant speedups on key kernels.

international conference on parallel architectures and compilation techniques | 2009

Polyhedral-Model Guided Loop-Nest Auto-Vectorization

Konrad Trifunovic; Dorit Nuzman; Albert Cohen; Ayal Zaks; Ira Rosen

Optimizing compilers strive to construct efficient executables by applying sequences of transformations. Additional transformations are constantly being devised, with various mutual interactions among them, thereby exacerbating the notoriously difficult phase-ordering problem --- that of deciding which transformations to apply and in which order. Fortunately, new infrastructures such as the polyhedral compilation framework host a variety of transformations, facilitating the efficient exploration and configuration of multiple transformation sequences. Many powerful optimizations, however, remain external to the polyhedral framework, with potential mutual interactions that need to be considered. In this paper we examine the interactions between loop transformations of the polyhedral compilation framework and subsequent vectorization optimizations targeting fine-grain SIMD data-level parallelism. Automatic vectorization involves many low-level, target-specific considerations and transformations, which currently exclude it from being part of the polyhedral framework. In order to consider potential interactions among polyhedral loop transformations and vectorization, we first model the performance impact of the different loop transformations and vectorization strategies, and then show how this cost model can be integrated seamlessly into the polyhedral representation. This predictive modelling then facilitates efficient exploration and educated decision making on how to best apply various polyhedral loop transformations while considering the subsequent effects of different vectorization schemes. Our work demonstrates the feasibility and benefit of tuning the polyhedral model in the context of vectorization. Experimental results confirm that our model has accurate predictions, providing speedups of over 2 on average over traditional innermost-loop vectorization on PowerPC970 and Cell-SPU SIMD platforms.

symposium on code generation and optimization | 2011

Vapor SIMD: Auto-vectorize once, run everywhere

Dorit Nuzman; Sergei Dyshel; Erven Rohou; Ira Rosen; Kevin Williams; David Yuste; Albert Cohen; Ayal Zaks

Just-in-Time (JIT) compiler technology offers portability while facilitating target- and context-specific specialization. Single-Instruction-Multiple-Data (SIMD) hardware is ubiquitous and markedly diverse, but can be difficult for JIT compilers to efficiently target due to resource and budget constraints. We present our design for a synergistic auto-vectorizing compilation scheme. The scheme is composed of an aggressive, generic offline stage coupled with a lightweight, target-specific online stage. Our method leverages the optimized intermediate results provided by the first stage across disparate SIMD architectures from different vendors, having distinct characteristics ranging from different vector sizes, memory alignment and access constraints, to special computational idioms. We demonstrate the effectiveness of our design using a set of kernels that exercise innermost loop, outer loop, as well as straight-line code vectorization, all automatically extracted by the common offline compilation stage. This results in performance comparable to that provided by specialized monolithic offline compilers. Our framework is implemented using open-source tools and standards, thereby promoting interoperability and extendibility.

International Journal of Parallel Programming | 2011

ACOTES project: Advanced compiler technologies for embedded streaming

Eduard Ayguadé; Cédric Bastoul; Paul M. Carpenter; Zbigniew Chamski; Albert Cohen; Marco Cornero; Philippe Dumont; Marc Duranton; Mohammed Fellahi; Roger Ferrer; Razya Ladelsky; Menno Lindwer; Xavier Martorell; Cupertino Miranda; Dorit Nuzman; Andrea Ornstein; Antoniu Pop; Sebastian Pop; Louis-Noël Pouchet; Alex Ramirez; David Ródenas; Erven Rohou; Ira Rosen; Uzi Shvadron; Konrad Trifunovic; Ayal Zaks

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.

computing frontiers | 2008

Compiling for an indirect vector register architecture

Dorit Nuzman; Mircea Namolaru; Ayal Zaks; Jeff H. Derby

The iVMX architecture contains a novel vector register file of up to 4096 vector registers accessed indirectly via a mapping mechanism, providing compatibility with the VMX architecture, and potential for dramatic performance benefits [7]. The large number of vector registers and the unique indirection mechanism pose compilation challenges to be used efficiently: the indirection mechanism emphasizes spatial locality of registers and interaction among destination and source operands during register allocation, and the many vector registers call for aggressive automatic vectorization. This work is a first step in addressing the compilability of iVMX, following the presentation and validation of its architectural aspects [7]. In this paper we present several compilation approaches to deal with the mapping mechanism and an outer-loop vectorization transformation developed to promote the use of many vector registers. We modified an existing register allocator to target all available registers and added a post-pass to rename live-ranges considering spatial locality and interaction among operand types. An FIR filter is used to demonstrate the effectiveness of the techniques developed compared to a version hand-optimized for iVMX. Initial results show that we can reduce the overhead of map management down to 29% of the total instruction count, compared to 22% obtained manually, and compared to 49% obtained using a naive scheme, while outperforming an equivalent VMX implementation by a factor of 2.

high performance embedded architectures and compilers | 2011

Speculatively vectorized bytecode

Erven Rohou; Sergei Dyshel; Dorit Nuzman; Ira Rosen; Kevin Williams; Albert Cohen; Ayal Zaks

Diversity is a confirmed trend of computing systems, which present a complex and moving target to software developers. Virtual machines and just-in-time compilers have been proposed to mitigate the complexity of these systems. They do so by offering a single and stable abstract machine model thereby hiding architectural details from programmers. SIMD capabilities are common among current and expected computing systems. Efficient exploitation of SIMD instructions has become crucial for the performance of many applications. Existing auto-vectorizers operate within traditional static optimizing compilers, and use details about the target architecture when generating SIMD instructions. Unfortunately, auto-vectorizers are currently too complex to be included in a constrained Just-In-Time (JIT) environment. In this paper we propose Vapor SIMD: a speculative approach for effective just-in-time vectorization. Vapor SIMD first applies complex ahead-of-time techniques to vectorize source code and produce bytecode of a standard portable format. Advanced JIT compilers can then quickly tailor this bytecode to exploit SIMD capabilities of appropriate platforms, yielding up to 14.7x and 11.8x speedups on x86 and PowerPC platforms (including JIT-compilation time). JIT compilers can also seamlessly revert to non-vector code, in the absence of SIMD capabilities or in the case of a third-party non-vectorizing JIT compiler, yielding 93% or more of the original performance.

ACM Transactions on Architecture and Code Optimization | 2013

JIT technology with C/C++: Feedback-directed dynamic recompilation for statically compiled languages

Dorit Nuzman; Revital Eres; Sergei Dyshel; Marcel Zalmanovici; José G. Castaños

The growing gap between the advanced capabilities of static compilers as reflected in benchmarking results and the actual performance that users experience in real-life scenarios makes client-side dynamic optimization technologies imperative to the domain of static languages. Dynamic optimization of software distributed in the form of a platform-agnostic Intermediate-Representation (IR) has been very successful in the domain of managed languages, greatly improving upon interpreted code, especially when online profiling is used. However, can such feedback-directed IR-based dynamic code generation be viable in the domain of statically compiled, rather than interpreted, languages? We show that fat binaries, which combine the IR together with the statically compiled executable, can provide a practical solution for software vendors, allowing their software to be dynamically optimized without the limitation of binary-level approaches, which lack the high-level IR of the program, and without the warm-up costs associated with the IR-only software distribution approach. We describe and evaluate the fat-binary-based runtime compilation approach using SPECint2006, demonstrating that the overheads it incurs are low enough to be successfully surmounted by dynamic optimization. Building on Java JIT technologies, our results already improve upon common real-world usage scenarios, including very small workloads.

international conference on systems | 2011

Case studies in hardware XPath acceleration

Dorit Nuzman; David Maze; Matthias Nicola; Glenn A. Marcy; Victor Kaplansky; Sergei Dyshel; Alon Dayan

The high increase in usage of XML in electronic data exchange introduces new challenges for efficient processing of XML data. Applications that heavily use XML need to be able to quickly extract the relevant parts of the XML data, often using the XPath language for addressing XML document parts. High speed execution of XPath requests and queries is therefore becoming a critical requirement in many application domains, including XML databases and event processing. This work explores the potential for accelerating XPath processing in these domains using specialized hardware. This in turn poses the challenges of integrating specialized hardware with general-purpose application code. We present the design decisions behind building an integration layer to bridge between applications and the hardware, and describe our implementation. We discuss the factors that affect the acceleration potential, and show that despite the transmission overheads associated with off-loading XPath processing to the specialized co-processor, significant speedups can be obtained, ranging from modest 11% improvements in the event-processing domain, to over 6x speedup factor in the healthcare domain.

Explore More