Alexandra Jimborean | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alexandra Jimborean is active.

Explore More

Publication

Featured researches published by Alexandra Jimborean.

symposium on code generation and optimization | 2014

Fix the code. Don't tweak the hardware: A new compiler approach to Voltage-Frequency scaling

Alexandra Jimborean; Konstantinos Koukos; Vasileios Spiliopoulos; David Black-Schaffer; Stefanos Kaxiras

Traditional compiler approaches to optimize power efficiency aim to adjust voltage and frequency at runtime to match the code characteristics to the hardware (e.g., running memory-bound phases at a lower frequency). However, such approaches are constrained by three factors: (i) voltage-frequency transitions are too slow to be applied at instruction granularity, (ii) larger code regions are seldom unequivocally memory- or compute-bound, and, (iii) the available voltage scaling range for future technologies is rapidly shrinking. These factors necessitate new approaches to address power-efficiency at the code-generation level. This paper proposes one such approach to automatically generate power-efficient code using a decoupled access/execute (DAE) model. In DAE a program is split into tasks, where each task consists of two sufficiently coarse-grained phases to enable effective Dynamic Voltage Frequency Scaling (DVFS): (i) the access-phase for data prefetch (heavily memory-bound), and (ii) the execute-phase that performs the actual computation (heavily compute-bound). Our contribution is to provide a compiler methodology to automatically generate the access-phases for a task-based programming system. Our approach is capable of handling both affine (through a polyhedral analysis) and non-affine codes (through optimized task skeletons). Our evaluation shows that the automatically generated versions improve EDP by 25% on average compared to a coupled execution, without any performance degradation, and surpasses the EDP savings of the corresponding hand-crafted tasks by 5%.

International Journal of Parallel Programming | 2014

Dynamic and Speculative Polyhedral Parallelization Using Compiler-Generated Skeletons

Alexandra Jimborean; Philippe Clauss; Jean-François Dollinger; Vincent Loechner; Juan Manuel Martinez Caamaño

We propose a framework based on an original generation and use of algorithmic skeletons, and dedicated to speculative parallelization of scientific nested loop kernels, able to apply at run-time polyhedral transformations to the target code in order to exhibit parallelism and data locality. Parallel code generation is achieved almost at no cost by using binary algorithmic skeletons that are generated at compile-time, and that embed the original code and operations devoted to instantiate a polyhedral parallelizing transformation and to verify the speculations on dependences. The skeletons are patched at run-time to generate the executable code. The run-time process includes a transformation selection guided by online profiling phases on short samples, using an instrumented version of the code. During this phase, the accessed memory addresses are used to compute on-the-fly dependence distance vectors, and are also interpolated to build a predictor of the forthcoming accesses. Interpolating functions and distance vectors are then employed for dependence analysis to select a parallelizing transformation that, if the prediction is correct, does not induce any rollback during execution. In order to ensure that the rollback time overhead stays low, the code is executed in successive slices of the outermost original loop of the nest. Each slice can be either a parallel version which instantiates a skeleton, a sequential original version, or an instrumented version. Moreover, such slicing of the execution provides the opportunity of transforming differently the code to adapt to the observed execution phases, by patching differently one of the pre-built skeletons. The framework has been implemented with extensions of the LLVM compiler and an x86-64 runtime system. Significant speed-ups are shown on a set of benchmarks that could not have been handled efficiently by a compiler.

compiler construction | 2012

VMAD: an advanced dynamic program analysis and instrumentation framework

Alexandra Jimborean; Luis Mastrangelo; Vincent Loechner; Philippe Clauss

VMAD (Virtual Machine for Advanced Dynamic analysis) is a platform for advanced profiling and analysis of programs, consisting in a static component and a runtime system. The runtime system is organized as a set of decoupled modules, dedicated to specific instrumenting or optimizing operations, dynamically loaded when required. The program binary files handled by VMAD are previously processed at compile time to include all necessary data, instrumentation instructions and callbacks to the runtime system. For this purpose, the LLVM compiler has been extended to automatically generate multiple versions of the code, each of them tailored for the targeted instrumentation or optimization strategies. The compiler chooses the most suitable intermediate representation for each version, depending on the information to be acquired and on the optimizations to be applied. The control flow graph is adapted to include the new versions and to transfer the control to and from the runtime system, which is in charge of the execution flow orchestration. The strength of our system resides in its extensibility, as one can add support for various new profiling or optimization strategies, independently of the existing modules. VMADs potential is illustrated by presenting several analysis and optimization applications dedicated to loop nests: instrumentation by sampling, dynamic dependence analysis, adaptive version selection.

IEEE Transactions on Parallel and Distributed Systems | 2016

A Hybrid Static-Dynamic Classification for Dual-Consistency Cache Coherence

Alberto Ros; Alexandra Jimborean

Traditional cache coherence protocols manage all memory accesses equally and ensure the strongest memory model, namely, sequential consistency. Recent cache coherence protocols based on self-invalidation advocate for the model sequential consistency for data-race-free, which enables powerful optimizations for race-free code. However, for racy code these cache coherence protocols provide sub-optimal performance compared to traditional protocols. This paper proposes SPEL++, a dual-consistency cache coherence protocol that supports two execution modes: a traditional sequential-consistent protocol and a protocol that provides weak consistency (or sequential consistency for data-race-free). SPEL++ exploits a static-dynamic hybrid classification of memory accesses based on (i) a compile-time identification of extended data-race-free code regions for OpenMP applications and (ii) a runtime classification of accesses based on the operating systems memory page management. By executing racy code under the sequential-consistent protocol and race-free code under the cache coherence protocol that provides sequential consistency for data-race-free, the end result is an efficient execution of the applications while still providing sequential consistency. Compared to a traditional protocol, we show improvements in performance from 19 to 38 percent and reductions in energy consumption from 47 to 53 percent, on average for different benchmark suites, on a 64-core chip multiprocessor.

compiler construction | 2016

Multiversioned decoupled access-execute: the key to energy-efficient compilation of general-purpose programs

Konstantinos Koukos; Per Ekemark; Georgios Zacharopoulos; Vasileios Spiliopoulos; Stefanos Kaxiras; Alexandra Jimborean

Computer architecture design faces an era of great challenges in an attempt to simultaneously improve performance and energy efficiency. Previous hardware techniques for energy management become severely limited, and thus, compilers play an essential role in matching the software to the more restricted hardware capabilities. One promising approach is software decoupled access-execute (DAE), in which the compiler transforms the code into coarse-grain phases that are well-matched to the Dynamic Voltage and Frequency Scaling (DVFS) capabilities of the hardware. While this method is proved efficient for statically analyzable codes, general-purpose applications pose significant challenges due to pointer aliasing, complex control flow and unknown runtime events. We propose a universal compile-time method to decouple general-purpose applications, using simple but efficient heuristics. Our solutions overcome the challenges of complex code and show that automatic decoupled execution significantly reduces the energy expenditure of irregular or memory-bound applications and even yields slight performance boosts. Overall, our technique achieves over 20% on average energy-delay-product (EDP) improvements (energy over 15% and performance over 5%) across 14 benchmarks from SPEC CPU 2006 and Parboil benchmark suites, with peak EDP improvements surpassing 70%.

acm sigplan symposium on principles and practice of parallel programming | 2012

Adapting the polyhedral model as a framework for efficient speculative parallelization

Alexandra Jimborean; Philippe Clauss; Benoı̂t Pradelle; Luis Mastrangelo; Vincent Loechner

In this paper, we present a Thread-Level Speculation (TLS) framework whose main feature is to be able to speculatively parallelize a sequential loop nest in various ways, by re-scheduling its iterations. The transformation to be applied is selected at runtime with the goal of minimizing the number of rollbacks and maximizing performance. We perform code transformations by applying the polyhedral model that we adapted for speculative and runtime code parallelization. For this purpose, we design a parallel code pattern which is patched by our runtime system according to the profiling information collected on some execution samples. Adaptability is ensured by considering chunks of code of various sizes, that are launched successively, each of which being parallelized in a different manner, or run sequentially, depending on the currently observed behavior for accessing memory. We show on several benchmarks that our framework yields good performance on codes which could not be handled efficiently by previously proposed TLS systems.

runtime verification | 2014

Speculative Program Parallelization with Scalable and Decentralized Runtime Verification

Aravind Sukumaran-Rajam; Juan Manuel Martinez Caamaño; Willy Wolff; Alexandra Jimborean; Philippe Clauss

Thread Level Speculation (TLS) is a dynamic code parallelization technique proposed to keep the software in pace with the advances in hardware, in particular, to automatically parallelize programs to take advantage of the multi-core processors. Being speculative, frameworks of this type unavoidably rely on verification systems that are similar to software transactional memory, and that require voluminous inter-thread communications or centralized registering of the performed memory accesses. The high degree of communication is against the basic principles of high performance parallel computing, does not scale with an increasing number of processor cores, and yields weak performance. Moreover, TLS systems often apply one unique parallelization strategy consisting in slicing a loop into several parallel speculative threads. Such a strategy is also against the basic principles since loops in the original serial code are not necessarily parallel and also, it is well-known that the parallel schedule must promote data locality which is crucial in obtaining good performance. This situation appeals to scalable and decentralized verification systems and new strategies to dynamically generate efficient parallel code resulting from advanced optimizing parallelizing transformations. Such transformations require a more complex verification system that allows intra-thread iterations to be reordered. In this paper, we propose a verification system of this kind, based on a model built at runtime and predicting a linear memory behavior. This strategy is part of the Apollo speculative code parallelizer which is based on an adaptation for dynamic usage of the polyhedral model.

international symposium on performance analysis of systems and software | 2011

VMAD: A virtual machine for advanced dynamic analysis of programs

Alexandra Jimborean; Matthieu Herrmann; Vincent Loechner; Philippe Clauss

Runtime code analysis and optimization is becoming a main strategy used to face the ever extending and changing variety of processor architectures and execution environments that an application can meet. Particularly with the advent of multicore processors, efficient program optimizations, such as adaptive and speculative parallelism, require accurate and advanced runtime analyses, which inevitably incur a time overhead that has to be minimized. In this paper, we present VMAD, a virtual machine (VM) that handles x86_54 binary files, which are especially tailored at compile time to include instructions and data for code instrumentation and for the VM. VMAD enables low level profiling initiated by the programmer from the source code, through the insertion of a dedicated pragma delimiting the regions of interest. This approach provides the programmer a direct view of the actual execution behavior of the source code. To our knowledge, VMAD is the first proposal providing low-level instrumentation initiated from the source code, with almost negligible runtime overhead.

international conference on embedded computer systems architectures modeling and simulation | 2014

Software-controlled processor stalls for time and energy efficient data locality optimization

Philippe Clauss; Imèn Fassi; Alexandra Jimborean

Data locality optimization is a well-known goal when handling programs that must run as fast as possible or use a minimum amount of energy. However, usual techniques never address the significant impact of numerous stalled processor cycles that may occur when consecutive load and store instructions are accessing the same memory location. We show that two versions of the same program may exhibit similar memory performance, while performing very differently regarding their execution times because of the stalled processor cycles generated by many pipeline hazards. We propose a new programming structure called “xfor”, enabling the explicit control of the way data locality is optimized in a program and thus, to control the amount of stalled processor cycles. We show the benefits of xfor regarding execution time and energy saving.

international conference on parallel processing | 2013

Online dynamic dependence analysis for speculative polyhedral parallelization

Alexandra Jimborean; Philippe Clauss; Juan Manuel Martinez; Aravind Sukumaran-Rajam

We present a dynamic dependence analyzer whose goal is to compute dependences from instrumented execution samples of loop nests. The resulting information serves as a prediction of the execution behavior during the remaining iterations and can be used to select and apply a speculatively optimizing and parallelizing polyhedral transformation of the target sequential loop nest. Thus, a parallel lock-free version can be generated which should not induce any rollback if the prediction is correct. The dependence analyzer computes distance vectors and linear functions interpolating the memory addresses accessed by each memory instruction, and the values of some scalars. Phases showing a changing memory behavior are detected thanks to a dynamic adjustment of the instrumentation frequency. The dependence analyzer takes part of a whole framework dedicated to speculative parallelization of loop nests which has been implemented with extensions of the LLVM compiler and an x86-64 runtime system.

Explore More