Antoniu Pop | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Antoniu Pop is active.

Explore More

Publication

Featured researches published by Antoniu Pop.

Microprocessors and Microsystems | 2014

TERAFLUX: Harnessing dataflow in next generation teradevices

Roberto Giorgi; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Rahulkumar Gayatri; Sylvain Girbal; Daniel Goodman; Behram Khan; Souad Koliai; Joshua Landwehr; Nhat Minh Lê; Feng Li; Mikel Luján; Avi Mendelson; Laurent Morin; Nacho Navarro; Tomasz Patejko; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Ian Watson; Sebastian Weis; Stéphane Zuckerman; Mateo Valero

The improvements in semiconductor technologies are gradually enabling extreme-scale systems such as teradevices (i.e., chips composed by 1000 billion of transistors), most likely by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper presents an overview of the research carried out by the TERAFLUX partners and some preliminary results. Our platform comprises 1000+ general purpose cores per chip in order to properly explore the above challenges. An architectural template has been proposed and applications have been ported to the platform. Programming models, compilation tools, and reliability techniques have been developed. The evaluation is carried out by leveraging on modifications of the HP-Labs COTSon simulator.

high performance embedded architectures and compilers | 2011

A stream-computing extension to OpenMP

Antoniu Pop; Albert Cohen

This paper introduces an extension to OpenMP3.0 enabling stream programming with minimal, incremental additions that seamlessly integrate into the current specification. The stream programming model decomposes programs into tasks and explicits the flow of data among them, thus exposing data, task and pipeline parallelism. It helps the programmers to express concurrency and data locality properties, avoiding non-portable low-level code and early optimizations. We survey the diverse motivations and constraints converging towards the design of our simple yet powerful language extension, and we present experimental results of a prototype implementation in a public branch of GCC 4.5.

acm sigplan symposium on principles and practice of parallel programming | 2013

Correct and efficient work-stealing for weak memory models

Nhat Minh Lê; Antoniu Pop; Albert Cohen; Francesco Zappa Nardelli

Chase and Levs concurrent deque is a key data structure in shared-memory parallel programming and plays an essential role in work-stealing schedulers. We provide the first correctness proof of an optimized implementation of Chase and Levs deque on top of the POWER and ARM architectures: these provide very relaxed memory models, which we exploit to improve performance but considerably complicate the reasoning. We also study an optimized x86 and a portable C11 implementation, conducting systematic experiments to evaluate the impact of memory barrier optimizations. Our results demonstrate the benefits of hand tuning the deque code when running on top of relaxed memory models.

digital systems design | 2013

The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices

Marco Solinas; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Sylvain Girbal; Daniel Goodman; Behran Khan; Souad Koliai; Feng Li; Mikel Luján; Laurent Morin; Avi Mendelson; Nacho Navarro; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Mateo Valero; Sebastian Weis; Ian Watson; Stéphane Zuckermann; Roberto Giorgi

Thanks to the improvements in semiconductor technologies, extreme-scale systems such as teradevices (i.e., composed by 1000 billion of transistors) will enable systems with 1000+ general purpose cores per chip, probably by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper describes the project and provides an overview of the research carried out by the TERAFLUX consortium.

International Journal of Parallel Programming | 2011

ACOTES project: Advanced compiler technologies for embedded streaming

Eduard Ayguadé; Cédric Bastoul; Paul M. Carpenter; Zbigniew Chamski; Albert Cohen; Marco Cornero; Philippe Dumont; Marc Duranton; Mohammed Fellahi; Roger Ferrer; Razya Ladelsky; Menno Lindwer; Xavier Martorell; Cupertino Miranda; Dorit Nuzman; Andrea Ornstein; Antoniu Pop; Sebastian Pop; Louis-Noël Pouchet; Alex Ramirez; David Ródenas; Erven Rohou; Ira Rosen; Uzi Shvadron; Konrad Trifunovic; Ayal Zaks

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.

IEEE Micro | 2012

Automatic Extraction of Coarse-Grained Data-Flow Threads from Imperative Programs

Feng Li; Antoniu Pop; Albert Cohen

This article presents a general algorithm for transforming sequential imperative programs into parallel data-flow programs. The algorithm operates on a program dependence graph in static-single-assignment form, extracting task, pipeline, and data parallelism from arbitrary control flow, and coarsening its granularity using a generalized form of typed fusion. A prototype based on GNU Compiler Collection (GCC) is applied to the automatic parallelization of recursive C programs.

compilers architecture and synthesis for embedded systems | 2010

Erbium: a deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes

Cupertino Miranda; Antoniu Pop; Philippe Dumont; Albert Cohen; Marc Duranton

Tuning applications for multicore systems involve subtle concurrency concepts and target-dependent optimizations. This paper advocates for a streaming execution model, called ER, where persistent processes communicate and synchronize through a multi-consumer processing applications, we demonstrate the scalability and efficiency advantages of streaming compared to data-driven scheduling. To exploit these benefits in compilers for parallel languages, we propose an intermediate representation enabling the compilation of data-flow tasks into streaming processes. This intermediate representation also facilitates the application of classical compiler optimizations to concurrent programs.

ACM Transactions on Architecture and Code Optimization | 2014

Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

Andi Drebes; Karine Heydemann; Nathalie Drach; Antoniu Pop; Albert Cohen

We present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the OpenStream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63× compared to a state-of-the-art work-stealing scheduler.

ACM Transactions on Architecture and Code Optimization | 2015

Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs

Martin Kong; Antoniu Pop; Louis-Noël Pouchet; R. Govindarajan; Albert Cohen; P. Sadayappan

Task-parallel languages are increasingly popular. Many of them provide expressive mechanisms for intertask synchronization. For example, OpenMP 4.0 will integrate data-driven execution semantics derived from the StarSs research language. Compared to the more restrictive data-parallel and fork-join concurrency models, the advanced features being introduced into task-parallel models in turn enable improved scalability through load balancing, memory latency hiding, mitigation of the pressure on memory bandwidth, and, as a side effect, reduced power consumption. In this article, we develop a systematic approach to compile loop nests into concurrent, dynamically constructed graphs of dependent tasks. We propose a simple and effective heuristic that selects the most profitable parallelization idiom for every dependence type and communication pattern. This heuristic enables the extraction of interband parallelism (cross-barrier parallelism) in a number of numerical computations that range from linear algebra to structured grids and image processing. The proposed static analysis and code generation alleviates the burden of a full-blown dependence resolver to track the readiness of tasks at runtime. We evaluate our approach and algorithms in the PPCG compiler, targeting OpenStream, a representative dataflow task-parallel language with explicit intertask dependences and a lightweight runtime. Experimental results demonstrate the effectiveness of the approach.

european conference on parallel processing | 2015

Effective Barrier Synchronization on Intel Xeon Phi Coprocessor

Andrey Rodchenko; Andy Nisbet; Antoniu Pop; Mikel Luján

Barriers are a fundamental synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art barrier synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, we present a novel hybrid barrier implementation that exploits the topology, the memory hierarchy and streaming stores of the Xeon Phi architecture to achieve a 3\(\times \) lower overhead than the Intel OpenMP barrier implementation (ICC 14.0.0), thus outperforming, to the best of our knowledge, all other implementations, and which we evaluate on the CG and MG kernels from the NAS Parallel Benchmarks, the direct N-body simulation kernel and the EPCC barrier OpenMP microbenchmark. The optimized barriers presented in the paper are available at https://github.com/arodchen/cbarriers released as free software.

Explore More