Sylvain Girbal
University of Paris-Sud
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sylvain Girbal.
International Journal of Parallel Programming | 2006
Sylvain Girbal; Nicolas Vasilache; Cédric Bastoul; Albert Cohen; David Parello; Marc Sigler; Olivier Temam
Modern compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved in general, resulting in weak scalability and disappointing sustained performance. We address this challenge by working on the program representation itself, using a semi-automatic optimization approach to demonstrate that current compilers offen suffer from unnecessary constraints and intricacies that can be avoided in a semantically richer transformation framework. Technically, the purpose of this paper is threefold: (1) to show that syntactic code representations close to the operational semantics lead to rigid phase ordering and cumbersome expression of architecture-aware loop transformations, (2) to illustrate how complex transformation sequences may be needed to achieve significant performance benefits, (3) to facilitate the automatic search for program transformation sequences, improving on classical polyhedral representations to better support operation research strategies in a simpler, structured search space. The proposed framework relies on a unified polyhedral representation of loops and statements, using normalization rules to allow flexible and expressive transformation sequencing. Thisrepresentation allows to extend the scalability of polyhedral dependence analysis, and to delay the (automatic) legality checks until the end of a transformation sequence. Our work leverages on algorithmic advances in polyhedral code generation and has been implemented in a modern research compiler.
IEEE Computer Architecture Letters | 2007
David I. August; Jonathan Chang; Sylvain Girbal; Daniel Gracia-Perez; Gilles Mouchard; David A. Penry; Olivier Temam; Neil Vachharajani
Simulator development is already a huge burden for many academic and industry research groups; future complex or heterogeneous multi-cores, as well as the multiplicity of performance metrics and required functionality, will make matters worse. We present a new simulation environment, called UNISIM, which is designed to rationalize simulator development by making it possible and efficient to distribute the overall effort over multiple research groups, even without direct cooperation. UNISIM achieves this goal with a combination of modular software development, distributed communication protocols, multilevel abstract modeling, interoperability capabilities, a set of simulator services APIs, and an open library/repository for providing a consistent set of simulator modules.
international conference on supercomputing | 2005
Albert Cohen; Marc Sigler; Sylvain Girbal; Olivier Temam; David Parello; Nicolas Vasilache
Static compiler optimizations can hardly cope with the complex run-time behavior and hardware components interplay of modern processor architectures. Multiple architectural phenomena occur and interact simultaneously, which requires the optimizer to combine multiple program transformations. Whether these transformations are selected through static analysis and models, runtime feedback, or both, the underlying infrastructure must have the ability to perform long and complex compositions of program transformations in a flexible manner. Existing compilers are ill-equipped to perform that task because of rigid phase ordering, fragile selection rules using pattern matching, and cumbersome expression of loop transformations on syntax trees. Moreover, iterative optimization emerges as a pragmatic and general means to select an optimization strategy via machine learning and operations research. Searching for the composition of dozens of complex, dependent, parameterized transformations is a challenge for iterative approaches.The purpose of this article is threefold: (1) to facilitate the automatic search for compositions of program transformations, introducing a richer framework which improves on classical polyhedral representations, suitable for iterative optimization on a simpler, structured search space, (2) to illustrate, using several examples, that syntactic code representations close to the operational semantics hamper the composition of transformations, and (3) that complex compositions of transformations can be necessary to achieve significant performance benefits. The proposed framework relies on a unified polyhedral representation of loops and statements. The key is to clearly separate four types of actions associated with program transformations: iteration domain, schedule, data layout and memory access functions modifications. The framework is implemented within the Open64/ORC compiler, aiming for native IA64, AMD64 and IA32 code generation, along with source-to-source optimization of Fortran90, C and C++.
high performance embedded architectures and compilers | 2012
Petar Radojković; Sylvain Girbal; Arnaud Grasset; Eduardo Quiñones; Sami Yehia; Francisco J. Cazorla
Commercial Off-The-Shelf (COTS) processors are now commonly used in real-time embedded systems. The characteristics of these processors fulfill system requirements in terms of time-to-market, low cost, and high performance-per-watt ratio. However, multithreaded (MT) processors are still not widely used in real-time systems because the timing analysis is too complex. In MT processors, simultaneously-running tasks share and compete for processor resources, so the timing analysis has to estimate the possible impact that the inter-task interferences have on the execution time of the applications. In this paper, we propose a method that quantifies the slowdown that simultaneously-running tasks may experience due to collision in shared processor resources. To that end, we designed benchmarks that stress specific processor resources and we used them to (1) estimate the upper limit of a slowdown that simultaneously-running tasks may experience because of collision in different shared processor resources, and (2) quantify the sensitivity of time-critical applications to collision in these resources. We used the presented method to determine if a given MT processor is a good candidate for systems with timing requirements. We also present a case study in which the method is used to analyze three multithreaded architectures exhibiting different configurations of resource sharing. Finally, we show that measuring the slowdown that real applications experience when simultaneously-running with resource-stressing benchmarks is an important step in measurement-based timing analysis. This information is a base for incremental verification of MT COTS architectures.
Microprocessors and Microsystems | 2014
Roberto Giorgi; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Rahulkumar Gayatri; Sylvain Girbal; Daniel Goodman; Behram Khan; Souad Koliai; Joshua Landwehr; Nhat Minh Lê; Feng Li; Mikel Luján; Avi Mendelson; Laurent Morin; Nacho Navarro; Tomasz Patejko; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Ian Watson; Sebastian Weis; Stéphane Zuckerman; Mateo Valero
The improvements in semiconductor technologies are gradually enabling extreme-scale systems such as teradevices (i.e., chips composed by 1000 billion of transistors), most likely by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper presents an overview of the research carried out by the TERAFLUX partners and some preliminary results. Our platform comprises 1000+ general purpose cores per chip in order to properly explore the above challenges. An architectural template has been proposed and applications have been ported to the platform. Programming models, compilation tools, and reliability techniques have been developed. The evaluation is carried out by leveraging on modifications of the HP-Labs COTSon simulator.
high-performance computer architecture | 2009
Sami Yehia; Sylvain Girbal; Hugues Berry; Olivier Temam
While parallelism and multi-cores are receiving much attention as a major scalability path, customization is another, orthogonal and complementary, scalability path which can target not easily parallelizable programs or program sections. The key assets of customization are cost and power efficiency. The key limitation of customization is flexibility. However, we argue that there is no perfect balance between efficiency and flexibility, each system vendor may want to strike a different such balance. In this article, we present a method for achieving any desired balance between flexibility and efficiency by automatically combining any set of individual customization circuits into a larger compound circuit. This circuit is significantly more cost efficient than the simple union of all target circuits, and is configurable to behave as any of the target circuits, while avoiding the routing and configuration cost overhead of FPGAs. The more individual circuits are included, the larger the number of applications which can potentially benefit from this compound customization circuit, realizing flexibility at a minimal cost. Moreover, we observe that the compound circuit cost does not increase in proportion to the number of target applications, due to the wide range of common data-flow and control-flow patterns in programs. Currently, the target individual circuits correspond to loops, like most accelerators in embedded systems, but the aggregation method can accommodate circuits of any size. Using the UTDSP benchmarks and accelerators coupled with an embedded PowerPC405 processor, we show that this approach can yield an average performance improvement of 2.97, while the corresponding synthesized aggregate accelerator is 3 time smaller than the sum of individual accelerators for each target benchmark.
european conference on parallel processing | 2004
Albert Cohen; Sylvain Girbal; Olivier Temam
We wish to extend the effectiveness of loop-restructuring compilers by improving the robustness of loop transformations and easing their composition in long sequences. We propose a formal and practical framework for program transformation. Our framework is well suited for iterative optimization techniques searching not only for the appropriate parameters of a given transformation, but for the program transformations themselves, and especially for compositions of program transformations. This framework is based on a unified polyhedral representation of loops and statements, enabling the application of generalized control and data transformations without reference to a syntactic program representation. The key to our framework is to clearly separate the impact of each program transformation on three independent components: the iteration domain, the iteration schedule and the memory access functions. The composition of generalized transformations builds on normalization rules specific to each component of the representation. Our techniques have been implemented on top of Open64/ORC.
international conference on supercomputing | 2006
Nicolas Vasilache; Cédric Bastoul; Albert Cohen; Sylvain Girbal
The polyhedral model is a powerful framework to reason about high level loop transformations. Yet the lack of scalable algorithms and tools has deterred actors from both academia and industry to put this model to practical use. Indeed, for fundamental complexity reasons, its applicability has long been limited to simple kernels. Recent developments broke some generally accepted ideas about these limitations. In particular, new algorithms made it possible to compute the target code for full SPEC benchmarks while this code generation step was expected not to be scalable.Instancewise array dependence analysis computes a finite, intensional representation of the (statically unbounded) set of all dynamic dependences. This problem has always been considered non-scalable and/or an overkill with respect to less expressive and faster dependence tests. On the contrary, this article presents experimental evidence of its applicability to full SPEC CPU2000 benchmarks. To make this possible, we revisit the characterization of data dependences, considering relations between time dimensions of the transformed space. Beyond algorithmic benefits, this naturally leads to a novel way of reasoning about violated dependences across arbitrary transformation sequences. Reasoning about violated dependences relieves the compiler designer from the cumbersome task of implementing specific legality checks for each single transformation. It also allows, in the case of invalid transformations, to precisely determine the violated dependences that need to be corrected. Identifying these violations can in turn enable automatic correction schemes to fix an illegal transformation sequence with minimal changes.
digital systems design | 2013
Marco Solinas; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Sylvain Girbal; Daniel Goodman; Behran Khan; Souad Koliai; Feng Li; Mikel Luján; Laurent Morin; Avi Mendelson; Nacho Navarro; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Mateo Valero; Sebastian Weis; Ian Watson; Stéphane Zuckermann; Roberto Giorgi
Thanks to the improvements in semiconductor technologies, extreme-scale systems such as teradevices (i.e., composed by 1000 billion of transistors) will enable systems with 1000+ general purpose cores per chip, probably by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper describes the project and provides an overview of the research carried out by the TERAFLUX consortium.
real-time networks and systems | 2014
Angeliki Kritikakou; Christine Rochange; Madeleine Faugere; Claire Pagetti; Matthieu Roy; Sylvain Girbal; Daniel Gracia Pérez
When integrating mixed critical systems on a multi/many-core, one challenge is to ensure predictability for high criticality tasks and an increased utilization for low criticality tasks. In this paper, we address this problem when several high criticality tasks with different deadlines, periods and offsets are concurrently executed on the system. We propose a distributed run-time WCET controller that works as follows: (1) locally, each critical task regularly checks if the interferences due to the low criticality tasks can be tolerated, otherwise it decides their suspension; (2) globally, a master suspends and restarts the low criticality tasks based on the received requests from the critical tasks. Our approach has been implemented as a software controller on a real multi-core COTS system with significant gains.