Judit Planas
École Polytechnique Fédérale de Lausanne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Judit Planas.
Parallel Processing Letters | 2011
Alejandro Duran; Eduard Ayguadé; Rosa M. Badia; Jesús Labarta; Luis Martinell; Xavier Martorell; Judit Planas
In this paper, we present OmpSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on different architectures, SMP, GPUs, and hybrid SMP/GPU environments, showing the wide usefulness of the approach. The evaluation is done with six different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, Julia Set, PBPI and FixedGrid. We compare the results obtained with the execution of the same benchmarks written in OpenCL or OpenMP, on the same architectures. The results show that OmpSs greatly outperforms both environments. With the use of OmpSs the programming environment is more flexible than traditional approaches to exploit multiple accelerators, and due to the simplicity of the annotations, it increases programmers productivity.
international parallel and distributed processing symposium | 2012
Javier Bueno; Judit Planas; Alejandro Duran; Rosa M. Badia; Xavier Martorell; Eduard Ayguadé; Jesús Labarta
Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism. It is based on annotating a serial application with directives that are translated by the compiler. With it, the same program that runs sequentially in a node with a single GPU can run in parallel in multiple GPUs either local (single node) or remote (cluster of GPUs). Besides performing a task-based parallelization, the runtime system moves the data as needed between the different nodes and GPUs minimizing the impact of communication by using affinity scheduling, caching, and by overlapping communication with the computational task. We show several applications programmed with OmpSs and their performance with multiple GPUs in a local node and in remote nodes. The results show good tradeoff between performance and effort from the programmer.
international parallel and distributed processing symposium | 2013
Judit Planas; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta
As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to applications source code and can potentially increase applications performance.
international conference on conceptual structures | 2015
Judit Planas; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta
Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: a set of optimization techniques to efficiently manage multi-accelerator systems. AMA maximizes the overlap of computation and communication in a blocking-free way. Then, we can use such spare time to do other work while waiting for device operations. Implemented on top of a task-based framework, the experimental evaluation of AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA code, with the advantage of fully hiding the device management. In addition, we obtain up to more than 2x performance speed-up with respect to the original framework implementation.
ieee international conference on high performance computing data and analytics | 2015
Judit Planas; Rosa M. Badia; Eduard Ayguadé; Jesús Labarta
High-performance computers can reach higher levels of computational power when combined with accelerators. Nevertheless, the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management and work distribution. We present SSMART, a task-based scheduler to dynamically distribute work among the processing units of a heterogeneous system. Assuming that different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) are given, SSMART is able to record statistics from previously executed tasks on each system device and dynamically adapt the workload distribution to achieve the optimal performance. SSMART has been implemented on top of OmpSs, a programming model based on compiler directives. The results obtained in a multi-GPU and a MIC+GPU systems prove that our proposal gives flexibility to applications and can potentially increase performance.
international parallel and distributed processing symposium | 2016
Stefan Eilemann; Fabien Delalondre; Jon Bernard; Judit Planas; Felix Schuermann; John Biddiscombe; Costas Bekas; Alessandro Curioni; Bernard Metzler; Peter Kaltstein; Peter Morjan; Joachim Fenkes; Ralph Bellofatto; Lars Schneidenbach; T. J. Christopher Ward; Blake G. Fitch
Scientific workflows are often composed of compute-intensive simulations and data-intensive analysis and visualization, both equally important for productivity. High-performance computers run the compute-intensive phases efficiently, but data-intensive processing is still getting less attention. Dense non-volatile memory integrated into super-computers can help address this problem. In addition to density, it offers significantly finer-grained I/O than disk-based I/O systems. We present a way to exploit the fundamental capabilities of Storage-Class Memories (SCM), such as Flash, by using scalable key-value (KV) I/O methods instead of traditional file I/O calls commonly used in HPC systems. Our objective is to enable higher performance for on-line and near-line storage for analysis and visualization of very high resolution, but correspondingly transient, simulation results. In this paper, we describe 1) the adaptation of a scalable key-value store to a BlueGene/Q system with integrated Flash memory, 2) a novel key-value aggregation module which implements coalesced, function-shipped calls between the clients and the servers, and 3) the refactoring of a scientific workflow to use application-relevant keys for fine-grained data subsets. The resulting implementation is analogous to function-shipping of POSIX I/O calls but shows an order of magnitude increase in read and a factor 2.5x increase in write IOPS performance (11 million read IOPS, 2.5 million write IOPS from 4096 compute nodes) when compared to a classical file system on the same system. It represents an innovative approach for the integration of SCM within an HPC system at scale.
international supercomputing conference | 2017
Timothée Ewart; Judit Planas; Francesco Cremonesi; Kai Langen; Felix Schürmann; Fabien Delalondre
The increasing complexity and heterogeneity of extreme scale systems makes the optimization of large scale scientific applications particularly challenging. Efficiently leveraging these complex systems requires a great deal of technical expertise and a considerable amount of man-hours. The computational neuroscience community relies on an handful of those frameworks to model the electrical activity of brain tissue at different scales. As the members of the Blue Brain Project actively contribute to a large part of those frameworks, it becomes mandatory to implement a strategy to reduce the overall development cost. Therefore, we present Neuromapp, a computational neuroscience mini-application framework. Neuromapp consists of a number of mini-apps (small standalone applications) that represent a single functionality in one of the large scientific frameworks. The collection of several mini-apps forms a skeleton which is able to reproduce the original workflow of the scientific application. Thus, it becomes easy to investigate both single component and workflow optimizations, new software and hardware systems or future system design. New solutions can then be integrated into the large scientific applications if proved to be successful, reducing the overall development and optimization effort.
The Journal of Supercomputing | 2016
Adrián Castelló; Antonio J. Peña; Rafael Mayo; Judit Planas; Enrique S. Quintana-Ortí; Pavan Balaji
Directive-based programming models, such as OpenMP, OpenACC, and OmpSs, enable users to accelerate applications by using coprocessors with little effort. These devices offer significant computing power, but their use can introduce two problems: an increase in the total cost of ownership and their underutilization because not all codes match their architecture. Remote accelerator virtualization frameworks address those problems. In particular, rCUDA provides transparent access to any graphic processor unit installed in a cluster, reducing the number of accelerators and increasing their utilization ratio. Joining these two technologies, directive-based programming models and rCUDA, is thus highly appealing. In this work, we study the integration of OmpSs and OpenACC with rCUDA, describing and analyzing several applications over three different hardware configurations that include two InfiniBand interconnections and three NVIDIA accelerators. Our evaluation reveals favorable performance results, showing low overhead and similar scaling factors when using remote accelerators instead of local devices.
international conference on computational science | 2018
Judit Planas; Fabien Delalondre; Felix Schürmann
Important progress in computational sciences has been made possible recently thanks to the increasing computing power of high performance systems. Following this trend, larger scientific studies, like brain tissue simulations, will continue to grow in the future. In addition to the challenges of conducting these experiments, we foresee an explosion of the amount of data generated and the consequent unfeasibility of analyzing and understanding the results with the current techniques.
international parallel and distributed processing symposium | 2016
Chris J. Newburn; Gaurav Bansal; Michael Wood; Luis Crivelli; Judit Planas; Alejandro Duran; Paulo Souza; Leonardo Borges; Piotr Luszczek; Stanimire Tomov; Jack J. Dongarra; Hartwig Anzt; Mark Gates; Azzam Haidar; Yulu Jia; Khairul Kabir; Ichitaro Yamazaki; Jesús Labarta