Harald Servat
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Harald Servat.
international parallel and distributed processing symposium | 2010
Germán Llort; Juan Gonzalez; Harald Servat; Judit Gimenez; Jesús Labarta
With larger and larger systems being constantly deployed, trace-based performance analysis of parallel applications has become a challenging task. Even if the amount of performance data gathered per single process is small, traces rapidly become unmanageable when merging together the information collected from all processes. In general, an efficient analysis of such a large volume of data is subject to a previous filtering step that directs the analysts attention towards what is meaningful to understand the observed application behavior. Furthermore, the iterative nature of most scientific applications usually ends up producing repetitive information. Discarding irrelevant data aims at r educing both the size of traces, and the time required to perform the analysis and deliver results. In this paper, we present an on-line analysis framework that relies on clustering techniques to intelligently select the most relevant information to understand how the application behaves, while keeping the volume of performance data at a reasonable size.
international conference on parallel processing | 2009
Harald Servat; Germán Llort; Judit Gimenez; Jesús Labarta
Performance evaluation tools enable analysts to shed light on how applications behave both from a general point of view and at concrete execution points, but cannot provide detailed information beyond the monitored regions of code. Having the ability to determine when and which data has to be collected is crucial for a successful analysis. This is particularly true for trace-based tools, which can easily incur either unmanageable large traces or information shortage. In order to mitigate the well-known resolution vs. usability trade-off, we present a procedure that obtains fine grain performance information using coarse grain sampling, projecting performance metrics scattered all over the execution into thoroughly detailed representative areas. This mechanism has been incorporated into the MPItrace tracing suite, greatly extending the amount of performance information gathered from statically instrumented points with further periodic samples collected beyond them. We have applied this solution to the analysis of two applications to introduce a novel performance analysis methodology based on the combination of instrumentation and sampling techniques.
ieee international conference on high performance computing data and analytics | 2013
Bernd Mohr; Vladimir Voevodin; Judit Gimenez; Erik Hagersten; Andreas Knüpfer; Dmitry A. Nikitenko; Mats Nilsson; Harald Servat; Aamer Shah; Frank Winkler; Felix Wolf; Ilya Zhukov
To maximise the scientific output of a high-performance computing system, different stakeholders pursue different strategies. While individual application developers are trying to shorten the time to solution by optimising their codes, system administrators are tuning the configuration of the overall system to increase its throughput. Yet, the complexity of today’s machines with their strong interrelationship between application and system performance presents serious challenges to achieving these goals. The HOPSA project (HOlistic Performance System Analysis) therefore sets out to create an integrated diagnostic infrastructure for combined application and system-level tuning – with the former provided by the EU and the latter by the Russian project partners. Starting from system-wide basic performance screening of individual jobs, an automated workflow routes findings on potential bottlenecks either to application developers or system administrators with recommendations on how to identify their root cause using more powerful diagnostic tools. Developers can choose from a variety of mature performance-analysis tools developed by our consortium. Within this project, the tools will be further integrated and enhanced with respect to scalability, depth of analysis, and support for asynchronous tasking, a node-level paradigm playing an increasingly important role in hybrid programs on emerging hierarchical and heterogeneous systems.
ieee international conference on high performance computing data and analytics | 2013
Germán Llort; Harald Servat; Juan Gonzalez; Judit Gimenez; Jesús Labarta
Understanding the behavior of a parallel application is crucial if we are to tune it to achieve its maximum performance. Yet the behavior the application exhibits may change over time and depend on the actual execution scenario: particular inputs and program settings, the number of processes used, or hardware-specific problems. So beyond the details of a single experiment a far more interesting question arises: how does the application behavior respond to changes in the execution conditions? In this paper, we demonstrate that object tracking concepts from computer vision have huge potential to be applied in the context of performance analysis. We leverage tracking techniques to analyze how the behavior of a parallel application evolves through multiple scenarios where the execution conditions change. This method provides comprehensible insights on the influence of different parameters on the application behavior, enabling us to identify the most relevant code regions and their performance trends.
Bioinformatics | 2012
Carles Pons; Daniel Jiménez-González; Cecilia González-Álvarez; Harald Servat; Daniel Cabrera-Benítez; Xavier Aguilar; Juan Fernández-Recio
SUMMARY The application of docking to large-scale experiments or the explicit treatment of protein flexibility are part of the new challenges in structural bioinformatics that will require large computer resources and more efficient algorithms. Highly optimized fast Fourier transform (FFT) approaches are broadly used in docking programs but their optimal code implementation leaves hardware acceleration as the only option to significantly reduce the computational cost of these tools. In this work we present Cell-Dock, an FFT-based docking algorithm adapted to the Cell BE processor. We show that Cell-Dock runs faster than FTDock with maximum speedups of above 200×, while achieving results of similar quality. AVAILABILITY AND IMPLEMENTATION The source code is released under GNU General Public License version 2 and can be downloaded from http://mmb.pcb.ub.es/~cpons/Cell-Dock. CONTACT [email protected] or [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Ibm Journal of Research and Development | 2005
Xavier Martorell; Nils Smeds; Robert Walkup; José R. Brunheroto; George S. Almasi; John A. Gunnels; L DeRose; Jesús Labarta; F Escalé; Judit Gimenez; Harald Servat; José E. Moreira
Good performance monitoring is the basis of modern performance analysis tools for application optimization. We are providing a variety of such performance analysis tools for the new Blue Gene®/L supercomputer. Those tools can be divided into two categories: single-node performance tools and multinode performance tools. From a single-node perspective, we provide standard interfaces and libraries, such as PAPI and libHPM, that provide access to the hardware performance counters for applications running on the Blue Gene/L compute nodes. From a multinode perspective, we focus on tools that analyze Message Passing Interface (MPI) behavior. Those tools work by first collecting message-passing trace data when a program runs. The trace data is then used by graphical interface tools that analyze the behavior of applications. Using the current prototype tools, we demonstrate their usefulness and applicability with case studies of application optimization.
international conference on parallel architectures and compilation techniques | 2007
Harald Servat; Cecilia Gonzalez; Xavier Aguilar; Daniel Cabrera; Daniel A. Jiménez
We evaluate a well known protein docking application in the bioinformatic field, Fourier Transform Docking (FTDock) (Gabb et al., 1997), on a Blade with two 3.2GHz cell broadband engine (BE) processor (Kahle et al., 2005). FTDock is a geometry complementary approximation of the protein docking problem, and uses 3D FFTs to reduce the complexity of the algorithm. FTDock achieves a significant speedup when most time consuming functions are offloaded to SPEs, and vectorized. We show the performance impact evolution of of-loading and vectorizing two functions of FTDock (CM and SC) on 1 SPU. We show total execution time of FTDock when CM and SC run in the PPU (bar 1), CM is off loaded (bar 2), CM is also vectorized (bar 3), SC is offloaded (bar 4) and SC is also vectorized (bar 5). Parallelizing functions that are not offloaded, using OpenMP for instance, on the dual-thread PPE helps to increase the PPEpipeline use and system throughput, and the scalability of the application.
international conference on parallel processing | 2012
Harald Servat; Xavier Teruel; Germán Llort; Alejandro Duran; Judit Gimenez; Xavier Martorell; Eduard Ayguadé; Jesús Labarta
Parallelism has become more and more commonplace with the advent of the multicore processors. Although different parallel programming models have arisen to exploit the computing capabilities of such processors, developing applications that take benefit of these processors may not be easy. And what is worse, the performance achieved by the parallel version of the application may not be what the developer expected, as a result of a dubious utilization of the resources offered by the processor. We present in this paper a fruitful synergy of a shared memory parallel compiler and runtime, and a performance extraction library. The objective of this work is not only to reduce the performance analysis life-cycle when doing the parallelization of an application, but also to extend the analysis experience of the parallel application by incorporating data that is only known in the compiler and runtime side. Additionally we present performance results obtained with the execution of instrumented application and evaluate the overhead of the instrumentation.
Concurrency and Computation: Practice and Experience | 2016
Harald Servat; Germán Llort; Judit Gimenez; Jesús Labarta
On the road to Exascale computing, both performance and power areas are meant to be tackled at different levels, from system to processor level. The processor itself is the main responsible for the serial node performance and also for the most of the energy consumed by the system. Thus, it is important to have tools to simultaneously analyze both performance and energy efficiency at processor level.
international conference on cluster computing | 2017
Harald Servat; Antonio J. Peña; Germán Llort; Estanislao Mercadal; Hans-Christian Hoppe; Jesús Labarta
Multi-tiered memory systems, such as those based on Intel® Xeon Phi™processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the time– to–solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application’s most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.