Hugo Meyer
Barcelona Supercomputing Center
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hugo Meyer.
international conference on conceptual structures | 2014
Hugo Meyer; Dolores Rexachs; Emilio Luque
Abstract With the growing scale of High Performance Computing applications comes an increase in the number of interruptions as a consequence of hardware failures. As the tendency is to scale parallel executions to hundred of thousands of processes, fault tolerance is becoming an important matter. Uncoordinated fault tolerance protocols, such as message logging, seem to be the best option since coordinated protocols might compromise applications scalability. Considering that most of the overhead during failure-free executions is caused by message logging approaches, in this paper we propose a Hybrid Message Logging protocol. It focuses on combining the fast recovery feature of pessimistic receiver-based message logging with the low protection overhead introduced by pessimistic sender-based message logging. The Hybrid Message Logging aims to reduce the overhead introduced by pessimistic receiver-based approaches by allowing applications to continue normally before a received message is properly saved. In order to guarantee that no message is lost, a pessimistic sender-based logging is used to temporarily save messages while the receiver fully saves its received messages. Experiments have shown that we can achieve up to 43% overhead reduction compared to a pessimistic receiver- based logging approach.
international conference on high performance computing and simulation | 2015
Hugo Meyer; Jose Carlos Sancho; W Wang Miao; Hjs Harm Dorren; N Nicola Calabretta; Montse Farreras
This paper analyzes the performance impact of Optical Packet Switches (OPS) on parallel HPC applications. Because these devices cannot store light, in case of a collision for accessing the same output port in the switch only one packet can proceed and the others are dropped. The analysis focuses on the negative impact of packet collisions in the OPS and subsequent re-transmissions of dropped packets. To carry out this analysis we have developed a system simulator that mimics the behavior of real HPC application traffic and optical network devices such as the OPS. By using real application traces we have analyzed how message re-transmissions could affect parallel executions. In addition, we have also developed a methodology that allows to process applications traces and determine packet concurrency. The concurrency evaluates the amount of simultaneous packets that applications could transmit in the network. Results have shown that there are applications that can benefit from the advantages of OPS technology. Taking into account the applications analyzed, these applications are the ones that show less than 1% of packet concurrency; whereas there are other applications where their performance could be impacted by up to 65%. This impact is mostly dependent on application traffic behavior that is successfully characterized by our proposed methodology.
Journal of Parallel and Distributed Computing | 2015
Marcela Castro-León; Hugo Meyer; Dolores Rexachs; Emilio Luque
The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node faults. This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries. The model consists of detecting failures in the communication socket caused by a faulty node. In those cases, the affected processes are recovered in a healthy node and the connections are reestablished without losing data. The Redundant Array of Distributed Independent Controllers architecture proposes a decentralized model for all the tasks required in a fault tolerance system: protection, detection, recovery and masking. Decentralized algorithms allow the application to scale, which is a key property for current HPC system. Three different rollback recovery protocols are defined and discussed with the aim of offering alternatives to reduce overhead when multicore systems are used. A prototype has been implemented to carry out an exhaustive experimental evaluation through Master/Worker and Single Program Multiple Data execution models. Multiple workloads and an increasing number of processes have been taken into account to compare the above mentioned protocols. The executions take place in two multicore Linux clusters with different socket communications libraries. A system-level fault-tolerant mechanism for message passing applications.Fully decentralized and transparent, for applications and communication library.Protection, detection and recovery functions, implemented at socket API level.Semi-coordinated vs. uncoordinated checkpoints: performance based election.
trust security and privacy in computing and communications | 2013
Hugo Meyer; Ronal Muresano; Dolores Rexachs; Emilio Luque
When running parallel applications on HPC clusters usually the prior objectives are: almost linear speedup, efficient resources utilization, scalability and successful completion. Hence, applications executions are now facing a multiobjective problem which is focused on improving Performance while giving Fault Tolerance (FT) support, this combination is defined as Performability. The performance of Single Program Multiple Data (SPMD) applications written using a message-passing library (MPI) may be seriously affected, when applying a message logging approach, because they are tightly coupled and have a huge amount of communications. In this sense, we have proposed a novel method for SPMD applications which allows us to obtain the maximum speedup under a defined efficiency threshold considering the impact of a fault tolerance strategy when executing on multicore clusters. This method is based on four phases: characterization, tile distribution, mapping and scheduling. The idea of this method is to manage the effects of the added overhead of FT techniques, which seriously affect the MPI application performance. In this sense, our method manages the overheads of message logging by overlapping them with computation. Then, the main objective of this method is to determine the approximate number of computational cores and the ideal number of tiles, which permit us to obtain a suitable balance between speedup, efficiency and dependability. The obtained results illustrate that we can find the maximum speedup under a defined efficiency using a FT strategy with a small error rate of 5.4% for the worst case. By using our method, we can also determine the ideal problem size for a given number of computational cores (weak scalability) using FT with an error of around 5.8%. Results also show that our message logging approach could be tuned to introduce a constant overhead percentage when scaling the size of the problem.
Future Generation Computer Systems | 2017
Ronal Muresano; Hugo Meyer; Dolores Rexachs; Emilio Luque
Executing traditional Message Passing Interface (MPI) applications on multi-core cluster balancing speed and computational efficiency is a difficult task that parallel programmers have to deal with. For this reason, communications on multi-core clusters ought to be handled carefully in order to improve performance metrics such as efficiency, speedup, execution time and scalability. In this paper we focus our attention on SPMD (Single Program Multiple Data) applications with high communication volume and synchronicity and also following characteristics such as: static, local and regular. This work proposes a method for SPMD applications, which is focused on managing the communication heterogeneity (different cache level, RAM memory, network, etc.) on homogeneous multi-core computing platform in order to improve the application efficiency. In this sense, the main objective of this work is to find analytically the ideal number of cores necessary that allows us to obtain the maximum speedup, while the computational efficiency is maintained over a defined threshold (strong scalability). This method also allows us to determine how the problem size must be increased in order to maintain an execution time constant while the number of cores are expanded (weak scalability) considering the tradeoff between speed and efficiency. This methodology has been tested with different benchmarks and applications and we achieved an average improvement around 30.35% of efficiency in applications tested using different problems sizes and multi-core clusters. In addition, results show that maximum speedup with a defined efficiency is located close to the values calculated with our analytical model with an error rate lower than 5% for the applications tested.
international conference on conceptual structures | 2017
Hugo Meyer; Jose Carlos Sancho; Josue V. Quiroga; Ferad Zyulkyarov; Damian Roca; Mario Nemirovsky
Next generation data centers will likely be based on the emerging paradigm of disaggregated function-blocks-as-a-unit departing from the current state of mainboard-as-a-unit. Multiple functional blocks or bricks such as compute, memory and peripheral will be spread through the entire system and interconnected together via one or multiple high speed networks. The amount of memory available will be very large distributed among multiple bricks. This new architecture brings various benefits that are desirable in today’s data centers such as fine-grained technology upgrade cycles, fine-grained resource allocation, and access to a larger amount of memory and accelerators. An analysis of the impact and benefits of memory disaggregation is presented in this paper. One of the biggest challenges when analyzing these architectures is that memory accesses should be modeled correctly in order to obtain accurate results. However, modeling every memory access would generate a high overhead that can make the simulation unfeasible for real data center applications. A model to represent and analyze memory disaggregation has been designed and a statistics-based queuing-based full system simulator was developed to rapidly and accurately analyze applications performance in disaggregated systems. With a mean error of 10%, simulation results pointed out that the network layers may introduce overheads that degrade applications’ performance up to 66%. Initial results also suggest that low memory access bandwidth may degrade up to 20% applications’ performance.
Future Generation Computer Systems | 2017
Hugo Meyer; Jose Carlos Sancho; Milica Mrdakovic; Wang Miao; N Nicola Calabretta
Optical Packet Switches (OPS) could provide the needed low latency transmissions in today large data centers. OPS can deliver lower latency and higher bandwidth than traditional electrical switches. These features are needed for parallel High Performance Computing (HPC) applications. For this purpose, it has been recently designed full optical network architectures for HPC system such as the Architecture-On-Demand (AoD) network infrastructure. Although light-based transmission has its advantage over electrical-based transmissions, optical devices such as OPS cannot store light. Therefore, in case of an optical packet collision occurs for accessing the same output port in OPS only one packet can proceed and the others must be dropped, triggering afterwards a retransmission procedure. Obviously, packet retransmissions are delaying the actual transmission and also increase the buffer utilization at the network interfaces cards (NICs) that deals with retransmissions. In this paper, it is proposed a technique based on mapping application processes to servers reducing the number of simultaneous packets in the network (Concurrency) and therefore, it is able to significantly reduce optical collisions at the OPS while substantially reduces the resource needed at the NICs for retransmissions. Our proposed concurrency-aware mapping technique can reduce the extra buffer size utilization up to 4.2 times and the execution time degradation up to 2.6 times.
international conference of distributed computing and networking | 2016
Hugo Meyer; Jose Carlos Sancho; Milica Mrdakovic; Shuping Peng; Dimitra Simeonidou; Wang Miao; N Nicola Calabretta
This paper analyzes methodologies that allow scaling properly Architecture-On-Demand (AoD) based optical networks. As Data Centers and HPC systems are growing in size and complexity, optical networks seem to be the way to scale the bandwidth of current network infrastructures. To scale the number of servers that are connected to optical switches normally Dense Wavelength Division Multiplexing (DWDM) is used to group several servers in one fiber. Using DWDM limits the number of servers per fiber to the number of wavelengths that fiber supports, and also may increase the number of packet collisions. Our proposal focuses on using Time Division Multiplexing (TDM) to allow multiple servers per wavelength, allowing to scale to a larger number of servers per switch. Initial results have shown that when using TDM we can obtain similar results in performance when comparing it with DWDM. For some of the applications, TDM can outperform DWDM up to 2.4% taking into account execution time.
international performance computing and communications conference | 2018
Carlos Vega; Jose Fernando Zazo; Hugo Meyer; Ferad Zyulkyarov; Sergio López-Buedo; Javier Aracil
Traditional data centers are designed with a rigid architecture of fit-for-purpose servers that provision resources beyond the average workload in order to deal with occasional peaks of data. Heterogeneous data centers are pushing towards more cost-efficient architectures with better resource provisioning. In this paper we study the feasibility of using disaggregated architectures for intensive data applications, in contrast to the monolithic approach of server-oriented architectures. Particularly, we have tested a proactive network analysis system in which the workload demands are highly variable. In the context of the dReDBox disaggregated architecture, the results show that the overhead caused by using remote memory resources is significant, between 66% and 80%, but we have also observed that the memory usage is one order of magnitude higher for the stress case with respect to average workloads. Therefore, dimensioning memory for the worst case in conventional systems will result in a notable waste of resources. Finally, we found that, for the selected use case, parallelism is limited by memory. Therefore, using a disaggregated architecture will allow for increased parallelism, which, at the same time, will mitigate the overhead caused by remote memory.
Journal of Parallel and Distributed Computing | 2017
Hugo Meyer; Ronal Muresano; Marcela Castro-Len; Dolores Rexachs; Emilio Luque
With the growing scale of HPC applications, there has been an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Time Between Failures (MTBF) in current systems encourages the research of suitable fault tolerance solutions. Message logging combined with uncoordinated checkpoint compose a scalable rollback-recovery solution. However, message logging techniques are usually responsible for most of the overhead during failure-free executions. Taking this into consideration, this paper proposes the Hybrid Message Pessimistic Logging (HMPL) which focuses on combining the fast recovery feature of pessimistic receiver-based message logging with the low failure-free overhead introduced by pessimistic sender-based message logging. The HMPL manages messages using a distributed controller and storage to avoid harming systems scalability. Experiments show that the HMPL is able to reduce overhead by 34% during failure-free executions and 20% in faulty executions when compared with a pessimistic receiver-based message logging. A low overhead Hybrid Message Pessimistic Logging (HMPL) protocol is presented.The HMPL focus on providing fast recovery with low failure-free overhead.A temporal buffer in senders is used to reduce penalties in critical paths.A detailed comparison of the HMPL with a classic receiver-based logging is presented.Overhead reductions up to 34% in failure-free and 20% in faulty executions.