Jérôme Vienne | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jérôme Vienne is active.

Explore More

Publication

Featured researches published by Jérôme Vienne.

ieee international conference on high performance computing data and analytics | 2012

Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes

Hari Subramoni; Sreeram Potluri; Krishna Chaitanya Kandalla; Bill Barth; Jérôme Vienne; Jeff Keasler; Karen Tomko; Karl W. Schulz; Adam Moody; Dhabaleswar K. Panda

Over the last decade, InfiniBand has become an increasingly popular interconnect for deploying modern supercomputing systems. However, there exists no detection service that can discover the underlying network topology in a scalable manner and expose this information to runtime libraries and users of the high performance computing systems in a convenient way. In this paper, we design a novel and scalable method to detect the InfiniBand network topology by using Neighbor-Joining techniques (NJ). To the best of our knowledge, this is the first instance where the neighbor joining algorithm has been applied to solve the problem of detecting InfiniBand network topology. We also design a network-topology-aware MPI library that takes advantage of the network topology service. The library places processes taking part in the MPI job in a network-topology-aware manner with the dual aim of increasing intra-node communication and reducing the long distance inter-node communication across the InfiniBand fabric.

international conference on parallel processing | 2013

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Jérôme Vienne; Rob F. Van der Wijngaart; Lars Koesterke; Ilya Sharapov

NAS parallel benchmarks (NPB) are a set of applications commonly used to evaluate parallel systems. We use the NPB-OpenMP version to examine the performance of the Intels new Xeon Phi co-processor and focus in particular on the many core aspect of the Xeon Phi architecture. A first analysis studies the scalability up to 244 threads on 61 cores and the impact of affinity settings on scaling. It also compares performance characteristics of Xeon Phi and traditional Xeon CPUs. The application of several well-established optimization techniques allows us to identify common bottlenecks that can specifically impede performance on the Xeon Phi but are not as severe on multi-core CPUs. We also find that many of the OpenMP-parallel loops are too short (in terms of the number of loop iterations) for a balanced execution by 244 threads. New or redesigned benchmarks will be needed to accommodate the greatly increased number of cores and threads. At the end, we summarize our findings in a set recommendations for performance optimization for Xeon Phi.

high performance interconnects | 2012

Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems

Jérôme Vienne; Jitong Chen; Md. Wasi-ur-Rahman; Nusrat Sharmin Islam; Hari Subramoni; Dhabaleswar K. Panda

Communication interfaces of high performance computing (HPC) systems and clouds have been continually evolving to meet the ever increasing communication demands being placed on them by HPC applications and cloud computing middleware (e.g., Hadoop). The PCIe interfaces can now deliver speeds up to 128 Gbps (Gen3) and high performance interconnects (10/40 GigE, InfiniBand 32 Gbps QDR, InfiniBand 54 Gbps FDR, 10/40 GigE RDMA over Converged Ethernet) are capable of delivering speeds from 10 to 54 Gbps. However, no previous study has demonstrated how much benefit an end user in the HPC / cloud computing domain can expect by utilizing newer generations of these interconnects over older ones or how one type of interconnect (such as IB) performs in comparison to another (such as RoCE).In this paper we evaluate various high performance interconnects over the new PCIe Gen3 interface with HPC as well as cloud computing workloads. Our comprehensive analysis done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC applications and cloud computing middleware. The results of our experiments show that the latest InfiniBand FDR interconnect gives the best performance for HPC as well as cloud computing applications.

international conference on cluster computing | 2011

Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters

Hari Subramoni; Krishna Chaitanya Kandalla; Jérôme Vienne; Sayantan Sur; Bill Barth; Karen Tomko; Robert T. McLay; Karl W. Schulz; Dhabaleswar K. Panda

It is an established fact that the network topology can have an impact on the performance of scientific parallel applications. However, little work has been done to design an easy to use solution inside a communication library supporting a parallel programming model where the complexities of making the application performance network topology agnostic is hidden from the end user. Similarly, the rapid improvements in networking technology and speed are resulting in many commodity clusters becoming heterogeneous, with respect to networking speed. For example, switches and adapters belonging to different generations (SDR - 8 Gbps, DDR - 16 Gbps and QDR - 36 Gbps speeds in InfiniBand) are integrated into a single system. This leads to an additional challenge to make the communication library aware of the performance implications of heterogeneous link speeds. Accordingly, the communication library can perform optimizations taking link speed into account. In this paper, we propose a framework to automatically detect the topology and speed of an InfiniBand network and make it available to users through an easy to use interface. We also make design changes inside the MPI library to dynamically query this topology detection service and to form a topology model of the underlying network. We have redesigned the broadcast algorithm to take into account this network topology information and dynamically adapt the communication pattern to best fit the characteristics of the underlying network. To the best of our knowledge, this is the first such work for InfiniBand clusters. Our experimental results show that, for large homogeneous systems and large message sizes, we get up to 14% improvement in the latency of the broadcast operation using our proposed network topology-aware scheme over the default scheme at the micro-benchmark level. At the application level, the proposed framework delivers up to 8% improvement in total application run-time especially as job size scales up. The proposed network speed-aware algorithms are able to attain micro-benchmark performance on the heterogeneous SDR-DDR InfiniBand cluster to perform on par with runs on the DDR only portion of the cluster for small to medium sized messages. We also demonstrate that the network speed aware algorithms perform 70% to 100% better than the naive algorithms when both are run on the heterogeneous SDR-DDR InfiniBand cluster.

Proceedings of the 22nd European MPI Users' Group Meeting on | 2015

MPI Advisor: a Minimal Overhead Tool for MPI Library Performance Tuning

Esthela Gallardo; Jérôme Vienne; Leonardo Fialho; Patricia J. Teller; James C. Browne

A majority of parallel applications executed on HPC clusters use MPI for communication between processes. Most users treat MPI as a black box, executing their programs using the clusters default settings. While the default settings perform adequately for many cases, it is well known that optimizing the MPI environment can significantly improve application performance. Although the existing optimization tools are effective when used by performance experts, they require deep knowledge of MPI library behavior and the underlying hardware architecture in which the application will be executed. Therefore, an easy-to-use tool that provides recommendations for configuring the MPI environment to optimize application performance is highly desirable. This paper addresses this need by presenting an easy-to-use methodology and tool, named MPI Advisor, that requires just a single execution of the input application to characterize its predominant communication behavior and determine the MPI configuration that may enhance its performance on the target combination of MPI library and hardware architecture. Currently, MPI Advisor provides recommendations that address the four most commonly occurring MPI-related performance bottlenecks, which are related to the choice of: 1) point-to-point protocol (eager vs. rendezvous), 2) collective communication algorithm, 3) MPI tasks-to-cores mapping, and 4) Infiniband transport protocol. The performance gains obtained by implementing the recommended optimizations in the case studies presented in this paper range from a few percent to more than 40%. Specifically, using this tool, we were able to improve the performance of HPCG with MVAPICH2 on four nodes of the Stampede cluster from 6.9 GFLOP/s to 10.1 GFLOP/s. Since the tool provides application-specific recommendations, it also informs the user about correct usage of MPI.

ieee international conference on high performance computing, data, and analytics | 2016

A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor

Carlos Rosales; John Cazes; Kent Milfeld; Antonio Gómez-Iglesias; Lars Koesterke; Lei Huang; Jérôme Vienne

Intel Knights Landing represents a qualitative change in the Many Integrated Core architecture. It represents a self-hosted option and includes a high speed integrated memory together with a two dimensional mesh used to interconnect the cores. This leads to a number of possible runtime configurations with different characteristics and implications in the performance of applications. This paper presents a study of the performance differences observed when using the three MCDRAM configurations available in combination with the three possible memory access or cluster modes. We analyze the effects that memory affinity and process pinning have on different applications. The Mantevo suite of mini applications and NAS Parallel Benchmarks are used to analyze the behavior of very different application kernels, from molecular dynamics to CFD mini-applications. Two full applications, the Weather Research and Forecast (WRF) application and a Lattice Boltzman Suite (LBS3D) are also analyzed in detail to complete the study and present scalability results of a variety of applications.

high performance interconnects | 2011

Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL

Krishna Chaitanya Kandalla; Hari Subramoni; Jérôme Vienne; S. Pai Raikar; Karen Tomko; Sayantan Sur; Dhabaleswar K. Panda

The upcoming MPI-3.0 standard is expected to include non-blocking collective operations. Non-blocking collectives offer a new MPI interface, using which an application can decouple the initiation and completion of collective operations. However, to be effective, the MPI library should provide a high performance and scalable implementation. One of the major challenges in designing an effective non-blocking collective operation is to ensure progress of the operation while processors are busy in application-level computation. The recently introduced Mellanox ConnectX-2 InfiniBand adapters offer a task offload interface (CORE-Direct) that enables communication progress without requiring CPU cycles. In this paper, we present the design of a non-blocking broadcast operation (MPI Ibcast) using the CORE-Direct offload interface. Our experimental evaluations show that our implementation delivers near perfect overlap, without penalizing the latency of the MPI Ibcast operation. Since existing MPI implementations do not provide non-blocking collective communication, scientific applications have been modified to implement collectives on top of MPI point-to-point operations to achieve overlap. HPL is an example of an application use case scenario for non-blocking collectives. We have explored the benefits of our proposed network offload based MPI Ibcast implementation with HPL and we observe that HPL can achieve its peak throughput with significantly smaller problem sizes, which also leads to an improvement in its run-time by up to 78%, with 512 processors. We also observe that our proposed designs can minimize the impact of system noise on applications.

extreme science and engineering discovery environment | 2014

Benefits of Cross Memory Attach for MPI libraries on HPC Clusters

Jérôme Vienne

With the number of cores per node increasing in modern clusters, an efficient implementation of intra-node communications is critical for application performance. MPI libraries generally use shared memory mechanisms for communication inside the node, unfortunately this approach has some limitations for large messages. The release of Linux kernel 3.2 introduced Cross Memory Attach (CMA) which is a mechanism to improve the communication between MPI processes inside the same node. But, as this feature is not enabled by default inside MPI libraries supporting it, it could be left disabled by HPC administrators which leads to a loss of performance benefits to users. In this paper, we explain how to use CMA and present an evaluation of CMA using micro-benchmarks and NAS parallel benchmarks (NPB) which are a set of applications commonly used to evaluate parallel systems. Our performance evaluation reveals that CMA outperforms shared memory performance for large messages. Micro-benchmark level evaluations show that CMA can enhance the performance by as much as a factor of four. With NPB, we see up to 24.75% improvement in total execution time for FT and up to 24.08% for IS.

international conference on parallel processing | 2012

A scalable infiniband network topology-aware performance analysis tool for MPI

Hari Subramoni; Jérôme Vienne; Dhabaleswar K. Panda

Over the last decade, InfiniBand (IB) has become an increasingly popular interconnect for deploying modern supercomputing systems. As supercomputing systems grow in size and scale, the impact of IB network topology on the performance of high performance computing (HPC) applications also increase. Depending on the kind of network (FAT Tree, Tori, or Mesh), the number of network hops involved in data transfer varies. No tool currently exists that allows users of such large-scale clusters to analyze and visualize the communication pattern of HPC applications in a network topology-aware manner. In this paper, we take up this challenge and design a scalable, low-overhead InfiniBand Network Topology-Aware Performance Analysis Tool for MPI - INTAP-MPI. INTAP-MPI allows users to analyze and visualize the communication pattern of HPC applications on any IB network (FAT Tree, Tori, or Mesh). We integrate INTAP-MPI into the MVAPICH2 MPI library, allowing users of HPC clusters to seamlessly use it for analyzing their applications. Our experimental analysis shows that the INTAP-MPI is able to profile and visualize the communication pattern of applications with very low memory and performance overhead at scale.

international conference on cluster computing | 2012

Can Network-Offload Based Non-blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms?

Krishna Chaitanya Kandalla; Aydin Buluç; Hari Subramoni; Karen Tomko; Jérôme Vienne; Leonid Oliker; Dhabaleswar K. Panda

Graph-based computations are commonly used across various data intensive computing domains ranging from social networks to biological systems. On distributed memory systems, graph algorithms involve explicit communication between processes and often exhibit sparse, irregular behavior. Minimizing these communication overheads is critical to cater to the graph-theoretic analyses demands of emerging âbigdataâ applications. In this paper, we explore the challenges associated with reducing the communication overheads of a popular 2D Breadth First Search (BFS) implementation in the CombBLAS library. This BFS algorithm relies on two common MPI collectives, MPI Alltoallv and MPI Allgathervto exchange data between processes and they account for more than 20% of the overall run time. Re-designing parallel applications to take advantage of MPI-3 non-blocking collectives to achieve latency hiding is an active area of research. However, the 2D BFS algorithm in CombBLAS is not directly amenable for such a re-design through common overlap techniques such as, double-buffering. In this paper, we propose to re-design the BFS algorithm to leverage MPI-3 no blocking, neighborhood collective communication operations to achieve fine-grained computation/communication overlap. We also leverage the CORE-Direct network offload feature in theConnectX-2 InfiniBand adapter from Mellanox to design highly efficient and scalable non-blocking, neighborhood Alltoallv and Allgatherv collective operations. Our experimental evaluations show that we can improve the communication overheads of the2D BFS algorithm by up to 70%, with 1,936 processes.

Explore More