Johannes Langguth
Simula Research Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Johannes Langguth.
Computers & Operations Research | 2013
Kamer Kaya; Johannes Langguth; Fredrik Manne; Bora Uçar
We investigate the push-relabel algorithm for solving the problem of finding a maximum cardinality matching in a bipartite graph in the context of the maximum transversal problem. We describe in detail an optimized yet easy-to-implement version of the algorithm and fine-tune its parameters. We also introduce new performance-enhancing techniques. On a wide range of real-world instances, we compare the push-relabel algorithm with state-of-the-art augmenting path-based algorithms and the recently proposed pseudoflow approach. We conclude that a carefully tuned push-relabel algorithm is competitive with all known augmenting path-based algorithms, and superior to the pseudoflow-based ones.
parallel computing | 2014
Johannes Langguth; Ariful Azad; Mahantesh Halappanavar; Fredrik Manne
We study multithreaded push-relabel based algorithms for computing maximum cardinality matching in bipartite graphs. Matching is a fundamental combinatorial problem with applications in a wide variety of problems in science and engineering. We are motivated by its use in the context of sparse linear solvers for computing the maximum transversal of a matrix. Other applications can be found in many fields such as bioinformatics (Azad et al., 2010) [4], scheduling (Timmer and Jess, 1995) [27], and chemical structure analysis (John, 1995) [14]. We implement and test our algorithms on several multi-socket multicore systems and compare their performance to state-of-the-art augmenting path-based serial and parallel algorithms using a test set comprised of a wide range of real-world instances. Building on several heuristics for enhancing performance, we demonstrate good scaling for the parallel push-relabel algorithm. We show that it is comparable to the best augmenting path-based algorithms for bipartite matching. To the best of our knowledge, this is the first extensive study of multithreaded push-relabel based algorithms. In addition to a direct impact on the applications using matching, the proposed algorithmic techniques can be extended to preflow-push based algorithms for computing maximum flow in graphs.
international conference on algorithms and architectures for parallel processing | 2015
Qiang Lan; Namit Gaur; Johannes Langguth; Xing Cai
We adopt a detailed human cardiac cell model, which has 10000 calcium release units, in connection with simulating the electrical activity and calcium handling at the tissue scale. This is a computationally intensive problem requiring a combination of efficient numerical algorithms and parallel programming. To this end, we use a method that is based on binomial distributions to collectively study the stochastic state transitions of the 100 ryanodine receptors inside every calcium release unit, instead of individually following each ryanodine receptor. Moreover, the implementation of the parallel simulator has incorporated optimizations in form of code vectorization and removing redundant calculations. Numerical experiments show very good parallel performance of the 3D simulator and demonstrate that various physiological behaviors are correctly reproduced. This work thus paves way for high-fidelity 3D simulations of human ventricular tissues, with the ultimate goal of understanding the mechanisms of arrhythmia.
international conference on parallel and distributed systems | 2014
Johannes Langguth; Xing Cai
A recent trend in modern high-performance computing environments is the introduction of accelerators such as GPU and Xeon Phi, i.e. specialized computing devices that are optimized for highly parallel applications and coexist with CPUs. In regular compute-intensive applications with predictable data access patterns, these devices often outperform traditional CPUs by far and thus relegate them to pure control functions instead of computations. For irregular applications however, the gap in relative performance can be much smaller, and sometimes even reversed. Thus, maximizing overall performance in such systems requires that full use of all available computational resources is made. In this paper we study the attainable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes using heterogeneous systems consisting of CPUs and multiple GPUs. Finite volume methods are widely used numerical strategies for solving partial differential equations. The advantages of using finite volumes include built-in support for conservation laws and suitability for unstructured meshes. Our focus lies in demonstrating how a workload distribution that maximizes overall performance can be derived from the actual performance attained by the different computing devices in the heterogeneous environment. We also highlight the dual role of partitioning software in reordering and partitioning the input mesh, thus giving rise to a new combined approach to partitioning.
irregular applications: architectures and algorithms | 2013
Johannes Langguth; Nan Wu; Jun Chai; Xing Cai
Finite volume methods are widely used numerical strategies for solving partial differential equations. This paper aims at obtaining a quantitative understanding of the achievable GPU performance of finite volume computations in the context of the cell-centered finite volume method on 3D unstructured tetrahedral meshes. By using an optimized implementation and a synthetic connectivity matrix that exhibits a perfect structure of equal-sized blocks lying on the main diagonal, we can closely relate the achievable computing performance to the size of these diagonal blocks. Moreover, we have derived a theoretical model for identifying characteristic levels of the attainable performance as function of the GPUs key hardware parameters. A realistic upper limit of the performance can thus be accurately predicted. For real-world tetrahedral meshes, the key to high performance lies in a reordering of the tetrahedra, such that the resulting connectivity matrix resembles a block diagonal form where the optimal size of the blocks depends on the GPU hardware. Performance can then be predicted accurately based on the success of the reordering. Numerical experiments confirm that the achieved performance is close to the practically attainable maximum and it reaches 75% of the theoretical upper limit, independent of the actual tetrahedral mesh considered.
International Journal of Parallel Programming | 2017
Johannes Langguth; Qiang Lan; Namit Gaur; Xing Cai
We investigate heterogeneous computing, which involves both multicore CPUs and manycore Xeon Phi coprocessors, as a new strategy for computational cardiology. In particular, 3D tissues of the human cardiac ventricle are studied with a physiologically realistic model that has 10,000 calcium release units per cell and 100 ryanodine receptors per release unit, together with tissue-scale simulations of the electrical activity and calcium handling. In order to attain resource-efficient use of heterogeneous computing systems that consist of both CPUs and Xeon Phis, we first direct the coding effort at ensuring good performance on the two types of compute devices individually. Although SIMD code vectorization is the main theme of performance programming, the actual implementation details differ considerably between CPU and Xeon Phi. Moreover, in addition to combined OpenMP+MPI programming, a suitable division of the cells between the CPUs and Xeon Phis is important for resource-efficient usage of an entire heterogeneous system. Numerical experiments show that good resource utilization is indeed achieved and that such a heterogeneous simulator paves the way for ultimately understanding the mechanisms of arrhythmia. The uncovered good programming practices can be used by computational scientists who want to adopt similar heterogeneous hardware platforms for a wide variety of applications.
international conference on parallel and distributed systems | 2016
Johannes Langguth; Qiang Lan; Namit Gaur; Xing Cai; Mei Wen; Chunyuan Zhang
We develop a simulator for 3D tissue of the human cardiac ventricle with a physiologically realistic cell model and deploy it on the supercomputer Tianhe-2. In order to attain the full performance of the heterogeneous CPU-Xeon Phi design, we use carefully optimized codes for both devices and combine them to obtain suitable load balancing. Using a large number of nodes, we are able to perform tissue-scale simulations of the electrical activity and calcium handling in millions of cells, at a level of detail that tracks the states of trillions of ryanodine receptors. We can thus simulate arrythmogenic spiral waves and other complex arrhythmogenic patterns which arise from calcium handling deficiencies in human cardiac ventricle tissue. Due to extensive code tuning and parallelization via OpenMP, MPI, and SCIF/COI, large scale simulations of 10 heartbeats can be performed in a matter of hours. Test results indicate excellent scalability, thus paving the way for detailed whole-heart simulations in future generations of leadership class supercomputers.
IEEE Micro | 2015
Johannes Langguth; Mohammed Sourouri; Glenn T. Lines; Scott B. Baden; Xing Cai
A recent trend in modern high-performance computing environments is the introduction of powerful, energy-efficient hardware accelerators such as GPUs and Xeon Phi coprocessors. These specialized computing devices coexist with CPUs and are optimized for highly parallel applications. In regular computing-intensive applications with predictable data access patterns, these devices often far outperform CPUs and thus relegate the latter to pure control functions instead of computations. For irregular applications, however, the performance gap can be much smaller and is sometimes even reversed. Thus, maximizing the overall performance on heterogeneous systems requires making full use of all available computational resources, including both accelerators and CPUs.
international conference on high performance computing and simulation | 2016
Jeremie Lagraviere; Johannes Langguth; Mohammed Sourouri; Phuong Hoai Ha; Xing Cai
Using large-scale multicore systems to get the maximum performance and energy efficiency with manageable programmability is a major challenge. The partitioned global address space (PGAS) programming model enhances programmability by providing a global address space over large-scale computing systems. However, so far the performance and energy efficiency of the PGAS model on multicore-based parallel architectures have not been investigated thoroughly. In this paper we use a set of selected kernels from the well-known NAS Parallel Benchmarks to evaluate the performance and energy efficiency of the UPC programming language, which is a widely used implementation of the PGAS model. In addition, the MPI and OpenMP versions of the same parallel kernels are used for comparison with their UPC counterparts. The investigated hardware platforms are based on multicore CPUs, both within a single 16-core node and across multiple nodes involving up to 1024 physical cores. On the multi-node platform we used the hardware measurement solution called High definition Energy Efficiency Monitoring tool in order to measure energy. On the single-node system we used the hybrid measurement solution to make an effort into understanding the observed performance differences, we use the Intel Performance Counter Monitor to quantify in detail the communication time, cache hit/miss ratio and memory usage. Our experiments show that UPC is competitive with OpenMP and MPI on single and multiple nodes, with respect to both the performance and energy efficiency.
ieee international conference on high performance computing data and analytics | 2015
Md. Naim; Fredrik Manne; Mahantesh Halappanavar; Antonino Tumeo; Johannes Langguth
Matching is a fundamental graph problem with numerous applications in science and engineering. While algorithms for computing optimal matchings are difficult to parallelize, approximation algorithms on the other hand generally compute high quality solutions and are amenable to parallelization. In this paper, we present efficient implementations of the current best algorithm for half-approximate weighted matching, the Suitor algorithm, on Nvidia Kepler K-40 platform. We develop four variants of the algorithm that exploit hardware features to address key challenges for a GPU implementation. We also experiment with different combinations of work assigned to a warp. Using an exhaustive set of 269 inputs, we demonstrate that the new implementation outperforms the previous best GPU algorithm by 10 to 100x for over 100 instances, and from 100 to 1000x for 15 instances. We also demonstrate up to 20x speedup relative to 2 threads, and up to 5x relative to 16 threads on Intel Xeon platform with 16 cores for the same algorithm. The new algorithms and implementations provided in this paper will have a direct impact on several applications that repeatedly use matching as a key compute kernel. Further, algorithm designs and insights provided in this paper will benefit other researchers implementing graph algorithms on modern GPU architectures.