Mario Flajslik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mario Flajslik is active.

Explore More

Publication

Featured researches published by Mario Flajslik.

ieee international conference on high performance computing, data, and analytics | 2016

Mitigating MPI Message Matching Misery

Mario Flajslik; James Dinan; Keith D. Underwood

To satisfy MPI ordering semantics in the presence of wildcards, current implementations store posted receive operations and unexpected messages in linked lists. As applications scale up, communication patterns that scale with the number of processes or the number of threads per process can cause those linked lists to grow and become a performance problem. We propose new structures and matching algorithms to address these performance challenges. Our scheme utilizes a hash map that is extended with message ordering annotations to significantly reduce time spent searching for matches in the posted receive and the unexpected message structures. At the same time, we maintain the required MPI ordering semantics, even in the presence of wildcards. We evaluate our approach on several benchmarks and demonstrate a significant reduction in the number of unsuccessful match attempts in the MPI message processing engine, while at the same time incurring low space and time overheads.

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models | 2014

Contexts: A Mechanism for High Throughput Communication in OpenSHMEM

James Dinan; Mario Flajslik

This paper introduces a proposed extension to the OpenSHMEM parallel programming model, called communication contexts. Contexts introduce a new construct that allows a programmer to generate independent streams of communication operations. In hybrid executions where multiple threads execute within an OpenSHMEM process, contexts eliminate interference between threads, and enable the OpenSHMEM library to map operations generated by threads to private communication resource sets. By providing thread isolation, contexts eliminate synchronization overheads and enable each thread to drive a similar set of resources and achieve performance comparable to an OpenSHMEM process. In conventional, single-threaded execution, contexts provide greater control over ordering of operations and can improve communication and computation overlap. A detailed description of the contexts interface and its implementation for the Portals 4 network programming interface is described. The implementation is evaluated using Mandelbrot set and integer sorting (IS) benchmarks. Contexts provide a 25% performance improvement for Mandelbrot by eliminating thread interference and enabling pipelining, and a 35% improvement was achieved for IS by enabling more effective communication/computation overlap.

ieee international conference on high performance computing, data, and analytics | 2018

Megafly: A Topology for Exascale Systems

Mario Flajslik; Eric Borch; Michael Parker

In this paper we explore network topologies suitable for future exascale systems that need to support over fifty thousand endpoints. With the increased necessity to use optics at higher link speeds, some of the more traditional topologies, such as Tori and Fat-Trees, become prohibitively expensive at such large scale. We identify two cost efficient hierarchical topologies, one a canonical Dragonfly, and one a variant of the Dragonfly topology that we call Megafly. Megafly is an indirect hierarchical topology with high path diversity, flexible tapering options and an abundance of possible system design points. We describe and analyze the Megafly topology to understand its key features and advantages, when compared to the Dragonfly. Additionally, we define a Megafly tapering scheme that enables a good balance of system performance versus cost. Our evaluation shows that the Megafly topology achieves equal or better throughput than the Dragonfly on a variety of traffic patterns, while requiring only half of the virtual channels for deadlock-free routing. Megafly also provides better fairness, which is shown in the evaluation of synchronizing traffic patterns, such as neighbor exchanges. We also showcase the design flexibility and cost vs. performance trade-offs of Megafly in a mini case study that illustrates the challenges of building a high performance fabric topology.

high performance interconnects | 2017

Fast Networks and Slow Memories: A Mechanism for Mitigating Bandwidth Mismatches

Timo Schneider; James Dinan; Mario Flajslik; Keith D. Underwood; Torsten Hoefler

The advent of non-volatile memory (NVM) technologies has added an interesting nuance to the node level memory hierarchy. With modern 100 Gb/s networks, the NVM tier of storage can often be slower than the high performance network in the system; thus, a new challenge arises in the datacenter. Whereas prior efforts have studied the impacts of multiple sources targeting one node (i.e., incast) and have studied multiple flows causing congestion in inter-switch links, it is now possible for a single flow from a single source to overwhelm the bandwidth of a key portion of the memory hierarchy. This can subsequently spread to the switches and lead to congestion trees in a flow-controlled network or excessive packet drops without flow control. In this work we describe protocols which avoid overwhelming the receiver in the case of a source/sink rate mismatch. We design our protocols on top of Portals 4, which enables us to make use of network offload. Our protocol yields up to 4x higher throughput in a 5k node Dragonfly topology for a permutation traffic pattern in which only 1% of all nodes have a memory write-bandwidth limitation of 1/8th of the network bandwidth.

2015 9th International Conference on Partitioned Global Address Space Programming Models | 2015

On the Fence: An Offload Approach to Ordering One-Sided Communication

Mario Flajslik; James Dinan

Partitioned Global Address Space (PGAS) and one-sided communication models allow shared data to be transparently and asynchronously accessed by any process within a parallel computation. In order to ensure that updates are performed in the intended order, the programmer must either use potentially slower ordered communication, or perform operations that order unordered communication, such as a fence in the OpenSHMEM model. Often, implementations of such ordering mechanisms require blocking until pending operations have completed remotely, before allowing new operations to be issued. In this work, we present a new queuing technique for the implementation of one-sided communication ordering that is nonblocking and ensures asynchronous progress for pending communication operations. We describe an implementation of this approach using Portals triggered operations to offload queuing of communication operations across ordering boundaries. By eliminating blocking for ordered communication, this approach is able to provide automatic overlap of communication and computation. We demonstrate the benefit of this technique on several applications and measure performance improvements in the 10%-15% range from allowing computation to progress while ordered communication operations are pending.

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models | 2014

One-Sided Append: A New Communication Paradigm For PGAS Models

James Dinan; Mario Flajslik

One-sided append represents a new class of one-sided operations that can be used to aggregate messages from multiple communication sources into a single destination buffer. This new communication paradigm is analyzed in terms of its impact on the OpenSHMEM parallel programming model and applications. Implementation considerations are discussed and an accelerated implementation using the Portals 4 networking API is presented. Initial experimental results with the NAS integer sort benchmark indicate that this new operation can significantly improve the communication performance of such applications.

Archive | 2016