Misbah Mubarak
Argonne National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Misbah Mubarak.
ieee international conference on high performance computing data and analytics | 2012
Misbah Mubarak; Christopher D. Carothers; Robert B. Ross; Philip H. Carns
A low-latency and low-diameter interconnection network will be an important component of future exascale architectures. The dragonfly network topology, a two-level directly connected network, is a candidate for exascale architectures because of its low diameter and reduced latency. To date, small-scale simulations with a few thousand nodes have been carried out to examine the dragonfly topology. However, future exascale machines will have millions of cores and up to 1 million nodes. In this paper, we focus on the modeling and simulation of large-scale dragonfly networks using the Rensselaer Optimistic Simulation System (ROSS). We validate the results of our model against the cycle-accurate simulator “booksim”. We also compare the performance of booksim and ROSS for the dragonfly network model at modest scales. We demonstrate the performance of ROSS on both the Blue Gene/P and Blue Gene/Q systems on a dragonfly model with up to 50 million nodes, showing a peak event rate of 1.33 billion events/second and a total of 872 billion committed events. The dragonfly network model for million-node configurations strongly scales when going from 1,024 to 65,536 MPI tasks on IBM Blue Gene/P and IBM Blue Gene/Q systems. We also explore a variety of ROSS tuning parameters to get optimal results with the dragonfly network model.
IEEE Transactions on Parallel and Distributed Systems | 2017
Misbah Mubarak; Christopher D. Carothers; Robert B. Ross; Philip H. Carns
With the increasing complexity of todays high-performance computing (HPC) architectures, simulation has become an indispensable tool for exploring the design space of HPC systems-in particular, networks. In order to make effective design decisions, simulations of these systems must possess the following properties: (1) have high accuracy and fidelity, (2) produce results in a timely manner, and (3) be able to analyze a broad range of network workloads. Most state-of-the-art HPC network simulation frameworks, however, are constrained in one or more of these areas. In this work, we present a simulation framework for modeling two important classes of networks used in todays IBM and Cray supercomputers: torus and dragonfly networks. We use the Co-Design of Multi-layer Exascale Storage Architecture (CODES) simulation framework to simulate these network topologies at a flit-level detail using the Rensselaer Optimistic Simulation System (ROSS) for parallel discrete-event simulation. Our simulation framework meets all the requirements of a practical network simulation and can assist network designers in design space exploration. First, it uses validated and detailed flit-level network models to provide an accurate and high-fidelity network simulation. Second, instead of relying on serial time-stepped or traditional conservative discrete-event simulations that limit simulation scalability and efficiency, we use the optimistic event-scheduling capability of ROSS to achieve efficient and scalable HPC network simulations on todays high-performance cluster systems. Third, our models give network designers a choice in simulating a broad range of network workloads, including HPC application workloads using detailed network traces, an ability that is rarely offered in parallel with high-fidelity network simulations.
principles of advanced discrete simulation | 2014
Misbah Mubarak; Christopher D. Carothers; Robert B. Ross; Philip H. Carns
A high-bandwidth, low-latency interconnect will be a critical component of future exascale systems. The torus network topology, which uses multidimensional network links to improve path diversity and exploit locality between nodes, is a potential candidate for exascale interconnects. The communication behavior of large-scale scientific applications running on future exascale networks is particularly important and analytical/algorithmic models alone cannot deduce it. Therefore, before building systems, it is important to explore the design space and performance of candidate exascale interconnects by using simulation. We improve upon previous work in this area and present a methodology for modeling and simulating a high-fidelity, validated, and scalable torus network topology at a packet-chunk level detail using the Rensselaer Optimistic Simulation System (ROSS). We execute various configurations of a 1.3 million node torus network model in order to examine the effect of torus dimensionality on network performance with relevant HPC traffic patterns. To the best of our knowledge, these are the largest torus network simulations that are carried out at such a detailed fidelity. In terms of simulation performance, a 1.3 million node, 9-D torus network model is shown to process a simulated exascale-class workload of nearest-neighbor traffic with 100 million message injections per second per node using 65,536 Blue Gene/Q cores in a simulation run-time of only 25 seconds. We also demonstrate that massive-scale simulations are a critical tool in exascale system design since small-scale torus simulations are not always indicative of the network behavior at an exascale size. The take-away message from this case study is that massively parallel simulation is a key enabler for effective extreme-scale network codesign.
european conference on parallel processing | 2015
Bilge Acun; Nikhil Jain; Abhinav Bhatele; Misbah Mubarak; Christopher D. Carothers; Laxmikant V. Kalé
This paper presents a preliminary evaluation of TraceR, a trace replay tool built upon the ROSS-based CODES simulation framework. TraceR can be used for predicting network performance and understanding network behavior by simulating messaging on interconnection networks. It addresses two major shortcomings in current network simulators. First, it enables fast and scalable simulations of large-scale supercomputer networks. Second, it can simulate production HPC applications using BigSim’s emulation framework. In addition to introducing TraceR, this paper studies the impact of input parameters on simulation performance. We also compare TraceR with other network simulators such as SST and BigSim, and demonstrate TraceR ’s scalability using various case studies.
ieee international conference on high performance computing data and analytics | 2016
Xu Yang; John Jenkins; Misbah Mubarak; Robert B. Ross; Zhiling Lan
High-radix, low-diameter dragonfly networks will be a common choice in next-generation supercomputers. Preliminary studies show that random job placement with adaptive routing should be the rule of thumb to utilize such networks, since it uniformly distributes traffic and alleviates congestion. Nevertheless, in this work we find that while random job placement coupled with adaptive routing is good at load balancing network traffic, it cannot guarantee the best performance for every job. The performance improvement of communication-intensive applications comes at the expense of performance degradation of less intensive ones. We identify this bully behavior and validate its underlying causes with the help of detailed network simulation and real application traces. We further investigate a hybrid contiguous-noncontiguous job placement policy as an alternative. Initial experimentation shows that hybrid job placement aids in reducing the worst-case performance degradation for less communication-intensive applications while retaining the performance of communication-intensive ones.
principles of advanced discrete simulation | 2016
Noah Wolfe; Christopher D. Carothers; Misbah Mubarak; Robert B. Ross; Philip H. Carns
As supercomputers close in on exascale performance, the increased number of processors and processing power translates to an increased demand on the underlying network interconnect. The Slim Fly network topology, a new lowdiameter and low-latency interconnection network, is gaining interest as one possible solution for next-generation supercomputing interconnect systems. In this paper, we present a high-fidelity Slim Fly it-level model leveraging the Rensselaer Optimistic Simulation System (ROSS) and Co-Design of Exascale Storage (CODES) frameworks. We validate our Slim Fly model with the Kathareios et al. Slim Fly model results provided at moderately sized network scales. We further scale the model size up to n unprecedented 1 million compute nodes; and through visualization of network simulation metrics such as link bandwidth, packet latency, and port occupancy, we get an insight into the network behavior at the million-node scale. We also show linear strong scaling of the Slim Fly model on an Intel cluster achieving a peak event rate of 36 million events per second using 128 MPI tasks to process 7 billion events. Detailed analysis of the underlying discrete-event simulation performance shows how the million-node Slim Fly model simulation executes in 198 seconds on the Intel cluster.
ieee international conference on high performance computing data and analytics | 2015
Shane Snyder; Philip H. Carns; Robert Latham; Misbah Mubarak; Robert B. Ross; Christopher D. Carothers; Babak Behzad; Huong Luu; Surendra Byna; Prabhat
Accurate analysis of HPC storage system designs is contingent on the use of I/O workloads that are truly representative of expected use. However, I/O analyses are generally bound to specific workload modeling techniques such as synthetic benchmarks or trace replay mechanisms, despite the fact that no single workload modeling technique is appropriate for all use cases. In this work, we present the design of IOWA, a novel I/O workload abstraction that allows arbitrary workload consumer components to obtain I/O workloads from a range of diverse input sources. Thus, researchers can choose specific I/O workload generators based on the resources they have available and the type of evaluation they wish to perform. As part of this research, we also outline the design of three distinct workload generation methods, based on I/O traces, synthetic I/O kernels, and I/O characterizations. We analyze and contrast each of these workload generation techniques in the context of storage system simulation models as well as production storage system measurements. We found that each generator mechanism offers varying levels of accuracy, flexibility, and breadth of use that should be considered before performing I/O analyses. We also recommend a set of best practices for HPC I/O workload modeling based on challenges that we encountered while performing our evaluation.
International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2014
Shane Snyder; Philip H. Carns; Jonathan Jenkins; Kevin Harms; Robert B. Ross; Misbah Mubarak; Christopher D. Carothers
Fault response strategies are crucial to maintaining performance and availability in HPC storage systems, and the first responsibility of a successful fault response strategy is to detect failures and maintain an accurate view of group membership. This is a nontrivial problem given the unreliable nature of communication networks and other system components. As with many engineering problems, trade-offs must be made to account for the competing goals of fault detection efficiency and accuracy.
winter simulation conference | 2014
Misbah Mubarak; Christopher D. Carothers; Robert B. Ross; Philip H. Carns
MPI collective operations are a critical and frequently used part of most MPI-based large-scale scientific applications. In previous work, we have enabled the Rensselaer Optimistic Simulation System (ROSS) to predict the performance of MPI point-to-point messaging on high-fidelity million-node network simulations of torus and dragonfly interconnects. The main contribution of this work is an extension of these torus and dragonfly network models to support MPI collective communication operations using the optimistic event scheduling capability of ROSS. We demonstrate that both small- and large-scale ROSS collective communication models can execute efficiency on massively parallel architectures. We validate the results of our collective communication model against the measurements from IBM Blue Gene/Q and Cray XC30 platforms using a data-driven approach on our network simulations. We also perform experiments to explore the impact of tree degree on the performance of collective communication operations in large-scale network models.
Scientific Programming | 2013
Misbah Mubarak; E. Seegyoung Seol; Qiukai Lu; Mark S. Shephard
Critical to the scalability of parallel adaptive simulations are parallel control functions including load balancing, reduced inter-process communication and optimal data decomposition. In distributed meshes, many mesh-based applications frequently access neighborhood information for computational purposes which must be transmitted efficiently to avoid parallel performance degradation when the neighbors are on different processors. This article presents a parallel algorithm of creating and deleting data copies, referred to as ghost copies, which localize neighborhood data for computation purposes while minimizing inter-process communication. The key characteristics of the algorithm are: 1 It can create ghost copies of any permissible topological order in a 1D, 2D or 3D mesh based on selected adjacencies. 2 It exploits neighborhood communication patterns during the ghost creation process thus eliminating all-to-all communication. 3 For applications that need neighbors of neighbors, the algorithm can create n number of ghost layers up to a point where the whole partitioned mesh can be ghosted. Strong and weak scaling results are presented for the IBM BG/P and Cray XE6 architectures up to a core count of 32,768 processors. The algorithm also leads to scalable results when used in a parallel super-convergent patch recovery error estimator, an application that frequently accesses neighborhood data to carry out computation.