Martin Swany | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Martin Swany is active.

Explore More

Publication

Featured researches published by Martin Swany.

latin american web congress | 2003

Enabling network measurement portability through a hierarchy of characteristics

Bruce Lowekamp; Brian Tierney; Les Cottrell; R. E. Hughes-Jones; Thilo Kielmann; Martin Swany

Measurement and prediction of network resources are crucial so that adaptive applications can make use of grid environments. Although a large number of systems and tools have been developed to provide such measurement services, the diversity of grid resources and lack of central control prevent the development of a single monitoring system that can be deployed to answer every applications resource queries for connections between any pair of machines it can use. We propose a standard for representing network entities and measurements of their properties. Our standard enables the exchange of measurements and will allow applications to function even in environments without the particular measurement system for which they were developed. We present an overview of our measurement representation and evaluate its usefulness. We have used the characteristics hierarchy to store and exchange measurement data between several systems, and we discuss its usefulness in comparing the output of several measurement tools.

network aware data management | 2013

Efficient wide area data transfer protocols for 100 Gbps networks and beyond

Ezra Kissel; Martin Swany; Brian Tierney; Eric Pouyoul

Due to a number of recent technology developments, now is the right time to re-examine the use of TCP for very large data transfers. These developments include the deployment of 100 Gigabit per second (Gbps) network backbones, hosts that can easily manage 40 Gbps, and higher, data transfers, the Science DMZ model, the availability of virtual circuit technology, and wide-area Remote Direct Memory Access (RDMA) protocols. In this paper we show that RDMA works well over wide-area virtual circuits, and uses much less CPU than TCP or UDP. We also characterize the limitations of RDMA in the presence of other traffic, including competing RDMA flows. We conclude that RDMA for Science DMZ to Science DMZ transfers of massive data is a viable and desirable option for high-performance data transfer.

international parallel and distributed processing symposium | 2016

Photon: Remote Memory Access Middleware for High-Performance Runtime Systems

Ezra Kissel; Martin Swany

We introduce the Photon RDMA middleware library that enables consistent remote memory access semantics over a number of network interconnect technologies. A primary goal of Photon is to expose a lightweight and flexible network abstraction that minimizes communication and message handling overheads for high-performance applications and runtime systems, in particular those that require the manipulation of objects within a global address space. Both one-sided and rendezvous communication models are supported and asynchronous network progress is exposed at a fine granularity. Photon implements a novel communication pattern called put-with-completion (PWC) that optimizes a completion notification path with variable size data for realizing active message-driven computation. The results of our performance evaluation show that our PWC model is comparable, and often improves upon, existing one-sided RDMA libraries in message latency and throughput metrics.

european conference on parallel processing | 2014

Software Defined Multicasting for MPI Collective Operation Offloading with the NetFPGA

Omer Arap; Geoffrey Brown; Bryce Himebaugh; Martin Swany

Collective operations play a key role in the performance of many high performance computing applications and are central to the widely used Message Passing Interface (MPI) programming model. In this paper we explore the use of programmable networking devices to accelerate the implementation of collective operations by offloading functionality to the underlying network. In our work we utilize a networked FPGA in conjunction with commercial OpenFlow switches supporting multicast. The union of hardware configurable network interfaces with Software Defined Networking (SDN) provides a significant opportunity to improve the performance of MPI applications that rely heavily on collective operations. The programmable interfaces implement collective operations in hardware using OpenFlow supported multicast. In our 8-node cluster, we observed up to 12% reduction in MPI_Allreduce latency in dynamic schemes employing SDN; and up to 22% reduction in static topologies. The results suggest more benefits if our approach is deployed in larger settings with low latency switches.

acm special interest group on data communication | 2015

Research challenges in future multi-domain network performance measurement and monitoring

Prasad Calyam; Martin Swany

The perfSONAR-based Multi-domain Network Performance Measurement and Monitoring Workshop was held on February 20-21, 2014 in Arlington, VA. The goal of the workshop was to review the state of the perfSONAR effort and catalyze future directions by cross-fertilizing ideas, and distilling common themes among the diverse perfSONAR stakeholders that include: network operators and managers, end-users and network researchers. The timing and organization for the second workshop is significant because there are an increasing number of groups within NSF supported data-intensive computing and networking programs that are dealing with measurement, monitoring and troubleshooting of multi-domain issues. These groups are forming explicit measurement federations using perfSONAR to address a wide range of issues. In addition, the emergence and wide-adoption of new paradigms such as software-defined networking are taking shape to aid in traffic management needs of scientific communities and network operators. Consequently, there are new challenges that need to be addressed for extensible and programmable instrumentation, measurement data analysis, visualization and middleware security features in perfSONAR. This report summarizes the workshop efforts to bring together diverse groups for delivering targeted short/long talks, sharing latest advances, and identifying gaps that exist in the community for solving end-to-end performance problems in an effective, scalable fashion.

international conference on communications | 2015

Using phoebus data transfer accelerator in cloud environments

Miao Zhang; Ezra Kissel; Martin Swany

The quality of data exchange in cloud computing applications relies on the connection performance between user clients and their cloud storage providers, and is often dependent on the wide area network (WAN) properties among data centers. For certain classes of applications, it can be crucial to provide an end-to-end solution that accelerates large data transfers and improves overall user experience. The development and deployment of WAN optimization technology has been investigated for improving application performance in heterogeneous, multi-domain environments. WAN optimization devices and services implement a number of approaches for performance improvement, and one key insight is that in contrast to traditional end-to-end TCP connections, middleboxes that segment and optimize transport-layer connections can improve the performance of wide area data transfers. In the context of dynamic cloud computing environments, there is an obvious target for implementations of WAN optimization as Network Function Virtualization (NFV), where the flexibility of virtualized cloud environments can be exploited. This paper describes recent developments and experimentation of our Phoebus WAN accelerator framework. We introduce a software suite that includes new Phoebus clients that operate with the Phoebus Gateway network. We test and discuss virtualizing Phoebus Gateways to provide acceleration services in cloud data transfers. Use cases and performance evaluations are conducted on FutureGrid and Internet2 testbeds, and we demonstrate the effectiveness of a virtualized Phoebus deployment.

international parallel and distributed processing symposium | 2015

Adaptive Recursive Doubling Algorithm for Collective Communication

Omer Arap; Martin Swany; Geoffrey Brown; Bryce Himebaugh

Process arrival times at MPI collective operations differ significantly. Addressing this fact with special handling for popular collective communication algorithms can yield performance improvements. The recursive doubling algorithm is one of the most efficient techniques for implementing collectives in MPI, especially for short messages and when the number of participating processes is a power of two. In the recursive doubling algorithm, all the processes must complete a given step before the algorithm continues to the next step. In this paper, we present a recursive doubling algorithm that makes use of available data and removes the requirement for each process to arrive at each step before proceeding. Our approach makes use of the multicast feature of the underlying network and progress tagging of messages, describing the currently available partial results. Our approach could be implemented in any parallel execution environment that supports multicasting. Our prototype implementation is based upon a network interface card with an FPGA, the Net FPGA. The Net FPGA provides hardware level programmability to offload processing, precise and controlled timing for accounting for packet and algorithm behavior, allowing classification of skew scenarios. Our algorithm provides up to 10% saving in synchronization delay in the presence of skew and up to 37% saving in number of messages generated, and up to 32% saving in reduction operations performed in MPI Allreduce.

utility and cloud computing | 2017

Low Latency Stream Processing: Apache Heron with Infiniband & Intel Omni-Path

Supun Kamburugamuve; Karthik Ramasamy; Martin Swany; Geoffrey C. Fox

Worldwide data production is increasing both in volume and velocity, and with this acceleration, data needs to be processed in streaming settings as opposed to the traditional store and process model. Distributed streaming frameworks are designed to process such data in real time with reasonable time constraints. Apache Heron is a production-ready large-scale distributed stream processing framework. The network is of utmost importance to scale streaming applications to large numbers of nodes with a reasonable latency. High performance computing (HPC) clusters feature interconnects that can perform at higher levels than traditional Ethernet. In this paper the authors present their findings on integrating Apache Heron distributed stream processing system with two high performance interconnects; Infiniband and Intel Omni-Path and show that they can be utilized to improve performance of distributed streaming applications.

european conference on parallel processing | 2017

Accelerating the 3-D FFT Using a Heterogeneous FPGA Architecture

Matthew Anderson; Maciej Brodowicz; Martin Swany; Thomas Sterling

Future Exascale architectures will likely make extensive use of computing accelerators such as Field Programmable Gate Arrays (FPGAs) given that these accelerators are very power efficient. Oftentimes, these FPGAs are located at the network interface card (NIC) and switch level in order to accelerate network operations, incorporate contention avoiding routing schemes, and perform computations directly on the NIC and bypass the arithmetic logic unit (ALU) of the CPU. This work explores just such a heterogeneous FPGA architecture in the context of two kernels that are driving applications in leadership machines: the 3-D Fast Fourier Transform (3-D FFT) and Asynchronous Multi-Tasking (AMT). The machine explored here is a DataVortex system which consists of conventional processors but with programmable logic incorporated in the memory architecture. The programmable logic controls the network and is incorporated both in the network interface cards and the network switches and implements a contention avoiding network routing. Both the 3-D FFT and AMT kernels show compelling performance for deployment to FFT driven applications in both molecular dynamics and density functional theory.

IEEE Micro | 2017

Offloading Collective Operations to Programmable Logic

Omer Arap; Lucas R.B. Brasilino; Ezra Kissel; Alexander Shroyer; Martin Swany

The authors describe their architecture and implementation for offloading collective operations to programmable logic in the communication substrate. Collective operations are widely used in parallel processing. Their design and implementation strategies affect the performance of many high-performance computing applications that utilize them. Collectives are central to the message passing interface (MPI) programming model. The programmable logic provided by field-programmable gate arrays (FPGAs) is a powerful option for creating task-specific logic to aid applications. The authors’ approach is applicable in scenarios where there is programmable logic in the communication pipeline and can be used to accelerate various network-based operations. In this article, the authors present a general collective offloading framework for use in applications using MPI. They evaluate their approach on the Xilinx Zynq system on a chip and an FPGA-based network interface card called the NetFPGA. Results are presented both from microbenchmarks and a benchmark scientific application using MPI.

Explore More