Vishal Ahuja
University of California, Davis
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vishal Ahuja.
IEEE MultiMedia | 2013
Amit Pande; Vishal Ahuja; Rajarajan Sivaraj; Eilwoo Baik; Prasant Mohapatra
Wireless network traffic is dominated by video and requires new ways to maximize the user experience and optimize networks to prevent saturation. The exploding number of subscribers in cellular networks has exponentially increased the volume and variety of multimedia content flowing across the network. This article details some challenges in delivery of multimedia content over 4G networks for several application scenarios. To augment the increasing demand for video applications in cellular and wireless traffic, these challenges must be efficiently addressed.
architectures for networking and communications systems | 2012
Vishal Ahuja; Matthew K. Farrens; Dipak Ghosal
For a given TCP or UDP flow, protocol processing of incoming packets is performed on the core that receives the interrupt, while the user-space application which consumes the data may run on the same or a different core. If the cores are not the same, additional costs due to context switches, cache misses, and the movement of data between the caches of the cores may occur. The magnitude of this cost depends upon the processor affinity of the user-space process relative to the network stack. In this paper we present a prototype implementation of a tool which enables the application processing and protocol processing to occur on cores which share the lowest cache level. The Cache-Aware Affinity Deamon (CAAD) analyzes the topology of the die and the NIC characteristics and conveys information to the sender which allows the entire end-to-end path for each new flow to be be managed and controlled. This is done in a light-weight manner for both uni- and bi-directional flows. Measurements show that for bulk data transfers using commodity multi-core machines, the use of CAAD improves the overall TCP throughput by as much as 31%, and reduces the cache miss rate as much as 37.5%. GridFTP combined with CAAD improves the download time for big file transfers by up to 18%.
network aware data management | 2013
Vishal Ahuja; Mehmet Balman; Matthew K. Farrens; Dipak Ghosal; Eric Pouyoul; Brian Tierney
Multi-core end-systems use Receive Side Scaling (RSS) to parallelize protocol processing. RSS uses a hash function on the standard flow descriptors and an indirection table to assign incoming packets to receive queues which are pinned to specific cores. This ensures flow affinity in that the interrupt processing of all packets belonging to a specific flow is processed by the same core. A key limitation of standard RSS is that it does not consider the application process that consumes the incoming data in determining the flow affinity. In this paper, we carry out a detailed experimental analysis of the performance impact of the application affinity in a 40 Gbps testbed network with a dual hexa-core end-system. We show, contrary to conventional wisdom, that when the application process and the flow are affinitized to the same core, the performance (measured in terms of end-to-end TCP throughput) is significantly lower than the line rate. Near line rate performance is observed when the flow and the application process are affinitized to different cores belonging to the same socket. Furthermore, affinitizing the application and the flow to cores on different sockets results in significantly lower throughput than the line rate. These results arise due to the memory bottleneck, which is demonstrated using preliminary correlational data on the cache hit rate in the core that services the application process.
network aware data management | 2014
Vishal Ahuja; Matthew K. Farrens; Dipak Ghosal; Mehmet Balman; Eric Pouyoul; Brian Tierney
Network throughput is scaling-up to higher data rates while end-system processors are scaling-out to multiple cores. In order to optimize high speed data transfer into multicore end-systems, techniques such as network adapter offloads and performance tuning have received a great deal of attention. Furthermore, several methods of multithreading the network receive process have been proposed. However, thus far attention has been focused on how to set the tuning parameters and which offloads to select for higher performance, and little has been done to understand why the settings do (or do not) work. In this paper we build on previous research to track down the source(s) of the end-system bottleneck for high-speed TCP flows. For the purposes of this paper, we consider protocol processing efficiency to be the amount of system resources used (such as CPU and cache) per unit of achieved throughout (in Gbps). The amount of various system resources consumed are measured using low-level system event counters. Affinitization, or core binding, is the decision about which processor cores on an end system are responsible for interrupt, network, and application processing. We conclude that affinitization has a significant impact on protocol processing efficiency, and that the performance bottleneck of the network receive process changes drastically with three distinct affinitization scenarios.
Future Generation Computer Systems | 2016
Vishal Ahuja; Matthew K. Farrens; Dipak Ghosal; Mehmet Balman; Eric Pouyoul; Brian Tierney
Network throughput is scaling-up to higher data rates while end-system processors are scaling-out to multiple cores. In order to optimize high speed data transfer into multicore end-systems, techniques such as network adaptor offloads and performance tuning have received a great deal of attention. Furthermore, several methods of multi-threading the network receive process have been proposed. However, thus far attention has been focused on how to set the tuning parameters and which offloads to select for higher performance, and little has been done to understand why the various parameter settings do (or do not) work. In this paper, we build on previous research to track down the sources of the end-system bottleneck for high-speed TCP flows. We define protocol processing efficiency to be the amount of system resources (such as CPU and cache) used per unit of achieved throughput (in Gbps). The amount of various system resources consumed are measured using low-level system event counters. In a multicore end-system, affinitization, or core binding, is the decision regarding how the various tasks of network receive process including interrupt, network, and application processing are assigned to the different processor cores. We conclude that affinitization has a significant impact on protocol processing efficiency, and that the performance bottleneck of the network receive process changes significantly with different affinitization. Affinity, or core binding, maps processes to cores in a multicore system.We characterized the effect of different receiving flow and application affinities.We used OProfile as an introspection tool to examine software bottlenecks.The location of the end-system bottleneck was dependent on the choice of affinity.There are multiple sources of end-system bottlenecks on commodity hardware.
cluster computing and the grid | 2012
Vishal Ahuja; Dipak Ghosal; Matthew K. Farrens
Data centers are being deployed in a wide variety of environments (cloud computing, scientific, financial, defense, etc.). When geographically distributed, these data centers must transmit and receive growing volumes of data. In order to avoid congestion in the public internet, most use high speed dedicated optical networks, which can be thought of as private highways for carrying data. In this work, we examined the impact of such high speed network traffic on a commodity multicore machine, and identified a number of scenarios that cause packet loss and degraded throughput due to an end-system inability to consume incoming data fast enough. We show that high speed single flow traffic nullifies the benefits of multicore systems and multiqueue NICs, and we propose an end-system aware flow bifurcation technique to optimize the data transfer time using rate based protocols. Using introspective end-system modeling, we determine the optimal number of parallel flows required to utilize the available bandwidth, and the optimal rate for each of the flows. We compare our approach with GridFTP, which is a widely used data transfer protocol in computational grids, and show that our approach performs better (particularly when the end-system losses are in the receive ring buffer.).
architectures for networking and communications systems | 2014
Vishal Ahuja; Matthew K. Farrens; Dipak Ghosal; Mehmet Balman; Eric Pouyoul; Brian Tierney
Network throughput is scaling-up to higher data rates while processors are scaling-out to multiple cores. In order to optimize high speed data transfer into multicore end-systems, network adapter offloads and performance tuning have received a great deal of attention. However, much of this attention is focused on how to set the tuning parameters and which offloads to select for higher performance and not why they do (or do not) work. In this study we have attempted to address two issues that impact the data transfer performance. First is the impact of the processor core affinity (or core binding) which determines the choice of which processor core or cores handle certain tasks in a network-or I/O-heavy application running on a multicore end-system. Second issue is the impact of Ethernet pause frames which provides a link layer flow control in addition to the end-to-end flow control provided by TCP. The goal of our research is to delve deeper into why these tuning suggestions and this offload exist, and how they affect the end-to-end performance and efficiency of a single, large TCP flow.
high performance distributed computing | 2011
Vishal Ahuja; Amitabha Banerjee; Matthew K. Farrens; Dipak Ghosal; Giuseppe Serazzi
The transmission capacity of todays high-speed networks is often greater than the capacity of an end-system (such as a server or a remote client) to consume the incoming data. The mismatch between the network and the end-system, which can be exacerbated by high end-system workloads, will result in incoming packets being dropped at different points in the packet receiving process. In particular, a packet may be dropped in the NIC, in the kernel ring buffer, and (for rate based protocols) in the socket buffer. To provide reliable data transfers, these losses require retransmissions, and if the loss rate is high enough result in longer download times. In this paper, we focus on UDP-like rate based transport protocols, and address the question of how best to estimate the rate at which the end-system can consume data which minimizes the overall transfer time of a file. We propose a novel queueing network model of the end-system, which consists of a model of the NIC, a model of the kernel ring buffer and the protocol processing, and a model of the socket buffer from which the application process reads the data. We show that using simple and approximate queueing models, we can accurately predict the effective end-system bottleneck rate that minimizes the file transfer time. We compare our protocol with PA-UDP, an end-system aware rate based transport protocol, and show that our approach performs better, particularly when the packet losses in the NIC and/or the kernel ring buffer are high. We also compare our approach to TCP. Unlike in our rate based scheme, TCP invokes the congestion control algorithm when there are losses in the NIC and the ring buffer. With higher end-to-end delay, this results in significant performance degradation compared to our reliable end-system aware rate based protocol.
ACM Computing Surveys | 2018
Vishal Ahuja; Matthew K. Farrens; Brian Tierney; Dipak Ghosal
The gap is widening between the processor clock speed of end-system architectures and network throughput capabilities. It is now physically possible to provide single-flow throughput of speeds up to 100 Gbps, and 400 Gbps will soon be possible. Most current research into high-speed data networking focuses on managing expanding network capabilities within datacenter Local Area Networks (LANs) or efficiently multiplexing millions of relatively small flows through a Wide Area Network (WAN). However, datacenter hyper-convergence places high-throughput networking workloads on general-purpose hardware, and distributed High-Performance Computing (HPC) applications require time-sensitive, high-throughput end-to-end flows (also referred to as “elephant flows”) to occur over WANs. For these applications, the bottleneck is often the end-system and not the intervening network. Since the problem of the end-system bottleneck was uncovered, many techniques have been developed which address this mismatch with varying degrees of effectiveness. In this survey, we describe the most promising techniques, beginning with network architectures and NIC design, continuing with operating and end-system architectures, and concluding with clean-slate protocol design.
communications and networking symposium | 2016
Ross K. Gegan; Vishal Ahuja; John D. Owens; Dipak Ghosal
As line rates continue to grow, network security applications such as covert timing channel (CTC) detection must utilize new techniques for processing network flows in order to protect critical enterprise networks. GPU-based packet processing provides one means of scaling the detection of CTCs and other anomalies in network flows. In this paper, we implement a GPU-based detection tool, capable of detecting model-based covert timing channels (MBCTCs). The GPUs ability to process a large number of packets in parallel enables more complex detection tests, such as the corrected conditional entropy (CCE) test—a modified version of the conditional entropy measurement, which has a variety of applications outside of covert channel detection. In our experiments, we evaluate the CCE tests true and false positive detection rates, as well as the time required to perform the test on the GPU. Our results demonstrate that GPU packet processing can be applied successfully to perform real-time CTC detection at near 10 Gbps with high accuracy.