Animesh Trivedi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Animesh Trivedi is active.

Explore More

Publication

Featured researches published by Animesh Trivedi.

asia pacific workshop on systems | 2011

A case for RDMA in clouds: turning supercomputer networking into commodity

Animesh Trivedi; Bernard Metzler; Patrick Stuedi

Modern cloud computing infrastructures are steadily pushing the performance of their network stacks. At the hardware-level, already some cloud providers have upgraded parts of their network to 10GbE. At the same time there is a continuous effort within the cloud community to improve the network performance inside the virtualization layers. The low-latency/high-throughput properties of those network interfaces are not only opening the cloud for HPC applications, they will also be well received by traditional large scale web applications or data processing frameworks. However, as commodity networks get faster the burden on the end hosts increases. Inefficient memory copying in socket-based networking takes up a significant fraction of the end-to-end latency and also creates serious CPU load on the host machine. Years ago, the supercomputing community has developed RDMA network stacks like Infiniband that offer both low end-to-end latency as well as a low CPU footprint. While adopting RDMA to the commodity cloud environment is difficult (costly, requires special hardware) we argue in this paper that most of the benefits of RDMA can in fact be provided in software. To demonstrate our findings we have implemented and evaluated a prototype of a software-based RDMA stack. Our results, when compared to a socket/TCP approach (with TCP receive copy offload) show significant reduction in end-to-end latencies for messages greater than modest 64kB and reduction of CPU load (w/o TCP receive copy offload) for better efficiency while saturating the 10Gbit/s link.

symposium on cloud computing | 2014

DaRPC: Data Center RPC

Patrick Stuedi; Animesh Trivedi; Bernard Metzler; Jonas Pfefferle

Remote Procedure Call (RPC) has been the cornerstone of distributed systems since the early 80s. Recently, new classes of large-scale distributed systems running in data centers are posing extra challenges for RPC systems in terms of scaling and latency. We find that existing RPC systems make very poor usage of resources (CPU, memory, network) and are not ready to handle these upcoming workloads. In this paper we present DaRPC, an RPC framework which uses RDMA to implement a tight integration between RPC message processing and network processing in user space. DaRPC efficiently distributes computation, network resources and RPC resources across cores and memory to achieve a high aggregate throughput (2-3M ops/sec) at a very low per-request latency (10μs with iWARP). In the evaluation we show that DaRPC can boost the RPC performance of existing distributed systems in the cloud by more than an order of magnitude for both throughput and latency.

symposium on cloud computing | 2013

jVerbs: ultra-low latency for data center applications

Patrick Stuedi; Bernard Metzler; Animesh Trivedi

Network latency has become increasingly important for data center applications. Accordingly, several efforts at both hardware and software level have been made to reduce the latency in data centers. Limited attention, however, has been paid to network latencies of distributed systems running inside an application container such as the Java Virtual Machine (JVM) or the .NET runtime. In this paper, we first highlight the latency overheads observed in several well-known Java-based distributed systems. We then present jVerbs, a networking framework for the JVM which achieves bare-metal latencies in the order of single digit microseconds using methods of Remote Direct Memory Access (RDMA). With jVerbs, applications are mapping the network device directly into the JVM, cutting through both the application virtual machine and the operating system. In the paper, we discuss the design and implementation of jVerbs and demonstrate how it can be used to improve latencies in some of the popular distributed systems running in data centers.

virtual execution environments | 2015

A Hybrid I/O Virtualization Framework for RDMA-capable Network Interfaces

Jonas Pfefferle; Patrick Stuedi; Animesh Trivedi; Bernard Metzler; Ioannis Koltsidas; Thomas R. Gross

DMA-capable interconnects, providing ultra-low latency and high bandwidth, are increasingly being used in the context of distributed storage and data processing systems. However, the deployment of such systems in virtualized data centers is currently inhibited by the lack of a flexible and high-performance virtualization solution for RDMA network interfaces. In this work, we present a hybrid virtualization architecture which builds upon the concept of separation of paths for control and data operations available in RDMA. With hybrid virtualization, RDMA control operations are virtualized using hypervisor involvement, while data operations are set up to bypass the hypervisor completely. We describe HyV (Hybrid Virtualization), a virtualization framework for RDMA devices implementing such a hybrid architecture. In the paper, we provide a detailed evaluation of HyV for different RDMA technologies and operations. We further demonstrate the advantages of HyV in the context of a real distributed system by running RAMCloud on a set of HyV-enabled virtual machines deployed across a 6-node RDMA cluster. All of the performance results we obtained illustrate that hybrid virtualization enables bare-metal RDMA performance inside virtual machines while retaining the flexibility typically associated with paravirtualization.

conference on emerging network experiment and technology | 2013

On limitations of network acceleration

Animesh Trivedi; Bernard Metzler; Patrick Stuedi; Thomas R. Gross

The performance of large-scale data-intensive applications running on thousands of machines depends considerably on the performance of the network. To deliver better application performance on rapidly evolving high-bandwidth, low-latency interconnects, researchers have proposed the use of network accelerator devices. However, despite the initial enthusiasm, translating network accelerators capabilities into high application performance remains a challenging issue. In this paper, we describe our experience and discuss issues that we uncover with network acceleration using Remote Direct Memory Access (RDMA) capable network controllers (RNICs). RNICs offload the complete packet processing into network controllers, and provide direct userspace access to the networking hardware. Our analysis shows that multiple (un)related factors significantly influence the performance gains for the end-application. We identify factors that span the whole stack, ranging from low-level architectural issues (cache and DMA interaction, hardware pre-fetching) to the high-level application parameters (buffer size, access pattern). We discuss implications of our findings upon application performance and the future of integration of network acceleration technology within the systems.

international conference on distributed computing systems | 2015

RStore: A Direct-Access DRAM-based Data Store

Animesh Trivedi; Patrick Stuedi; Bernard Metzler; Clemens Lutz; Martin L. Schmatz; Thomas R. Gross

Distributed DRAM stores have become an attractive option for providing fast data accesses to analytics applications. To accelerate the performance of these stores, researchers have proposed using RDMA technology. RDMA offers high bandwidth and low latency data access by carefully separating resource setup from IO operations, and making IO operations fast by using rich network semantics and offloading. Despite recent interest, leveraging the full potential of RDMA in a distributed environment remains a challenging task. In this paper, we present RDMA Store or RStore, a DRAM-based data store that delivers high performance by extending RDMAs separation philosophy to a distributed setting. RStore achieves high aggregate bandwidth (705 Gb/s) and close-to-hardware latency on our 12-machine testbed. We developed a distributed graph processing framework and a Key-Value sorter using RStores unique memory-like API. The graph processing framework, which relies on RStore for low-latency graph access, outperforms state-of-the-art systems by margins of 2.6 -- 4.2× when calculating Page Rank. The Key-Value sorter can sort 256 GB of data in 31.7 sec, which is 8× better than Hadoop TeraSort in a similar setting.

acm international conference on systems and storage | 2017

FlashNet: flash/network stack co-design

Animesh Trivedi; Nikolas Ioannou; Bernard Metzler; Patrick Stuedi; Jonas Pfefferle; Ioannis Koltsidas; Kornilios Kourtis; Thomas R. Gross

During the past decade, network and storage devices have undergone rapid performance improvements, delivering ultra-low latency and several Gbps of bandwidth. Nevertheless, current network and storage stacks fail to deliver this hardware performance to the applications, often due to the loss of IO efficiency from stalled CPU performance. While many efforts attempt to address this issue solely on either the network or the storage stack, achieving high-performance for networked-storage applications requires a holistic approach that considers both. In this paper, we present FlashNet, a software IO stack that unifies high-performance network properties with flash storage access and management. FlashNet builds on RDMA principles and abstractions to provide a direct, asynchronous, end-to-end data path between a client and remote flash storage. The key insight behind FlashNet is to co-design the stacks components (an RDMA controller, a flash controller, and a file system) to enable cross-stack optimizations and maximize IO efficiency. In micro-benchmarks, FlashNet improves 4kB network IOPS by 38.6% to 1.22M, decreases access latency by 43.5% to 50.4 µsecs, and prolongs the flash lifetime by 1.6--5.9× for writes. We illustrate the capabilities of FlashNet by building a Key-Value store, and porting a distributed data store that uses RDMA on it. The use of FlashNets RDMA API improves the performance of KV store by 2×, and requires minimum changes for the ported data store to access remote flash devices.

usenix annual technical conference | 2012