Karthikeyan Vaidyanathan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Karthikeyan Vaidyanathan is active.

Explore More

Publication

Featured researches published by Karthikeyan Vaidyanathan.

international symposium on performance analysis of systems and software | 2004

Sockets Direct Protocol over InfiniBand in clusters: is it beneficial?

Pavan Balaji; Sundeep Narravula; Karthikeyan Vaidyanathan; Savitha Krishnamoorthy; Jiesheng Wu; Dhabaleswar K. Panda

The Sockets Direct Protocol (SDP) had been proposed recently in order to enable sockets based applications to take advantage of the enhanced features provided by InfiniBand architecture. In this paper, we study the benefits and limitations of an implementation of SDP. We first analyze the performance of SDP based on a detailed suite of micro-benchmarks. Next, we evaluate it on two different real application domains: (1) A multitier data-center environment and (2) A Parallel Virtual File System (PVFS). Our micro-benchmark results show that SDP is able to provide up to 2.7 times better bandwidth as compared to the native sockets implementation over InfiniBand (IPoIB) and significantly better latency for large message sizes. Our experimental results also show that SDP is able to achieve a considerably higher performance (improvement of up to 2.4 times) as compared to IPoIB in the PVFS environment. In the data-center environment, SDP outperforms IPoIB for large file transfers inspite of currently being limited by a high connection setup time. However, this limitation is entirely implementation specific and as the InfiniBand software and hardware products are rapidly maturing, we expect this limitation to be overcome soon. Based on this, we have shown that the projected performance for SDP, without the connection setup time, can outperform IPoIB for small message transfers as well.

international conference on cluster computing | 2006

Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers

Karthikeyan Vaidyanathan; Hyun-Wook Jin; Dhabaleswar K. Panda

Efficiently capturing the resource usage in a shared server environment has been a critical research issue in the past several years. With the amount of resources used by each application becoming more and more divergent and unpredictable, the solution to this problem is becoming increasingly important. In the past, several researchers have come up with a number of techniques which rely on coarse-grained monitoring of resources in order to avoid the overheads associated with fine-grained monitoring. In this paper, we propose a low-overhead efficient fine-grained resource monitoring scheme using the advanced Remote Direct Memory Access (RDMA) operation provided by RDMA-enabled interconnects such as InfiniBand (IBA). We evaluate the relative benefits of our approach against traditional approaches in various environments (including micro-benchmarks as well as real applications such as an auction server based on the RUBiS benchmark and the Ganglia distributed monitoring tool). Our results indicate that our approach for fine-grained monitoring can significantly improve the overall system utilization, thereby resulting in up to 25% improvement in the number of requests the cluster-system can admit

cluster computing and the grid | 2006

Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks

Sundeep Narravula; Hyun-Wook Jin; Karthikeyan Vaidyanathan; Dhabaleswar K. Panda

Caching has been a very important technique in improving the performance and scalability of web-serving datacenters. The research community has proposed cooperation of caching servers to achieve higher performance benefits. These existing cooperative caching mechanisms often partially duplicate the cached data redundantly on multiple servers for higher performance (by optimizing the datafetch costs for multiple similar requests). With the advent of RDMA enabled interconnects these basic data-fetch cost estimates have changed significantly. Further, the effective utilization of the vast resources available across multiple tiers in today’s data-centers is of obvious interest. Hence, a systematic study of these various issues involved is of paramount importance. In this paper, we present several cooperative caching schemes that are designed to benefit in the light of the above mentioned trends. In particular, we design schemes that take advantage of the RDMA capabilities of networks and the multitude of resources available in modern multi-tier data-centers. Our designs are implemented on InfiniBand based clusters to work in conjunction with Apache based servers. Our experimental results show that our schemes achieve a throughput improvement of up to 35% as compared to the basic cooperative caching schemes and 180% better than the simple single node caching schemes. Our experimental results lead us to a new scheme which can deliver good performance in many Caching has been a very important technique in improving the performance and scalability of web-serving datacenters. The research community has proposed cooperation of caching servers to achieve higher performance benefits. These existing cooperative caching mechanisms often partially duplicate the cached data redundantly on multiple servers for higher performance (by optimizing the datafetch costs for multiple similar requests). With the advent of RDMA enabled interconnects these basic data-fetch cost estimates have changed significantly. Further, the effective utilization of the vast resources available across multiple tiers in today’s data-centers is of obvious interest. Hence, a systematic study of these various issues involved is of paramount importance. In this paper, we present several cooperative caching schemes that are designed to benefit in the light of the above mentioned trends. In particular, we design schemes that take advantage of the RDMA capabilities of networks and the multitude of resources available in modern multi-tier data-centers. Our designs are implemented on InfiniBand based clusters to work in conjunction with Apache based servers. Our experimental results show that our schemes achieve a throughput improvement of up to 35% as compared to the basic cooperative caching schemes and 180% better than the simple single node caching schemes. Our experimental results lead us to a new scheme which can deliver good performance in many scenarios.

ieee international conference on high performance computing data and analytics | 2013

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

Jongsoo Park; Ganesh Bikshandi; Karthikeyan Vaidyanathan; Ping Tak Peter Tang; Pradeep Dubey; Daehyun Kim

This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5× than achievable on a same number of Intel® Xeon® nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.

international conference on cluster computing | 2007

Efficient asynchronous memory copy operations on multi-core systems and I/OAT

Karthikeyan Vaidyanathan; Lei Chai; Wei Huang; Dhabaleswar K. Panda

Bulk memory copies incur large overheads such as CPU stalling (i.e., no overlap of computation with memory copy operation), small register-size data movement, cache pollution, etc. Asynchronous copy engines introduced by Intelpsilas I/O Acceleration Technology help in alleviating these overheads by offloading the memory copy operations using several DMA channels. However, the startup overheads associated with these copy engines such as pinning the application buffers, posting the descriptors and checking for completion notifications, limit their overlap capability. In this paper, we propose two schemes to provide complete overlap of memory copy operation with computation by dedicating the critical tasks to a single core in a multi-core system. In the first scheme, MCI (Multi-Core with I/OAT), we offload the memory copy operation to the copy engine and onload the startup overheads to the dedicated core. For systems without any hardware copy engine support, we propose a second scheme, MCNI (Multi-Core with No I/OAT) that onloads the memory copy operation to the dedicated core. We further propose a mechanism for an application-transparent asynchronous memory copy operation using memory protection. We analyze our schemes based on overlap efficiency, performance and associated overheads using several micro-benchmarks and applications. Our microbenchmark results show that memory copy operations can be significantly overlapped (up to 100%) with computation using the MCI and MCNI schemes. Evaluation with MPI-based applications such as IS-B and PSTSWM-small using the MCNI scheme show up to 4% and 5% improvement, respectively, as compared to traditional implementations. Evaluations with data-centers using the MCI scheme show up to 37% improvement compared to the traditional implementation. Our evaluations with gzip SPEC benchmark using application-transparent asynchronous memory copy show a lot of potential to use such mechanisms in several application domains.

cluster computing and the grid | 2007

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Sundeep Narravula; Amith R. Mamidala; Abhinav Vishnu; Karthikeyan Vaidyanathan; Dhabaleswar K. Panda

There has been a massive increase in computing requirements for parallel applications. These parallel applications and supporting cluster services often need to share system-wide resources. The coordination of these applications is typically managed by a distributed lock manager. The performance of the lock manager is extremely critical for application performance. Researchers have shown that the use of two sided communication protocols, like TCP/IP (used by current generation lock managers), can have significant impact on the scalability of distributed lock managers. In addition, existing one sided communication based locking designs support locking in exclusive access mode only and can pose significant scalability limitations on applications that need both shared and exclusive access modes like cooperative/file-system caching. Hence the utility of these existing designs in high performance scenarios can be limited. In this paper, we present a novel protocol, for distributed locking services, utilizing the advanced network-level one-sided atomic operations provided by InfiniBand. Our approach augments existing approaches by eliminating the need for two sided communication protocols in the critical locking path. Further, we also demonstrate that our approach provides significantly higher performance in scenarios needing both shared and exclusive mode access to resources. Our experimental results show 39% improvement in basic locking latencies over traditional send/receive based implementations. Further, we also observe a significant (up to 317% for 16 nodes) improvement over existing RDMA based distributed queuing schemes for shared mode locking scenarios.

IEEE Transactions on Parallel and Distributed Systems | 2005

Communication and memory optimal parallel data cube construction

Ruoming Jin; Karthikeyan Vaidyanathan; Ge Yang; Gagan Agrawal

Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. This paper addresses a number of algorithmic issues in parallel data cube construction. First, we present an aggregation tree for sequential (and parallel) data cube construction, which has minimally bounded memory requirements. An aggregation tree is parameterized by the ordering of dimensions. We present a parallel algorithm based upon the aggregation tree. We analyze the interprocessor communication volume and construct a closed form expression for it. We prove that the same ordering of the dimensions in the aggregation tree minimizes both the computational and communication requirements. We also describe a method for partitioning the initial array and prove that it minimizes the communication volume. Finally, in the cases when memory may be a bottleneck, we describe how tiling can help scale sequential and parallel data cube construction. Experimental results from implementation of our algorithms on a cluster of workstations show the effectiveness of our algorithms and validate our theoretical results.

international conference on cluster computing | 2005

Supporting iWARP Compatibility and Features for Regular Network Adapters

Pavan Balaji; Hyun-Wook Jin; Karthikeyan Vaidyanathan; Dhabaleswar K. Panda

With several recent initiatives in the protocol offloading technology present on network adapters, the user market is now distributed amongst various technology levels including regular Ethernet network adapters, TCP Offload Engines (TOEs) and the recently introduced iWARP-capable networks. While iWARP-capable networks provide all the features provided by their predecessors (TOEs and regular Ethernet network adapters) and a new richer programming interface, they lack with respect to backward compatibility. In this aspect, two important issues need to be considered. First, not all network adapters support iWARP; thus, software compatibility for regular network adapters (which have no offloaded protocol stack) with iWARP capable network adapters needs to be achieved. Second, several applications on top of regular Ethernet as well as TOE based adapters have been written with the sockets interface; rewriting such applications using the new iWARP interface is cumbersome and impractical. Thus, it is desirable to have an interface which provides a two-fold benefit: (i) it allows existing applications to run directly without any modifications and (ii) it exposes the richer feature set of iWARP to the applications to be utilized with minimal modifications. In this paper, we design and implement a software stack to handle these issues. Specifically, (i) the software stack emulates the functionality of the iWARP stack in software to provide compatibility for regular Ethernet adapters with iWARP capable networks and (ii) it provides applications with an extended sockets interface that provides the traditional sockets functionality as well as functionality extended with the rich iWARP features

international parallel and distributed processing symposium | 2007

Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT

Karthikeyan Vaidyanathan; Wei Huang; Lei Chai; Dhabaleswar K. Panda

Memory copies for bulk data transport incur large overheads due to CPU stalling, small register-size data movement, etc. Intels I/O Acceleration Technology offers an asynchronous memory copy engine in kernel space which alleviates such overheads. In this paper, we propose a set of designs for asynchronous memory operations in user space for both single process (as an offloaded memcpy()) and lPC using the copy engine. We analyze our design based on overlap efficiency, performance and cache utilization. Our microbenchmark results show that using the copy engine for performing memory copies can achieve close to 87% overlap with computation. Further, the copy engine improves the copy latency of bulk memory data transfers by 50% and avoids cache pollution effects. With the emergence of multi-core architectures, the support for asynchronous memory operations holds a lot of promise in reducing the gap between the memory and processor performance.

international symposium on performance analysis of systems and software | 2007

Benefits of I/O Acceleration Technology (I/OAT) in Clusters

Karthikeyan Vaidyanathan; Dhabaleswar K. Panda

Packet processing in the TCP/IP stack at multi-gigabit data rates occupies a significant portion of the system overhead. Though there are several techniques to reduce the packet processing overhead on the sender-side, the receiver-side continues to remain as a bottleneck. I/O acceleration technology (I/OAT), developed by Intel, is a set of features particularly designed to reduce the receiver-side packet processing overhead. This paper studies the benefits of the I/OAT technology by extensive evaluations through micro-benchmarks as well as evaluations on two different application domains: (1) a multi-tier data-center environment and (2) a parallel virtual file system (PVFS). Our micro-benchmark evaluations show that I/OAT results in 38% lower overall CPU utilization in comparison with traditional communication. Due to this reduced CPU utilization, I/OAT delivers better performance and increased network bandwidth. Our experimental results with data-centers and file systems reveal that I/OAT can improve the total number of transactions processed by 14% and throughput by 12%, respectively. In addition, I/OAT can sustain a large number of concurrent threads (up to a factor of four as compared to non-I/OAT) in data-center environments, thus increasing the scalability of the servers

Explore More