Sanidhya Kashyap | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sanidhya Kashyap is active.

Explore More

Publication

Featured researches published by Sanidhya Kashyap.

european conference on computer systems | 2017

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Steffen Maass; Changwoo Min; Sanidhya Kashyap; Woon-Hak Kang; Mohan Kumar; Taesoo Kim

Processing a one trillion-edge graph has recently been demonstrated by distributed graph engines running on clusters of tens to hundreds of nodes. In this paper, we employ a single heterogeneous machine with fast storage media (e.g., NVMe SSD) and massively parallel coprocessors (e.g., Xeon Phi) to reach similar dimensions. By fully exploiting the heterogeneous devices, we design a new graph processing engine, named Mosaic, for a single machine. We propose a new locality-optimizing, space-efficient graph representation---Hilbert-ordered tiles, and a hybrid execution model that enables vertex-centric operations in fast host processors and edge-centric operations in massively parallel coprocessors. Our evaluation shows that for smaller graphs, Mosaic consistently outperforms other state-of-the-art out-of-core engines by 3.2-58.6x and shows comparable performance to distributed graph engines. Furthermore, Mosaic can complete one iteration of the Pagerank algorithm on a trillion-edge graph in 21 minutes, outperforming a distributed disk-based engine by 9.2×.

symposium on operating systems principles | 2015

Cross-checking semantic correctness: the case of finding file system bugs

Changwoo Min; Sanidhya Kashyap; Byoungyoung Lee; Chengyu Song; Taesoo Kim

Today, systems software is too complex to be bug-free. To find bugs in systems software, developers often rely on code checkers, like Linuxs Sparse. However, the capability of existing tools used in commodity, large-scale systems is limited to finding only shallow bugs that tend to be introduced by simple programmer mistakes, and so do not require a deep understanding of code to find them. Unfortunately, the majority of bugs as well as those that are difficult to find are semantic ones, which violate high-level rules or invariants (e.g., missing a permission check). Thus, it is difficult for code checkers lacking the understanding of a programmers true intention to reason about semantic correctness. To solve this problem, we present Juxta, a tool that automatically infers high-level semantics directly from source code. The key idea in Juxta is to compare and contrast multiple existing implementations that obey latent yet implicit high-level semantics. For example, the implementation of open() at the file system layer expects to handle an out-of-space error from the disk in all file systems. We applied Juxta to 54 file systems in the stock Linux kernel (680K LoC), found 118 previously unknown semantic bugs (one bug per 5.8K LoC), and provided corresponding patches to 39 different file systems, including mature, popular ones like ext4, btrfs, XFS, and NFS. These semantic bugs are not easy to locate, as all the ones found by Juxta have existed for over 6.2 years on average. Not only do our empirical results look promising, but the design of Juxta is generic enough to be extended easily beyond file systems to any software that has multiple implementations, like Web browsers or protocols at the same layer of a network stack.

international conference on cloud computing | 2014

RLC - A Reliable Approach to Fast and Efficient Live Migration of Virtual Machines in the Clouds

Sanidhya Kashyap; Jaspal Singh Dhillon; Suresh Purini

Today, IaaS cloud providers are dynamically minimizing the cost of data centers operations, while maintaining the Service Level Agreement (SLA). Currently, this is achieved by the live migration capability, which is an advanced state-of-the-art technology of Virtualization. However, existing migration techniques suffer from high network bandwidth utilization, large network data transfer, large migration time as well as the destinations VM failure during migration. In this paper, we propose Reliable Lazy Copy (RLC) - a fast, efficient and a reliable migration technique. RLC provides a reasonable solution for high-efficiency and less disruptive migration scheme by utilizing the three phases of the process migration. For effective network bandwidth utilization and reducing the total migration time, we introduce a learning phase to estimate the writable working set (WWS) prior to the migration, resulting in an almost single time transfer of the pages. Our approach decreases the total data transfer by 1.16 x - 12.21x and the total migration time by a factor of 1.42x - 9.84x against the existing approaches, thus providing a fast and an efficient, reliable VM migration of the VMs in the cloud.

Operating Systems Review | 2016

Opportunistic Spinlocks: Achieving Virtual Machine Scalability in the Clouds

Sanidhya Kashyap; Changwoo Min; Taesoo Kim

With increasing demand for big-data processing and faster in-memory databases, cloud providers are moving towards large virtualized instances besides focusing on the horizontal scalability. However, our experiments reveal that such instances in popular cloud services (e.g., 32 vCPUs with 208 GB supported by Google Compute Engine) do not achieve the desired scalability with increasing core count even with a simple, embarrassingly parallel job (e.g., Linux kernel compile). On a serious note, the internal synchronization scheme (e.g., paravirtualized ticket spinlock) of the virtualized instance on a machine with higher core count (e.g., 80-core) dramatically degrades its overall performance. Our finding is different from the previously well-known scalability problem (i.e., lock contention problem) and occurs because of the sophisticated optimization techniques implemented in the hypervisor---what we call sleepy spinlock anomaly. To solve this problem, we design and implement OTICKET, a variant of paravirtualized ticket spinlock that effectively scales the virtualized instances in both undersubscribed and oversubscribed environments.

asia pacific workshop on systems | 2015

Scalability in the Clouds!: A Myth or Reality?

Sanidhya Kashyap; Changwoo Min; Taesoo Kim

With increasing demand of big-data processing and faster in-memory databases, cloud providers are gearing towards large virtualized instances rather than horizontal scalability. However, our experiments reveal that such instances in popular cloud services (e.g., 32 vCPUs with 208 GB supported by Google Compute Engine) do not achieve the desired scalability with increasing core count even with a simple, embarrassingly parallel job (e.g., kernel compile). On a serious note, the internal synchronization scheme (e.g., paravirtualized ticket spinlock) of the virtualized instance on a machine with higher core count (e.g., 80-core) dramatically degrades its overall performance. Our finding is different from a previously well-known scalability problem (lock contention problem), and occurs because of the sophisticated optimization techniques implemented in the hypervisor, what we call---sleepy spinlock anomaly. To solve this problem, we design and implement oticket, a variant of paravirtualized ticket spinlock that effectively scales the virtualized instances in both undersubscribed and oversubscribed environments.

ieee/acm international conference utility and cloud computing | 2013

Virtual Machine Coscheduling: A Game Theoretic Approach

Jaspal Singh Dhillon; Suresh Purini; Sanidhya Kashyap

When multiple virtual machines (VMs) are co scheduled on the same physical machine, they may undergo a performance degradation. The performance degradation is due to the contention for shared resources like last level cache, hard disk, network bandwidth etc. This can lead to service-level agreement violations and thereby customer dissatisfaction. The classical approach to solve the co scheduling problem involves a central authority which decides a co schedule by solving a constrained optimization problem with an objective function such as average performance degradation. In this paper, we use the theory of stable matchings to provide an alternate game theoretic perspective to the co scheduling problem wherein each VM selfishly tries to minimize its performance degradation. We show that the co scheduling problem can be formulated as a Stable Roommates Problem (SRP). Since certain instances of the SRP do not have any stable matching, we reduce the problem to the Stable Marriages Problem (SMP) via an initial approximation. Gale and Shapley proved that any instance of the SMP has a stable matching and can be found in quadratic time. From a game theoretic perspective, the SMP can be thought of as a matching game which always has a Nash equilibrium. There are distributed algorithms for both the SRP and SMP problems. A VM agent in a distributed algorithm need not reveal its preference list to any other VM. This allows each VM to have a private cost function. A principal advantage of this problem formulation is that it opens up the possibility of applying the rich theory of matching markets from game theory to address various aspects of the VM co scheduling problem such as stability, coalitions and privacy both from a theoretical and practical standpoint. We also propose a new workload characterization technique for a combination of compute and memory intensive workloads. The proposed technique uses a sentinel program and it requires only two runs per workload for characterization. VMs can use this technique in deciding their partner preference ranks in the SRP and SMP problems. The characterization technique has also been used in proposing two new centralized VM co scheduling algorithms whose performance is close to the optimal Blossom algorithm.

computer and communications security | 2017

Designing New Operating Primitives to Improve Fuzzing Performance

Wen Xu; Sanidhya Kashyap; Changwoo Min; Taesoo Kim

Fuzzing is a software testing technique that finds bugs by repeatedly injecting mutated inputs to a target program. Known to be a highly practical approach, fuzzing is gaining more popularity than ever before. Current research on fuzzing has focused on producing an input that is more likely to trigger a vulnerability. In this paper, we tackle another way to improve the performance of fuzzing, which is to shorten the execution time of each iteration. We observe that AFL, a state-of-the-art fuzzer, slows down by 24x because of file system contention and the scalability of fork() system call when it runs on 120 cores in parallel. Other fuzzers are expected to suffer from the same scalability bottlenecks in that they follow a similar design pattern. To improve the fuzzing performance, we design and implement three new operating primitives specialized for fuzzing that solve these performance bottlenecks and achieve scalable performance on multi-core machines. Our experiment shows that the proposed primitives speed up AFL and LibFuzzer by 6.1 to 28.9x and 1.1 to 735.7x, respectively, on the overall number of executions per second when targeting Googles fuzzer test suite with 120 cores. In addition, the primitives improve AFLs throughput up to 7.7x with 30 cores, which is a more common setting in data centers. Our fuzzer-agnostic primitives can be easily applied to any fuzzer with fundamental performance improvement and directly benefit large-scale fuzzing and cloud-based fuzzing services.

european conference on computer systems | 2018

S olros : a data-centric operating system architecture for heterogeneous computing

Changwoo Min; Woonhak Kang; Mohan Kumar; Sanidhya Kashyap; Steffen Maass; Heeseung Jo; Taesoo Kim

We propose Solros---a new operating system architecture for heterogeneous systems that comprises fast host processors, slow but massively parallel co-processors, and fast I/O devices. A general consensus to fully drive such a hardware system is to have a tight integration among processors and I/O devices. Thus, in the Solros architecture, a co-processor OS (data-plane OS) delegates its services, specifically I/O stacks, to the host OS (control-plane OS). Our observation for such a design is that global coordination with system-wide knowledge (e.g., PCIe topology, a load of each co-processor) and the best use of heterogeneous processors is critical to achieving high performance. Hence, we fully harness these specialized processors by delegating complex I/O stacks on fast host processors, which leads to an efficient global coordination at the level of the control-plane OS. We developed Solros with Xeon Phi co-processors and implemented three core OS services: transport, file system, and network services. Our experimental results show significant performance improvement compared with the stock Xeon Phi running the Linux kernel. For example, Solros improves the throughput of file system and network operations by 19x and 7x, respectively. Moreover, it improves the performance of two realistic applications: 19x for text indexing and 2x for image search.

european conference on computer systems | 2018

A scalable ordering primitive for multicore machines

Sanidhya Kashyap; Changwoo Min; Kangnyeon Kim; Taesoo Kim

Timestamping is an essential building block for designing concurrency control mechanisms and concurrent data structures. Various algorithms either employ physical timestamping, assuming that they have access to synchronized clocks, or maintain a logical clock with the help of atomic instructions. Unfortunately, these approaches have two problems. First, hardware developers do not guarantee that the available hardware clocks are exactly synchronized, which they find difficult to achieve in practice. Second, the atomic instructions are a deterrent to scalability resulting from cache-line contention. This paper addresses these problems by proposing and designing a scalable ordering primitive, called Ordo, that relies on invariant hardware clocks. Ordo not only enables the correct use of these clocks, by providing a notion of a global hardware clock, but also frees various logical timestamp-based algorithms from the burden of the software logical clock, while trying to simplify their design. We use the Ordo primitive to redesign 1) a concurrent data structure library that we apply on the Linux kernel; 2) a synchronization mechanism for concurrent programming; 3) two database concurrency control mechanisms; and 4) a clock-based software transactional memory algorithm. Our evaluation shows that there is a possibility that the clocks are not synchronized on two architectures (Intel and ARM) and that Ordo generally improves the efficiency of several algorithms by 1.2--39.7X on various architectures.

architectural support for programming languages and operating systems | 2018

LATR: Lazy Translation Coherence

Mohan Kumar; Steffen Maass; Sanidhya Kashyap; Ján Veselý; Zi Yan; Taesoo Kim; Abhishek Bhattacharjee; Tushar Krishna

We propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid expensive IPIs which are required for delivering a shootdown signal to remote cores, and the performance overhead of associated interrupt handlers. Therefore, virtual memory operations, such as free and page migration operations, can benefit significantly from LATRs mechanism. For example, LATR improves the latency of munmap() by 70.8% on a 2-socket machine, a widely used configuration in modern data centers. Real-world, performance-critical applications such as web servers can also benefit from LATR: without any application-level changes, LATR improves Apache by 59.9% compared to Linux, and by 37.9% compared to ABIS, a highly optimized, state-of-the-art TLB coherence technique.

Explore More