Featured Researches

Distributed Parallel And Cluster Computing

Byzantine Fault-Tolerance in Decentralized Optimization under Minimal Redundancy

This paper considers the problem of Byzantine fault-tolerance in multi-agent decentralized optimization. In this problem, each agent has a local cost function. The goal of a decentralized optimization algorithm is to allow the agents to cooperatively compute a common minimum point of their aggregate cost function. We consider the case when a certain number of agents may be Byzantine faulty. Such faulty agents may not follow a prescribed algorithm, and they may share arbitrary or incorrect information with other non-faulty agents. Presence of such Byzantine agents renders a typical decentralized optimization algorithm ineffective. We propose a decentralized optimization algorithm with provable exact fault-tolerance against a bounded number of Byzantine agents, provided the non-faulty agents have a minimal redundancy.

Read more
Distributed Parallel And Cluster Computing

Byzantine Fault-Tolerance in Peer-to-Peer Distributed Gradient-Descent

We consider the problem of Byzantine fault-tolerance in the peer-to-peer (P2P) distributed gradient-descent method -- a prominent algorithm for distributed optimization in a P2P system. In this problem, the system comprises of multiple agents, and each agent has a local cost function. In the fault-free case, when all the agents are honest, the P2P distributed gradient-descent method allows all the agents to reach a consensus on a solution that minimizes their aggregate cost. However, we consider a scenario where a certain number of agents may be Byzantine faulty. Such faulty agents may not follow an algorithm correctly, and may share arbitrary incorrect information to prevent other non-faulty agents from solving the optimization problem. In the presence of Byzantine faulty agents, a more reasonable goal is to allow all the non-faulty agents to reach a consensus on a solution that minimizes the aggregate cost of all the non-faulty agents. We refer to this fault-tolerance goal as f -resilience where f is the maximum number of Byzantine faulty agents in a system of n agents, with f<n . Most prior work on fault-tolerance in P2P distributed optimization only consider approximate fault-tolerance wherein, unlike f -resilience, all the non-faulty agents' compute a minimum point of a non-uniformly weighted aggregate of their cost functions. We propose a fault-tolerance mechanism that confers provable f -resilience to the P2P distributed gradient-descent method, provided the non-faulty agents satisfy the necessary condition of 2f -redundancy, defined later in the paper. Moreover, compared to prior work, our algorithm is applicable to a larger class of high-dimensional convex distributed optimization problems.

Read more
Distributed Parallel And Cluster Computing

Byzantine Generals in the Permissionless Setting

Consensus protocols have traditionally been studied in a setting where all participants are known to each other from the start of the protocol execution. In the parlance of the 'blockchain' literature, this is referred to as the permissioned setting. What differentiates Bitcoin from these previously studied protocols is that it operates in a permissionless setting, i.e. it is a protocol for establishing consensus over an unknown network of participants that anybody can join, with as many identities as they like in any role. The arrival of this new form of protocol brings with it many questions. Beyond Bitcoin, what can we prove about permissionless protocols in a general sense? How does recent work on permissionless protocols in the blockchain literature relate to the well-developed history of research on permissioned protocols in distributed computing? To answer these questions, we describe a formal framework for the analysis of both permissioned and permissionless systems. Our framework allows for "apples-to-apples" comparisons between different categories of protocols and, in turn, the development of theory to formally discuss their relative merits. A major benefit of the framework is that it facilitates the application of a rich history of proofs and techniques in distributed computing to problems in blockchain and the study of permissionless systems. Within our framework, we then address the questions above. We consider the Byzantine Generals Problem as a formalisation of the problem of reaching consensus, and address a programme of research that asks, "Under what adversarial conditions, and for what types of permissionless protocol, is consensus possible?" We prove a number of results for this programme, our main result being that deterministic consensus is not possible for decentralised permissionless protocols. To close, we give a list of eight open questions.

Read more
Distributed Parallel And Cluster Computing

C-Balancer: A System for Container Profiling and Scheduling

Linux containers have gained high popularity in recent times. This popularity is significantly due to various advantages of containers over Virtual Machines (VM). The containers are lightweight, occupy lesser storage, have fast boot-up time, easy to deploy and have faster auto-scaling. The key reason behind the popularity of containers is that they leverage the mechanism of micro-service style software development, where applications are designed as independently deployable services. There are various container orchestration tools for deploying and managing the containers in the cluster. The prominent among them are Docker Swarm and Kubernetes. However, they do not address the effects of resource contention when multiple containers are deployed on a node. Moreover, they do not provide support for container migration in the event of an attack or increased resource contention. To address such issues, we propose C-Balancer, a scheduling framework for efficient placement of containers in the cluster environment. C-Balancer works by periodically profiling the containers and deciding the optimal container to node placement. Our proposed approach improves the performance of containers in terms of resource utilization and throughput. Experiments using a workload mix of Stress-NG and iPerf benchmark shows that our proposed approach achieves a maximum performance improvement of 58% for the workload mix. Our approach also reduces the variance in resource utilization across the cluster by 60% on average.

Read more
Distributed Parallel And Cluster Computing

C-SAW: A Framework for Graph Sampling and Random Walk on GPUs

Many applications require to learn, mine, analyze and visualize large-scale graphs. These graphs are often too large to be addressed efficiently using conventional graph processing technologies. Many applications have requirements to analyze, transform, visualize and learn large scale graphs. These graphs are often too large to be addressed efficiently using conventional graph processing technologies. Recent literatures convey that graph sampling/random walk could be an efficient solution. In this paper, we propose, to the best of our knowledge, the first GPU-based framework for graph sampling/random walk. First, our framework provides a generic API which allows users to implement a wide range of sampling and random walk algorithms with ease. Second, offloading this framework on GPU, we introduce warp-centric parallel selection, and two optimizations for collision migration. Third, towards supporting graphs that exceed GPU memory capacity, we introduce efficient data transfer optimizations for out-of-memory sampling, such as workload-aware scheduling and batched multi-instance sampling. In its entirety, our framework constantly outperforms the state-of-the-art projects. First, our framework provides a generic API which allows users to implement a wide range of sampling and random walk algorithms with ease. Second, offloading this framework on GPU, we introduce warp-centric parallel selection, and two novel optimizations for collision migration. Third, towards supporting graphs that exceed the GPU memory capacity, we introduce efficient data transfer optimizations for out-of-memory and multi-GPU sampling, such as workload-aware scheduling and batched multi-instance sampling. Taken together, our framework constantly outperforms the state of the art projects in addition to the capability of supporting a wide range of sampling and random walk algorithms.

Read more
Distributed Parallel And Cluster Computing

C-for-Metal: High Performance SIMD Programming on Intel GPUs

The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap we introduce C-For-Metal (CM), an explicit SIMD programming framework designed to deliver close-to-the-metal performance on Intel GPUs. The CM programming language and its vector/matrix types provide an intuitive interface to exploit the underlying hardware features, allowing fine-grained register management, SIMD size control and cross-lane data sharing. Experimental results show that CM applications from different domains outperform the best-known SIMT-based OpenCL implementations, achieving up to 2.7x speedup on the latest Intel GPU.

Read more
Distributed Parallel And Cluster Computing

CASH: A Credit Aware Scheduling for Public Cloud Platforms

The public cloud offers a myriad of services which allows its tenants to process large scale big data in a flexible, easy and cost effective manner. Tenants generally use large scale data processing frameworks such as MapReduce, Tez, Spark etc. to process their data. Tenants can configure their frameworks to run individual tasks by the framework itself or have a middleware cluster manager like YARN or Mesos to arbitrate resource scheduling in their public-cloud cluster. Cluster managers need to be cognizant about the workload requirement along with the state of the individual resource such as CPU and disk in the cluster. Cloud providers use a token bucket mechanism for their individual hardware resources as an indicator of the quality-of-service that individual hardware resource can provide. In this paper, through our changes in YARN, Hadoop and Tez, we show how middleware cluster managers can be made cognizant about the expected quality-of-service of individual hardware resources in the cluster. Our optimized cluster manager with a coarse grained knowledge of task requirement and fine grained knowledge of expected quality-of-service of hardware resources in the cluster performs highly optimal task placements. Our experiments with our optimizations show CPU credit based instances like the Amazon T3 instances as a viable cost effective option for running bigdata workloads. We also show that streaming SQL queries on a Hive warehouse can be accelerated by up to 31% leading to public cloud cost savings of up to 22%.

Read more
Distributed Parallel And Cluster Computing

CPU Scheduling in Data Centers Using Asynchronous Finite-Time Distributed Coordination Mechanisms

We propose an asynchronous iterative scheme which allows a set of interconnected nodes to distributively reach an agreement to within a pre-specified bound in a finite number of steps. While this scheme could be adopted in a wide variety of applications, we discuss it within the context of task scheduling for data centers. In this context, the algorithm is guaranteed to approximately converge to the optimal scheduling plan, given the available resources, in a finite number of steps. Furthermore, being asynchronous, the proposed scheme is able to take in account the uncertainty that can be introduced from straggler nodes or communication issues in the form of latency variability while still converging to the target objective. In addition, by using extensive empirical evaluation through simulations we show that the proposed method exhibits state-of-the-art performance.

Read more
Distributed Parallel And Cluster Computing

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory (eliminating the programmer's burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not checkpointed) within a single process's memory. This eliminates the high overhead of inter-process communication in earlier approaches, and has fewer limitations.

Read more
Distributed Parallel And Cluster Computing

CTDGM: A Data Grouping Model Based on Cache Transaction for Unstructured Data Storage Systems

Cache prefetching technology has become the mainstream data access optimization strategy in the data centers. However, the rapidly increasing of unstructured data generates massive pairwise access relationships, which can result in a heavy computational burden for the existing prefetching model and lead to severe degradation in the performance of data access. We propose cache-transaction-based data grouping model (CTDGM) to solve the problems described above by optimizing the feature representation method and grouping efficiency. First, we provide the definition of the cache transaction and propose the method for extracting the cache transaction feature (CTF). Second, we design a data chunking algorithm based on CTF and spatiotemporal locality to optimize the relationship calculation efficiency. Third, we propose CTDGM by constructing a relation graph that groups data into independent groups according to the strength of the data access relation. Based on the results of the experiment, compared with the state-of-the-art methods, our algorithm achieves an average increase in the cache hit rate of 12% on the MSR dataset with small cache size (0.001% of all the data), which in turn reduces the number of data I/O accesses by 50% when the cache size is less than 0.008% of all the data.

Read more

Ready to get started?

Join us today