Featured Researches

Distributed Parallel And Cluster Computing

Approximate Majority With Catalytic Inputs

Population protocols are a class of algorithms for modeling distributed computation in networks of finite-state agents communicating through pairwise interactions. Their suitability for analyzing numerous chemical processes has motivated the adaptation of the original population protocol framework to better model these chemical systems. In this paper, we further the study of two such adaptations in the context of solving approximate majority: persistent-state agents (or catalysts) and spontaneous state changes (or leaks). Based on models considered in recent protocols for populations with persistent-state agents, we assume a population with n catalytic input agents and m worker agents, and the goal of the worker agents is to compute some predicate over the states of the catalytic inputs. We call this model the Catalytic Input (CI) model. For m=Θ(n) , we show that computing the parity of the input population with high probability requires at least Ω( n 2 ) total interactions, demonstrating a strong separation between the CI model and the standard population protocol model. On the other hand, we show that the simple third-state dynamics of Angluin et al. for approximate majority in the standard model can be naturally adapted to the CI model: we present such a constant-state protocol for the CI model that solves approximate majority in O(nlogn) total steps with high probability when the input margin is Ω( nlogn − − − − − − √ ) . We then show the robustness of third-state dynamics protocols to the transient leaks events introduced by Alistarh et al. In both the original and CI models, these protocols successfully compute approximate majority with high probability in the presence of leaks occurring at each step with probability β≤O( nlogn − − − − − − √ /n) , exhibiting a resilience to leaks similar to that of Byzantine agents in previous works.

Read more
Distributed Parallel And Cluster Computing

Asynchronous Gathering in a Torus

We consider the gathering problem for asynchronous and oblivious robots that cannot communicate explicitly with each other, but are endowed with visibility sensors that allow them to see the positions of the other robots. Most of the investigations on the gathering problem on the discrete universe are done on ring shaped networks due to the number of symmetric configuration. We extend in this paper the study of the gathering problem on torus shaped networks assuming robots endowed with local weak multiplicity detection. That is, robots cannot make the difference between nodes occupied by only one robot from those occupied by more than one robots unless it is their current node. As a consequence, solutions based on creating a single multiplicity node as a landmark for the gathering cannot be used. We present in this paper a deterministic algorithm that solves the gathering problem starting from any rigid configuration on an asymmetric unoriented torus shaped network.

Read more
Distributed Parallel And Cluster Computing

Asynchronous Gossip in Smartphone Peer-to-Peer Networks

In this paper, we study gossip algorithms in communication models that describe the peer-to-peer networking functionality included in most standard smartphone operating systems. We begin by describing and analyzing a new synchronous gossip algorithm in this setting that features both a faster round complexity and simpler operation than the best-known existing solutions. We also prove a new lower bound on the rounds required to solve gossip that resolves a minor open question by establishing that existing synchronous solutions are within logarithmic factors of optimal. We then adapt our synchronous algorithm to produce a novel gossip strategy for an asynchronous model that directly captures the interface of a standard smartphone peer-to-peer networking library (enabling algorithms described in this model to be easily implemented on real phones). Using new analysis techniques, we prove that this asynchronous strategy efficiently solves gossip. This is the first known efficient asynchronous information dissemination result for the smartphone peer-to-peer setting. We argue that our new strategy can be used to implement effective information spreading subroutines in real world smartphone peer-to-peer network applications, and that the analytical tools we developed to analyze it can be leveraged to produce other broadly useful algorithmic strategies for this increasingly important setting.

Read more
Distributed Parallel And Cluster Computing

Asynchronous Runtime with Distributed Manager for Task-based Programming Models

Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per task that the runtime uses to order the tasks execution. This order is calculated using shared graphs, which are updated by all threads in exclusive access using synchronization mechanisms (locks) to ensure the dependence management correctness. The contention in the access to these structures becomes critical in many-core systems because several threads may be wasting computation resources waiting their turn. This paper proposes an asynchronous management of the runtime structures, like task dependence graphs, suitable for task-based programming model runtimes. In such organization, the threads request actions to the runtime instead of doing them directly. The requests are then handled by a distributed runtime manager (DDAST) which does not require dedicated resources. Instead, the manager uses the idle threads to modify the runtime structures. The paper also presents an implementation, analysis and performance evaluation of such runtime organization. The performance results show that the proposed asynchronous organization outperforms the speedup obtained by the original runtime for different benchmarks and different many-core architectures.

Read more
Distributed Parallel And Cluster Computing

At-Scale Sparse Deep Neural Network Inference with Efficient GPU Implementation

This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Demands for network quality have increased rapidly, pushing the size and thus the memory requirements of many neural networks beyond the capacity of available accelerators. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks. However, there is room for improvement in implementing SpDNN operations on GPUs. This work presents optimized sparse matrix multiplication kernels fused with the ReLU function. The optimized kernels reuse input feature maps from the shared memory and sparse weights from registers. For multi-GPU parallelism, our SpDNN implementation duplicates weights and statically partition the feature maps across GPUs. Results for the challenge benchmarks show that the proposed kernel design and multi-GPU parallelization achieve up to 180 tera-edges per second inference throughput. These results are up to 4.3x faster for a single GPU and an order of magnitude faster at full scale than those of the champion of the 2019 Sparse Deep Neural Network Graph Challenge for the same generation of NVIDIA V100 GPUs. Using the same implementation, we also show single-GPU throughput on NVIDIA A100 is 2.37 × faster than V100.

Read more
Distributed Parallel And Cluster Computing

Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

The last decade has witnessed growth in the computational requirements for training deep neural networks. Current approaches (e.g., data/model parallelism, pipeline parallelism) parallelize training tasks onto multiple devices. However, these approaches always rely on specific deep learning frameworks and requires elaborate manual design, which make it difficult to maintain and share between different type of models. In this paper, we propose Auto-MAP, a framework for exploring distributed execution plans for DNN workloads, which can automatically discovering fast parallelization strategies through reinforcement learning on IR level of deep learning models. Efficient exploration remains a major challenge for reinforcement learning. We leverage DQN with task-specific pruning strategies to help efficiently explore the search space including optimized strategies. Our evaluation shows that Auto-MAP can find the optimal solution in two hours, while achieving better throughput on several NLP and convolution models.

Read more
Distributed Parallel And Cluster Computing

Automatic Horizontal Fusion for GPU Kernels

We present automatic horizontal fusion, a novel optimization technique that complements the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose goal is to eliminate intermediate data round trips, our horizontal fusion technique aims to increase the thread-level parallelism to hide instruction latencies. We also present HFuse, a new source to source CUDA compiler that implements automatic horizontal fusion. Our experimental results show that horizontal fusion can speed up the running time by 2.5%-60.8%. Our results reveal that the horizontal fusion is especially beneficial for fusing kernels with instructions that require different kinds of GPU resources (e.g., a memory-intensive kernel and a compute-intensive kernel).

Read more
Distributed Parallel And Cluster Computing

Avoiding Register Overflow in the Bakery Algorithm

Computer systems are designed to make resources available to users and users may be interested in some resources more than others, therefore, a coordination scheme is required to satisfy the users' requirements. This scheme may implement certain policies such as "never allocate more than X units of resource Z". One policy that is of particular interest is the inability of users to access a single resource at the same time, which is called the problem of mutual exclusion. Resource management concerns the coordination and collaboration of users, and it is usually based on making a decision. In the case of mutual exclusion, that decision is about granting access to a resource. Therefore, mutual exclusion is useful for supporting resource access management. The first true solution to the mutual exclusion problem is known as the Bakery algorithm that does not rely on any lower-lever mutual exclusion. We examine the problem of register overflow in real-world implementations of the Bakery algorithm and present a variant algorithm named Bakery++ that prevents overflows from ever happening. Bakery++ avoids overflows without allowing a process to write into other processes' memory and without using additional memory or complex arithmetic or redefining the operations and functions used in Bakery. Bakery++ is almost as simple as Bakery and it is straightforward to implement in real systems. With Bakery++, there is no reason to keep implementing Bakery in real computers because Bakery++ eliminates the probability of overflows and hence it is more practical than Bakery. Previous approaches to circumvent the problem of register overflow included introducing new variables or redefining the operations or functions used in the original Bakery algorithm, while Bakery++ avoids overflows by using simple conditional statements. (the abstract does not end here.)

Read more
Distributed Parallel And Cluster Computing

BB: Booting Booster for Consumer Electronics with Modern OS

Unconventional computing platforms have spread widely and rapidly following smart phones and tablets: consumer electronics such as smart TVs and digital cameras. For such devices, fast booting is a critical requirement; waiting tens of seconds for a TV or a camera to boot up is not acceptable, unlike a PC or smart phone. Moreover, the software platforms of these devices have become as rich as conventional computing devices to provide comparable services. As a result, the booting procedure to start every required OS service, hardware component, and application, the quantity of which is ever increasing, may take unbearable time for most consumers. To accelerate booting, this paper introduces \textit{Booting Booster} (BB), which is used in all 2015 Samsung Smart TV models, and which runs the Linux-based Tizen OS. BB addresses the init scheme of Linux, which launches initial user-space OS services and applications and manages the life cycles of all user processes, by identifying and isolating booting-critical tasks, deferring non-critical tasks, and enabling execution of more tasks in parallel. BB has been successfully deployed in Samsung Smart TV 2015 models achieving a cold boot in 3.5 s (compared to 8.1 s with full commercial-grade optimizations without BB) without the need for suspend-to-RAM or hibernation. After this successful deployment, we have released the source code via this http URL, and BB will be included in the open-source OS, Tizen (this http URL).

Read more
Distributed Parallel And Cluster Computing

BSF-skeleton: A Template for Parallelization of Iterative Numerical Algorithms on Cluster Computing Systems

This article describes a method for creating applications for cluster computing systems using the parallel BSF skeleton based on the original BSF (Bulk Synchronous Farm) model of parallel computations developed by the author earlier. This model uses the master/slave paradigm. The main advantage of the BSF model is that it allows to estimate the scalability of a parallel algorithm before its implementation. Another important feature of the BSF model is the representation of problem data in the form of lists that greatly simplifies the logic of building applications. The BSF skeleton is designed for creating parallel programs in C++ using the MPI library. The scope of the BSF skeleton is iterative numerical algorithms of high computational complexity. The BSF skeleton has the following distinctive features. - The BSF-skeleton completely encapsulates all aspects that are associated with parallelizing a program. - The BSF skeleton allows error-free compilation at all stages of application development. - The BSF skeleton supports OpenMP programming model and workflows.

Read more

Ready to get started?

Join us today