Featured Researches

Distributed Parallel And Cluster Computing

An Ensemble Scheme for Proactive Data Allocation in Distributed Datasets

The advent of the Internet of Things (IoT) gives the opportunity to numerous devices to interact with their environment, collect and process data. Data are transferred, in an upwards mode, to the Cloud through the Edge Computing (EC) infrastructure. A high number of EC nodes become the hosts of distributed datasets where various processing activities can be realized in close distance with end users. This approach can limit the latency in the provision of responses. In this paper, we focus on a model that proactively decides where the collected data should be stored in order to maximize the accuracy of datasets present at the EC infrastructure. We consider that the accuracy is defined by the solidity of datasets exposed as the statistical resemblance of data. We argue upon the similarity of the incoming data with the available datasets and select the most appropriate of them to store the new information. For alleviating processing nodes from the burden of a continuous, complicated statistical processing, we propose the use of synopses as the subject of the similarity process. The incoming data are matched against the available synopses based on an ensemble scheme, then, we select the appropriate host to store them and perform the update of the corresponding synopsis. We provide the description of the problem and the formulation of our solution. Our experimental evaluation targets to reveal the performance of the proposed approach.

Read more
Distributed Parallel And Cluster Computing

An In-Depth Analysis of the Slingshot Interconnect

The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.

Read more
Distributed Parallel And Cluster Computing

An Intelligent Scheme for Uncertainty Management of Data Synopses Management in Pervasive Computing Applications

Pervasive computing applications deal with the incorporation of intelligent components around end users to facilitate their activities. Such applications can be provided upon the vast infrastructures of Internet of Things (IoT) and Edge Computing (EC). IoT devices collect ambient data transferring them towards the EC and Cloud for further processing. EC nodes could become the hosts of distributed datasets where various processing activities take place. The future of EC involves numerous nodes interacting with the IoT devices and themselves in a cooperative manner to realize the desired processing. A critical issue for concluding this cooperative approach is the exchange of data synopses to have EC nodes informed about the data present in their peers. Such knowledge will be useful for decision making related to the execution of processing activities. In this paper, we propose n uncertainty driven model for the exchange of data synopses. We argue that EC nodes should delay the exchange of synopses especially when no significant differences with historical values are present. Our mechanism adopts a Fuzzy Logic (FL) system to decide when there is a significant difference with the previous reported synopses to decide the exchange of the new one. Our scheme is capable of alleviating the network from numerous messages retrieved even for low fluctuations in synopses. We analytically describe our model and evaluate it through a large set of experiments. Our experimental evaluation targets to detect the efficiency of the approach based on the elimination of unnecessary messages while keeping immediately informed peer nodes for significant statistical changes in the distributed datasets.

Read more
Distributed Parallel And Cluster Computing

An OpenMP translator for the GAP8 MPSoC

One of the barriers to the adoption of parallel computing is the inherent complexity of its programming. The Open Multi-Processing (OpenMP) Application Programming Interface (API) facilitates such implementations, providing high abstraction level directives. On another front, new architectures aimed at low energy consumption have been developed, such as the Greenwaves Technologies GAP8, a Multi-Processor System-on-Chip (MPSoC) based on the Parallel Ultra Low Power (PULP) Platform. The GAP8 has an 8-core cluster and a Fabric Controller(FC) master core. Parallel programming with GAP8 is very promising on the efficiency side, but its recent development and lack of a robust OS to handle threads and core scheduling complicate a simple implementation of the OpenMP APIs. This project implements a source to source translator that interprets a limited set of OpenMP directives, and is capable of generating parallel microcontroller code manipulating the cores directly. The preliminary results obtained in this work shows a reduction of the code size, if compared with the base implementation, proving the efficiency of the project to ease the programming of the GAP8. Further work is need in order to implement more OpenMP directives.

Read more
Distributed Parallel And Cluster Computing

An SMDP-Based Approach to Thermal-Aware Task Scheduling in NoC-based MPSoC platforms

One efficient approach to control chip-wide thermal distribution in multi-core systems is the optimization of online assignments of tasks to processing cores. Online task assignment, however, faces several uncertainties in real-world Systems and does not show a deterministic nature. In this paper, we consider the operation of a thermal-aware task scheduler, dispatching tasks from an arrival queue as well as setting the voltage and frequency of the processing cores to optimize the mean temperature margin of the entire chip (i.e., cores as well as the NoC routers). We model the decision process of the task scheduler as a semi-Markov decision problem (SMDP). Then, to solve the formulated SMDP, we propose two reinforcement learning algorithms that are capable of computing the optimal task assignment policy without requiring the statistical knowledge of the stochastic dynamics underlying the system states. The proposed algorithms also rely on function approximation techniques to handle the infinite length of the task queue as well as the continuous nature of temperature readings. Compared to related research, the simulation results show a nearly 6 Kelvin reduction in system average peak temperature and 66 milliseconds decrease in mean task service time.

Read more
Distributed Parallel And Cluster Computing

Analytics of Longitudinal System Monitoring Data for Performance Prediction

In recent years, several HPC facilities have started continuous monitoring of their systems and jobs to collect performance-related data for understanding performance and operational efficiency. Such data can be used to optimize the performance of individual jobs and the overall system by creating data-driven models that can predict the performance of pending jobs. In this paper, we model the performance of representative control jobs using longitudinal system-wide monitoring data to explore the causes of performance variability. Using machine learning, we are able to predict the performance of unseen jobs before they are executed based on the current system state. We analyze these prediction models in great detail to identify the features that are dominant predictors of performance. We demonstrate that such models can be application-agnostic and can be used for predicting performance of applications that are not included in training.

Read more
Distributed Parallel And Cluster Computing

Analyzing Performance Properties Collected by the PerSyst Scalable HPC Monitoring Tool

The ability to understand how a scientific application is executed on a large HPC system is of great importance in allocating resources within the HPC data center. In this paper, we describe how we used system performance data to identify: execution patterns, possible code optimizations and improvements to the system monitoring. We also identify candidates for employing machine learning techniques to predict the performance of similar scientific codes.

Read more
Distributed Parallel And Cluster Computing

Analyzing and Mitigating Data Stalls in DNN Training

Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data preprocessing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time: time spent waiting for data to be fetched and preprocessed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5x on a single server).

Read more
Distributed Parallel And Cluster Computing

Applying the Roofline model for Deep Learning performance optimizations

In this paper We present a methodology for creating Roofline models automatically for Non-Unified Memory Access (NUMA) using Intel Xeon as an example. Finally, we present an evaluation of highly efficient deep learning primitives as implemented in the Intel oneDNN Library.

Read more
Distributed Parallel And Cluster Computing

Approximate Byzantine Fault-Tolerance in Distributed Optimization

This paper considers the problem of Byzantine fault-tolerance in distributed multi-agent optimization. In this problem, each agent has a local cost function, and in the fault-free case, the goal is to design a distributed algorithm that allows all the agents to find a minimum point of all the agents' aggregate cost function. We consider a scenario where some agents might be Byzantine faulty that renders the original goal of computing a minimum point of all the agents' aggregate cost vacuous. A more reasonable objective for an algorithm in this scenario is to allow all the non-faulty agents to compute the minimum point of only the non-faulty agents' aggregate cost. Prior work shows that if there are up to f (out of n ) Byzantine agents then a minimum point of the non-faulty agents' aggregate cost can be computed exactly if and only if the non-faulty agents' costs satisfy a certain redundancy property called 2f -redundancy. However, 2f -redundancy is an ideal property that can be satisfied only in systems free from noise or uncertainties, which can make the goal of exact fault-tolerance unachievable in some applications. Thus, we introduce the notion of (f,ϵ) -resilience, a generalization of exact fault-tolerance wherein the objective is to find an approximate minimum point of the non-faulty aggregate cost, with ϵ accuracy. This approximate fault-tolerance can be achieved under a weaker condition that is easier to satisfy in practice, compared to 2f -redundancy. We obtain necessary and sufficient conditions for achieving (f,ϵ) -resilience characterizing the correlation between relaxation in redundancy and approximation in resilience. In case when the agents' cost functions are differentiable, we obtain conditions for (f,ϵ) -resilience of the distributed gradient-descent method when equipped with robust gradient aggregation.

Read more

Ready to get started?

Join us today