Featured Researches

Performance

Branch prediction related Optimizations for Multithreaded Processors

Major chip manufacturers have all introduced Multithreaded processors. These processors are used for running a variety of workloads. Efficient resource utilization is an important design aspect in such processors. Depending on the workload, mis-speculated execution can severely impact resource utilization and power utilization. In general, compared to a uniprocessor, a multithreaded processor may have better tolerance towards mis-speculation. However there can still be phases where even a multi-threaded processor performance may get impacted by branch induced mis-speculation. In this paper I propose monitoring the branch predictor behavior of various hardware threads running on the multi-threaded processor and use that information as a feedback to the thread arbiter/picker which schedules the next thread to fetch instructions from. If I find that a particular thread is going through a phase where it is consistently mis-predicting its branches and its average branch misprediction stall is above a specific threshold then I temporarily reduce the priority for picking that thread. I do a qualitative comparison of various solutions to the problem of resource inefficiency caused due to mis-speculated branches in multithreaded processors. This work can be extended by doing a quantitative evaluation.

Read more
Performance

Breakdown of a Benchmark Score Without Internal Analysis of Benchmarking Program

A breakdown of a benchmark score is how much each aspect of the system performance affects the score. Existing methods require internal analysis on the benchmarking program and then involve the following problems: (1) require a certain amount of labor for code analysis, profiling, simulation, and so on and (2) require the benchmarking program itself. In this paper, we present a method for breaking down a benchmark score without internal analysis of the benchmarking program. The method utilizes regression analysis of benchmark scores on a number of systems. Experimental results with 3 benchmarks on 15 Android smartphones showed that our method could break down those benchmark scores even though there is room for improvement in accuracy.

Read more
Performance

Brewing Analytics Quality for Cloud Performance

Cloud computing has become increasingly popular. Many options of cloud deployments are available. Testing cloud performance would enable us to choose a cloud deployment based on the requirements. In this paper, we present an innovative process, implemented in software, to allow us to assess the quality of the cloud performance data. The process combines performance data from multiple machines, spanning across user experience data, workload performance metrics, and readily available system performance data. Furthermore, we discuss the major challenges of bringing raw data into tidy data formats in order to enable subsequent analysis, and describe how our process has several layers of assessment to validate the quality of the data processing procedure. We present a case study to demonstrate the effectiveness of our proposed process, and conclude our paper with several future research directions worth investigating.

Read more
Performance

Broadcast Strategies and Performance Evaluation of IEEE 802.15.4 in Wireless Body Area Networks WBAN

The rapid advances in sensors and ultra-low power wireless communication has enabled a new generation of wireless sensor networks: Wireless Body Area Networks (WBAN). To the best of our knowledge the current paper is the first to address broadcast in WBAN. We first analyze several broadcast strategies inspired from the area of Delay Tolerant Networks (DTN). The proposed strategies are evaluated via the OMNET++ simulator that we enriched with realistic human body mobility models and channel models issued from the recent research on biomedical and health informatics. Contrary to the common expectation, our results show that existing research in DTN cannot be transposed without significant modifications in WBANs area. That is, existing broadcast strategies for DTNs do not perform well with human body mobility. However, our extensive simulations give valuable insights and directions for designing efficient broadcast in WBAN. Furthermore, we propose a novel broadcast strategy that outperforms the existing ones in terms of end-to-end delay, network coverage and energy consumption. Additionally, we performed investigations of independent interest related to the ability of all the studied strategies to ensure the total order delivery property when stressed with various packet rates. These investigations open new and challenging research directions.

Read more
Performance

CLTune: A Generic Auto-Tuner for OpenCL Kernels

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations. Example parameters include the OpenCL workgroup size, vector data-types, tile sizes, and loop unrolling factors. CLTune can be used in the following scenarios: 1) when there are too many tunable parameters to explore manually, 2) when performance portability across OpenCL devices is desired, or 3) when the optimal parameters change based on input argument values (e.g. matrix dimensions). The auto-tuner is generic, easy to use, open-source, and supports multiple search strategies including simulated annealing and particle swarm optimisation. CLTune is evaluated on two GPU case-studies inspired by the recent successes in deep learning: 2D convolution and matrix-multiplication (GEMM). For 2D convolution, we demonstrate the need for auto-tuning by optimizing for different filter sizes, achieving performance on-par or better than the state-of-the-art. For matrix-multiplication, we use CLTune to explore a parameter space of more than two-hundred thousand configurations, we show the need for device-specific tuning, and outperform the clBLAS library on NVIDIA, AMD and Intel GPUs.

Read more
Performance

COCOA: Cold Start Aware Capacity Planning for Function-as-a-Service Platforms

Function-as-a-Service (FaaS) is increasingly popular in the software industry due to the implied cost-savings in event-driven workloads and its synergy with DevOps. To size an on-premise FaaS platform, it is important to estimate the required CPU and memory capacity to serve the expected loads. Given the service-level agreements, it is however challenging to take the cold start issue into account during the sizing process. We have investigated the similarity of this problem with the hit rate improvement problem in TTL caches and concluded that solutions for TTL cache, although potentially applicable, lead to over-provisioning in FaaS. Thus, we propose a novel approach, COCOA, to solve this issue. COCOA uses a queueing-based approach to assess the effect of cold starts on FaaS response times. It also considers different memory consumption values depending on whether the function is idle or in execution. Using an event-driven FaaS simulator, FaasSim, we have developed, we show that COCOA can reduce over-provisioning by over 70% in some workloads, while satisfying the service-level agreements.

Read more
Performance

Characterizing Deep Learning Training Workloads on Alibaba-PAI

Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.

Read more
Performance

Ciw: An open source discrete event simulation library

This paper introduces Ciw, an open source library for conducting discrete event simulations that has been developed in Python. The strengths of the library are illustrated in terms of best practice and reproducibility for computational research. An analysis of Ciw's performance and comparison to several alternative discrete event simulation frameworks is presented.

Read more
Performance

ClassyTune: A Performance Auto-Tuner for Systems in the Cloud

Performance tuning can improve the system performance and thus enable the reduction of cloud computing resources needed to support an application. Due to the ever increasing number of parameters and complexity of systems, there is a necessity to automate performance tuning for the complicated systems in the cloud. The state-of-the-art tuning methods are adopting either the experience-driven tuning approach or the data-driven one. Data-driven tuning is attracting increasing attentions, as it has wider applicability. But existing data-driven methods cannot fully address the challenges of sample scarcity and high dimensionality simultaneously. We present ClassyTune, a data-driven automatic configuration tuning tool for cloud systems. ClassyTune exploits the machine learning model of classification for auto-tuning. This exploitation enables the induction of more training samples without increasing the input dimension. Experiments on seven popular systems in the cloud show that ClassyTune can effectively tune system performance to seven times higher for high-dimensional configuration space, outperforming expert tuning and the state-of-the-art auto-tuning solutions. We also describe a use case in which performance tuning enables the reduction of 33% computing resources needed to run an online stateless service.

Read more
Performance

Cloud-based or On-device: An Empirical Study of Mobile Deep Inference

Modern mobile applications are benefiting significantly from the advancement in deep learning, e.g., implementing real-time image recognition and conversational system. Given a trained deep learning model, applications usually need to perform a series of matrix operations based on the input data, in order to infer possible output values. Because of computational complexity and size constraints, these trained models are often hosted in the cloud. To utilize these cloud-based models, mobile apps will have to send input data over the network. While cloud-based deep learning can provide reasonable response time for mobile apps, it restricts the use case scenarios, e.g. mobile apps need to have network access. With mobile specific deep learning optimizations, it is now possible to employ on-device inference. However, because mobile hardware, such as GPU and memory size, can be very limited when compared to its desktop counterpart, it is important to understand the feasibility of this new on-device deep learning inference architecture. In this paper, we empirically evaluate the inference performance of three Convolutional Neural Networks (CNNs) using a benchmark Android application we developed. Our measurement and analysis suggest that on-device inference can cost up to two orders of magnitude greater response time and energy when compared to cloud-based inference, and that loading model and computing probability are two performance bottlenecks for on-device deep inferences.

Read more

Ready to get started?

Join us today