Featured Researches

Performance

A figure of merit for describing the performance of scaling of parallelization

With the spread of multi- and many-core processors more and more typical task is to re-implement some source code written originally for a single processor to run on more than one cores. Since it is a serious investment, it is important to decide how much efforts pays off, and whether the resulting implementation has as good performability as it could be. The Amdahl's law provides some theoretical upper limits for the performance gain reachable through parallelizing the code, but it needs the detailed architectural knowledge of the program code, does not consider the housekeeping activity needed for parallelization and cannot tell how the actual stage of parallelization implementation performs. The present paper suggests a quantitative measure for that goal. This figure of merit is derived experimentally, from measured running time, and number of threads/cores. It can be used to quantify the used parallelization technology, the connection between the computing units, the acceleration technology under the given conditions, communication method within SoC, or the performance of the software team/compiler.

Read more
Performance

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance, parallel applications. In today's increasingly heterogeneous computing environment, this task must be accomplished efficiently across multiple architectures, including massively parallel coprocessors like GPUs. To address this challenge, we present an approach for constructing customizable, cross-machine performance models for GPU kernels, including a mechanism to automatically and symbolically gather performance-relevant kernel operation counts, a tool for formulating mathematical models using these counts, and a customizable parameterized collection of benchmark kernels used to calibrate models to GPUs in a black-box fashion. Our approach empowers a user to manage trade-offs between model accuracy, evaluation speed, and generalizability. A user can define a model and customize the calibration process, making it as simple or complex as desired, and as application-targeted or general as desired. To evaluate our approach, we demonstrate both linear and nonlinear models; each example models execution times for multiple variants of a particular computation: two matrix multiplication variants, four Discontinuous Galerkin (DG) differentiation operation variants, and two 2-D five-point finite difference stencil variants. For each variant, we present accuracy results on GPUs from multiple vendors and hardware generations. We view this customizable approach as a response to a central question in GPU performance modeling: how can we model GPU performance in a cost-explanatory fashion while maintaining accuracy, evaluation speed, portability, and ease of use, an attribute we believe precludes manual collection of kernel or hardware statistics.

Read more
Performance

A model-driven approach for a new generation of adaptive libraries

Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters and algorithmic choices are tuned and then hard-coded for a specific architecture and for certain characteristics of the inputs. However, emerging applications are often data-driven, thus traditional approaches are not effective across the wide range of inputs and architectures used in practice. In this paper, we present a new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets. We demonstrate the effectiveness of a BLAS library and specifically on its matrix multiplication routine. We present experimental results for two GPU architectures and show significant performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x (on an embedded ARM Mali GPU) when compared to a traditionally optimized library.

Read more
Performance

A note on integrating products of linear forms over the unit simplex

Integrating a product of linear forms over the unit simplex can be done in polynomial time if the number of variables n is fixed (V. Baldoni et al., 2011). In this note, we highlight that this problem is equivalent to obtaining the normalizing constant of state probabilities for a popular class of Markov processes used in queueing network theory. In light of this equivalence, we survey existing computational algorithms developed in queueing theory that can be used for exact integration. For example, under some regularity conditions, queueing theory algorithms can exactly integrate a product of linear forms of total degree N by solving N systems of linear equations.

Read more
Performance

A refined and asymptotic analysis of optimal stopping problems of Bruss and Weber

The classical secretary problem has been generalized over the years into several directions. In this paper we confine our interest to those generalizations which have to do with the more general problem of stopping on a last observation of a specific kind. We follow Dendievel, (where a bibliography can be found) who studies several types of such problems, mainly initiated by Bruss and Weber. Whether in discrete time or continuous time, whether all parameters are known or must be sequentially estimated, we shall call such problems simply "Bruss-Weber problems". Our contribution in the present paper is a refined analysis of several problems in this class and a study of the asymptotic behaviour of solutions. The problems we consider center around the following model. Let X 1 , X 2 ,?? X n be a sequence of independent random variables which can take three values: {+1,??,0}. Let $p:=¶(X_i=1), p':=¶(X_i=-1), \qt:=¶(X_i=0), p\geq p'$, where $p+p'+\qt=1$. The goal is to maximize the probability of stopping on a value +1 or ?? appearing for the last time in the sequence. Following a suggestion by Bruss, we have also analyzed an x-strategy with incomplete information: the cases p known, n unknown, then n known, p unknown and finally n,p unknown are considered. We also present simulations of the corresponding complete selection algorithm.

Read more
Performance

A refined mean field approximation of synchronous discrete-time population models

Mean field approximation is a popular method to study the behaviour of stochastic models composed of a large number of interacting objects. When the objects are asynchronous, the mean field approximation of a population model can be expressed as an ordinary differential equation. When the objects are (clock-) synchronous the mean field approximation is a discrete time dynamical system. We focus on the latter.We study the accuracy of mean field approximation when this approximation is a discrete-time dynamical system. We extend a result that was shown for the continuous time case and we prove that expected performance indicators estimated by mean field approximation are O(1/N) -accurate. We provide simple expressions to effectively compute the asymptotic error of mean field approximation, for finite time-horizon and steady-state, and we use this computed error to propose what we call a \emph{refined} mean field approximation. We show, by using a few numerical examples, that this technique improves the quality of approximation compared to the classical mean field approximation, especially for relatively small population sizes.

Read more
Performance

A review of analytical performance modeling and its role in computer engineering and science

This article is a review of analytical performance modeling for computer systems. It discusses the motivation for this area of research, examines key issues, introduces some ideas, illustrates how it is applied, and points out a role that it can play in developing Computer Science.

Read more
Performance

A survey of sparse matrix-vector multiplication performance on large matrices

We contribute a third-party survey of sparse matrix-vector (SpMV) product performance on industrial-strength, large matrices using: (1) The SpMV implementations in Intel MKL, the Trilinos project (Tpetra subpackage), the CUSPARSE library, and the CUSP library, each running on modern architectures. (2) NVIDIA GPUs and Intel multi-core CPUs (supported by each software package). (3) The CSR, BSR, COO, HYB, and ELL matrix formats (supported by each software package).

Read more
Performance

AI Benchmark: All About Deep Learning on Smartphones in 2019

The performance of mobile AI accelerators has been evolving rapidly in the past two years, nearly doubling with each new generation of SoCs. The current 4th generation of mobile NPUs is already approaching the results of CUDA-compatible Nvidia graphics cards presented not long ago, which together with the increased capabilities of mobile deep learning frameworks makes it possible to run complex and deep AI models on mobile devices. In this paper, we evaluate the performance and compare the results of all chipsets from Qualcomm, HiSilicon, Samsung, MediaTek and Unisoc that are providing hardware acceleration for AI inference. We also discuss the recent changes in the Android ML pipeline and provide an overview of the deployment of deep learning models on mobile devices. All numerical results provided in this paper can be found and are regularly updated on the official project website: this http URL.

Read more
Performance

AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

Domain-specific software and hardware co-design is encouraging as it is much easier to achieve efficiency for fewer tasks. Agile domain-specific benchmarking speeds up the process as it provides not only relevant design inputs but also relevant metrics, and tools. Unfortunately, modern workloads like Big data, AI, and Internet services dwarf the traditional one in terms of code size, deployment scale, and execution path, and hence raise serious benchmarking challenges. This paper proposes an agile domain-specific benchmarking methodology. Together with seventeen industry partners, we identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We propose the permutations of essential AI and non-AI component benchmarks as end-to-end benchmarks. An end-to-end benchmark is a distillation of the essential attributes of an industry-scale application. We design and implement a highly extensible, configurable, and flexible benchmark framework, on the basis of which, we propose the guideline for building end-to-end benchmarks, and present the first end-to-end Internet service AI benchmark. The preliminary evaluation shows the value of our benchmark suite---AIBench against MLPerf and TailBench for hardware and software designers, micro-architectural researchers, and code developers. The specifications, source code, testbed, and results are publicly available from the web site \url{this http URL}.

Read more

Ready to get started?

Join us today