Anton Lokhmotov | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anton Lokhmotov is active.

Explore More

Publication

Featured researches published by Anton Lokhmotov.

high performance embedded architectures and compilers | 2010

Automatically tuning sparse matrix-vector multiplication for GPU architectures

Alexander Monakov; Anton Lokhmotov; Arutyun Avetisyan

Graphics processors are increasingly used in scientific applications due to their high computational power, which comes from hardware with multiple-level parallelism and memory hierarchy. Sparse matrix computations frequently arise in scientific applications, for example, when solving PDEs on unstructured grids. However, traditional sparse matrix algorithms are difficult to efficiently parallelize for GPUs due to irregular patterns of memory references. In this paper we present a new storage format for sparse matrices that better employs locality, has low memory footprint and enables automatic specialization for various matrices and future devices via parameter tuning. Experimental evaluation demonstrates significant speedups compared to previously published results.

high performance embedded architectures and compilers | 2008

Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

Lee W. Howes; Anton Lokhmotov; Alastair F. Donaldson; Paul H. J. Kelly

On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both the memory access pattern and the execution schedule of a computation kernel, the compiler or run-time system can derive efficient data movement, even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled Access/Execute specifications, allowing for automatic communication optimisations such as software pipelining and data reuse. We demonstrate the ease and efficiency of programming the Cell Broadband Engine architecture using these classes by implementing a set of benchmarks, which exhibit data reuse and non-affine access functions, and by comparing these implementations against alternative implementations, which use hand-written DMA transfers and software-based caching.

international conference on parallel architectures and compilation techniques | 2015

PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming

Riyadh Baghdadi; Ulysse Beaugnon; Albert Cohen; Tobias Grosser; Michael Kruse; Chandan Reddy; Sven Verdoolaege; Adam Betts; Alastair F. Donaldson; Jeroen Ketema; Javed Absar; Sven Van Haastregt; Alexey Kravets; Anton Lokhmotov; Róbert Dávid; Elnar Hajiyev

Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA is difficult, error-prone, and not performance-portable. Automatic parallelization and domain specific languages (DSLs) have been proposed to hide complexity and regain performance portability. We present PENCIL, a rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators. PENCIL aims to serve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a PENCIL-to-OpenCL backend using a state-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affine array accesses, generates optimized OpenCL code. To demonstrate the potential and performance portability of PENCIL and the PENCIL-to-OpenCL compiler, we consider a number of image processing kernels, a set of benchmarks from the Rodinia and SHOC suites, and DSL embedding scenarios for linear algebra (BLAS) and signal processing radar applications (SpearDE), and present experimental results for four GPU platforms: AMD Radeon HD 5670 and R9 285, NVIDIA GTX 470, and ARM Mali-T604.

international conference on parallel processing | 2011

Generating GPU code from a high-level representation for image processing kernels

Richard Membarth; Anton Lokhmotov; Jürgen Teich

We present a framework for representing image processing kernels based on decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access pattern of a kernel. The framework performs source-to-source translation of kernels expressed in high-level framework-specific C++ classes into low-level CUDA or OpenCL code with effective device-dependent optimizations such as global memory padding for memory coalescing and optimal memory bandwidth utilization. We evaluate the framework on several image filters, comparing generated code against highly-optimized CPU and GPU versions in the popular OpenCV library.

languages, compilers, and tools for embedded systems | 2014

VOBLA: a vehicle for optimized basic linear algebra

Ulysse Beaugnon; Alexey Kravets; Sven Van Haastregt; Riyadh Baghdadi; David Tweed; Javed Absar; Anton Lokhmotov

We present VOBLA, a domain-specific language designed for programming linear algebra libraries. VOBLA is compiled to PENCIL, a domain independent intermediate language designed for efficient mapping to accelerator architectures such as GPGPUs. PENCIL is compiled to efficient, platform-specific OpenCL code using techniques based on the polyhedral model. This approach addresses both the programmer productivity and performance portability concerns associated with accelerator programming.n We demonstrate our approach by using VOBLA to implement a BLAS library. We have evaluated the performance of OpenCL code generated using our compilation flow on ARM Mali, AMD Radeon, and AMD Opteron platforms. The generated code is currently on average 1.9x slower than highly hand-optimized OpenCL code, but on average 8.1x faster than straightforward OpenCL code. Given that the VOBLA coding takes significantly less effort compared to hand-optimizing OpenCL code, we believe our approach leads to improved productivity and performance portability.

international conference on parallel processing | 2009

Towards metaprogramming for parallel systems on a chip

Lee W. Howes; Anton Lokhmotov; Alastair F. Donaldson; Paul H. J. Kelly

We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata which allow the programmer to specify both execution constraints and memory access pattern of a computation kernel.

european conference on parallel processing | 2009

Compile-Time and Run-Time Issues in an Auto-Parallelisation System for the Cell BE Processor

Alastair F. Donaldson; Paul Keir; Anton Lokhmotov

We describe compiler and run-time optimisations for effective auto-parallelisation of C++ programs on the Cell BE architecture. Auto-parallelisation is made easier by annotating sieve scopes , which abstract the read in, compute in parallel, write out processing paradigm. We show that the semantics of sieve scopes enables data movement optimisations, such as re-organising global memory reads to minimise DMA transfers and streaming reads from uniformly accessed arrays. We also describe run-time optimisations for committing side-effects to main memory. We provide experimental results showing the benefits of our optimisations, and compare the Sieve-Cell system with IBMs OpenMP implementation for Cell.

architectural support for programming languages and operating systems | 2018

Multi-objective autotuning of MobileNets across the full software/hardware stack

Anton Lokhmotov; Nikolay Chunosov; Flavio Vella; Grigori Fursin

We present a customizable Collective Knowledge workflow to study the execution time vs. accuracy trade-offs for the MobileNets CNN family. We use this workflow to evaluate MobileNets on Arm Cortex CPUs using TensorFlow and Arm Mali GPUs using several versions of the Arm Compute Library. Our optimizations for the Arm Bifrost GPU architecture reduce the execution time by 2--3 times, while lying on a Pareto-optimal frontier. We also highlight the challenge of maintaining the accuracy when deploying CNN models across diverse platforms. We make all the workflow components (models, programs, scripts, etc.) publicly available to encourage further exploration by the community.

arXiv: Programming Languages | 2013

PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs

Riyadh Baghdadi; Albert Cohen; Serge Guelton; Sven Verdoolaege; Jun Inoue; Tobias Grosser; Georgia Kouveli; Alexey Kravets; Anton Lokhmotov; Cedric Nugteren; Fraser Waters; Alastair F. Donaldson

Archive | 2015

PENCIL Language Specification

Riyadh Baghdadi; Albert Cohen; Tobias Grosser; Sven Verdoolaege; Anton Lokhmotov; Javed Absar; Sven Van Haastregt; Alexey Kravets; Alastair F. Donaldson

Explore More