Mikhail Smelyanskiy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mikhail Smelyanskiy is active.

Explore More

Publication

Featured researches published by Mikhail Smelyanskiy.

international symposium on computer architecture | 2010

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Victor W. Lee; Changkyu Kim; Jatin Chhugani; Michael E. Deisher; Daehyun Kim; Anthony D. Nguyen; Nadathur Satish; Mikhail Smelyanskiy; Srinivas Chennupaty; Per Hammarlund; Ronak Singhal; Pradeep Dubey

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for todays multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

international conference on supercomputing | 2013

Efficient sparse matrix-vector multiplication on x86-based many-core processors

Xing Liu; Mikhail Smelyanskiy; Edmond Chow; Pradeep Dubey

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks which may limit performance even before memory bandwidth: (a) low SIMD efficiency due to sparsity, (b) overhead due to irregular memory accesses, and (c) load-imbalance due to non-uniform matrix structures. We describe an efficient implementation of SpMV on the IntelR Xeon PhiTM Coprocessor, codenamed Knights Corner (KNC), that addresses the above challenges. Our implementation exploits the salient architectural features of KNC, such as large caches and hardware support for irregular memory accesses. By using a specialized data structure with careful load balancing, we attain performance on average close to 90% of KNCs achievable memory bandwidth on a diverse set of sparse matrices. Furthermore, we demonstrate that our implementation is 3.52x and 1.32x faster, respectively, than the best available implementations on dual IntelR XeonR Processor E5-2680 and the NVIDIA Tesla K20X architecture.

international parallel and distributed processing symposium | 2013

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor

Alexander Heinecke; Karthikeyan Vaidyanathan; Mikhail Smelyanskiy; Alexander Kobotov; Roman Dubtsov; Greg Henry; Aniruddha G. Shet; George Z. Chrysos; Pradeep Dubey

Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intels recently released Intel® Xeon Phi™1 co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corners salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.

international parallel and distributed processing symposium | 2013

Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors

Simon J. Pennycook; Christopher J. Hughes; Mikhail Smelyanskiy; Stephen A. Jarvis

We analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandias miniMD benchmark, which we demonstrate using three SIMD widths (128-, 256and 512bit). The applicability of these optimisations to wider SIMD is discussed, and we show that the conventional approach of exposing more parallelism through redundant computation is not necessarily best. In single precision, our optimised implementation is up to 5x faster than the original scalar code running on Intel®Xeon®processors with 256-bit SIMD, and adding a single Intel®Xeon Phi™coprocessor provides up to an additional 2x performance increase. These results demonstrate: (i) the importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware; and (ii) the considerable performance increase afforded by the use of Intel®Xeon Phi™coprocessors for highly parallel workloads.

IEEE Transactions on Visualization and Computer Graphics | 2009

Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures

Mikhail Smelyanskiy; David R. Holmes; Jatin Chhugani; Alan Larson; Doug Carmean; Dennis P. Hanson; Pradeep Dubey; Kurt E. Augustine; Daehyun Kim; Alan B. Kyker; Victor W. Lee; Anthony D. Nguyen; Larry Seiler; Richard A. Robb

Medical volumetric imaging requires high fidelity, high performance rendering algorithms. We motivate and analyze new volumetric rendering algorithms that are suited to modern parallel processing architectures. First, we describe the three major categories of volume rendering algorithms and confirm through an imaging scientist-guided evaluation that ray-casting is the most acceptable. We describe a thread- and data-parallel implementation of ray-casting that makes it amenable to key architectural trends of three modern commodity parallel architectures: multi-core, GPU, and an upcoming many-core Intelreg architecture code-named Larrabee. We achieve more than an order of magnitude performance improvement on a number of large 3D medical datasets. We further describe a data compression scheme that significantly reduces data-transfer overhead. This allows our approach to scale well to large numbers of Larrabee cores.

high performance computer architecture | 2001

Stack value file: custom microarchitecture for the stack

Hsien-Hsin S. Lee; Mikhail Smelyanskiy; Chris J. Newburn; Gary S. Tyson

As processor performance increases, there is a corresponding increase in the demands on the memory system, including caches. Research papers have proposed partitioning the cache into instruction/data, temporal/non-temporal, and/or stack/non-stack regions. Each of these designs can improve performance by constructing two separate structures which can be probed in parallel while reducing contention. In this paper, we propose a new memory organization that partitions data references into stack and nonstack regions. Non-stack references are routed to a conventional cache. Stack references, on the other hand, are shown to have several characteristics that can be leveraged to improve performance using a less conventional storage organization. This paper enumerates those characteristics and proposes a new microarchitectural feature, the stack value file (SVF), which exploits them to improve instruction-level parallelism, reduce stack access latencies, reduce demand on the first-level cache, and reduce data bus traffic. Our results show that the SVF can improve execution performance by 29 to 65% while reducing overhead traffic for the stack region by many orders of magnitude over cache structures of the same size.

ieee international conference on high performance computing data and analytics | 2012

Optimization of geometric multigrid for emerging multi- and manycore processors

Samuel Williams; Dhiraj D. Kalamkar; Amik Singh; Anand M. Deshpande; Brian Van Straalen; Mikhail Smelyanskiy; Ann S. Almgren; Pradeep Dubey; John Shalf; Leonid Oliker

Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel® Xeon® E5-2670 and X5550 processor-based Infiniband clusters, as well as the new Intel® Xeon Phi coprocessor (Knights Corner). Our work examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding, dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.

ACM Transactions on Mathematical Software | 2016

The BLIS Framework: Experiments in Portability

Field G. Van Zee; Tyler M. Smith; Bryan Marker; Tze Meng Low; Robert A. van de Geijn; Francisco D. Igual; Mikhail Smelyanskiy; Xianyi Zhang; Michael Kistler; Vernon Austel; John A. Gunnels; Lee Killough

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD’s ACML, IBM’s ESSL, and Intel’s MKL libraries. Although most of this article focuses on single-core implementation, we also provide compelling results that suggest the framework’s leverage extends to the multithreaded domain.

ieee international conference on high performance computing data and analytics | 2014

Petascale high order dynamic rupture earthquake simulations on heterogeneous supercomputers

Alexander Heinecke; Alexander Breuer; Sebastian Rettenberger; Michael Bader; Alice-Agnes Gabriel; Christian Pelties; Arndt Bode; William L. Barth; Xiangke Liao; Karthikeyan Vaidyanathan; Mikhail Smelyanskiy; Pradeep Dubey

We present an end-to-end optimization of the innovative Arbitrary high-order DERivative Discontinuous Galerkin (ADER-DG) software SeisSol targeting Intel® Xeon Phi coprocessor platforms, achieving unprecedented earthquake model complexity through coupled simulation of full frictional sliding and seismic wave propagation. SeisSol exploits unstructured meshes to flexibly adapt for complicated geometries in realistic geological models. Seismic wave propagation is solved simultaneously with earthquake faulting in a multiphysical manner leading to a heterogeneous solver structure. Our architecture aware optimizations deliver up to 50% of peak performance, and introduce an efficient compute-communication overlapping scheme shadowing the multiphysics computations. SeisSol delivers near-optimal weak scaling, reaching 8.6 DP-PFLOPS on 8,192 nodes of the Tianhe-2 supercomputer. Our performance model projects reaching 18 -- 20 DP-PFLOPS on the full Tianhe-2 machine. Of special relevance to modern civil engineering needs, our pioneering simulation of the 1992 Landers earthquake shows highly detailed rupture evolution and ground motion at frequencies up to 10 Hz.

application specific systems architectures and processors | 2003

Systematic register bypass customization for application-specific processors

Kevin Fan; Nathan Clark; Michael L. Chu; K. V. Manjunath; Rajiv A. Ravindran; Mikhail Smelyanskiy; Scott A. Mahlke

Register bypass provides additional datapaths to eliminate data hazards in processor pipelines. The difficulty with register bypass is that the cost of the bypass network is substantial and grows substantially as processor width or pipeline depth are increased. For a single application, many of the bypass paths have extremely low utilization. Thus, there is an important opportunity in the design of application-specific processors to remove a large fraction of the bypass cost while maintaining performance comparable to a processor with full bypass. We propose a systematic design customization process along with a bypass-cognizant compiler scheduler. For the former, we employ iterative design space exploration wherein successive processor designs are selected based on bypass utilization statistics combined with the availability of redundant bypass paths. Compiler scheduling for sparse bypass processors is accomplished by prioritizing function unit choices for each operation prior to scheduling using global information. Results show that for a 5-issue customized VLIW processor, 70% of the bypass cost is eliminated while sacrificing only 10% performance.

Explore More