Stanislav G. Sedukhin

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stanislav G. Sedukhin is active.

Explore More

Publication

Featured researches published by Stanislav G. Sedukhin.

high performance computing and communications | 2011

Blocked All-Pairs Shortest Paths Algorithm for Hybrid CPU-GPU System

Kazuya Matsumoto; Naohito Nakasato; Stanislav G. Sedukhin

This paper presents a blocked algorithm for the all-pairs shortest paths (APSP) problem for a hybrid CPU-GPU system. In the blocked APSP algorithm, the amount of data communication between CPU (host) memory and GPU memory is minimized. When a problem size (the number of vertices in a graph) is large enough compared with a blocking factor, the blocked algorithm virtually requires CPU

2012 IEEE 6th International Symposium on Embedded Multicore SoCs | 2012

Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU

Kazuya Matsumoto; Naohito Nakasato; Stanislav G. Sedukhin

\rightleftharpoons

ieee international conference on high performance computing data and analytics | 2012

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Kazuya Matsumoto; Naohito Nakasato; Stanislav G. Sedukhin

GPU exchanging of two block matrices for a block computation on the GPU. We also estimate a required memory/communication bandwidth to utilize the GPU efficiently. On a system containing an Intel West mere CPU (Core i7 970) and an AMD Cypress GPU (Radeon HD 5870), our implementation of the blocked APSP algorithm achieves the performance up to 1 TFlop/s in single precision.

application specific systems architectures and processors | 1996

Parallel algorithm and architecture for two-step division-free Gaussian elimination

Shietung Peng; Stanislav G. Sedukhin; Igor S. Sedukhin

This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precisionGEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.

international conference on conceptual structures | 2011

Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems

Kazuya Matsumoto; Naohito Nakasato; Tomoya Sakai; Hideki Yahagi; Stanislav G. Sedukhin

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes performance of OpenCL programs also portable on different processors. We have developed an auto-tuning system with a code generator for fast matrix multiply kernels in OpenCL. This paper presents results of performance evaluation of DGEMM (double-precision general matrix multiply) and SGEMM (single-precision GEMM) implementations by using the auto-tuning system. Performance evaluations are conducted on two AMD GPUs (Tahiti and Cayman), two NVIDIA GPUs (Kepler and Fermi), and two CPUs (Intel Sandy Bridge and AMD Bulldozer). Our GEMM implementations on the AMD GPUs show higher performance than the highly tuned vendor library while the implementations on the NVIDIA GPUs are comparable.

international conference on model transformation | 2011

Image scrambling based on a new linear transform

Abhijeet A. Ravankar; Stanislav G. Sedukhin

The design of parallel algorithms and architectures for solving linear systems using two-step division-free Gaussian elimination method is considered. The two-step method circumvents the ordinary single-step division-free method by its greater numerical stability. In spite of the rather complicated computations needed at each iteration of the two-step method, we develop first an innovative regular iterative algorithm, then a two-dimensional array processor by deriving a localized dependency graph of the algorithm and adopting a systematic approach to investigate the set of all admissible solutions and obtain the optimal architecture under linear scheduling. The optimal array processor improves the previous systolic designs based on the widely used Gaussian elimination in term of numerical stability and the time-space complexity for VLSI implementation because of the absence of division operations.

international conference on parallel processing | 2010

Orbital Algorithms and Unified Array Processor for Computing 2D Separable Transforms

Stanislav G. Sedukhin; Ahmed Zekri; Toshiaki Myiazaki

This paper presents results of our study on double-precision general matrix-matrix multiplication (DGEMM) for GPU-equipped systems. We applied further optimization to utilize the DGEMM stream kernel previously implemented for a Cypress GPU from AMD. We have examined the effects of different memory access patterns to the performance of the DGEMM kernel by changing its layout function. The experimental results show that the GEMM kernel with X-Morton layout function superiors to the one with any other functions in terms of performance and cache hit rate. Moreover, we have implemented a DGEMM routine for large matrices, where all data cannot be allocated in a GPU memory. Our DGEMM performance achieves up to 472 GFlop/s and 921 GFlop/s on a system, using one GPU and two GPUs, respectively.

international conference on networking and computing | 2010

Mesh-of-Tori: A Novel Interconnection Network for Frontal Plane Cellular Processors

Abhijeet A. Ravankar; Stanislav G. Sedukhin

A new linear transform for scrambling images is proposed in this paper. The forward transform scrambles the image and the inverse transform unscrambles the image. We define transformation matrices for both scalar and blocked cases. Recursive and non-recursive algorithms based on the new transform are also proposed. The degree of scrambling or unscrambling can be determined by the user. The experimental results show that the positions of the pixels are strongly irregularized using the proposed transform. Unscrambling using a wrong key fails and results in an unintelligible image which cannot be recognized. We also show that the new transform provides a high level of image scrambling and is robust under common attacks and noise. The proposed linear transform is very simple and its implementation as triple-matrix multiplication is straightforward.

great lakes symposium on vlsi | 1994

A new systolic architecture for pipeline prime factor DFT-algorithm

Stanislav G. Sedukhin

The two-dimensional (2D) forward/inverse discrete Fourier transform (DFT), discrete cosine transform (DCT), discrete sine transform (DST), discrete Hartley transform (DHT), discrete Walsh-Hadamard transform (DWHT), play a fundamental role in many practical applications. Due to the separability property, all these transforms can be uniquely defined as a triple matrix product with one matrix transposition. Based on a systematic approach to represent and schedule different forms of the

international parallel and distributed processing symposium | 2006