Vadim Voevodin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vadim Voevodin is active.

Explore More

Publication

Featured researches published by Vadim Voevodin.

international conference on parallel processing | 2015

An Approach for Ensuring Reliable Functioning of a Supercomputer Based on a Formal Model

Alexander Antonov; Dmitry A. Nikitenko; Pavel Shvets; Sergey Sobolev; Konstantin Stefanov; Vadim Voevodin; Vladimir Voevodin; Sergey Zhumatiy

In this article we describe the Octotron project intended to ensure reliability and sustainability of a supercomputer. Octotron is based on a formal model of computing system that describes system components and their interconnections in graph form. The model determines relations between data describing current supercomputer state (monitoring data) under which all components are functioning properly. Relations are given in form of rules, with the input of real monitoring data. If these relations are violated, Octotron registers the presence of abnormal situation and performs one of the predefined actions: notification of system administrators, logging, disabling or restarting faulty hardware or software components, etc. This paper describes the general structure of the model, augmented with details of its realization and evaluation at supercomputing center in Moscow State University.

parallel, distributed and network-based processing | 2016

A Study of the Dynamic Characteristics of Software Implementation as an Essential Part for a Universal Description of Algorithm Properties

Alexander Antonov; Vadim Voevodin; Vladimir Voevodin; Alexey Teplov

The AlgoWiki open encyclopedia of parallel algorithmic features enables the entire computing community to work together to describe the properties of a multitude of mathematical algorithms and their implementation for various software and hardware platforms. As part of the AlgoWiki project, a structure has been suggested for providing universal descriptions of algorithm properties. Along with the first part of the description, dedicated to machine-independent properties of the algorithms, it is extremely important to study and describe the dynamic characteristics of their software implementation. By studying fundamental algorithm properties such as execution time, performance, data locality, efficiency and scalability, we can give some estimate of the potential implementation quality for a given algorithm on a specific computer and lay the foundation for comparative analysis of various computing platforms with regards to the algorithms presented in AlgoWiki.

international conference on supercomputing | 2015

Efficiency of Exascale Supercomputer Centers and Supercomputing Education

Vladimir Voevodin; Vadim Voevodin

The efficient usage of all opportunities offered by modern computing systems represents a global challenge. To solve it efficiently we need to move in two directions simultaneously. Firstly, the higher educational system must be changed with a wide adoption of parallel computing technologies as the main idea across all curricula and courses. Secondly, it is necessary to develop software tools and systems to be able to reveal root causes of poor performance for applications as well as to evaluate efficiency of supercomputer centers on a large task flow. We try to combine both these two directions within supercomputer center of Moscow State University. In this article we will focus on the main idea of wide dissemination of supercomputing education for efficient usage of supercomputer systems today and in the nearest future as well as describe the results we have reached so far in this area.

international conference on algorithms and architectures for parallel processing | 2016

System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC Center

Dmitry A. Nikitenko; Konstantin Stefanov; Sergey Zhumatiy; Vadim Voevodin; Alexey Teplov; Pavel Shvets

The problem of effective resource utilization is very challenging nowadays, especially for HPC centers running top-level supercomputing facilities with high energy consumption and significant number of workgroups. The weakness of many system monitoring based approaches to efficiency study is the basic orientation on professionals and analysis of specific jobs with low availability for regular users. The proposed all-round performance analysis approach, covering single application performance, project-level and overall system resource utilization based on system monitoring data that promises to be an effective and low cost technique aimed at all types of HPC center users. Every user of HPC center can access details on any of his executed jobs to better understand application behavior and sequences of job runs including scalability study, helping in turn to perform appropriate optimizations and implement co-design techniques. Taking into consideration all levels (user, project manager, administrator), the approach aids to improve output of HPC centers.

NUMERICAL COMPUTATIONS: THEORY AND ALGORITHMS (NUMTA–2016): Proceedings of the 2nd International Conference “Numerical Computations: Theory and Algorithms” | 2016

Data mining method for anomaly detection in the supercomputer task flow

Vadim Voevodin; Vladimir Voevodin; Denis Shaikhislamov; Dmitry A. Nikitenko

The efficiency of most supercomputer applications is extremely low. At the same time, the user rarely even suspects that their applications may be wasting computing resources. Software tools need to be developed to help detect inefficient applications and report them to the users. We suggest an algorithm for detecting anomalies in the supercomputer’s task flow, based on a data mining methods. System monitoring is used to calculate integral characteristics for every job executed, and the data is used as input for our classification method based on the Random Forest algorithm. The proposed approach can currently classify the application as one of three classes – normal, suspicious and definitely anomalous. The proposed approach has been demonstrated on actual applications running on the “Lomonosov” supercomputer.

international conference on parallel processing | 2017

An Approach for Detecting Abnormal Parallel Applications Based on Time Series Analysis Methods

Denis Shaykhislamov; Vadim Voevodin

The low efficiency of parallel program execution is one of the most serious problems in high-performance computing area. There are many researches and software tools aimed at analyzing and improving the performance of a particular program, but the task of detecting such applications that need to be analyzed is still far from being solved.

international conference on parallel processing | 2017

Multidimensional Performance and Scalability Analysis for Diverse Applications Based on System Monitoring Data

Maya Neytcheva; Sverker Holmgren; Jonathan Bull; Ali Dorostkar; Anastasia Kruchinina; Dmitry A. Nikitenko; Nina Popova; Pavel Shvets; Alexey Teplov; Vadim Voevodin; Vladimir Voevodin

Multidimensional performance and scalability analysis for diverse applications based on system monitoring data

Russian Supercomputing Days | 2017

JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis

Dmitry A. Nikitenko; Alexander Antonov; Pavel Shvets; Sergey Sobolev; Konstantin Stefanov; Vadim Voevodin; Vladimir Voevodin; Sergey Zhumatiy

The efficiency of computing resources utilization by user applications can be analyzed in various ways. The JobDigest approach based on system monitoring was developed in Moscow State University and is currently used in everyday practice of the largest Russian supercomputing center of Moscow State University. The approach features application behavior analysis for every job run on HPC system providing: the set of dynamic application characteristics - time series of values representing utilization of CPU, memory, network, storage, etc. with diagrams and heat maps; the integral characteristics representing average utilization rates; job tagging and categorization with means of informing system administrators and managers on suspicious or abnormal applications. The paper describes the approach principles and workflow, it also demonstrates JobDigest use cases and positioning of the proposed techniques in the set of tools and methods that are used in the MSU HPC Center to ensure its 24/7 efficient and productive functioning.

Archive | 2018

Role-Dependent Resource Utilization Analysis for Large HPC Centers

Dmitry A. Nikitenko; Pavel Shvets; Vadim Voevodin; Sergey Zhumatiy

The resource utilization analysis of HPC systems can be performed in different ways. The method of analysis is selected depending primarily on the original focus of research. It can be a particular application and/or a series of application run analyses, a selected partition or a whole supercomputer system utilization study, a research on peculiarities of workgroup collaboration, and so on. The larger an HPC center is, the more diverse are the scenarios and user roles that arise. In this paper, we share the results of our research on possible roles and scenarios, as well as typical methods of resource utilization analysis for each role and scenario. The results obtained in this research have served as the basis for the development of appropriate modules in the Octoshell management system, which is used by all users of the largest HPC center in Russia, at Lomonosov Moscow State University.

Archive | 2018

What Do We Need to Know About Parallel Algorithms and Their Efficient Implementation

Vladimir Voevodin; Alexander Antonov; Vadim Voevodin

The computing world is changing and all devices—from mobile phones and personal computers to high-performance supercomputers—are becoming parallel. At the same time, the efficient usage of all the opportunities offered by modern computing systems represents a global challenge. Using full potential of parallel computing systems and distributed computing resources requires new knowledge, skills and abilities, where one of the main roles belongs to understanding key properties of parallel algorithms. What are these properties? What should be discovered and expressed explicitly in existing algorithms when a new parallel architecture appears? How to ensure efficient implementation of an algorithm on a particular parallel computing platform? All these as well as many other issues are addressed in this chapter. The idea that we use in our educational practice is to split a description of an algorithm into two parts. The first part describes algorithms and their properties. The second part is dedicated to describing particular aspects of their implementation on various computing platforms. This division is made intentionally to highlight the machine-independent properties of algorithms and to describe them separately from a number of issues related to the subsequent stages of programming and executing the resulting programs.

Explore More