Sergey Zhumatiy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sergey Zhumatiy is active.

Explore More

Publication

Featured researches published by Sergey Zhumatiy.

international conference on cluster computing | 2013

Capturing inter-application interference on clusters

Aamer Shah; Felix Wolf; Sergey Zhumatiy; Vladimir Voevodin

Cluster systems usually run several applications-often from different users-concurrently, with individual applications competing for access to shared resources such as the file system or the network. Low application performance is therefore not always the result of inefficient program design, but may instead be caused by interference from outside. However, knowing the difference is essential for an appropriate response. Unfortunately, traditional performance-analysis techniques consider an application always in isolation, without the ability to compare its performance to the overall performance conditions on the system when it was executed. In this paper, we present a novel approach of how to correlate the performance behavior of applications running side by side. To accomplish this, we divide the application runtime into fine-grained time slices whose boundaries are synchronized across the entire system. Mapping performance data related to shared resources onto these time slices, we are able to establish the simultaneity of their usage across jobs, which can be indicative of inter-application interference. Our experiments show that such interference effects, for which the developer is usually not to blame, can degrade application performance significantly.

international conference on algorithms and architectures for parallel processing | 2016

System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC Center

Dmitry A. Nikitenko; Konstantin Stefanov; Sergey Zhumatiy; Vadim Voevodin; Alexey Teplov; Pavel Shvets

The problem of effective resource utilization is very challenging nowadays, especially for HPC centers running top-level supercomputing facilities with high energy consumption and significant number of workgroups. The weakness of many system monitoring based approaches to efficiency study is the basic orientation on professionals and analysis of specific jobs with low availability for regular users. The proposed all-round performance analysis approach, covering single application performance, project-level and overall system resource utilization based on system monitoring data that promises to be an effective and low cost technique aimed at all types of HPC center users. Every user of HPC center can access details on any of his executed jobs to better understand application behavior and sequences of job runs including scalability study, helping in turn to perform appropriate optimizations and implement co-design techniques. Taking into consideration all levels (user, project manager, administrator), the approach aids to improve output of HPC centers.

computing frontiers | 2016

Resolving frontier problems of mastering large-scale supercomputer complexes

Dmitry A. Nikitenko; Vladimir Voevodin; Sergey Zhumatiy

Managing and administering of large-scale HPC centers is a complicated problem. Using a number of independent tools for resolving its seemingly independent sub problems can become a bottleneck with rapidly increasing scale of systems, number of hardware and software components, variety of user applications and types of licenses, number of users and workgroups, and so on. The developed tool is designed to help resolving routine problems in mastering and administering of any supercomputer center from a scale of a stand-alone system up to the top-rank supercomputer centers that include a number of totally different HPC systems. The toolkit implements a flexibly configurable variety of essential tools in a single interface. It also features useful means of automation for typical administering and management multi-step procedures. Another important design and implementation feature allows installing and using the toolkit without any significant changes to existing administrating tools and system software. The developed tool is not integrated with target machines system software, it is run on a remote server and runs scripts on HPC systems via SSH as a dedicated user with limited access permissions to perform certain actions. This reduces possibility of security issues greatly and takes care of many fault tolerance issues that are in the line of the key challenges on the road to the Exascale. At the same time this allows administrator performing any operations with corresponding to the situation tools, whether using our tools or any other available tool. The approbation of the developed system proved its practicality in HPC center with some Petaflop-level supercomputers, thousands of active researchers from a diversity of institutions within several hundreds of applied projects.

Russian Supercomputing Days | 2017

JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis

Dmitry A. Nikitenko; Alexander Antonov; Pavel Shvets; Sergey Sobolev; Konstantin Stefanov; Vadim Voevodin; Vladimir Voevodin; Sergey Zhumatiy

The efficiency of computing resources utilization by user applications can be analyzed in various ways. The JobDigest approach based on system monitoring was developed in Moscow State University and is currently used in everyday practice of the largest Russian supercomputing center of Moscow State University. The approach features application behavior analysis for every job run on HPC system providing: the set of dynamic application characteristics - time series of values representing utilization of CPU, memory, network, storage, etc. with diagrams and heat maps; the integral characteristics representing average utilization rates; job tagging and categorization with means of informing system administrators and managers on suspicious or abnormal applications. The paper describes the approach principles and workflow, it also demonstrates JobDigest use cases and positioning of the proposed techniques in the set of tools and methods that are used in the MSU HPC Center to ensure its 24/7 efficient and productive functioning.

Archive | 2018

Role-Dependent Resource Utilization Analysis for Large HPC Centers

Dmitry A. Nikitenko; Pavel Shvets; Vadim Voevodin; Sergey Zhumatiy

The resource utilization analysis of HPC systems can be performed in different ways. The method of analysis is selected depending primarily on the original focus of research. It can be a particular application and/or a series of application run analyses, a selected partition or a whole supercomputer system utilization study, a research on peculiarities of workgroup collaboration, and so on. The larger an HPC center is, the more diverse are the scenarios and user roles that arise. In this paper, we share the results of our research on possible roles and scenarios, as well as typical methods of resource utilization analysis for each role and scenario. The results obtained in this research have served as the basis for the development of appropriate modules in the Octoshell management system, which is used by all users of the largest HPC center in Russia, at Lomonosov Moscow State University.

International Conference on Parallel Computational Technologies | 2018

Machine Learning Techniques for Detecting Supercomputer Applications with Abnormal Behavior

Alexander Bezrukov; Mikhail Kokarev; Denis Shaykhislamov; Vadim Voevodin; Sergey Zhumatiy

There are different approaches that help to solve the issue of low efficiency of modern supercomputer usage. One of them is based on constant monitoring of a supercomputer job flow in order to promptly detect inefficient programs. The execution dynamics of such programs usually differs from the “normal” behavior of common programs; however, it is very difficult to establish exact criteria for determining abnormal behavior. Machine learning methods are therefore used in this study for detecting abnormal jobs. This paper deals with an important aspect of working with machine learning methods, namely data preparation. The solution proposed herein was evaluated on the Lomonosov-2 supercomputer.

parallel computing technologies | 2016