Dmitry A. Nikitenko | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dmitry A. Nikitenko is active.

Explore More

Publication

Featured researches published by Dmitry A. Nikitenko.

ieee international conference on high performance computing data and analytics | 2013

The HOPSA workflow and tools

Bernd Mohr; Vladimir Voevodin; Judit Gimenez; Erik Hagersten; Andreas Knüpfer; Dmitry A. Nikitenko; Mats Nilsson; Harald Servat; Aamer Shah; Frank Winkler; Felix Wolf; Ilya Zhukov

To maximise the scientific output of a high-performance computing system, different stakeholders pursue different strategies. While individual application developers are trying to shorten the time to solution by optimising their codes, system administrators are tuning the configuration of the overall system to increase its throughput. Yet, the complexity of today’s machines with their strong interrelationship between application and system performance presents serious challenges to achieving these goals. The HOPSA project (HOlistic Performance System Analysis) therefore sets out to create an integrated diagnostic infrastructure for combined application and system-level tuning – with the former provided by the EU and the latter by the Russian project partners. Starting from system-wide basic performance screening of individual jobs, an automated workflow routes findings on potential bottlenecks either to application developers or system administrators with recommendations on how to identify their root cause using more powerful diagnostic tools. Developers can choose from a variety of mature performance-analysis tools developed by our consortium. Within this project, the tools will be further integrated and enhanced with respect to scalability, depth of analysis, and support for asynchronous tasking, a node-level paradigm playing an increasingly important role in hybrid programs on emerging hierarchical and heterogeneous systems.

international conference on algorithms and architectures for parallel processing | 2016

System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC Center

Dmitry A. Nikitenko; Konstantin Stefanov; Sergey Zhumatiy; Vadim Voevodin; Alexey Teplov; Pavel Shvets

The problem of effective resource utilization is very challenging nowadays, especially for HPC centers running top-level supercomputing facilities with high energy consumption and significant number of workgroups. The weakness of many system monitoring based approaches to efficiency study is the basic orientation on professionals and analysis of specific jobs with low availability for regular users. The proposed all-round performance analysis approach, covering single application performance, project-level and overall system resource utilization based on system monitoring data that promises to be an effective and low cost technique aimed at all types of HPC center users. Every user of HPC center can access details on any of his executed jobs to better understand application behavior and sequences of job runs including scalability study, helping in turn to perform appropriate optimizations and implement co-design techniques. Taking into consideration all levels (user, project manager, administrator), the approach aids to improve output of HPC centers.

computing frontiers | 2016

Resolving frontier problems of mastering large-scale supercomputer complexes

Dmitry A. Nikitenko; Vladimir Voevodin; Sergey Zhumatiy

Managing and administering of large-scale HPC centers is a complicated problem. Using a number of independent tools for resolving its seemingly independent sub problems can become a bottleneck with rapidly increasing scale of systems, number of hardware and software components, variety of user applications and types of licenses, number of users and workgroups, and so on. The developed tool is designed to help resolving routine problems in mastering and administering of any supercomputer center from a scale of a stand-alone system up to the top-rank supercomputer centers that include a number of totally different HPC systems. The toolkit implements a flexibly configurable variety of essential tools in a single interface. It also features useful means of automation for typical administering and management multi-step procedures. Another important design and implementation feature allows installing and using the toolkit without any significant changes to existing administrating tools and system software. The developed tool is not integrated with target machines system software, it is run on a remote server and runs scripts on HPC systems via SSH as a dedicated user with limited access permissions to perform certain actions. This reduces possibility of security issues greatly and takes care of many fault tolerance issues that are in the line of the key challenges on the road to the Exascale. At the same time this allows administrator performing any operations with corresponding to the situation tools, whether using our tools or any other available tool. The approbation of the developed system proved its practicality in HPC center with some Petaflop-level supercomputers, thousands of active researchers from a diversity of institutions within several hundreds of applied projects.

NUMERICAL COMPUTATIONS: THEORY AND ALGORITHMS (NUMTA–2016): Proceedings of the 2nd International Conference “Numerical Computations: Theory and Algorithms” | 2016

Data mining method for anomaly detection in the supercomputer task flow

Vadim Voevodin; Vladimir Voevodin; Denis Shaikhislamov; Dmitry A. Nikitenko

The efficiency of most supercomputer applications is extremely low. At the same time, the user rarely even suspects that their applications may be wasting computing resources. Software tools need to be developed to help detect inefficient applications and report them to the users. We suggest an algorithm for detecting anomalies in the supercomputer’s task flow, based on a data mining methods. System monitoring is used to calculate integral characteristics for every job executed, and the data is used as input for our classification method based on the Random Forest algorithm. The proposed approach can currently classify the application as one of three classes – normal, suspicious and definitely anomalous. The proposed approach has been demonstrated on actual applications running on the “Lomonosov” supercomputer.

international conference on parallel processing | 2017

Multidimensional Performance and Scalability Analysis for Diverse Applications Based on System Monitoring Data

Maya Neytcheva; Sverker Holmgren; Jonathan Bull; Ali Dorostkar; Anastasia Kruchinina; Dmitry A. Nikitenko; Nina Popova; Pavel Shvets; Alexey Teplov; Vadim Voevodin; Vladimir Voevodin

Multidimensional performance and scalability analysis for diverse applications based on system monitoring data

Russian Supercomputing Days | 2017

JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis

Dmitry A. Nikitenko; Alexander Antonov; Pavel Shvets; Sergey Sobolev; Konstantin Stefanov; Vadim Voevodin; Vladimir Voevodin; Sergey Zhumatiy

The efficiency of computing resources utilization by user applications can be analyzed in various ways. The JobDigest approach based on system monitoring was developed in Moscow State University and is currently used in everyday practice of the largest Russian supercomputing center of Moscow State University. The approach features application behavior analysis for every job run on HPC system providing: the set of dynamic application characteristics - time series of values representing utilization of CPU, memory, network, storage, etc. with diagrams and heat maps; the integral characteristics representing average utilization rates; job tagging and categorization with means of informing system administrators and managers on suspicious or abnormal applications. The paper describes the approach principles and workflow, it also demonstrates JobDigest use cases and positioning of the proposed techniques in the set of tools and methods that are used in the MSU HPC Center to ensure its 24/7 efficient and productive functioning.

Archive | 2018

Role-Dependent Resource Utilization Analysis for Large HPC Centers

Dmitry A. Nikitenko; Pavel Shvets; Vadim Voevodin; Sergey Zhumatiy

The resource utilization analysis of HPC systems can be performed in different ways. The method of analysis is selected depending primarily on the original focus of research. It can be a particular application and/or a series of application run analyses, a selected partition or a whole supercomputer system utilization study, a research on peculiarities of workgroup collaboration, and so on. The larger an HPC center is, the more diverse are the scenarios and user roles that arise. In this paper, we share the results of our research on possible roles and scenarios, as well as typical methods of resource utilization analysis for each role and scenario. The results obtained in this research have served as the basis for the development of appropriate modules in the Octoshell management system, which is used by all users of the largest HPC center in Russia, at Lomonosov Moscow State University.

Russian Supercomputing Days | 2016

Techniques for Solving Large-Scale Graph Problems on Heterogeneous Platforms

Ilya Afanasyev; Alexander Daryin; Jack J. Dongarra; Dmitry A. Nikitenko; Alexey Teplov; Vladimir Voevodin

The paper introduces techniques for solving various large-scale graph problems on hybrid architectures. The proposed approach is illustrated on the computation of minimum spanning tree and shortest paths. We provide a precise mathematical description accompanied by the information structure of required algorithms. Efficient parallel implementations of several graph algorithms are proposed based on this analysis. Hybrid computations allow using all the available resources on both multi-core CPUs and GPUs. Our implementation uses out-of-core memory algorithms to handle graphs that don’t fit in the main memory. Experimental results confirm high performance and scalability of the proposed solutions. Moreover, the proposed approach can be applied to other graph processing problems, which have recently rapidly increased in demand.

parallel computing technologies | 2016