Amnon Shiloh | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amnon Shiloh is active.

Explore More

Publication

Featured researches published by Amnon Shiloh.

Software - Practice and Experience | 1985

A distributed load-balancing policy for a multicomputer

Amnon Barak; Amnon Shiloh

This paper deals with the organization of a distributed load‐balancing policy for a multicomputer system which consists of a cluster of independent computers that are interconnected by a local area communication network. We introduce three algorithms necessary to maintain load balancing in this system: the local load algorithm, used by each processor to monitor its own load; the exchange algorithm, for exchanging load information between the processors, and the process migration algorithm that uses this information to dynamically migrate processes from overloaded to underloaded processors.

international conference on cluster computing | 2010

A package for OpenCL based heterogeneous computing on clusters with many GPU devices

Amnon Barak; Tal Ben-Nun; Ely Levy; Amnon Shiloh

Heterogeneous systems provide new opportunities to increase the performance of parallel applications on clusters with CPU and GPU architectures. Currently, applications that utilize GPU devices run their device-executable code on local devices in their respective hosting-nodes. This paper presents a package for running OpenMP, C++ and unmodified OpenCL applications on clusters with many GPU devices. This Many GPUs Package (MGP) includes an implementation of the OpenCL specifications and extensions of the OpenMP API that allow applications on one hosting-node to transparently utilize cluster-wide devices (CPUs and/or GPUs). MGP provides means for reducing the complexity of programming and running parallel applications on clusters, including scheduling based on task dependencies and buffer management. The paper presents MGP and the performance of its internals.

cluster computing and the grid | 2005

An organizational grid of federated MOSIX clusters

Amnon Barak; Amnon Shiloh; Lior Amar

MOSIX is a cluster management system that uses process migration to allow a Linux cluster to perform like a parallel computer. Recently it has been extended with new features that could make a grid of Linux clusters run as a cooperative system of federated clusters. On one hand, it supports automatic workload distribution among connected clusters that belong to different owners, while still preserving the autonomy of each owner to disconnect its cluster from the grid at any time, without sacrificing migrated processes from other clusters. Other new features of MOSIX include grid-wide automatic resource discovery; a precedence scheme for local processes and among guest processes (from other clusters); flood control; a secure run-time environment (sandbox) which prevents guest processes from accessing local resources in a hosting system, and support of cluster partitions. The resulting grid management system is suitable to create an intra-organizational high-performance computational grid, e.g., in an enterprise or in a campus. The paper presents enhanced and new features of MOSIX and their performance.

Cluster Computing | 2004

The MOSIX Direct File System Access Method for Supporting Scalable Cluster File Systems

Lior Amar; Amnon Barak; Amnon Shiloh

MOSIX is a cluster management system that supports preemptive process migration. This paper presents the MOSIX Direct File System Access (DFSA), a provision that can improve the performance of cluster file systems by allowing a migrated process to directly access files in its current location. This capability, when combined with an appropriate file system, could substantially increase the I/O performance and reduce the network congestion by migrating an I/O intensive process to a file server rather than the traditional way of bringing the files data to the process. DFSA is suitable for clusters that manage a pool of shared disks among multiple machines. With DFSA, it is possible to migrate parallel processes from a client node to file servers for parallel access to different files. Any consistent file system can be adjusted to work with DFSA. To test its performance, we developed the MOSIX File-System (MFS) which allows consistent parallel operations on different files. The paper describes DFSA and presents the performance of MFS with and without DFSA.

grid computing | 2008

Harnessing migrations in a market-based grid OS

Lior Amar; Jochen Stosser; Ely Levy; Amnon Shiloh; Amnon Barak; Dirk Neumann

Applying economic principles to grids is deemed promising to improve the overall value provided by such systems. End users can influence the allocation of resources by reporting valuations for these resources. Current market-based schedulers, however, are static, assume the availability of complete information about jobs (in particular with respect to processing times), and do not make use of the flexibility offered by advanced computing systems. In this paper, we present the implementation of economic resource allocation principles into MOSIX, a state-of-the-art management system for computing clusters and multi-cluster organizational grids. The system is designed so as to be able to work in large-scale settings with selfish agents. Facing incomplete information about jobspsila characteristics, it dynamically allocates jobs to computing machines by leveraging preemption and job migration, two distinct features offered by MOSIX. We validate and showcase the behavior of our economic model by means of experiments in the real system.

Concurrency and Computation: Practice and Experience | 2015

Resilient gossip algorithms for collecting online management information in exascale clusters

Amnon Barak; Zvi Drezner; Ely Levy; Matthias Lieber; Amnon Shiloh

Management of forthcoming exascale clusters requires frequent collection of run‐time information about the nodes and the running applications. This paper presents a new paradigm for providing online information to the management system of scalable clusters, consisting of a large number of nodes and one or more masters that manage these nodes. We describe the details of resilient gossip algorithms for sharing local information within subsets of nodes and for sending global information to a master, which holds information on all the nodes. The presented algorithms are decentralized, scalable and resilient, working well even when some nodes fail, without needing any recovery protocol. The paper gives formal expressions for approximating the average ages of the local information at each node and the information collected by the master. It then shows that these results closely match the results of simulations and measurements on a real cluster. The paper also investigates the resilience of the algorithms and the impact on the average age when nodes or masters fail. The main outcome of this paper is that partitioning of large clusters can improve the quality of information available to the management system without increasing the number of messages per node. Copyright

international workshop on runtime and operating systems for supercomputers | 2014

Overhead of a decentralized gossip algorithm on the performance of HPC applications

Ely Levy; Amnon Barak; Amnon Shiloh; Matthias Lieber; Carsten Weinhold; Hermann Härtig

Gossip algorithms can provide online information about the availability and the state of the resources in supercomputers. These algorithms require minimal computing and storage capabilities at each node and when properly tuned, they are not expected to overload the nodes or the network that connects these nodes. These properties make gossip interesting for future exascale systems. This paper examines the overhead of a decentralized gossip algorithm on the performance of parallel MPI applications running on up to 8192 nodes of an IBM BlueGene/Q supercomputer. The applications that were used in the experiments include PTRANS and MPI-FFT from the HPCC benchmark suite as well as the coupled weather and cloud simulation model COSMO-SPECS+FD4. In most cases, no gossip overhead was observed when the gossip messages were sent at intervals of 256ms or more. As expected, the overhead that is observed at higher rates is sensitive to the communication pattern of the application and the amount of gossip information being circulated.

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface | 2012

Automatic resource-centric process migration for MPI

Amnon Barak; Alexander Margolin; Amnon Shiloh

Process migration refers to the ability to move a running process from one node and make it continue on another. The MPI standard prescribes support for process migration, but so far it was implemented mostly via checkpoint-restart. This paper presents an automatic and transparent process migration framework that can be used for MPI processes. This framework is advantageous when migration of individual processes for purposes such as load-balancing is more adequate than checkpointing the whole job. The paper describes this framework for process migration in clusters and multi-clusters, how it was tuned for Open MPI and the performance of migrated MPI processes.

Software for Exascale Computing | 2016

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Carsten Weinhold; Adam Lackorzynski; Jan Bierbaum; Martin Küttler; Maksym Planeta; Hermann Härtig; Amnon Shiloh; Ely Levy; Tal Ben-Nun; Amnon Barak; Thomas Steinke; Thorsten Schütt; Jan Fajerski; Alexander Reinefeld; Matthias Lieber; Wolfgang E. Nagel

The FFMK project designs, builds and evaluates a system-software architecture to address the challenges expected in Exascale systems. In particular, these challenges include performance losses caused by the much larger impact of runtime variability within applications, hardware, and operating system (OS), as well as increased vulnerability to failures. The FFMK OS platform is built upon a multi-kernel architecture, which combines the L4Re microkernel and a virtualized Linux kernel into a noise-free, yet feature-rich execution environment. It further includes global, distributed platform management and system-level optimization services that transparently minimize checkpoint/restart overhead for applications. The project also researched algorithms to make collective operations fault tolerant in presence of failing nodes. In this paper, we describe the basic components, algorithms, and services we developed in Phase 2 of the project.

international parallel and distributed processing symposium | 2017

Corrected Gossip Algorithms for Fast Reliable Broadcast on Unreliable Systems

Torsten Hoefler; Amnon Barak; Amnon Shiloh; Zvi Drezner

Large-scale parallel programming environments and algorithms require efficient group-communication on computing systems with failing nodes. Existing reliable broadcast algorithms either cannot guarantee that all nodes are reached or are very expensive in terms of the number of messages and latency. This paper proposes Corrected-Gossip, a method that combines Monte Carlo style gossiping with a deterministic correction phase, to construct a Las Vegas style reliable broadcast that guarantees reaching all the nodes at low cost. We analyze the performance of this method both analytically and by simulations and show how it reduces the latency and network load compared to existing algorithms. Our method improves the latency by 20% and the network load by 53% compared to the fastest known algorithm on 4,096 nodes. We believe that the principle of corrected-gossip opens an avenue for many other reliable group communication operations.

Explore More