Ozalp Babaoglu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ozalp Babaoglu is active.

Explore More

Publication

Featured researches published by Ozalp Babaoglu.

ACM Transactions on Computer Systems | 2005

Gossip-based aggregation in large dynamic networks

Márk Jelasity; Alberto Montresor; Ozalp Babaoglu

As computer networks increase in size, become more heterogeneous and span greater geographic distances, applications must be designed to cope with the very large scale, poor reliability, and often, with the extreme dynamism of the underlying network. Aggregation is a key functional building block for such applications: it refers to a set of functions that provide components of a distributed system access to global information including network size, average load, average uptime, location and description of hotspots, and so on. Local access to global information is often very useful, if not indispensable for building applications that are robust and adaptive. For example, in an industrial control application, some aggregate value reaching a threshold may trigger the execution of certain actions; a distributed storage system will want to know the total available free space; load-balancing protocols may benefit from knowing the target average load so as to minimize the load they transfer. We propose a gossip-based protocol for computing aggregate values over network components in a fully decentralized fashion. The class of aggregate functions we can compute is very broad and includes many useful special cases such as counting, averages, sums, products, and extremal values. The protocol is suitable for extremely large and highly dynamic systems due to its proactive structure---all nodes receive the aggregate value continuously, thus being able to track any changes in the system. The protocol is also extremely lightweight, making it suitable for many distributed applications including peer-to-peer and grid computing systems. We demonstrate the efficiency and robustness of our gossip-based protocol both theoretically and experimentally under a variety of scenarios including node and communication failures.

international conference on distributed computing systems | 2002

Anthill: a framework for the development of agent-based peer-to-peer systems

Ozalp Babaoglu; Hein Meling; Alberto Montresor

Recent peer-to-peer (P2P) systems are characterized by decentralized control, large scale and extreme dynamism of their operating environment. As such, they can be seen as instances of complex adaptive systems (CAS) typically found in biological and social sciences. We describe Anthill, a framework to support the design, implementation and evaluation of P2P applications based on ideas such as multi-agent and evolutionary programming borrowed from CAS. An Anthill system consists of a dynamic network of peer nodes; societies of adaptive agents travel through this network, interacting with nodes and cooperating with other agents in order to solve complex problems. Anthill can be used to construct different classes of P2P services that exhibit resilience, adaptation and self-organization properties. We also describe preliminary experiences with Anthill in implementing a file sharing application.

ACM Transactions on Autonomous and Adaptive Systems | 2006

Design patterns from biology for distributed computing

Ozalp Babaoglu; Geoffrey Canright; Andreas Deutsch; Gianni A. Di Caro; Frederick Ducatelle; Luca Maria Gambardella; Niloy Ganguly; Márk Jelasity; Roberto Montemanni; Alberto Montresor; Tore Urnes

Recent developments in information technology have brought about important changes in distributed computing. New environments such as massively large-scale, wide-area computer networks and mobile ad hoc networks have emerged. Common characteristics of these environments include extreme dynamicity, unreliability, and large scale. Traditional approaches to designing distributed applications in these environments based on central control, small scale, or strong reliability assumptions are not suitable for exploiting their enormous potential. Based on the observation that living organisms can effectively organize large numbers of unreliable and dynamically-changing components (cells, molecules, individuals, etc.) into robust and adaptive structures, it has long been a research challenge to characterize the key ideas and mechanisms that make biological systems work and to apply them to distributed systems engineering. In this article we propose a conceptual framework that captures several basic biological processes in the form of a family of design patterns. Examples include plain diffusion, replication, chemotaxis, and stigmergy. We show through examples how to implement important functions for distributed computing based on these patterns. Using a common evaluation methodology, we show that our bio-inspired solutions have performance comparable to traditional, state-of-the-art solutions while they inherit desirable properties of biological systems including adaptivity and robustness.

Computer Networks | 2009

T-Man: Gossip-based fast overlay topology construction

Márk Jelasity; Alberto Montresor; Ozalp Babaoglu

Large-scale overlay networks have become crucial ingredients of fully-decentralized applications and peer-to-peer systems. Depending on the task at hand, overlay networks are organized into different topologies, such as rings, trees, semantic and geographic proximity networks. We argue that the central role overlay networks play in decentralized application development requires a more systematic study and effort towards understanding the possibilities and limits of overlay network construction in its generality. Our contribution in this paper is a gossip protocol called T-Man that can build a wide range of overlay networks from scratch, relying only on minimal assumptions. The protocol is fast, robust, and very simple. It is also highly configurable as the desired topology itself is a parameter in the form of a ranking method that orders nodes according to preference for a base node to select them as neighbors. The paper presents extensive empirical analysis of the protocol along with theoretical analysis of certain aspects of its behavior. We also describe a practical application of T-Man for building Chord distributed hash table overlays efficiently from scratch.

ESOA'05 Proceedings of the Third international conference on Engineering Self-Organising Systems | 2005

T-Man: gossip-based overlay topology management

Márk Jelasity; Ozalp Babaoglu

Overlay topology plays an important role in P2P systems. Topology serves as a basis for achieving functions such as routing, searching and information dissemination, and it has a major impact on their efficiency, cost and robustness. Furthermore, the solution to problems such as sorting and clustering of nodes can also be interpreted as a topology. In this paper we propose a generic protocol, T-MAN, for constructing and maintaining a large class of topologies. In the proposed framework, a topology is defined with the help of a ranking function. The nodes participating in the protocol can use this ranking function to order any set of other nodes according to preference for choosing them as a neighbor. This simple abstraction makes it possible to control the self-organization process of topologies in a straightforward, intuitive and flexible manner. At the same time, the T-MAN protocol involves only local communication to increase the quality of the current set of neighbors of each node. We show that this bottom-up approach results in fast convergence and high robustness in dynamic environments. The protocol can be applied as a standalone solution as well as a component for recovery or bootstrapping of other protocols.

SIAM Journal on Computing | 1984

On the optimum checkpoint selection problem

Sam Toueg; Ozalp Babaoglu

We consider a model of computation consisting of a sequence of n tasks. In the absence of failures, each task i has a known completion time ti. Checkpoints can be placed between any two consecutive tasks. At a checkpoint, the state of the computation is saved on a reliable storage medium. Establishing a checkpoint immedi- ately before task i is known to cost si. This is the time spent in saving the state of the computation. When a failure is detected, the computation is restarted at the most recent checkpoint. Restarting the computation at checkpoint i requires restoring the state to the previously saved value. The time necessary for this action is given by ri. We derive an O(n 3 ) algorithm to select out of the n − 1 potential checkpoint locations those that result in the smallest expected time to complete all the tasks. An O(n 2 ) algorithm is described for the reasonable case where si > s j implies ri ≥ r j. These algorithms are applied to two models of failure. In the first one, each task i has a given probability pi of com- pleting without a failure, i.e., in time ti. Furthermore, failures occur independently and are detected at the end of the task during which they occur. The second model admits a continuous time failure mode where the failure intervals are independent and identically distributed random variables drawn from any giv en distribution. In this model, fail- ures are detected immediately. In both models, the algorithm also gives the expected value of the overall completion time and we show how to derive all the other moments. 1. Introduction. A variety of hardware and software techniques have been proposed to increase the reliability of computing systems that are inherently unreliable. One such soft- ware technique is rollback-recovery. In this scheme, the program is checkpointed from time to time by saving its state on secondary storage and the computation is restarted at the most recent checkpoint after the detection of a failure (6). Between the times when the failure is detected and the computation is restarted, the computation must be rolled back to the most recent checkpoint by restoring its state to the saved value. Obviously, rollback-recovery is an effective method only against transient failures. Examples of such failures are temporary hardware malfunctions, deadlocks due to resource contention, incorrect human interactions with the computation, and other external factors that can corrupt the computations state. Per- sisting failures will block the computation no matter how many times it is rolled back. The ability to detect failures is an essential part of any fault-tolerance method including rollback- recovery. Examples of such failure detection methods are integrity assertion checking (8) and fail-stop processors (10). In the absence of checkpoints, the computation has to be restated from the beginning whenever a failure is detected. It is clear that with respect to many objectives such as mini- mum completion time, minimum recovery overhead, maximum throughput, etc., the position- ing of the checkpoints involves certain tradeoffs. A survey of an analytical framework for resolving some of these tradeoffs is presented by Chandy (1). Young (11) and Chandy et al (3) addressed the problem of finding the checkpoint interval so as to minimize the time lost due to recovery for a never-ending program subject to failures constituting a Poisson process. Gelenbe and Derochette (5) and Gelenbe (4) have generalized this result to allow the possibil- ity of external requests arriving during the establishment of a checkpoint or rollback-recovery.

hawaii international conference on system sciences | 1995

RELACS: A communications infrastructure for constructing reliable applications in large-scale distributed systems

Ozalp Babaoglu; Renzo Davoli; Luigi-Alberto Giachini; M. Gray Baker

Distributed systems that span large geographic distances or manage large numbers of objects are already common place. In such systems, programming applications with even modest reliability requirements to run correctly and efficiently is a difficult task due to asynchrony and the possibility of complex failure scenarios. We describe the architecture of the RELACS communication subsystem that constitutes the microkernel of a layered approach to reliable computing in large-scale distributed systems. RELACS is designed to be highly portable and implements a very small number of abstractions and primitives that should be sufficient for building a variety of interesting higher-level paradigms.<<ETX>>

dependable systems and networks | 2001

Online reconfiguration in replicated databases based on group communication

Bettina Kemme; Alberto Bartoli; Ozalp Babaoglu

Over the last years, many replica control protocols have been developed that take advantage of the ordering and reliability semantics of group communication primitives to simplify database system design and to improve performance. Although current solutions are able to mask site failures effectively, many of them are unable to cope with recovery of failed sites, merging of partitions, or joining of new sites. This paper addresses this important issue. It proposes efficient solutions for online system reconfiguration providing new sites with a current state of the database without interrupting transaction processing in the rest of the system. Furthermore, the paper analyzes the impact of cascading reconfigurations, and argues that they call be handled in an elegant way by extended forms of group communication.

IEEE Transactions on Software Engineering | 1985

Streets of Byzantium: Network Architectures for Fast Reliable Broadcasts

Ozalp Babaoglu; Rogerio Drummond

A site broadcasting its local value to all other sites ina fault-prone environment is a fundamental paradigm in constructing reliable distributed systems. Time complexity lower bounds and network connectivity requirements for reliable broadcast protocols in point-to-point communication networks are well known. In this paper, we consider the reliable broadcast problem in distributed systems with broadcast networks (for example, Ethernets) as the basic communication architecture. We show how properties of such network architectures can be used to effectively restrict the externally visible behavior of faulty processors. We use these techniques to derive simple protocols that implement reliable broadcast in only two rounds, independent of the failure upper bounds.

world of wireless mobile and multimedia networks | 2011

Server consolidation in Clouds through gossiping

Moreno Marzolla; Ozalp Babaoglu; Fabio Panzieri

The success of Cloud computing, where computing power is treated as a utility, has resulted in the creation of many large datacenters that are very expensive to build and operate. In particular, the energy bill accounts for a significant fraction of the total operation costs. For this reason a significant attention is being devoted to energy conservation techniques, for example by taking advantage of the built-in power saving features of modern hardware. Cloud computing offers novel opportunities for achieving energy savings: Cloud systems rely on virtualization techniques to allocate computing resources on demand, and modern Virtual Machine (VM) monitors allow live migration of running VMs. Thus, energy conservation can be achieved through server consolidation, moving VM instances away from lightly loaded computing nodes so that they become empty and can be switched to low-power mode. In this paper we present V-MAN, a fully decentralized algorithm for consolidating VMs in large Cloud datacenters. V-MAN can operate on any arbitrary initial allocation of VMs on the Cloud, iteratively producing new allocations that quickly converge towards the one maximizing the number of idle hosts. V-MAN uses a simple gossip protocol to achieve efficiency, scalability and robustness to failures. Simulation experiments indicate that, starting from a random allocation, V-MAN produces an almost-optimal VM placement in just a few rounds; the protocol is intrinsically robust and can cope with computing nodes being added to or removed from the Cloud.

Explore More