Raoul Bhoedjang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raoul Bhoedjang is active.

Explore More

Publication

Featured researches published by Raoul Bhoedjang.

acm sigplan symposium on principles and practice of parallel programming | 1999

MagPIe: MPI's collective communication operations for clustered wide area systems

Thilo Kielmann; Rutger F. H. Hofman; Henri E. Bal; Aske Plaat; Raoul Bhoedjang

Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed MAGPIE, a library of collective communication operations optimized for wide area systems. MAGPIEs algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, MAGPIE executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, MAGPIEs advantage increases for higher wide area latencies.

IEEE Computer | 1998

User-level network interface protocols

Raoul Bhoedjang; Tim Rühl; Henri E. Bal

Modern high speed local area networks offer great potential for communication intensive applications, but their performance is limited by the use of traditional communication protocols, such as TCP/IP. In most cases, these protocols require that all network access be through the operating system, which adds significant overhead to both the transmission path (typically a system call and data copy) and the receive path (typically an interrupt, a system call, and a data copy). To address this performance problem, several user level communication architectures have been developed that remove the operating system from the critical communication path. The article describes six important issues to consider in designing communication protocols for user level architectures. The issues discussed focus on the performance and semantics of a communication system. These issues include data transfer, address translation, protection, and control transfer mechanisms, as well as the issues of reliability and multicast. To provide a basis for analyzing these issues, the authors present a simple network interface protocol for Myricoms Myrinet network, which has a programmable network interface. Researchers can thus explore many protocol design options, and several groups have designed communication systems for Myrinet. The authors refer to 11 such systems, all of which differ significantly in how they resolve these design issues but all of which aim for high performance and provide a lean, low level, and more or less generic communication facility.

ACM Transactions on Computer Systems | 1998

Performance evaluation of the Orca shared-object system

Henri E. Bal; Raoul Bhoedjang; Rutger F. H. Hofman; Ceriel J. H. Jacobs; Koen Langendoen; Tim Rühl; M. Frans Kaashoek

Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orcas coherence protocol (based on write-updates with function shipping), the totally ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture. Performance measurements for 10 parallel applications illustrate the trade-offs made in the design of Orca and show that essentially the right design decisions have been made. A write-update protocol with function shipping is effective for Orca, especially since it is used in combination with techniques that avoid replicating objects that have a low read/write ratio. The overhead of totally ordered group communication on application performance is low. The Orca system is able to make near-optimal decisions for object placement and replication. In addition, the article compares the performance of Orca with that of a page-based DSM (TreadMarks) and another object-based DSM (CRL). It also analyzes the communication overhead of the DSMs for several applications. All performance measurements are done on a 32-node Pentium Pro cluster with Myrinet and Fast Ethernet networks. The results show that Orca programs send fewer messages and less data than the TreadMarks and CRL programs and obtain better speedups.

Operating Systems Review | 2000

The distributed ASCI Supercomputer project

Henri E. Bal; Raoul Bhoedjang; Rutger F. H. Hofman; Ceriel J. H. Jacobs; Thilo Kielmann; Jason Maassen; Rob V. van Nieuwpoort; John W. Romein; Luc Renambot; Tim Rühl; Ronald Veldema; Kees Verstoep; Aline Baggio; G.C. Ballintijn; Ihor Kuz; Guillaume Pierre; Maarten van Steen; Andrew S. Tanenbaum; G. Doornbos; Desmond Germans; Hans J. W. Spoelder; Evert Jan Baerends; Stan J. A. van Gisbergen; Hamideh Afsermanesh; Dick Van Albada; Adam Belloum; David Dubbeldam; Z.W. Hendrikse; Bob Hertzberger; Alfons G. Hoekstra

The Distributed ASCI Supercomputer (DAS) is a homogeneous wide-area distributed system consisting of four cluster computers at different locations. DAS has been used for research on communication software, parallel languages and programming systems, schedulers, parallel applications, and distributed applications. The paper gives a preview of the most interesting research results obtained so far in the DAS project.

Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande | 2001

Runtime optimizations for a Java DSM implementation

Ronald Veldema; Rutger F. H. Hofman; Raoul Bhoedjang; Henri E. Bal

Jackal is a fine-grained distributed shared memory implementation of the Java programming language. Jackal implements Javas memory model and allows multithreaded Java programs to run unmodified on distributed-memory systems.nThis paper focuses on Jackals runtime system, which implements a multiple-writer, home-based consistency protocol. Protocol actions are triggered by software access checks that Jackals compiler inserts before object and array references. We describe optimizations for Jackals runtime system, which mainly consist of discovering opportunities to dispense with flushing of cached data. We give performance results for different runtime optimizations, and compare their impact with the impact of one compiler optimization. We find that our runtime optimizations are necessary for good Jackal performance, but only in conjunction with the Jackal compiler optimizations described in [24]. As a yardstick, we compare the performance of Java applications run on Jackal with the performance of equivalent applications that use a fast implementation of Javas Remote Method Invocation (RMI) instead of shared memory.

international conference on parallel processing | 1998

Efficient multicast on Myrinet using link-level flow control

Raoul Bhoedjang; Tim Rühl; Henri E. Bal

This paper studies the implementation of efficient multicast protocols for Myrinet, a switched, wormhole-routed Gigabit-per-second network technology. Since Myrinet does not support multicasting in hardware, multicast services must be implemented in software. We present a new, efficient, and reliable software multicast protocol that uses the network interface to efficiently forward multicast traffic. The new protocol is constructed on top of reliable, flow-controlled channels between pairs of network interfaces. We describe the design of the protocol and make a detailed comparison with a previous multicast protocol. We show that our protocol is simpler and scales better than the previous protocol. This claim is supported by extensive performance measurements on a 64-node Myrinet cluster.

symposium on frontiers of massively parallel computation | 1996

Integrating polling, interrupts, and thread management

Koen Langendoen; John W. Romein; Raoul Bhoedjang; Henri E. Bal

Many user-level communication systems receive network messages by polling the network adapter from user space. While polling avoids the overhead of interrupt-based mechanisms, it is not suited for all parallel applications. This paper describes a general-purpose, multithreaded, communication system that uses both polling and interrupts to receive messages. Users need not insert polls into their code; through a careful integration of the user-level communication software with a user-level thread scheduler, the system can automatically switch between polling and interrupts. We have evaluated the performance of this integrated system on Myrinet, using a synthetic benchmark and a number of applications that have very different communication requirements. We show that the integrated system achieves robust performance: in most cases, it performs as well as or better than systems that rely exclusively on interrupts or polling.

acm sigplan symposium on principles and practice of parallel programming | 2001

Source-level global optimizations for fine-grain distributed shared memory systems

Ronald Veldema; Rutger F. H. Hofman; Raoul Bhoedjang; Ceriel J. H. Jacobs; Henri E. Bal

This paper describes and evaluates the use of aggressive static analysis in Jackal, a fine-grain Distributed Shared Memory (DSM) system for Java. Jackal uses an optimizing, source-level compiler rather than the binary rewriting techniques employed by most other fine-grain DSM systems. Source-level analysis makes existing access-check optimizations (e.g., access-check batching) more effective and enables two novel fine-grain DSM optimizations: object-graph aggregation and automatic computation migration.nThe compiler detects situations where an access to a root object is followed by accesses to subobjects. Jackal attempts to aggregate all access checks on objects in such object graphs into a single check on the graphs root object. If this check fails, the entire graph is fetched. Object-graph aggregation can reduce the number of network roundtrips and, since it is an advanced form of access-check batching, improves sequential performance.nComputation migration (or function shipping) is used to optimize critical sections in which a single processor owns both the shared data that is accessed and the lock that protects the data. It is usually more efficient to execute such critical sections on the processor that holds the lock and the data than to incur multiple roundtrips for acquiring the lock, fetching the data, writing the data back, and releasing the lock. Jackals compiler detects such critical sections and optimizes them by generating single-roundtrip computation-migration code rather than standard data-shipping code.n Jackals optimizations improve both sequential and parallel application performance. On average, sequential execution times of instrumented, optimized programs are within 10% of those of uninstrumented programs. Application speedups usually improve significantly and several Jackal applications perform as well as hand-optimized message-passing programs.

IEEE Concurrency | 1997

Models for asynchronous message handling

Koen Langendoen; Raoul Bhoedjang; Henri E. Bal

by implementing three well-known models-active messages, single-threaded upcalls, and popup threads-on the same user-level communication architecture, these authors show that expressiveness need not necessarily be sacrificed for performance.

Journal of Parallel and Distributed Computing | 1997

Performance of a High-Level Parallel Language on a High-Speed Network

Henri E. Bal; Raoul Bhoedjang; Rutger F. H. Hofman; Ceriel J. H. Jacobs; Koen Langendoen; Tim Rühl; Kees Verstoep

Clusters of workstations are often claimed to be a good platform for parallel processing, especially if a fast network is used to interconnect the workstations. Indeed, high performance can be obtained for low-level message passing primitives on modern networks like ATM and Myrinet. Most application programmers, however, want to use higher level communication primitives. Unfortunately, implementing such primitives efficiently on a modern network is a difficult task, because their software overhead is relatively much higher than on a traditional, slow network (such as Ethernet). In this paper we investigate the issues involved in implementing a high-level programming environment on a fast network. We have implemented a portable runtime system for an object-based language (Orca) on a collection of processors connected by a Myrinet network. Many performance optimizations were required in order to let application programmers benefit sufficiently from the faster network. In particular, we have optimized message handling, multicasting, buffer management, fragmentation, marshalling, and various other issues. The paper analyzes the impact of these optimizations on the performance of the basic language primitives as well as parallel applications.

Explore More