John W. Romein | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John W. Romein is active.

Explore More

Publication

Featured researches published by John W. Romein.

Operating Systems Review | 2000

The distributed ASCI Supercomputer project

Henri E. Bal; Raoul Bhoedjang; Rutger F. H. Hofman; Ceriel J. H. Jacobs; Thilo Kielmann; Jason Maassen; Rob V. van Nieuwpoort; John W. Romein; Luc Renambot; Tim Rühl; Ronald Veldema; Kees Verstoep; Aline Baggio; G.C. Ballintijn; Ihor Kuz; Guillaume Pierre; Maarten van Steen; Andrew S. Tanenbaum; G. Doornbos; Desmond Germans; Hans J. W. Spoelder; Evert Jan Baerends; Stan J. A. van Gisbergen; Hamideh Afsermanesh; Dick Van Albada; Adam Belloum; David Dubbeldam; Z.W. Hendrikse; Bob Hertzberger; Alfons G. Hoekstra

The Distributed ASCI Supercomputer (DAS) is a homogeneous wide-area distributed system consisting of four cluster computers at different locations. DAS has been used for research on communication software, parallel languages and programming systems, schedulers, parallel applications, and distributed applications. The paper gives a preview of the most interesting research results obtained so far in the DAS project.

acm sigplan symposium on principles and practice of parallel programming | 2008

ZOID: I/O-forwarding infrastructure for petascale architectures

Kamil Iskra; John W. Romein; Kazutomo Yoshii; Peter H. Beckman

The ZeptoOS project is developing an open-source alternative to the proprietary software stacks available on contemporary massively parallel architectures. The aim is to enable computer science research on these architectures, enhance community collaboration, and foster innovation. In this paper, we introduce a component of ZeptoOS called ZOID---an I/O-forwarding infrastructure for architectures such as IBM Blue Gene that decouple file and socket I/O from the compute nodes, shipping those functions to dedicated I/O nodes. Through the use of optimized network protocols and data paths, as well as a multithreaded daemon running on I/O nodes, ZOID provides greater performance than does the stock infrastructure. We present a set of benchmark results that highlight the improvements. Crucially, the flexibility of our infrastructure is a vast improvement over the stock infrastructure, allowing users to forward data using custom-designed application interfaces, through an easy-to-use plug-in mechanism. This capability is used for real-time telescope data transfers, extensively discussed in the paper. Plug-in--specific threads implement prefetching of data obtained over sockets from an input cluster and merge results from individual compute nodes before sending them out, significantly reducing required network bandwidth. This approach allows a ZOID version of the application to handle a larger number of subbands per I/O node, or even to bypass the input cluster altogether, plugging the input from remote receiver stations directly into the I/O nodes. Using the resources more efficiently can result in considerable savings.

IEEE Computer | 2003

Solving awari with parallel retrograde analysis

John W. Romein; Henri E. Bal

A parallel search algorithm running on a large computer cluster solves a popular board game by computing the best moves from all reachable positions. The resulting databases contain scores for 889 billion positions.

symposium on frontiers of massively parallel computation | 1996

Integrating polling, interrupts, and thread management

Koen Langendoen; John W. Romein; Raoul Bhoedjang; Henri E. Bal

Many user-level communication systems receive network messages by polling the network adapter from user space. While polling avoids the overhead of interrupt-based mechanisms, it is not suited for all parallel applications. This paper describes a general-purpose, multithreaded, communication system that uses both polling and interrupts to receive messages. Users need not insert polls into their code; through a careful integration of the user-level communication software with a user-level thread scheduler, the system can automatically switch between polling and interrupts. We have evaluated the performance of this integrated system on Myrinet, using a synthetic benchmark and a number of applications that have very different communication requirements. We show that the integrated system achieves robust performance: in most cases, it performs as well as or better than systems that rely exclusively on interrupts or polling.

acm symposium on parallel algorithms and architectures | 2006

Astronomical real-time streaming signal processing on a Blue Gene/L supercomputer

John W. Romein; P. Chris Broekema; Ellen van Meijeren; Kjeld Van Der Schaaf; Walther H. Zwart

LOFAR is the first of a new generation of radio telescopes, that combines the signals from many thousands of simple, fixed antennas, rather than from expensive dishes. Its revolutionary design and unprecedented size enables observations in a frequency range that could hardly be observed before, and allows the study of a vast amount of new science cases.In this paper, we describe a novel approach to process realtime, streaming telescope data in software, using a supercomputer. The desire for a flexible and reconfigurable instrument demands a software solution, where traditionally customized hardware was used. This, and LOFARs exceptional real-time, streaming signalprocessing requirements compel the use of a supercomputer. We focus on the LOFAR CEntral Processing facility (CEP), that combines the signals of all LOFAR stations. CEP consists of a 12,288-core IBM Blue Gene/L supercomputer, embedded in several conventional clusters.We describe a highly optimized implementation that will do the bulk of the central signal processing on the Blue Gene/L, namely PolyPhase Filtering, Delay Compensation, and Correlation. Measurements show that we reach exceptionally high computational performance (up to 98% of the theoretical floating-point peak performance). We also discuss how we handle external I/O performance limitations into and out of the Blue Gene/L, to obtain sufficient bandwidth for LOFAR.

IEEE Computer | 2016

A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term

Henri E. Bal; Dick H. J. Epema; Cees de Laat; Rob V. van Nieuwpoort; John W. Romein; Frank J. Seinstra; Cees G. M. Snoek; Harry A. G. Wijshoff

The Dutch Advanced School for Computing and Imaging has built five generations of a 200-node distributed system over nearly two decades while remaining aligned with the shifting computer science research agenda. The system has supported years of award-winning research, underlining the benefits of investing in a smaller-scale, tailored design.

international conference on supercomputing | 2009

Using many-core hardware to correlate radio astronomy signals

Rob V. van Nieuwpoort; John W. Romein

A recent development in radio astronomy is to replace traditional dishes with many small antennas. The signals are combined to form one large, virtual telescope. The enormous data streams are cross-correlated to filter out noise. This is especially challenging, since the computational demands grow quadratically with the number of data streams. Moreover, the correlator is not only computationally intensive, but also very I/O intensive. The LOFAR telescope, for instance, will produce over 100 terabytes per day. The future SKA telescope will even require in the order of exaflops, and petabits/s of I/O. A recent trend is to correlate in software instead of dedicated hardware. This is done to increase flexibility and to reduce development efforts. Examples include e-VLBI and LOFAR. In this paper, we evaluate the correlator algorithm on multi-core CPUs and many-core architectures, such as NVIDIA and ATI GPUs, and the Cell/B.E. The correlator is a streaming, real-time application, and is much more I/O intensive than applications that are typically implemented on many-core hardware today. We compare with the LOFAR production correlator on an IBM Blue Gene/P supercomputer. We investigate performance, power efficiency, and programmability. We identify several important architectural problems which cause architectures to perform suboptimally. Our findings are applicable to data-intensive applications in general. The results show that the processing power and memory bandwidth of current GPUs are highly imbalanced for correlation purposes. While the production correlator on the Blue Gene/P achieves a superb 96% of the theoretical peak performance, this is only 14% on ATI GPUs, and 26% on NVIDIA GPUs. The Cell/B.E. processor, in contrast, achieves an excellent 92%. We found that the Cell/B.E. is also the most energy-efficient solution, it runs the correlator 5-7 times more energy efficiently than the Blue Gene/P. The research presented is an important pathfinder for next-generation telescopes.

IEEE Transactions on Parallel and Distributed Systems | 2002

A performance analysis of transposition-table-driven work scheduling in distributed search

John W. Romein; Henri E. Bal; Jonathan Schaeffer; Aske Plaat

This paper introduces a new scheduling algorithm for parallel single-agent search, transposition table driven work scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less processor idle time, and less redundant search effort. Measurements on a 128-processor parallel machine show that the scheme achieves nearly-optimal performance and scales well. The algorithm performs a factor of 2.0 to 13.7 times better than traditional work-stealing-based schemes.

acm sigplan symposium on principles and practice of parallel programming | 2010

The LOFAR correlator: implementation and performance analysis

John W. Romein; P. Chris Broekema; Jan David Mol; Rob V. van Nieuwpoort

LOFAR is the first of a new generation of radio telescopes.Rather than using expensive dishes, it forms a distributed sensor network that combines the signals from many thousands of simple antennas. Its revolutionary design allows observations in a frequency range that has hardly been studied before. Another novel feature of LOFAR is the elaborate use of software to process data, where traditional telescopes use customized hardware. This dramatically increases flexibility and substantially reduces costs, but the high processing and bandwidth requirements compel the use of a supercomputer. The antenna signals are centrally combined, filtered, optionally beam-formed, and correlated by an IBM Blue Gene/P. This paper describes the implementation of the so-called correlator. To meet the real-time requirements, the application is highly optimized, and reaches exceptionally high computational and I/O efficiencies. Additionally, we study the scalability of the system, and show that it scales well beyond the requirements. The optimizations allows us to use only half the planned amount of resources, and process 50% more telescope data, significantly improving the effectiveness of the entire telescope.

cluster computing and the grid | 2008

Experiences with Fine-Grained Distributed Supercomputing on a 10G Testbed

Kees Verstoep; Jason Maassen; Henri E. Bal; John W. Romein

This paper shows how lightpath-based networks can allow challenging, fine-grained parallel supercomputing applications to be run on a grid, using parallel retrograde analysis on DAS-3 as a case study. Detailed performance analysis shows that several problems arise that are not present on tightly-coupled systems like clusters. In particular, flow control, asynchronous communication, and host- level communication overheads become new obstacles. By optimizing these aspects, however, a 10 G grid can obtain high performance for this type of communication-intensive application. The class of large-scale distributed applications suitable for running on a grid is therefore larger than previously thought realistic.

Explore More