Rob F. Van der Wijngaart

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rob F. Van der Wijngaart is active.

Explore More

Publication

Featured researches published by Rob F. Van der Wijngaart.

international solid-state circuits conference | 2010

A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS

Jason Howard; Saurabh Dighe; Yatin Hoskote; Sriram R. Vangal; David Finan; Gregory Ruhl; David Jenkins; Howard Wilson; Nitin Borkar; Gerhard Schrom; Fabric Pailet; Shailendra Jain; Tiju Jacob; Satish Yada; Sraven Marella; Praveen Salihundam; Vasantha Erraguntla; Michael Konow; Michael Riepen; Guido Droege; Joerg Lindemann; Matthias Gries; Thomas Apel; Kersten Henriss; Tor Lund-Larsen; Sebastian Steibl; Shekhar Borkar; Vivek De; Rob F. Van der Wijngaart; Timothy G. Mattson

Current developments in microprocessor design favor increased core counts over frequency scaling to improve processor performance and energy efficiency. Coupling this architectural trend with a message-passing protocol helps realize a data-center-on-a-die. The prototype chip (Figs. 5.7.1 and 5.7.7) described in this paper integrates 48 Pentium™ class IA-32 cores [1] on a 6×4 2D-mesh network of tiled core clusters with high-speed I/Os on the periphery. The chip contains 1.3B transistors. Each core has a private 256KB L2 cache (12MB total on-die) and is optimized to support a message-passing-programming model whereby cores communicate through shared memory. A 16KB message-passing buffer (MPB) is present in every tile, giving a total of 384KB on-die shared memory, for increased performance. Power is kept at a minimum by transmitting dynamic, fine-grained voltage-change commands over the network to an on-die voltage-regulator controller (VRC). Further power savings are achieved through active frequency scaling at the tile granularity. Memory accesses are distributed over four on-die DDR3 controllers for an aggregate peak memory bandwidth of 21GB/s at 4× burst. Additionally, an 8-byte bidirectional system interface (SIF) provides 6.4GB/s of I/O bandwidth. The die area is 567mm2 and is implemented in 45nm high-к metal-gate CMOS [2].

ieee international conference on high performance computing data and analytics | 2008

Programming the Intel 80-core network-on-a-chip terascale processor

Timothy G. Mattson; Rob F. Van der Wijngaart; Michael Frumkin

Intels 80-core terascale processor was the first generally programmable microprocessor to break the Teraflops barrier. The primary goal for the chip was to study power management and on-die communication technologies. When announced in 2007, it received a great deal of attention for running a stencil kernel at 1.0 single precision TFLOPS while using only 97 Watts. The literature about the chip, however, focused on the hardware, saying little about the software environment or the kernels used to evaluate the chip. This paper completes the literature on the 80-core terascale processor by fully defining the chips software environment. We describe the instruction set, the programming environment, the kernels written for the chip, and our experiences programming this microprocessor. We close by discussing the lessons learned from this project and what it implies for future message passing, network-on-a-chip processors.

Operating Systems Review | 2011

Light-weight communications on Intel's single-chip cloud computer processor

Rob F. Van der Wijngaart; Timothy G. Mattson

Many-core chips are changing the way high-performance computing systems are built and programmed. As it is becoming increasingly difficult to maintain cache coherence across many cores, manufacturers are exploring designs that do not feature any cache coherence between cores. Communications on such chips are naturally implemented using message passing, which makes them resemble clusters, but with an important difference. Special hardware can be provided that supports very fast on-chip communications, reducing latency and increasing bandwidth. We present one such chip, the Single-Chip Cloud Computer (SCC). This is an experimental processor, created by Intel Labs. We describe two communication libraries available on SCC: RCCE and Rckmb. RCCE is a light-weight, minimal library for writing message passing parallel applications. Rckmb provides the data link layer for running network services such as TCP/IP. Both utilize SCCs non-cache-coherent shared memory for transferring data between cores without needing to go off-chip. In this paper we describe the design and implementation of RCCE and Rckmb. To compare their performance, we consider simple benchmarks run with RCCE, and MPI over TCP/IP.

conference on high performance computing (supercomputing) | 2003

Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations

Leonid Oliker; Andrew Canning; Jonathan Carter; John Shalf; David Skinner; Ethier Ethier; Rupak Biswas; Jahed Djomehri; Rob F. Van der Wijngaart

The growing gap between sustained and peak performance for scientific applications is a well-known problem in high end computing. The recent development of parallel vector systems offers the potential to bridge this gap for many computational science codes and deliver a substantial increase in comput-ing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of scientific computing areas. First, we present the performance of a microbenchmark suite that examines low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Results demonstrate that the SX-6 achieves high performance on a large fraction of our applications and often significantly outperforms the cache-based architectures. However, certain applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX-6 effectively.

Journal of Parallel and Distributed Computing | 2006

Performance characteristics of the multi-zone NAS parallel benchmarks

Haoqiang Jin; Rob F. Van der Wijngaart

Summary form only given. We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multizone, is extended from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multilevel programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes.

measurement and modeling of computer systems | 2003

Benchmarks for grid computing: a review of ongoing efforts and future directions

Allan Snavely; Greg Chun; Henri Casanova; Rob F. Van der Wijngaart; Michael Frumkin

Grid architectures are collections of computational and data storage resources linked by communication channels for shared use. It is important to deploy measurement methods so that Grid applications and architectures can evolve guided by scientific principles. Engineering pursuits need agreed upon metrics---a common language for communicating results, so that alternative implementations can be compared quantitatively. Users of systems need performance parameters that describe system capabilities so that they can develop and tune their applications. Architects need examples of how users will exercise their system to improve the design. The Grid community is building systems such as the TeraGrid [1] and The Informational Power Grid [2] while applications that can fully benefit from such systems are also being developed. We conclude that the time to develop and deploy sets of Grid benchmarks is now. This article reviews fundamental principles, early efforts, and benefits of Grid benchmarks to the study and design of Grids.

international conference on parallel processing | 2013

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Jérôme Vienne; Rob F. Van der Wijngaart; Lars Koesterke; Ilya Sharapov

NAS parallel benchmarks (NPB) are a set of applications commonly used to evaluate parallel systems. We use the NPB-OpenMP version to examine the performance of the Intels new Xeon Phi co-processor and focus in particular on the many core aspect of the Xeon Phi architecture. A first analysis studies the scalability up to 244 threads on 61 cores and the impact of affinity settings on scaling. It also compares performance characteristics of Xeon Phi and traditional Xeon CPUs. The application of several well-established optimization techniques allows us to identify common bottlenecks that can specifically impede performance on the Xeon Phi but are not as severe on multi-core CPUs. We also find that many of the OpenMP-parallel loops are too short (in terms of the number of loop iterations) for a balanced execution by 244 threads. New or redesigned benchmarks will be needed to accommodate the greatly increased number of cores and threads. At the end, we summarize our findings in a set recommendations for performance optimization for Xeon Phi.

Multiprocessor System-on-Chip | 2011

The Case for Message Passing on Many-Core Chips

Rakesh Kumar; Timothy G. Mattson; Gilles Pokam; Rob F. Van der Wijngaart

The debate over shared memory vs. message passing programming models has raged for decades, with cogent arguments on both sides. In this paper, we revisit this debate for multicore chips and argue that message passing programming models are often more suitable than shared memory models for addressing the problems presented by the many-core era.

ieee high performance extreme computing conference | 2014

The Parallel Research Kernels

Rob F. Van der Wijngaart; Timothy G. Mattson

We present the Parallel Research Kernels; a collection of kernels supporting research on parallel computer systems. This set of kernels covers the most common patterns of communication, computation and synchronization encountered in parallel HPC applications. By focusing on these kernels instead of specific workloads, one can design an effective parallel computer system without needing to make predictions about the nature of future workloads.

ieee international conference on high performance computing data and analytics | 2000

Parallel and Distributed Computational Fluid Dynamics: Experimental Results and Challenges

M. J. Djomehri; Rupak Biswas; Rob F. Van der Wijngaart; Maurice Yarrow

This paper describes several results of parallel and distributed computing using a production flow solver program. A coarse grained parallelization based on clustering of discretization grids, combined with partitioning of large grids, for load balancing is presented. An assessment is given of its performance on tightly-coupled distributed and distributed-shared memory platforms using large-scale scientific problems. An experiment with this solver, adapted to a Wide Area Network environment, is also presented.

Explore More