Erik Vermij | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Erik Vermij is active.

Explore More

Publication

Featured researches published by Erik Vermij.

international conference on computer design | 2015

Analytic processor model for fast design-space exploration

Rik Jongerius; Giovanni Mariani; Andreea Anghel; Gero Dittmann; Erik Vermij; Henk Corporaal

In this paper, we propose an analytic model that takes as inputs a) a parametric microarchitecture-independent characterization of the target workload, and b) a hardware configuration of the core and the memory hierarchy, and returns as output an estimation of processor-core performance. To validate our technique, we compare our performance estimates with measurements on an Intel® Xeon® system. The average error increases from 21% for a state-of-the-art simulator to 25% for our model, but we achieve a speedup of several orders of magnitude. Thus, the model enables fast designspace exploration and represents a first step towards an analytic exascale system model.

international conference on supercomputing | 2014

Exascale Radio Astronomy: Can We Ride the Technology Wave?

Erik Vermij; Leandro Fiorin; Christoph Hagleitner; Koen Bertels

The Square Kilometre Array SKA will be the most sensitive radio telescope in the world. This unprecedented sensitivity will be achieved by combining and analyzing signals from 262,144 antennas and 350 dishes at a raw datarate of petabits per second. The processing pipeline to create useful astronomical data will require exa-operations per second, at a very limited power budget. We analyze the compute, memory and bandwidth requirements for the key algorithms used in the SKA. By studying their implementation on existing platforms, we show that most algorithms have properties that map inefficiently on current hardware, such as a low compute-bandwidth ratio and complex arithmetic. In addition, we estimate the power breakdown on CPUs and GPUs, analyze the cache behavior on CPUs, and discuss possible improvements. This work is complemented with an analysis of supercomputer trends, which demonstrates that current efforts to use commercial off-the-shelf accelerators results in a two to three times smaller improvement in compute capabilities and power efficiency than custom built machines. We conclude that waiting for new technology to arrive will not give us the instruments currently planned in 2018: one or two orders of magnitude better power efficiency and compute capabilities are required. Novel hardware and system architectures, to match the needs and features of this unique project, must be developed.

ieee international conference on high performance computing data and analytics | 2015

Challenges in exascale radio astronomy: Can the SKA ride the technology wave?

Erik Vermij; Leandro Fiorin; Rik Jongerius; Christoph Hagleitner; Koen Bertels

The Square Kilometre Array (SKA) will be the most sensitive radio telescope in the world. This unprecedented sensitivity will be achieved by combining and analyzing signals from 262,144 antennas and 350 dishes at a raw datarate of petabits per second. The processing pipeline to create useful astronomical data will require hundreds of peta-operations per second, at a very limited power budget. We analyze the compute, memory and bandwidth requirements for the key algorithms used in the SKA. By studying their implementation on existing platforms, we show that most algorithms have properties that map inefficiently on current hardware, such as a low compute–bandwidth ratio and complex arithmetic. In addition, we estimate the power breakdown on CPUs and GPUs, analyze the cache behavior on CPUs, and discuss possible improvements. This work is complemented with an analysis of supercomputer trends, which demonstrates that current efforts to use commercial off-the-shelf accelerators results in a two to three times smaller improvement in compute capabilities and power efficiency than custom built machines. We conclude that waiting for new technology to arrive will not give us the instruments currently planned in 2018: one or two orders of magnitude better power efficiency and compute capabilities are required. Novel hardware and system architectures, to match the needs and features of this unique project, must be developed.

computing frontiers | 2015

An energy-efficient custom architecture for the SKA1-low central signal processor

Leandro Fiorin; Erik Vermij; Jan van Lunteren; Rik Jongerius; Christoph Hagleitner

The Square Kilometre Array (SKA) will be the biggest radio telescope ever built, with unprecedented sensitivity, angular resolution, and survey speed. This paper explores the design of a custom architecture for the central signal processor (CSP) of the SKA1-Low, the SKAs aperture-array instrument consisting of 131,072 antennas. The SKA1-Lows antennas receive signals between 50 and 350 MHz. After digitization and preliminary processing, samples are moved to the CSP for further processing. In this work, we describe the challenges in building the CSP, and present a first quantitative study for the implementation of a custom hardware architecture for processing the main CSP algorithms. By taking advantage of emerging 3D-stacked-memory devices and by exploring the design space for a 14-nm implementation, we estimate a power consumption of 14.4 W for processing all channels of a sub-band and an energy efficiency at application level of up to 208 GFLOPS/W for our architecture.

computing frontiers | 2016

An architecture for near-data processing systems

Erik Vermij; Christoph Hagleitner; Leandro Fiorin; Rik Jongerius; Jan van Lunteren; Koen Bertels

Near-data processing is a promising paradigm to address the bandwidth, latency, and energy limitations in todays computer systems. In this work, we introduce an architecture that enhances a contemporary multi-core CPU with new features for supporting a seamless integration of near-data processing capabilities. Crucial aspects such as coherency, data placement, communication, address translation, and the programming model are discussed. The essential components, as well as a system simulator, are realized in hardware and software. Results for the important Graph500 benchmark show a 1.5x speedup when using the proposed architecture.

IEEE Transactions on Computers | 2018

Analytic Multi-Core Processor Model for Fast Design-Space Exploration

Rik Jongerius; Andreea Anghel; Gero Dittmann; Giovanni Mariani; Erik Vermij; Henk Corporaal

Simulators help computer architects optimize system designs. The limited performance of simulators even of moderate size and detail makes the approach infeasible for design-space exploration of future exascale systems. Analytic models, in contrast, offer very fast turn-around times. In this paper we propose an analytic multi-core processor-performance model that takes as inputs a) a parametric microarchitecture-independent characterization of the target workload, and b) a hardware configuration of the core and the memory hierarchy. The processor-performance model considers instruction-level parallelism (ILP) per type, models single instruction, multiple data (SIMD) features, and considers cache and memory-bandwidth contention between cores. We validate our model by comparing its performance estimates with measurements from hardware performance counters on Intel Xeon and ARM Cortex-A15 systems. We estimate multi-core contention with a maximum error of 11.4 percent. The average single-thread error increases from 25 percent for a state-of-the-art simulator to 59 percent for our model, but the correlation is still 0.8, a high relative accuracy, while we achieve a speedup of several orders of magnitude. With a much higher capacity than simulators and more reliable insights than back-of-the-envelope calculations it makes automated design-space exploration of exascale systems possible, which we show using a real-world case study from radio astronomy.

international conference on parallel processing | 2017

Boosting the Efficiency of HPCG and Graph500 with Near-Data Processing

Erik Vermij; Leandro Fiorin; Christoph Hagleitner; Koen Bertels

HPCG and Graph500 can be regarded as the two most relevant benchmarks for high-performance computing systems. Existing supercomputer designs, however, tend to focus on floating-point peak performance, a metric less relevant for these two benchmarks, leaving resources underutilized, and resulting in little performance improvements, for these benchmarks, over time. In this work, we analyze the implementation of both benchmarks on a novel shared-memory near-data processing architecture. We study a number of aspects: 1. a system parameter design exploration, 2. software optimizations, and 3. the exploitation of unique architectural features like user-enhanced coherence as well as the exploitation of data-locality for inter near-data processor traffic.For the HPCG benchmark, we show a factor 2.5x application level speedup with respect to a CPU, and a factor 2.5x power-efficiency improvement with respect to a GPU. For the Graph500 benchmark, we show up to a factor 3.5x speedup with respect to a CPU. Furthermore, we show that, with many of the existing data-locality optimizations for this specific graph workload applied, local memory bandwidth is not the crucial parameter, and a high-bandwidth as well as low-latency interconnect are arguably more important, shining a new light on the near-data processing characteristics most relevant for this type of heavily optimized graph processing.

computing frontiers | 2017

Sorting big data on heterogeneous near-data processing systems

Erik Vermij; Leandro Fiorin; Christoph Hagleitner; Koen Bertels

Big data workloads assumed recently a relevant importance in many business and scientific applications. Sorting elements efficiently in big data workloads is a key operation. In this work, we analyze the implementation of the mergesort algorithm on heterogeneous systems composed of CPUs and near-data processors located on the system memory channels. For configurations with equal number of active CPU cores and near-data processors, our experiments show a performance speedup of up to 2.5, as well as up to 2.5x energy-per-solution reduction.

ACM Transactions on Architecture and Code Optimization | 2017

An Architecture for Integrated Near-Data Processors

Erik Vermij; Leandro Fiorin; Rik Jongerius; Christoph Hagleitner; Jan van Lunteren; Koen Bertels

To increase the performance of data-intensive applications, we present an extension to a CPU architecture that enables arbitrary near-data processing capabilities close to the main memory. This is realized by introducing a component attached to the CPU system-bus and a component at the memory side. Together they support hardware-managed coherence and virtual memory support to integrate the near-data processors in a shared-memory environment. We present an implementation of the components, as well as a system-simulator, providing detailed performance estimations. With a variety of synthetic workloads we demonstrate the performance of the memory accesses, the mixed fine- and coarse-grained coherence mechanisms, and the near-data processor communication mechanism. Furthermore, we quantify the inevitable start-up penalty regarding coherence and data writeback, and argue that near-data processing workloads should access data several times to offset this penalty. A case study based on the Graph500 benchmark confirms the small overhead for the proposed coherence mechanisms and shows the ability to outperform a real CPU by a factor of two.

automation, robotics and control systems | 2016

Balancing High-Performance Parallelization and Accuracy in Canny Edge Detector

Valery Kritchallo; Billy Braithwaite; Erik Vermij; Koen Bertels; Zaid Al-Ars

We present a novel approach to tradeoff accuracy against the degree of parallelization for the Canny edge detector, a well-known image-processing algorithm. At the heart of our method is a single top-level image-slicing loop incorporated into the sequential algorithm to process image segments concurrently, a parallelization technique allowing for breaks in the computational continuity in order to achieve high performance levels. By using the fidelity slider, a new approximate computing concept that we introduce, the user can exercise full control over the desired balance between accuracy of the output and parallel performance. The practical value and strong scalability of the presented method is demonstrated by extensive benchmarks performed on three evaluation platforms, showing speedups of upi¾źto 7x for an accuracy of 100i¾ź% and upi¾źto 19x for an accuracy of 99i¾ź% over the sequential version, as recorded on an Intel Xeon platform with 14 cores and 28 hardware threads.

Explore More