Wim Heirman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wim Heirman is active.

Explore More

Publication

Featured researches published by Wim Heirman.

international conference on parallel architectures and compilation techniques | 2013

Fairness-aware scheduling on single-ISA heterogeneous multi-cores

Kenzo Van Craeynest; Shoaib Akram; Wim Heirman; Aamer Jaleel; Lieven Eeckhout

Single-ISA heterogeneous multi-cores consisting of small (e.g., in-order) and big (e.g., out-of-order) cores dramatically improve energy- and power-efficiency by scheduling workloads on the most appropriate core type. A significant body of recent work has focused on improving system throughput through scheduling. However, none of the prior work has looked into fairness. Yet, guaranteeing that all threads make equal progress on heterogeneous multi-cores is of utmost importance for both multi-threaded and multi-program workloads to improve performance and quality-of-service. Furthermore, modern operating systems affinitize workloads to cores (pinned scheduling) which dramatically affects fairness on heterogeneous multi-cores. In this paper, we propose fairness-aware scheduling for single-ISA heterogeneous multi-cores, and explore two flavors for doing so. Equal-time scheduling runs each thread or workload on each core type for an equal fraction of the time, whereas equal-progress scheduling strives at getting equal amounts of work done on each core type. Our experimental results demonstrate an average 14% (and up to 25%) performance improvement over pinned scheduling through fairness-aware scheduling for homogeneous multi-threaded workloads; equal-progress scheduling improves performance by 32% on average for heterogeneous multi-threaded workloads. Further, we report dramatic improvements in fairness over prior scheduling proposals for multi-program workloads, while achieving system throughput comparable to throughput-optimized scheduling, and an average 21% improvement in throughput over pinned scheduling.

international symposium on performance analysis of systems and software | 2013

Sampled simulation of multi-threaded applications

Trevor E. Carlson; Wim Heirman; Lieven Eeckhout

Sampling is a well-known workload reduction technique that allows one to speed up architectural simulation while accurately predicting performance. Previous sampling methods have been shown to accurately predict single-threaded application runtime based on its overall IPC. However, these previous approaches are unsuitable for general multi-threaded applications, for which IPC is not a good proxy for runtime. Additionally, we find that issues such as application periodicity and inter-thread synchronization play a significant role in determining how best to sample these applications. The proposed multi-threaded application sampling methodology is able to derive an effective sampling strategy for candidate applications using architecture-independent metrics. Using this methodology, large input sets can now be simulated which would otherwise be infeasible, allowing for more accurate conclusions to be made than from studies using scaled-down input sets. Through the use of the proposed methodology, we can simulate less than 10% of the total application runtime in detail. On the SPEComp, NPB and PARSEC benchmarks, running on an 8-core simulated system, we achieve an average absolute error of 3.5%.

IEEE Journal of Selected Topics in Quantum Electronics | 2006

Selective optical broadcast component for reconfigurable multiprocessor interconnects

I Artundo; Lieven Desmet; Wim Heirman; C Debaes; Joni Dambre; J. Van Campenhout; Hugo Thienpont

Recent advances in the development of optical interconnect technologies suggest the possible emergence of optical interconnects within distributed shared memory (DSM) machines in the near future. Moreover, current developments in wavelength tunable devices could soon allow for the fabrication of low-cost, adaptable interconnection networks with varying switching times. It is the objective of this paper to investigate whether such reconfigurable networks can boost the performance of the DSM machines further. In this respect, we propose a system concept of a passive optical broadcasting component to be used as the scalable key element in such a reconfigurable network. We briefly discuss the necessary opto-electronic components and the limitations they impose on network performance. We show through detailed full-system simulations of benchmark executions, that the proposed system architecture can provide a significant speedup for shared-memory machines, even when taking into account the limitations imposed by the opto-electronics and the optical broadcast component

international conference on parallel architectures and compilation techniques | 2012

Power-aware multi-core simulation for early design stage hardware/software co-optimization

Wim Heirman; Souradip Sarkar; Trevor E. Carlson; Ibrahim Hur; Lieven Eeckhout

Stringent performance targets and power constraints push designers towards building specialized workload-optimized systems across a broad spectrum of the computing arena, including supercomputing applications as exemplified by the IBM BlueGene and Intel MIC architectures. In this paper, we make the case for hardware/software co-design during early design stages of processors for scientific computing applications. Considering an important scientific kernel, namely stencil computation, we demonstrate that performance and energy-efficiency can be improved by a factor of 1.66× and 1.25×, respectively, by co-optimizing hardware and software. To enable hardware/software co-design in early stages of the design cycle, we propose a novel simulation infrastructure by combining high-abstraction performance simulation using Sniper with power modeling using McPAT and custom DRAM power models. Sniper/McPAT is fast — simulation speed is around 2 MIPS on an 8-core host machine — because it uses analytical modeling to abstract away core performance during multi-core simulation. We demonstrate Sniper/McPATs accuracy through validation against real hardware; we report average performance and power prediction errors of 22.1% and 8.3%, respectively, for a set of SPEComp benchmarks.

ACM Transactions on Design Automation of Electronic Systems | 2011

Dynamic data folding with parameterizable FPGA configurations

Karel Bruneel; Wim Heirman; Dirk Stroobandt

In many applications, subsequent data manipulations differ only in a small set of parameter values. Because of their reconfigurability, FPGAs (field programmable gate arrays) can be configured with a specialized circuit each time the parameter values change. This technique is called dynamic data folding. The specialized circuits are smaller and faster than their generic counterparts. However, the overhead involved in generating the configurations for the specialized circuits at runtime is very large when conventional tools are used, and this overhead will in many cases negate the benefit of using optimized configurations. This article introduces an automatic method for generating runtime parameterizable configurations from arbitrary Boolean circuits. These configurations, in which some of the configuration bits are expressed as a closed-form Boolean expression of a set of parameters, enable very fast run-time specialization, since specialization only involves evaluating these expressions. Our approach is validated on a ternary content-addressable memory (TCAM). We show that the specialized configurations, produced by our method use 2.82 times fewer LUTs than the generic configuration, and even 1.41 times fewer LUTs than the implementation generated by Xilinx Coregen. Moreover, while Coregen needs hand-crafted generators for each type of circuit, our toolflow can be applied to any VHDL design. Using our automatic and generally applicable method, run-time hardware optimization suddenly becomes feasible for a large class of applications.

system-level interconnect prediction | 2008

Rent's rule and parallel programs: characterizing network traffic behavior

Wim Heirman; Joni Dambre; Dirk Stroobandt; Jan Van Campenhout

In VLSI systems, Rents rule characterizes the locality of interconnect between different subsystems, and allows an efficient layout of the circuit on a chip. With rising complexities of both hardware and software, Systems-on-Chip are converging to multiprocessor architectures connected by a Network-on-Chip. Here, packets are routed instead of wires, and threads of a parallel program are distributed among processors. Still, Rents rule remains applicable, as it can now be used to describe the locality of network traffic. In this paper, we analyze network traffic on an on-chip network and observe the power-law relation between the size of clusters of network nodes and their external bandwidths. We then use the same techniques to study the time-varying behavior of the application, and derive the implications for future on-chip networks.

ieee international symposium on workload characterization | 2011

Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads

Wim Heirman; Trevor E. Carlson; Shuai Che; Kevin Skadron; Lieven Eeckhout

This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cycle stack quantifies where the cycles have gone, and provides hints towards optimization opportunities. We make the case that this is particularly interesting for analyzing parallel performance: understanding how cycle components scale with increasing core counts and/or input data set sizes leads to insight with respect to scaling bottlenecks due to synchronization, load imbalance, poor memory performance, etc. We present several case studies illustrating the use of cycle stacks. As a subsequent step, we further extend the methodology to analyze sets of parallel workloads using statistical data analysis, and perform a workload characterization to understand behavioral differences across benchmark suites. We analyze the SPLASH-2, PARSEC and Rodinia benchmark suites and conclude that the three benchmark suites cover similar areas in the workload space. However, scaling behavior of these benchmarks towards larger input sets and/or higher core counts is highly dependent on the benchmark, the way in which the inputs have been scaled, and on the machine configuration.

international symposium on performance analysis of systems and software | 2014

BarrierPoint: Sampled simulation of multi-threaded applications

Trevor E. Carlson; Wim Heirman; Kenzo Van Craeynest; Lieven Eeckhout

Sampling is a well-known technique to speed up architectural simulation of long-running workloads while maintaining accurate performance predictions. A number of sampling techniques have recently been developed that extend well-known single-threaded techniques to allow sampled simulation of multi-threaded applications. Unfortunately, prior work is limited to non-synchronizing applications (e.g., server throughput workloads); requires the functional simulation of the entire application using a detailed cache hierarchy which limits the overall simulation speedup potential; leads to different units of work across different processor architectures which complicates performance analysis; or, requires massive machine resources to achieve reasonable simulation speedups. In this work, we propose BarrierPoint, a sampling methodology to accelerate simulation by leveraging globally synchronizing barriers in multi-threaded applications. BarrierPoint collects microarchitecture-independent code and data signatures to determine the most representative inter-barrier regions, called barrierpoints. BarrierPoint estimates total application execution time (and other performance metrics of interest) through detailed simulation of these barrierpoints only, leading to substantial simulation speedups. Barrierpoints can be simulated in parallel, use fewer simulation resources, and define fixed units of work to be used in performance comparisons across processor architectures. Our evaluation of BarrierPoint using NPB and Parsec benchmarks reports average simulation speedups of 24.7× (and up to 866.6×) with an average simulation error of 0.9% and 2.9% at most. On average, BarrierPoint reduces the number of simulation machine resources needed by 78×.

high performance interconnects | 2009

Low-Power Reconfigurable Network Architecture for On-Chip Photonic Interconnects

I Artundo; Wim Heirman; Mikel Loperena; Christof Debaes; Jan Van Campenhout; Hugo Thienpont

Photonic Networks-On-Chip have emerged as a viable solution for interconnecting multicore computer architectures in a power-efficient manner. Current architectures focus on large messages, however, which are not compatible with the coherence traffic found on chip multiprocessor networks. In this paper, we introduce a reconfigurable optical interconnect in which the topology is adapted automatically to the evolving traffic situation. This allows a large fraction of the (short) coherence messages to use the optical links, making our technique a better match for CMP networks when compared to existing solutions. We also evaluate the performance and power efficiency of our architecture using an assumed physical implementation based on ultra-low power optical switching devices and under realistic traffic load conditions.

high-performance computer architecture | 2014

Undersubscribed threading on clustered cache architectures

Wim Heirman; Trevor E. Carlson; Kenzo Van Craeynest; Ibrahim Hur; Aamer Jaleel; Lieven Eeckhout

Recent many-core processors such as Intels Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an applications working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.

Explore More