Hans Vandierendonck | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hans Vandierendonck is active.

Explore More

Publication

Featured researches published by Hans Vandierendonck.

international conference on parallel architectures and compilation techniques | 2002

Workload design: selecting representative program-input pairs

Lieven Eeckhout; Hans Vandierendonck; K. De Bosschere

Having a representative workload of the target domain of a microprocessor is extremely important throughout its design. The composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input data sets to select per benchmark. Unfortunately, it is impossible to select a huge number of benchmarks and respective input sets due to the large instruction counts per benchmark and due to limitations on the available simulation time. We use statistical data analysis techniques such as principal component analysis (PCA) and cluster analysis to efficiently explore the workload space. Within this workload space, different input data sets for a given benchmark can be displayed, a distance can be measured between program-input pairs that gives us an idea about their mutual behavioral differences and representative input data sets can be selected for the given benchmark. This methodology is validated by showing that program-input pairs that are close to each other in this workload space indeed exhibit similar behavior. The final goal is to select a limited set of representative benchmark-input pairs that span the complete workload space. Next to workload composition, there are a number of other possible applications, namely getting insight in the impact of input data sets on program behavior and profile-guided compiler optimizations.

high performance computer architecture | 2001

Differential FCM: increasing value prediction accuracy by improving table usage efficiency

Bart Goeman; Hans Vandierendonck; K. De Bosschere

Value prediction is a relatively new technique to increase the Instruction Level Parallelism (ILP) in future microprocessors. An important problem when designing a value predictor is efficiency, an accurate predictor requires huge prediction tables. This is especially the case for the finite context method (FCM) predictor the most accurate one. In this paper, we show that the prediction accuracy of the FCM can be greatly improved by making the FCM predict studies instead of values. This new predictor is called the differential finite context method (DFCM) predictor. The DFCM predictor outperforms a similar FCM predictor by as much as 33%, depending on the prediction table size. If we take the additional storage into account, the difference is still 15% for realistic predictor sizes. We use several metrics to show that the key to this success is reduced aliasing in the level-2 table. We also show that the DFCM is superior to hybrid predictors based on FCM and stride predictors, since its prediction accuracy is higher than that of a hybrid one using a perfect meta-predictor.

international conference on parallel architectures and compilation techniques | 2010

The Paralax infrastructure: automatic parallelization with a helping hand

Hans Vandierendonck; Sean Rul; Koen De Bosschere

Speeding up sequential programs on multicores is a challenging problem that is in urgent need of a solution. Automatic parallelization of irregular pointer-intensive codes, exemplified by the SPECint codes, is a very hard problem. This paper shows that, with a helping hand, such auto-parallelization is possible and fruitful.

IEEE Computer | 2003

Designing computer architecture research workloads

Lieven Eeckhout; Hans Vandierendonck; K. De Bosschere

Although architectural simulators model microarchitectures at a high abstraction level, the increasing complexity of both the microarchitectures themselves and the applications that run on them make simulator use extremely time-consuming. Simulators must execute huge numbers of instructions to create a workload representative of real applications, creating an unreasonably long simulation time and stretching the time to market. Using reduced input sets instead of reference input sets helps to solve this problem. The authors have developed a methodology that reliably quantifies program behavior similarity to verify if reduced input sets result in program behavior similar to the reference inputs.

IEEE Computer Architecture Letters | 2011

Fairness Metrics for Multi-Threaded Processors

Hans Vandierendonck; André Seznec

Multi-threaded processors execute multiple threads concurrently in order to increase overall throughput. It is well documented that multi-threading affects per-thread performance but, more importantly, some threads are affected more than others. This is especially troublesome for multi-programmed workloads. Fairness metrics measure whether all threads are affected equally. However defining equal treatment is not straightforward. Several fairness metrics for multi-threaded processors have been utilized in the literature, although there does not seem to be a consensus on what metric does the best job of measuring fairness. This paper reviews the prevalent fairness metrics and analyzes their main properties. Each metric strikes a different trade-off between fairness in the strict sense and throughput. We categorize the metrics with respect to this property. Based on experimental data for SMT processors, we suggest using the minimum fairness metric in order to balance fairness and throughput.

IEEE Transactions on Computers | 2005

XOR-based hash functions

Hans Vandierendonck; K. De Bosschere

Bank conflicts can severely reduce the bandwidth of an interleaved multibank memory and conflict misses increase the miss rate of a cache or a predictor. Both occurrences are manifestations of the same problem: objects, which should be mapped to different indices, are accidentally mapped to the same index. Suitable chosen hash functions can avoid conflicts in each of these situations by mapping the most frequently occurring patterns conflict-free. A particularly interesting class of hash functions is the XOR-based hash functions, which compute each set index bit as the exclusive-or of a subset of the address bits. When implementing a XOR-based hash function, it is extremely important to understand what patterns are mapped conflict-free and how a hash function can be constructed to map the most frequently occurring patterns without conflicts. Hereto, this paper presents two ways to reason about hash functions: by their null space and by their column space. The null space helps to quickly determine whether a pattern is mapped conflict-free. The column space is more useful for other purposes, e.g., to reduce the fan-in of the XOR-gates without introducing conflicts or to evaluate interbank dispersion in skewed-associative caches. Examples illustrate how these ideas can be applied to construct conflict-free hash functions.

parallel computing | 2010

A profile-based tool for finding pipeline parallelism in sequential programs

Sean Rul; Hans Vandierendonck; Koen De Bosschere

Traditional static analysis fails to auto-parallelize programs with a complex control and data flow. Furthermore, thread-level parallelism in such programs is often restricted to pipeline parallelism, which can be hard to discover by a programmer. In this paper we propose a tool that, based on profiling information, helps the programmer to discover parallelism. The programmer hand-picks the code transformations from among the proposed candidates which are then applied by automatic code transformation techniques. This paper contributes to the literature by presenting a profiling tool for discovering thread-level parallelism. We track dependencies at the whole-data structure level rather than at the element level or byte level in order to limit the profiling overhead. We perform a thorough analysis of the needs and costs of this technique. Furthermore, we present and validate the belief that programs with complex control and data flow contain significant amounts of exploitable coarse-grain pipeline parallelism in the programs outer loops. This observation validates our approach to whole-data structure dependencies. As state-of-the-art compilers focus on loops iterating over data structure members, this observation also explains why our approach finds coarse-grain pipeline parallelism in cases that have remained out of reach for state-of-the-art compilers. In cases where traditional compilation techniques do find parallelism, our approach allows to discover higher degrees of parallelism, allowing a 40% speedup over traditional compilation techniques. Moreover, we demonstrate real speedups on multiple hardware platforms.

ACM Sigarch Computer Architecture News | 2007

Function level parallelism driven by data dependencies

Sean Rul; Hans Vandierendonck; Koen De Bosschere

With the rise of Chip multiprocessors (CMPs), the amount of parallel computing power will increase significantly in the near future. However, most programs are sequential in nature and have not been explicitly parallelized, so they cannot exploit these parallel resources. Automatic parallelization of sequential, non-regular codes is very hard, as illustrated by the lack of solutions after more than 30 years of research on the topic. The question remains if there is parallelism in sequential programs that can be detected automatically and if so, how much parallelism there is. In this paper, we propose a framework for extracting potential parallelism from programs. Applying this framework to sequential programs can teach us how much parallelism is present in a program, but also tells us what the most appropriate parallel construct for a program is, e.g. a pipeline, master/slave work distribution, etc. Our framework is profile-based, implying that it is not safe. It builds two new graph representations of the profile-data: the interprocedural data flow graph and the data sharing graph. This graphs show the data-flow between functions and the data structures facilitating this data-flow, respectively. We apply our framework on the SPECcpu2000 bzip2 benchmark, achieving a speedup of 3.74 of the compression part and a global speedup of 2.45 on a quad processor system.

high performance computer architecture | 2000

A technique for high bandwidth and deterministic low latency load/store accesses to multiple cache banks

Henk Neefs; Hans Vandierendonck; K. De Bosschere

One of the problems in future processors will be the resource conflicts caused by several load/store units competing to access the same cache bank. The traditional approach for handling this case is by introducing buffers combined with a cross-bar. This approach suffers from (i) the non-deterministic latency of a load/store and (ii) the extra latency caused by the cross-bar and the buffer management. A deterministic latency is of the utmost importance for the forwarding mechanism of out-of-order processors because it enables back-to-back operation of instructions. We propose a technique by which we eliminate the buffers and cross-bars from the critical path of the load/store execution. This results in both, a low and a deterministic latency. Our solution consists of predicting which bank is to be accessed. Only in the case of a wrong prediction a penalty results.

international conference on parallel architectures and compilation techniques | 2011

A Unified Scheduler for Recursive and Task Dataflow Parallelism

Hans Vandierendonck; Georgios Tzenakis; Dimitrios S. Nikolopoulos

Task dataflow languages simplify the specification of parallel programs by dynamically detecting and enforcing dependencies between tasks. These languages are, however, often restricted to a single level of parallelism. This language design is reflected in the runtime system, where a master thread explicitly generates a task graph and worker threads execute ready tasks and wake-up their dependents. Such an approach is incompatible with state-of-the-art schedulers such as the Cilk scheduler, that minimize the creation of idle tasks (work-first principle) and place all task creation and scheduling off the critical path. This paper proposes an extension to the Cilk scheduler in order to reconcile task dependencies with the work-first principle. We discuss the impact of task dependencies on the properties of the Cilk scheduler. Furthermore, we propose a low-overhead ticket-based technique for dependency tracking and enforcement at the object level. Our scheduler also supports renaming of objects in order to increase task-level parallelism. Renaming is implemented using versioned objects, a new type of hyper object. Experimental evaluation shows that the unified scheduler is as efficient as the Cilk scheduler when tasks have no dependencies. Moreover, the unified scheduler is more efficient than SMPSS, a particular implementation of a task dataflow language.

Explore More