Erik H. D'Hollander
Ghent University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Erik H. D'Hollander.
Journal of Systems Architecture | 2005
Kristof Beyls; Erik H. D'Hollander
One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedule. The target hint indicates at which cache levels it is profitable to retain data, allowing to improve cache replacement decisions at run time. A compile-time method is presented which calculates appropriate cache hints. Both kind of hints are based on the locality of the instruction, measured by the reuse distance metric.Two alternative methods are discussed. The first one profiles the reuse distance distribution, and selects a static hint for each instruction. The second method calculates the reuse distance analytically, which allows to generate dynamic hints, i.e. the best hint for each memory access is calculated at run-time.The implementation of the static hints scheme in the Open64-compiler for the Itanium processor shows a speedup of 10% on average on a set of pointer-intensive and regular loop-based programs. The analytical approach with dynamic hints was implemented in the FPT-compiler and shows up to 34% reduction in cache misses.
IEEE Transactions on Parallel and Distributed Systems | 1992
Erik H. D'Hollander
A general method for the identification of the independent subsets in loops with constant dependence vectors is presented. It is shown that the dependence relation remains invariant under a unimodular transformation. Then a unimodular transformation is used to bring the dependence matrix into a form where the independent subsets are obtained by a direct and inexpensive partitioning algorithm. This leads to a procedure for the automatic conversion of a serial loop into a nest of parallel DO-ALL loops. Another unimodular transformation results in an algorithm to label the dependent iterations of an n-fold nested loop in O(n/sup 2/) time. This provides a multithreaded dynamic scheduling scheme requiring only one fork and one join primitive. >
european conference on parallel processing | 2002
Kristof Beyls; Erik H. D'Hollander
Modern instruction sets extend their load/store-instructions with cache hints, as an additional means to bridge the processor-memory speed gap. Cache hints are used to specify the cache level at which the data is likely to be found, as well as the cache level where the data is stored after accessing it. In order to improve a programs cache behavior, the cache hint is selected based on the data locality of the instruction. We represent the data locality of an instruction by its reuse distance distribution. The reuse distance is the amount of data addressed between two accesses to the same memory location. The distribution allows to efficiently estimate the cache level where the data will be found, and to determine the level where the data should be stored to improve the hit rate. The Open64 EPIC-compiler was extended with cache hint selection and resulted in speedups of up to 36% in numerical and 23% in nonnumerical programs on an Itanium multiprocessor.
IEEE Computer | 2009
Kristof Beyls; Erik H. D'Hollander
Refactoring for data locality opens a new avenue for performance-oriented program rewriting. SLO has broken down a large part of the complexity that software developers face when speeding up programs with numerous cache misses. Therefore, we consider SLO to belong to a new generation of program analyzers. Whereas existing cache profilers (generation 1.0) highlight problems such as cache misses, second-generation analyzers (such as SLO) highlight the place to fix problems. Improving data locality is also important in hardware-based applications. SLO was already used to optimize the frame rate and energy consumption in a wavelet decoder implemented on an FPGA. In another vein, the SLO concepts could be incorporated in interactive performance debuggers and profile-directed compilers. We believe that SLO will be useful in the optimization of many data-intensive applications.
high performance computing and communications | 2006
Kristof Beyls; Erik H. D'Hollander
Due to the huge speed gaps in the memory hierarchy of modern computer architectures, it is important that programs maintain a good data locality. Improving temporal locality implies reducing the distance of data reuses that are far apart. The best existing tools indicate locality bottlenecks by highlighting both the source locations generating the use and the subsequent cache-missing reuse. Even with this knowledge of the bottleneck locations in the source code, it often remains hard to find an effective code refactoring that improves temporal locality, due to the unclear interaction of function calls and loop iterations occurring between use and reuse. The contributions in this paper are two-fold. First, the locality analysis is enhanced to not only pinpoint the cache bottlenecks, but to also suggest code refactorings that may resolve them. The refactorings are found by analyzing the dynamic hierarchy of function calls and loops on the code path between reuses, called reuse paths. Secondly, reservoir sampling of the reuse paths results in a significant reduction of the execution time and memory requirements during profiling, enabling the analysis of realistic programs. An interactive GUI, called SLO (Suggestions for Locality Optimizations), has been used to explore the most appropriate refactorings in a number of SPEC2000 programs. After refactoring, the execution time of the selected programs was halved, on the average.
IEEE Transactions on Biomedical Engineering | 1979
Erik H. D'Hollander; Guy A. Orban
An on-line spike recognition system allows separation of multiple spikes present on a single channel, in up to six different classes. The learning phase is unsupervised, and uses the data samples of the waveform as coordinates in a multidimensional feature space. Additional signal characteristics may improve the system performance in special cases. Using the well known nearest neighbor technique, all possible cluster configurations are determined. From this analysis, the investigator selects the physiologically best suited duster layout, primary based on a curve showing the number of clusters versus the maximum distance of two neighboring spikes in the same cluster. This procedure is supported by visual examination of the spikes of each cluster. Statistics are calculated for inter-and intracluster distances, yielding confidence limits for the cluster bounds, and estimates for the quality of separation. During the classification phase, a separate graphic display processor permits continuous control without delay. Each classified spike is projected over its cluster, identifying mean waveform.
high performance embedded architectures and compilers | 2007
Harald Devos; Kristof Beyls; Mark Christiaens; Jan Van Campenhout; Erik H. D'Hollander; Dirk Stroobandt
When implementing multimedia applications, solutions in dedicated hardware are chosen only when the required performance or energy-efficiency cannot be met with a software solution. The performance of a hardware design critically depends upon having high levels of parallelism and data locality. Often a long sequence of high-level transformations is needed to sufficiently increase the locality and parallelism. The effect of the transformations is known only after translating the high-level code into a specific design at the circuit level. When the constraints are not met, hardware designers need to redo the high-level loop transformations, and repeat all subsequent translation steps, which leads to long design times. We propose a method to reduce design time through the synergistic combination of techniques (a) to quickly pinpoint the loop transformations that increase locality; (b) to refactor loops in a polyhedral model and check whether a sequence of refactorings is legal; (c) to generate efficient structural VHDL from the optimized refactored algorithm. The implementation of these techniques in a tool suite results in a far shorter design time of hours instead of days or weeks. A 2D-inverse discrete wavelet transform was taken as a case study. The results outperform those of a commercial C-to-VHDL compiler, and compare favorably with existing published approaches.
Proceedings Fifth International Conference on Information Visualisation | 2001
Yijun Yu; Kristof Beyls; Erik H. D'Hollander
The global cache misses ratio of a program does not reveal the time distribution of the memory reference patterns in detail. On the other hand, cache visualization is hampered by the huge amount of memory references to display. Therefore, many visualizers focus on a snapshot of the cache content, instead of viewing all memory transactions. A cache visualizer is introduced which presents the integral cache behavior of a program in several complementary views: the density view of the cache misses shows the hot spots of the program; the reuse distances view shows the data locality and its effect on performance; the histogram view shows the periodical patterns that occurs in the trace. In a number of experiments, the visualizer has been used to characterize the cache behavior and effectively improve the cache behavior and program performance.
Journal of Visual Languages and Computing | 2001
Yijun Yu; Erik H. D'Hollander
A 3D iteration space visualizer (ISV) is presented to analyze the parallelism in loops and to find loop transformations which enhance the parallelism. Using automatic program instrumentation, the iteration space dependency graph (ISDG) is constructed, which shows the exact data dependencies of arbitrarily nested loops. Various graphical operations such as rotation, zooming, clipping, coloring and filtering, permit a detailed examination of the dependence relations. Furthermore, an animated dataflow execution shows the maximal parallelism and the parallel loops are indicated automatically by an embedded data dependence analysis. In addition, the user may discover and indicate additional parallelism for which a suitable unimodular loop transformation is calculated and verified. The ISV has been applied to parallelize algorithmic kernel programs, a computational fluid dynamics (CFD) simulation code, the detection of statement-level parallelism and loop variable privatization. The applications show that the visualizer is a versatile and easy to use tool for the high-performance application programmer.
Information Sciences | 1998
Erik H. D'Hollander; Fubo Zhang; Qi Wang
The Fortran Parallel Transformer (FPT) is a parallelization tool for Fortran-77 programs. It is used for the automatic parallelization of loops, program transformations, dependence analysis, performance tuning and code generation for various platforms. FPT is able to deal with GOTOs by restructuring ill-structured code using hammock graph transformations. In this way more parallelism becomes detectable. The X-window based Programming Environment, PEFPT, extends FPT with interactive dependence analysis, the iteration space graph, ISG, and guided loop optimization. FPT contains a PVM (Parallel Virtual Machine) code generator which converts the parallel loops into PVM master- and slave-code for a network of workstations. This includes job scheduling, synchronization and optimized data communication. The productivity gained is about a factor of 10 in programming time and a significant speedup of the execution.