Kristof Beyls | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kristof Beyls is active.

Explore More

Publication

Featured researches published by Kristof Beyls.

Algorithmica | 2007

Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions

Sven Verdoolaege; Rachid Seghir; Kristof Beyls; Vincent Loechner; Maurice Bruynooghe

AbstractMany compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that the enumerator of such a set can be represented by an explicit function consisting of a set of quasi-polynomials, each associated with a chamber in the parameter space. Previously, interpolation was used to obtain these quasi-polynomials, but this technique has several disadvantages. Its worst-case computation time for a single quasi-polynomial is exponential in the input size, even for fixed dimensions. The worst-case size of such a quasi-polynomial (measured in bits needed to represent the quasi-polynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution. Our main contribution is a novel method for calculating the required quasi-polynomials analytically. It extends an existing method, based on Barvinoks decomposition, for counting the number of integer points in a non-parametric polytope. Our technique always produces a solution and computes polynomially-sized enumerators in polynomial time (for fixed dimensions).

Journal of Systems Architecture | 2005

Generating cache hints for improved program efficiency

Kristof Beyls; Erik H. D'Hollander

One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedule. The target hint indicates at which cache levels it is profitable to retain data, allowing to improve cache replacement decisions at run time. A compile-time method is presented which calculates appropriate cache hints. Both kind of hints are based on the locality of the instruction, measured by the reuse distance metric.Two alternative methods are discussed. The first one profiles the reuse distance distribution, and selects a static hint for each instruction. The second method calculates the reuse distance analytically, which allows to generate dynamic hints, i.e. the best hint for each memory access is calculated at run-time.The implementation of the static hints scheme in the Open64-compiler for the Itanium processor shows a speedup of 10% on average on a set of pointer-intensive and regular loop-based programs. The analytical approach with dynamic hints was implemented in the FPT-compiler and shows up to 34% reduction in cache misses.

compilers, architecture, and synthesis for embedded systems | 2004

Analytical computation of Ehrhart polynomials: enabling more compiler analyses and optimizations

Sven Verdoolaege; Rachid Seghir; Kristof Beyls; Vincent Loechner; Maurice Bruynooghe

Many optimization techniques, including several targeted specifically at embedded systems, depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that this parametric count can be represented by a set of Ehrhart polynomials. Previously, interpolation was used to obtain these polynomials, but this technique has several disadvantages. Its worst-case computation time for a single Ehrhart polynomial is exponential in the input size, even for fixed dimensions. The worst-case size of such an Ehrhart polynomial (measured in bits needed to represent the polynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution.Our main contribution is a novel method for calculating Ehrhart polynomials analytically. It extends an existing method, based on Barvinoks decomposition, for counting the number of integer points in a non-parametric polytope. Our technique always produces a solution and computes polynomially-sized Ehrhart polynomials in polynomial time (for fixed dimensions).

european conference on parallel processing | 2002

Reuse Distance-Based Cache Hint Selection

Kristof Beyls; Erik H. D'Hollander

Modern instruction sets extend their load/store-instructions with cache hints, as an additional means to bridge the processor-memory speed gap. Cache hints are used to specify the cache level at which the data is likely to be found, as well as the cache level where the data is stored after accessing it. In order to improve a programs cache behavior, the cache hint is selected based on the data locality of the instruction. We represent the data locality of an instruction by its reuse distance distribution. The reuse distance is the amount of data addressed between two accesses to the same memory location. The distribution allows to efficiently estimate the cache level where the data will be found, and to determine the level where the data should be stored to improve the hit rate. The Open64 EPIC-compiler was extended with cache hint selection and resulted in speedups of up to 36% in numerical and 23% in nonnumerical programs on an Itanium multiprocessor.

IEEE Computer | 2009

Refactoring for Data Locality

Kristof Beyls; Erik H. D'Hollander

Refactoring for data locality opens a new avenue for performance-oriented program rewriting. SLO has broken down a large part of the complexity that software developers face when speeding up programs with numerous cache misses. Therefore, we consider SLO to belong to a new generation of program analyzers. Whereas existing cache profilers (generation 1.0) highlight problems such as cache misses, second-generation analyzers (such as SLO) highlight the place to fix problems. Improving data locality is also important in hardware-based applications. SLO was already used to optimize the frame rate and energy consumption in a wavelet decoder implemented on an FPGA. In another vein, the SLO concepts could be incorporated in interactive performance debuggers and profile-directed compilers. We believe that SLO will be useful in the optimization of many data-intensive applications.

compiler construction | 2005

Experiences with enumeration of integer projections of parametric polytopes

Sven Verdoolaege; Kristof Beyls; Maurice Bruynooghe; Francky Catthoor

Many compiler optimization techniques depend on the ability to calculate the number of integer values that satisfy a given set of linear constraints. This count (the enumerator of a parametric polytope) is a function of the symbolic parameters that may appear in the constraints. In an extended problem (the “integer projection” of a parametric polytope), some of the variables that appear in the constraints may be existentially quantified and then the enumerated set corresponds to the projection of the integer points in a parametric polytope. This paper shows how to reduce the enumeration of the integer projection of parametric polytopes to the enumeration of parametric polytopes. Two approaches are described and experimentally compared. Both can solve problems that were considered very difficult to solve analytically.

high performance computing and communications | 2006

Discovery of locality-improving refactorings by reuse path analysis

Kristof Beyls; Erik H. D'Hollander

Due to the huge speed gaps in the memory hierarchy of modern computer architectures, it is important that programs maintain a good data locality. Improving temporal locality implies reducing the distance of data reuses that are far apart. The best existing tools indicate locality bottlenecks by highlighting both the source locations generating the use and the subsequent cache-missing reuse. Even with this knowledge of the bottleneck locations in the source code, it often remains hard to find an effective code refactoring that improves temporal locality, due to the unclear interaction of function calls and loop iterations occurring between use and reuse. The contributions in this paper are two-fold. First, the locality analysis is enhanced to not only pinpoint the cache bottlenecks, but to also suggest code refactorings that may resolve them. The refactorings are found by analyzing the dynamic hierarchy of function calls and loops on the code path between reuses, called reuse paths. Secondly, reservoir sampling of the reuse paths results in a significant reduction of the execution time and memory requirements during profiling, enabling the analysis of realistic programs. An interactive GUI, called SLO (Suggestions for Locality Optimizations), has been used to explore the most appropriate refactorings in a number of SPEC2000 programs. After refactoring, the execution time of the selected programs was halved, on the average.

high performance embedded architectures and compilers | 2007

Finding and Applying Loop Transformations for Generating Optimized FPGA Implementations

Harald Devos; Kristof Beyls; Mark Christiaens; Jan Van Campenhout; Erik H. D'Hollander; Dirk Stroobandt

When implementing multimedia applications, solutions in dedicated hardware are chosen only when the required performance or energy-efficiency cannot be met with a software solution. The performance of a hardware design critically depends upon having high levels of parallelism and data locality. Often a long sequence of high-level transformations is needed to sufficiently increase the locality and parallelism. The effect of the transformations is known only after translating the high-level code into a specific design at the circuit level. When the constraints are not met, hardware designers need to redo the high-level loop transformations, and repeat all subsequent translation steps, which leads to long design times. We propose a method to reduce design time through the synergistic combination of techniques (a) to quickly pinpoint the loop transformations that increase locality; (b) to refactor loops in a polyhedral model and check whether a sequence of refactorings is legal; (c) to generate efficient structural VHDL from the optimized refactored algorithm. The implementation of these techniques in a tool suite results in a far shorter design time of hours instead of days or weeks. A 2D-inverse discrete wavelet transform was taken as a case study. The results outperform those of a commercial C-to-VHDL compiler, and compare favorably with existing published approaches.

Proceedings Fifth International Conference on Information Visualisation | 2001

Visualizing the impact of the cache on program execution

Yijun Yu; Kristof Beyls; Erik H. D'Hollander

The global cache misses ratio of a program does not reveal the time distribution of the memory reference patterns in detail. On the other hand, cache visualization is hampered by the huge amount of memory references to display. Therefore, many visualizers focus on a snapshot of the cache content, instead of viewing all memory transactions. A cache visualizer is introduced which presents the integral cache behavior of a program in several complementary views: the density view of the cache misses shows the hot spots of the program; the reuse distances view shows the data locality and its effect on performance; the histogram view shows the periodical patterns that occurs in the trace. In a number of experiments, the visualizer has been used to characterize the cache behavior and effectively improve the cache behavior and program performance.

international conference on computational science | 2004

Platform-independent cache optimization by pinpointing low-locality reuse

Kristof Beyls; Erik H. D'Hollander

For many applications, cache misses are the primary performance bottleneck. Even though much research has been performed on automatically optimizing cache behavior at the hardware and the compiler level, many program executions remain dominated by cache misses. Therefore, we propose to let the programmer optimize, who has a better high-level program overview, needed to resolve many cache problems. In order to assist the programmer, a visualization of memory accesses with poor locality is developed. The aim is to indicate causes of cache misses independent of actual cache parameters such as associativity or size. In that way, the programmer is steered towards platform-independent locality optimizations. The visualization was applied to three programs from the SPEC2000 benchmarks. After optimizing the source code based on the visualization, an average speedup of 3.06 was obtained on different platforms with Athlon, Itanium and Alpha processors; indicating the feasibility of platform-independent cache optimizations.

Explore More