Sean Rul | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sean Rul is active.

Explore More

Publication

Featured researches published by Sean Rul.

international conference on parallel architectures and compilation techniques | 2010

The Paralax infrastructure: automatic parallelization with a helping hand

Hans Vandierendonck; Sean Rul; Koen De Bosschere

Speeding up sequential programs on multicores is a challenging problem that is in urgent need of a solution. Automatic parallelization of irregular pointer-intensive codes, exemplified by the SPECint codes, is a very hard problem. This paper shows that, with a helping hand, such auto-parallelization is possible and fruitful.

parallel computing | 2010

A profile-based tool for finding pipeline parallelism in sequential programs

Sean Rul; Hans Vandierendonck; Koen De Bosschere

Traditional static analysis fails to auto-parallelize programs with a complex control and data flow. Furthermore, thread-level parallelism in such programs is often restricted to pipeline parallelism, which can be hard to discover by a programmer. In this paper we propose a tool that, based on profiling information, helps the programmer to discover parallelism. The programmer hand-picks the code transformations from among the proposed candidates which are then applied by automatic code transformation techniques. This paper contributes to the literature by presenting a profiling tool for discovering thread-level parallelism. We track dependencies at the whole-data structure level rather than at the element level or byte level in order to limit the profiling overhead. We perform a thorough analysis of the needs and costs of this technique. Furthermore, we present and validate the belief that programs with complex control and data flow contain significant amounts of exploitable coarse-grain pipeline parallelism in the programs outer loops. This observation validates our approach to whole-data structure dependencies. As state-of-the-art compilers focus on loops iterating over data structure members, this observation also explains why our approach finds coarse-grain pipeline parallelism in cases that have remained out of reach for state-of-the-art compilers. In cases where traditional compilation techniques do find parallelism, our approach allows to discover higher degrees of parallelism, allowing a 40% speedup over traditional compilation techniques. Moreover, we demonstrate real speedups on multiple hardware platforms.

ACM Sigarch Computer Architecture News | 2007

Function level parallelism driven by data dependencies

Sean Rul; Hans Vandierendonck; Koen De Bosschere

With the rise of Chip multiprocessors (CMPs), the amount of parallel computing power will increase significantly in the near future. However, most programs are sequential in nature and have not been explicitly parallelized, so they cannot exploit these parallel resources. Automatic parallelization of sequential, non-regular codes is very hard, as illustrated by the lack of solutions after more than 30 years of research on the topic. The question remains if there is parallelism in sequential programs that can be detected automatically and if so, how much parallelism there is. In this paper, we propose a framework for extracting potential parallelism from programs. Applying this framework to sequential programs can teach us how much parallelism is present in a program, but also tells us what the most appropriate parallel construct for a program is, e.g. a pipeline, master/slave work distribution, etc. Our framework is profile-based, implying that it is not safe. It builds two new graph representations of the profile-data: the interprocedural data flow graph and the data sharing graph. This graphs show the data-flow between functions and the data structures facilitating this data-flow, respectively. We apply our framework on the SPECcpu2000 bzip2 benchmark, achieving a speedup of 3.74 of the compression part and a global speedup of 2.45 on a quad processor system.

high performance embedded architectures and compilers | 2008

Experiences with parallelizing a bio-informatics program on the cell BE

Hans Vandierendonck; Sean Rul; Michiel Questier; Koen De Bosschere

The Cell Broadband Engine Architecture is a new heterogeneous multi-core architecture targeted at compute-intensive workloads. The architecture of the Cell BE has several features that are unique in high-performance general-purpose processors, such as static instruction scheduling, extensive support for vectorization, scratch pad memories, explicit programming of DMAs, mailbox communication, multiple processor cores, etc. It is necessary to make explicit use of these features to obtain high performance. Yet, little work reports on how to apply them and how much each of them contributes to performance. This paper presents our experiences with programming the Cell BE architecture. Our test application is Clustal W, a bio-informatics program for multiple sequence alignment. We report on how we apply the unique features of the Cell BE to Clustal Wand how important each is to obtain high performance. By making extensive use of vectorization and by parallelizing the application across all cores, we speedup the pairwise alignment phase of ClustalWwith a factor of 51.2 over PPU (superscalar) execution. The progressive alignment phase is sped up by a factor of 5.7 over PPU execution, resulting in an overall speedup by 9.1.

acm sigplan symposium on principles and practice of parallel programming | 2008

Extracting coarse-grain parallelism in general-purpose programs

Sean Rul; Hans Vandierendonck; Koen De Bosschere

While the chip multiprocessor (CMP) has quickly become the predominant processor architecture, its continuing success largely depends on the parallelizability of complex programs. In the early 1990s great successes were obtained to extract parallelism from the inner loops of scientific computations. In this paper we show that significant amounts of coarse-grain parallelism exists in the outer program loops, even in general-purpose programs. This coarse-grain parallelism can be exploited efficiently on CMPs without additional hardware support.

The Computer Journal | 2010

Accelerating Multiple Sequence Alignment with the Cell BE Processor

Hans Vandierendonck; Sean Rul; Koen De Bosschere

The Cell Broadband Engine (BE) Architecture is a new heterogeneous multi-core architecture targeted at compute-intensive workloads. The architecture of the Cell BE has several features that are unique in high-performance general-purpose processors, most notably the extensive support for vectorization, scratch pad memories and explicit programming of direct memory accesses (DMAs) and mailbox communication. While these features strongly increase programming complexity, it is generally claimed that significant speedups can be obtained by using Cell BE processors. This paper presents our experiences with using the Cell BE architecture to accelerate Clustal W, a bio-informatics program for multiple sequence alignment. We report on how we apply the unique features of the Cell BE to Clustal W and how important each is in obtaining high performance. By making extensive use of vectorization and by parallelizing the application across all cores, we demonstrate a speedup of 24.4 times when using 16 synergistic processor units on a QS21 Cell Blade compared to single-thread execution on the power processing unit. As the Cell BE exploits a large number of slim cores, our highly optimized implementation is just 3.8 times faster than a 3-thread version running on an Intel Core2 Duo, as the latter processor exploits a small number of fat cores.

computing frontiers | 2009

Towards automatic program partitioning

Sean Rul; Hans Vandierendonck; Koen De Bosschere

There is a trend towards using accelerators to increase performance and energy efficiency of general-purpose processors. Adoption of accelerators, however, depends on the availability of tools to facilitate programming these devices. In this paper, we present techniques for automatically partitioning programs for execution on accelerators. We call the off-loaded code regions sub-algorithms, which are parts of the program that are loosely connected to the remainder of the program. We present three heuristics for automatically identifying sub-algorithms based on control flow and data flow properties. Analysis of SPECint and MiBench benchmarks shows that on average 12 sub-algorithms are identified (up to 54), covering the full execution time for 27 out of 30 benchmarks. We show that these sub-algorithms are suitable for off-loading them to accelerators by manually implementing sub-algorithms for 2 SPECint benchmarks on the Cell processor.

ieee international conference on high performance computing data and analytics | 2010