Herbert Jordan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Herbert Jordan is active.

Explore More

Publication

Featured researches published by Herbert Jordan.

ieee international conference on high performance computing data and analytics | 2012

A multi-objective auto-tuning framework for parallel codes

Herbert Jordan; Peter Thoman; Juan José Durillo; Simone Pellegrini; Philipp Gschwandtner; Thomas Fahringer; Hans Moritsch

In this paper we introduce a multi-objective autotuning framework comprising compiler and runtime components. Focusing on individual code regions, our compiler uses a novel search technique to compute a set of optimal solutions, which are encoded into a multi-versioned executable. This enables the runtime system to choose specifically tuned code versions when dynamically adjusting to changing circumstances. We demonstrate our method by tuning loop tiling in cache-sensitive parallel programs, optimizing for both runtime and efficiency. Our static optimizer finds solutions matching or surpassing those determined by exhaustively sampling the search space on a regular grid, while using less than 4% of the computational effort on average. Additionally, we show that parallelism-aware multi-versioning approaches like our own gain a performance improvement of up to 70% over solutions tuned for only one specific number of threads.

international conference on parallel architectures and compilation techniques | 2013

INSPIRE: the insieme parallel intermediate representation

Herbert Jordan; Simone Pellegrini; Peter Thoman; Klaus Kofler; Thomas Fahringer

Programming standards like OpenMP, OpenCL and MPI are frequently considered programming languages for developing parallel applications for their respective kind of architecture. Nevertheless, compilers treat them like ordinary APIs utilized by an otherwise sequential host language. Their parallel control flow remains hidden within opaque runtime library calls which are embedded within a sequential intermediate representation lacking the concepts of parallelism. Consequently, the tuning and coordination of parallelism is clearly beyond the scope of conventional optimizing compilers and hence left to the programmer or the runtime system. The main objective of the Insieme compiler is to overcome this limitation by utilizing INSPIRE, a unified, parallel, highlevel intermediate representation. Instead of mapping parallel constructs and APIs to external routines, their behavior is modeled explicitly using a unified and fixed set of parallel language constructs. Making the parallel control flow accessible to the compiler lays the foundation for the development of reusable, static and dynamic analyses and transformations bridging the gap between a variety of parallel paradigms. Within this paper we describe the structure of INSPIRE and elaborate the considerations which influenced its design. Furthermore, we demonstrate its expressiveness by illustrating the encoding of a variety of parallel language constructs and we evaluate its ability to preserve performance relevant aspects of input codes.

international workshop on openmp | 2012

Automatic OpenMP loop scheduling: a combined compiler and runtime approach

Peter Thoman; Herbert Jordan; Simone Pellegrini; Thomas Fahringer

The scheduling of parallel loops in OpenMP has been a research topic for over a decade. While many methods have been proposed, most focus on adapting the loop schedule purely at runtime, and without regard for the overall system state. We present a fully automatic loop scheduling policy that can adapt to both the characteristics of the input program as well as the current runtime behaviour of the system, including external load. Using state of the art polyhedral compiler analysis, we generate effort estimation functions that are then used by the runtime system to derive the optimal loop schedule for a given loop, work group size, iteration range and system state. We demonstrate performance improvements of up to 82% compared to default scheduling in an unloaded scenario, and up to 471% in a scenario with external load. We further show that even in the worst case, the results achieved by our automated system stay within 3% of the performance of a manually tuned strategy.

computer aided verification | 2016

Soufflé: On Synthesis of Program Analyzers

Herbert Jordan; Bernhard Scholz; Pavle Subotić

Souffle is an open source programming framework that performs static program analysis expressed in Datalog on very large code bases, including points-to analysis on OpenJDK7 (1.4M program variables, 350K objects, 160K methods) in under a minute. Souffle is being successfully used for Java security analyses at Oracle Labs due to (1) its high-performance, (2) support for rapid program analysis development, and (3) customizability. Souffle incorporates the highly flexible Datalog-based program analysis paradigm while exhibiting performance results that are on-par with manually developed state-of-the-art tools. In this tool paper, we introduce the Souffle architecture, usage and demonstrate its applicability for large-scale code analysis on the OpenJDK7 library as a use case.

international conference on parallel processing | 2013

Adaptive granularity control in task parallel programs using multiversioning

Peter Thoman; Herbert Jordan; Thomas Fahringer

Task parallelism is a programming technique that has been shown to be applicable in a wide variety of problem domains. A central parameter that needs to be controlled to ensure efficient execution of task-parallel programs is the granularity of tasks. When they are too coarse-grained, scalability and load balance suffer, while very fine-grained tasks introduce execution overheads. We present a combined compiler and runtime approach that enables automatic granularity control. Starting from recursive, task parallel programs, our compiler generates multiple versions of each task, increasing granularity by task unrolling and subsequent removal of superfluous synchronization primitives. A runtime system then selects among these task versions of varying granularity by tracking task demand. Benchmarking on a set of task parallel programs using a work-stealing scheduler demonstrates that our approach is generally effective. For fine-grained tasks, we can achieve reductions in execution time exceeding a factor of 6, compared to state-of-the-art implementations.

compiler construction | 2016

On fast large-scale program analysis in Datalog

Bernhard Scholz; Herbert Jordan; Pavle Subotić; Till Westmann

Designing and crafting a static program analysis is challenging due to the complexity of the task at hand. Among the challenges are modelling the semantics of the input language, finding suitable abstractions for the analysis, and handwriting efficient code for the analysis in a traditional imperative language such as C++. Hence, the development of static program analysis tools is costly in terms of development time and resources for real world languages. To overcome, or at least alleviate the costs of developing a static program analysis, Datalog has been proposed as a domain specific language (DSL). With Datalog, a designer expresses a static program analysis in the form of a logical specification. While a domain specific language approach aids in the ease of development of program analyses, it is commonly accepted that such an approach has worse runtime performance than handcrafted static analysis tools. In this work, we introduce a new program synthesis methodology for Datalog specifications to produce highly efficient monolithic C++ analyzers. The synthesis technique requires the re-interpretation of the semi-naive evaluation as a scaffolding for translation using partial evaluation. To achieve high-performance, we employ staged-compilation techniques and specialize the underlying relational data structures for a given Datalog specification. Experimentation on benchmarks for large-scale program analysis validates the superior performance of our approach over available Datalog tools and demonstrates our competitiveness with state-of-the-art handcrafted tools.

computing frontiers | 2010

Automatic tuning of MPI runtime parameter settings by using machine learning

Simone Pellegrini; Thomas Fahringer; Herbert Jordan; Hans Moritsch

MPI implementations provide several hundred runtime parameters that can be tuned for performance improvement. The ideal parameter setting does not only depend on the target multiprocessor architecture but also on the application, its problem and communicator size. This paper presents ATune, an automatic performance tuning tool that uses machine learning techniques to determine the program-specific optimal settings for a subset of the Open MPIs runtime parameters. ATune learns the behaviour of a target system by means of a training phase where several MPI benchmarks and MPI applications are run on a target architecture for varying problem and communicator sizes. For new input programs, only one run is required in order for ATune to deliver a prediction of the optimal runtime parameters values. Experiments based on the NAS Parallel Benchmarks performed on a cluster of SMP machines are shown that demonstrate the effectiveness of ATune. For these experiments, ATune derives MPI runtime parameter settings that are on average within 4% of the maximum performance achievable on the target system resulting in a performance gain of up to 18% with respect to the default parameter setting.

Concurrency and Computation: Practice and Experience | 2014

Compiler multiversioning for automatic task granularity control

Peter Thoman; Herbert Jordan; Thomas Fahringer

Task parallelism is a programming technique that has been shown to be applicable in a wide variety of problem domains. A central parameter that needs to be controlled to ensure efficient execution of task parallel programs is the granularity of tasks. When they are too coarse grained, scalability and load balance suffer, while very fine‐grained tasks introduce execution overheads. We present a combined compiler and runtime approach that enables automatic granularity control. Starting from recursive, task parallel programs, our compiler generates multiple versions of each task, increasing granularity by task unrolling. Subsequently, we apply a parallelism‐aware optimizing transformation to remove superfluous task synchronization primitives in all generated versions. A runtime system then selects among these task versions of varying granularity by locally tracking task demand. Benchmarking on a set of task parallel programs using a work‐stealing scheduler demonstrates that our approach is generally effective. For fine‐grained tasks, we can achieve reductions in execution time exceeding a factor of 6, compared with state‐of‐the‐art implementations. Additionally, we evaluate the impact of two crucial algorithmic parameters, the number of generated code versions and the task queue length, on the performance of our method. Copyright

computing frontiers | 2010

Dynamic load management for MMOGs in distributed environments

Herbert Jordan; Radu Prodan; Vlad Nae; Thomas Fahringer

To support thousands of concurrent players in virtual worlds simulated by contemporary Massively Multiplayer Online Games, most implementations employ static game world partitioning for distributing the load among multiple game server instances. Further, the resources that manage the resulting subregions are statically allocated, independent of the actual game load. As a result, due to the high variability of the user demand, this approach leads to a low resource utilization causing much higher provisioning costs than necessary. In addition, the number of players supported by a region is limited by the maximum load that can be handled by a single server instance. We propose in this paper a novel game load management technique divided in two (global and a local) layers, capable of dynamically adjusting the amount of allocated resources to the present user demand. The global level assigns the responsibility of serving particular game regions to data centers using a peer-to-peer infrastructure, while the local level within individual facilities maintains the necessary server instances for the assigned obligations. We device two generic heuristics based on the well-known bin-packing problem to achieve the ultimate goal of maximizing the resource utilization on both levels while maintaining user-level Quality of Service (QoS). We evaluate the performance of our proposed solution using simulation-based experiments, which demonstrate a potential cost reduction in maintaining MMOG sessions by up to 60% while maintaining QoS in 99% of the cases.

The Journal of Supercomputing | 2018

A taxonomy of task-based parallel programming technologies for high-performance computing

Peter Thoman; Kiril Dichev; Thomas Heller; Roman Iakymchuk; Xavier Aguilar; Khalid Hasanov; Philipp Gschwandtner; Pierre Lemarinier; Stefano Markidis; Herbert Jordan; Thomas Fahringer; Kostas Katrinis; Erwin Laure; Dimitrios S. Nikolopoulos

Task-based programming models for shared memory—such as Cilk Plus and OpenMP 3—are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Explore More