Is this you? Create Your Porfile

Jan Christian Meyer

Norwegian University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan Christian Meyer is active.

Explore More

Publication

Featured researches published by Jan Christian Meyer.

ieee international conference on high performance computing data and analytics | 2012

Improving Energy Efficiency through Parallelization and Vectorization on Intel Core i5 and i7 Processors

Juan M. Cebrian; Lasse Natvig; Jan Christian Meyer

Driven by the utilization wall and the Dark Silicon effect, energy efficiency has become a key research area in microprocessor design. Vectorization, parallelization, specialization and heterogeneity are the key design points to deal with the utilization wall. Heterogeneous architectures are enhanced with architectural optimizations, such as vectorization, to further increase the energy efficiency of the processor, reducing the number of instructions that go through the pipeline and leveraging the usage of the memory hierarchy. AMD® FusionTM or Intel Core i5 and i7 are commercial examples of this new generation of microprocessors. Still, there is a question to be answered: How can software developers maximize energy efficiency of these architectures? In this paper, we evaluate the energy efficiency of different processors from the Intel Core i5 and i7 family, using selected benchmarks from the PARSEC suite with variable core counts and vectorization techniques to quantify energy efficiency under the Thermal Design Power (TDP). Results show that software developers should prioritize vectorization over parallelization whenever possible, as it is much better in terms of energy efficiency. When using vectorization and parallelization simultaneously, scalability of the application can be reduced drastically, and may require different development strategies to maximize resource utilization in order to increase energy efficiency. This is especially true in the server market, where we can find more than one processor per board. Finally, when comparing on-chip and “at the wall” energy savings, we can see variations from 5 to 20%, depending on the benchmark and system. This high variability shows the need to develop a more detailed model to predict system power based on on-chip power information.

international parallel and distributed processing symposium | 2009

A super-efficient adaptable bit-reversal algorithm for multithreaded architectures

Anne C. Elster; Jan Christian Meyer

Fast bit-reversal algorithms have been of strong interest for many decades, especially after Cooley and Tukey introduced their FFT implementation in 1965. Many recent algorithms, including FFTW try to avoid the bit-reversal all together by doing in-place algorithms within their FFTs. We therefore motivate our work by showing that for FFTs of up to 65.536 points, a minimally tuned Cooley-Tukey FFT in C using our bit-reversal algorithm performs comparable or better than the default FFTW algorithm. In this paper, we present an extremely fast linear bit-reversal adapted for modern multithreaded architectures. Our bit-reversal algorithm takes advantage of recursive calls combined with the fact that it only generates pairs of indices for which the corresponding elements need to be exchanges, thereby avoiding any explicit tests. In addition we have implemented an adaptive approach which explores the trade-off between compile time and run-time work load. By generating look-up tables at compile time, our algorithm becomes even faster at run-time. Our results also show that by using more than one thread on tightly coupled architectures, further speed-up can be achieved.

international conference on information and communication technology | 2012

Case studies of multi-core energy efficiency in task based programs

Hallgeir Lien; Lasse Natvig; Abdullah Al Hasib; Jan Christian Meyer

In this paper, we present three performance and energy case studies of benchmark applications in the OmpSs environment for task based programming. Different parallel and vectorized implementations are evaluated on an Intel® CoreTMi7-2600 quad-core processor. Using FLOPS/W derived from chip MSR registers, we find AVX code to be clearly most energy efficient in general. The peak on-chip GFLOPS/W rates are: Black-Scholes (BS) 0.89, FFTW 1.38 and Matrix Multiply (MM) 1.97. Experiments cover variable degrees of thread parallelism and different OmpSs task pool scheduling policies. We find that maximum energy efficiency for small and medium sized problems is obtained by limiting the number of parallel threads. Comparison of AVX variants with non-vectorized code shows ≈6−7 × (BS) and ≈3−5 × (FFTW) improvements in on-chip energy efficiency, depending on the problem size and degree of multithreading.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

Performance modeling of heterogeneous systems

Jan Christian Meyer; Anne C. Elster

Predicting how well applications may run on modern systems is becoming increasingly challenging. It is no longer sufficient to look at number of floating point operations and communication costs, but one also needs to model the underlying systems and how their topology, heterogeneity, system loads, etc, may impact performance. This work focuses on developing a practical model for heterogeneous computing by looking at the older BSP model, which attempts to model communication costs on homogeneous systems, and looks at how its library implementations can be extended to include a run-time system that may be useful for heterogeneous systems. Our extensions of BSPlib with MPI and GASnet mechanisms at the communication layer should provide useful tools for evaluating applications with respect to how they may run on heterogeneous systems.

international parallel and distributed processing symposium | 2014

A Study of Energy and Locality Effects Using Space-Filling Curves

Nico Reissman; Jan Christian Meyer; Magnus Jahre

The cost of energy is becoming an increasingly important driver for the operating cost of HPC systems, adding yet another facet to the challenge of producing efficient code. In this paper, we investigate the energy implications of trading computation for locality by applying Hilbert and Morton space-filling curves to dense matrix-matrix multiplication. The advantage of these curves is that they exhibit an inherent tiling effect without requiring specific architecture tuning. By accessing the matrices in the order determined by the space-filling curves, we can trade computation for locality. The index computation overhead of the Morton curve is found to be balanced against its locality and energy efficiency, while the overhead of the Hilbert curve outweighs its improvements on our test system.

ACM Transactions on Architecture and Code Optimization | 2015

Perfect Reconstructability of Control Flow from Demand Dependence Graphs

Helge Bahmann; Nico Reissmann; Magnus Jahre; Jan Christian Meyer

Demand-based dependence graphs (DDGs), such as the (Regionalized) Value State Dependence Graph ((R)VSDG), are intermediate representations (IRs) well suited for a wide range of program transformations. They explicitly model the flow of data and state, and only implicitly represent a restricted form of control flow. These features make DDGs especially suitable for automatic parallelization and vectorization, but cannot be leveraged by practical compilers without efficient construction and destruction algorithms. Construction algorithms remodel the arbitrarily complex control flow of a procedure to make it amenable to DDG representation, whereas destruction algorithms reestablish control flow for generating efficient object code. Existing literature presents solutions to both problems, but these impose structural constraints on the generatable control flow, and omit qualitative evaluation. The key contribution of this article is to show that there is no intrinsic structural limitation in the control flow directly extractable from RVSDGs. This fundamental result originates from an interpretation of loop repetition and decision predicates as computed continuations, leading to the introduction of the predicate continuation normal form. We provide an algorithm for constructing RVSDGs in predicate continuation form, and propose a novel destruction algorithm for RVSDGs in this form. Our destruction algorithm can generate arbitrarily complex control flow; we show this by proving that the original CFG an RVSDG was derived from can, apart from overspecific detail, be reconstructed perfectly. Additionally, we prove termination and correctness of these algorithms. Furthermore, we empirically evaluate the performance, the representational overhead at compile time, and the reduction in branch instructions compared to existing solutions. In contrast to previous work, our algorithms impose no additional overhead on the control flow of the produced object code. To our knowledge, this is the first scheme that allows the original control flow of a procedure to be recovered from a DDG representation.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Optimized Barriers for Heterogeneous Systems Using MPI

Jan Christian Meyer; Anne C. Elster

The heterogeneous communication characteristics of clustered SMP systems create great potential for optimizations which favor physical locality. This paper describes a novel technique for automating such optimizations, applied to barrier operations. Portability poses a challenge when optimizing for locality, as costs are bound to variations in platform topology. This challenge is addressed through representing both platform structure and barrier algorithms as input data, and altering the algorithm based on benchmark results which can be easily obtained from a given platform. Our resulting optimization technique is empirically tested on two modern clusters, up to eight dual quad-core nodes on one, and up to ten dual hex-core nodes on another. Included test results show that the method captures performance advantages on both systems without any explicit customization, and produces specialized barriers of superior performance to a topology-neutral implementation.

parallel computing | 2006

A load balancing strategy for computations on large, read-only data sets

Jan Christian Meyer; Anne C. Elster

As data repositories grow larger, it becomes increasingly difficult to transmit a large volume of data and handle several simultaneous data requests. One solution is to use a cluster of workstations for data storage. The challenge, however, is to balance the system load, since these requests may appear and change continuously. In this paper, a new method for load balancing requests on such large data sets is developed. The motivation for our method is systems where large geological data sets are rendered in real-time by a homogeneous computational cluster. The goal is to expand this system to accommodate multiple simultaneous clients. Our method assumes that the large input sets may be examined in advance, and uses simple, continuous functions to approximate the discrete costs associated with each data element. Finally, we show that partitioning a data set using our method involves very little overhead.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Energy-Efficient Sparse Matrix Autotuning with CSX -- A Trade-off Study

Jan Christian Meyer; Juan M. Cebrian; Lasse Natvig; Vasileios Karakasis; Dimitris Siakavaras; Konstantinos Nikas

In this paper, we apply a method for extracting a running power estimate of applications from hardware performance counters, producing power/time curves which can be integrated over particular intervals to estimate the energy consumption of individual application stages. We use this method to instrument executions of a conjugate gradient solver, to examine the energy and performance impacts of applying the Compressed Sparse eXtended (CSX) and classic Compressed Sparse Row (CSR) matrix compression methods to sparse linear systems from different application areas. The CSX format requires a preprocessing stage which identifies and exploits a range of matrix substructures, incurring a one-time cost which can facilitate more effective sparse matrix-vector multiplication (SpMV). As this numerical kernel is the primary performance bottleneck of conjugate gradient solvers, we take the approach of isolating the energy cost of preprocessing from a short sample of application iterations, obtaining measurements which enlighten the choice of which compression scheme is more appropriate to the input data. We examine the impact variable degrees of parallelism, processor clock frequency, and Hyper threading have on this trade-off. Our results include comparisons of empirically obtained results from all combinations of up to 8 threads on 4 hyper threaded cores, 3 clock frequencies, and 5 sample application matrices. We assess program-hardware interactions with views to structural properties of the data and hardware architectural features, and evaluate the approach with respect to integrating the energy instrumentation with present automatic performance tuning. Results show that our method is sufficiently precise to identify non-trivial tradeoffs in the parameter space, and may become suitable for a run-time automatic tuning scheme by applying a faster preprocessing mode of CSX.

ieee international conference on high performance computing data and analytics | 2010

Automatic Run-time Parallelization and Transformation of I/O

Thorvald Natvig; Anne C. Elster; Jan Christian Meyer

As the size of computational clusters grows, one can expect that I/O will consume an increasing portion of wall-clock time as the problem and node sizes are scaled up, unless parallel I/O is introduced. Unfortunately, using parallel I/O is non-trivial, so few applications developed by individual researchers enjoy its benefits. In this paper, we describe our novel method for analyzing I/O and communication operations at run-time. When nodes perform I/O or communication operations, our technique protects the memory associated with the requests from the application. Subsequent operations are analyzed for overlap between communication and I/O operations. When found, the I/O operation is automatically transformed, by our injected library, from an individual operation to a collective and shared MPI I/O operation. This allows users to benefit from parallel file systems without redesigning or recompiling their applications, and we demonstrate speedup for common usage patterns.

Explore More