Is this you? Create Your Porfile

Jan Lucas

Technical University of Berlin

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan Lucas is active.

Explore More

Publication

Featured researches published by Jan Lucas.

international symposium on performance analysis of systems and software | 2013

How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator

Jan Lucas; Sohan Lal; Michael Andersch; Mauricio Alvarez-Mesa; Ben H. H. Juurlink

Modern GPUs are true power houses in every meaning of the word: While they offer general-purpose (GPGPU) compute performance an order of magnitude higher than that of conventional CPUs, they have also been rapidly approaching the infamous “power wall”, as a single chip sometimes consumes more than 300W. Thus, the design space of GPGPU microarchitecture has been extended by another dimension: power. While GPU researchers have previously relied on cycle-accurate simulators for estimating performance during design cycles, there are no simulation tools that include power as well. To mitigate this issue, we introduce the GPUSimPow power estimation framework for GPGPUs consisting of both analytical and empirical models for regular and irregular hardware components. To validate this framework, we build a custom measurement setup to obtain power numbers from real graphics cards. An evaluation on a set of well-known benchmarks reveals an average relative error of 11.7% between simulated and hardware power for GT240 and an average relative error of 10.8% for GTX580. The simulator has been made available to the public [1].

signal processing systems | 2013

Parallel HEVC Decoding on Multi- and Many-core Architectures

Chi Ching Chi; Mauricio Alvarez-Mesa; Jan Lucas; Ben H. H. Juurlink; Thomas Schierl

The Joint Collaborative Team on Video Decoding is developing a new standard named High Efficiency Video Coding (HEVC) that aims at reducing the bitrate of H.264/AVC by another 50 %. In order to fulfill the computational demands of the new standard, in particular for high resolutions and at low power budgets, exploiting parallelism is no longer an option but a requirement. Therefore, HEVC includes several coding tools that allows to divide each picture into several partitions that can be processed in parallel, without degrading the quality nor the bitrate. In this paper we adapt one of these approaches, the Wavefront Parallel Processing (WPP) coding, and show how it can be implemented on multi- and many-core processors. Our approach, named Overlapped Wavefront (OWF), processes several partitions as well as several pictures in parallel. This has the advantage that the amount of (thread-level) parallelism stays constant during execution. In addition, performance and power results are provided for three platforms: a server Intel CPU with 8 cores, a laptop Intel CPU with 4 cores, and a TILE-Gx36 with 36 cores from Tilera. The results show that our parallel HEVC decoder is capable of achieving an average frame rate of 116 fps for 4k resolution on a standard multicore CPU. The results also demonstrate that exploiting more parallelism by increasing the number of cores can improve the energy efficiency measured in terms of Joules per frame substantially.

modeling analysis and simulation on computer and telecommunication systems | 2016

ALUPower: Data Dependent Power Consumption in GPUs

Jan Lucas; Ben H. H. Juurlink

Existing architectural power models for GPUs count activities such as executing floating point or integer instructions, but do not consider the data values processed. While data value dependent power consumption can often be neglected when performing architectural simulations of high performance Out-of-Order (OoO) CPUs, we show that this approach is invalid for estimating the power consumption of GPUs. The throughput processing approach of GPUs reduces the amount of control logic and shifts the area and power budget towards functional units and register files. This makes accurate estimations of the power consumption of functional units even more crucial than in OoO CPUs. Using measurements from actual GPUs, we show that the processed data values influence the energy consumption of GPUs significantly. For example, the power consumption of one kernel varies between 155 and 257 Watt depending on the processed values. Existing architectural simulators are not able to model the influence of the data values on power consumption. RTL and gate level simulators usually consider data values in their power estimates but require detailed modeling of the employed units and are extremely slow. We first describe how the power consumption of GPU functional units can be measured and characterized using microbenchmarks. Then measurement results are presented and several opportunities for energy reduction by software developers or compilers are described. Finally, we demonstrate a simple and fast power macro model to estimate the power consumption of functional units and provide a significant improvement in accuracy compared to previously used constant energy per instruction models.

international symposium on performance analysis of systems and software | 2015

On latency in GPU throughput microarchitectures

Michael Andersch; Jan Lucas; Mauricio A. Lvarez-Mesa; Ben H. H. Juurlink

Modern GPUs provide massive processing power (arithmetic throughput) as well as memory throughput. Presently, while it appears to be well understood how performance can be improved by increasing throughput, it is less clear what the effects of micro-architectural latencies are on the performance of throughput-oriented GPU architectures. In fact, little is publicly known about the values, behavior, and performance impact of microarchitecture latency components in modern GPUs. This work attempts to fill that gap by analyzing both the idle (static) as well as loaded (dynamic) latency behavior of GPU microarchitectural components. Our results show that GPUs are not as effective in latency hiding as commonly thought and based on that, we argue that latency should also be a GPU design consideration besides throughput.

ACM Transactions on Architecture and Code Optimization | 2015

Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency

Jan Lucas; Michael Andersch; Mauricio Alvarez-Mesa; Ben H. H. Juurlink

Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatiotemporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction, but a combination of scalarization and STSIMT yields a mean performance enhancement of 19.6% and improves the energy-delay product by 26.2% compared to SIMT.

international conference on embedded computer systems architectures modeling and simulation | 2014

GPGPU workload characteristics and performance analysis

Sohan Lal; Jan Lucas; Michael Andersch; Mauricio Alvarez-Mesa; Ahmed Elhossini; Ben H. H. Juurlink

GPUs are much more power-efficient devices compared to CPUs, but due to several performance bottlenecks, the performance per watt of GPUs is often much lower than what could be achieved theoretically. To sustain and continue high performance computing growth, new architectural and application techniques are required to create power-efficient computing systems. To find such techniques, however, it is necessary to study the power consumption at a detailed level and understand the bottlenecks which cause low performance. Therefore, in this paper, we study GPU power consumption at component level and investigate the bottlenecks that cause low performance and low energy efficiency. We divide the low performance kernels into low occupancy and full occupancy categories. For the low occupancy category, we study if increasing the occupancy helps in increasing performance and energy efficiency. For the full occupancy category, we investigate if these kernels are limited by memory bandwidth, coalescing efficiency, or SIMD utilization.

software and compilers for embedded systems | 2017

The LPGPU2 Project: Low-Power Parallel Computing on GPUs: Extended Abstract

Ben H. H. Juurlink; Jan Lucas; Nadjib Mammeri; Martyn Bliss; Georgios Keramidas; Chrysa Kokkala; Andrew Richards

The LPGPU2 project is a 30-month-project (Innovation Action) funded by the European Union. Its overall goal is to develop an analysis and visualization framework that enables GPU application developers to improve the performance and power consumption of their applications. To achieve this overall goal, several key objectives need to be achieved. First, several applications (use cases) need to be developed for or ported to low-power GPUs. Thereafter, these applications need to be optimized using the tooling framework. In addition, power measurement devices and power models need to be developed that are 10x more accurate than the state of the art. The project consortium actively promotes open vendor-neutral standards via the Khronos group. This paper briefly reports on the achievements made in the first half of the project, and focuses on the progress made in applications; in power measurement, estimation, and modelling; and in the analysis and visualization tool suite.

conference on design and architectures for signal and image processing | 2017

Enabling GPU software developers to optimize their applications — The LPGPU 2 approach

Ben H. H. Juurlink; Jan Lucas; Nadjib Mammeri; Georgios Keramidas; Katerina Pontzolkova; Ignacio Aransay; Chrysa Kokkala; Martyn Bliss; Andrew Richards

Low-power GPUs have become ubiquitous, they can be found in domains ranging from wearable and mobile computing to automotive systems. With this ubiquity has come a wider range of applications exploiting low-power GPUs, placing ever increasing demands on the expected performance and power efficiency of the devices. The LPGPU2 project is an EU-funded, Innovation Action, 30-month-project targeting to develop an analysis and visualization framework that enables GPU application developers to improve the performance and power consumption of their applications. To this end, the project follows a holistic approach. First, several applications (use cases) are being developed for or ported to low-power GPUs. These applications will be optimized using the tooling framework in the last phase of the project. In addition, power measurement devices and power models are devised that are 10× more accurate than the state of the art. The ultimate goal of the project is to promote open vendor-neutral standards via the Khronos group. This paper briefly reports on the achievements made in the first phase of the project (till month 18) and focuses on the progress made in applications; in power measurement, estimation, and modelling; and in the analysis and visualization tool suite.

Archive | 2013