Milosz Ciznicki
Poznań University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Milosz Ciznicki.
Concurrency and Computation: Practice and Experience | 2015
Krzysztof Rojek; Milosz Ciznicki; Bogdan Rosa; Michal Kulczewski; Krzysztof Kurowski; Zbigniew P. Piotrowski; Lukasz Szustak; Damian Karol Wójcik; Roman Wyrzykowski
The goal of this study is to adapt the multiscale fluid solver EULerian or LAGrangian framewrok (EULAG) to future graphics processing units (GPU) platforms. The EULAG model has the proven record of successful applications, and excellent efficiency and scalability on conventional supercomputer architectures. Currently, the model is being implemented as the new dynamical core of the COSMO weather prediction framework. Within this study, two main modules of EULAG, namely the multidimensional positive definite advection transport algorithm (MPDATA) and the variational generalized conjugate residual, elliptic pressure solver Generalized Conjugate Residual (GCR) are analyzed and optimized. In this paper, a method is proposed, which ensures a comprehensive analysis of the resource consumption including registers, shared, and global memories. This method allows us to identify bottlenecks of the algorithm, including data transfers between host and global memory, global and shared memories, as well as GPU occupancy. We put the emphasis on providing a fixed memory access pattern, padding as well as organizing computation in the MPDATA algorithm. The testing and validation of the new GPU implementation have been carried out based on modeling decaying turbulence of a homogeneous incompressible fluid in a triply‐periodic cube. Simulations performed using the standard version of EULAG and its new GPU implementation give similar solutions. Preliminary results show a promising increase in terms of computational efficiency. Copyright
Scientific Programming | 2013
Marek Blazewicz; Ian Hinder; David M. Koppelman; Steven R. Brandt; Milosz Ciznicki; Michal Kierzynka; Frank Löffler; Jian Tao
Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemoras capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.
Journal of Applied Remote Sensing | 2012
Milosz Ciznicki; Krzysztof Kurowski; Antonio Plaza
Hyperspectral image compression has received considerable interest in recent years due to the enormous data volumes collected by imaging spectrometers for Earth Observation. JPEG2000 is an important technique for data compression, which has been successfully used in the context of hyperspectral image compression, either in lossless and lossy fashion. Due to the increasing spatial, spectral, and temporal resolution of remotely sensed hyperspectral data sets, fast (on-board) compression of hyperspectral data is becoming an important and challenging objective, with the potential to reduce the limitations in the downlink connection between the Earth Observation platform and the receiving ground stations on Earth. For this purpose, implementation of hyperspectral image compression algorithms on specialized hardware devices are currently being investigated. We have developed an implementation of the JPEG2000 compression standard in commodity graphics processing units (GPUs). These hardware accelerators are characterized by their low cost and weight and can bridge the gap toward on-board processing of remotely sensed hyperspectral data. Specifically, we develop GPU implementations of the lossless and lossy modes of JPEG2000. For the lossy mode, we investigate the utility of the compressed hyperspectral images for different compression ratios, using a standard technique for hyperspectral data exploitation such as spectral unmixing. Our study reveals that GPUs represent a source of computational power that is both accessible and applicable to obtaining compression results in valid response times in information extraction applications from remotely sensed hyperspectral imagery.
international conference on parallel processing | 2013
Milosz Ciznicki; Michal Kulczewski; Krzysztof Kurowski; Pawel Gepner
The recent advent of novel multi- and many-core architectures forces application programmers to deal with hardware-specific implementation details and to be familiar with software optimisation techniques to benefit from new high-performance computing machines. An extra care must be taken for communication-intensive algorithms, which may be a bottleneck for forthcoming era of exascale computing. This paper aims to present performance evaluation of preliminary adaptation techniques to hybrid MPI+OpenMP parallelisation schemes we provided into the EULAG code. Various techniques are discussed, and the results will lead us toward efficient algorithms and methods to scale communication-intensive elliptic solver with preconditioner, including GPU architectures to be provided later in the future.
parallel processing and applied mathematics | 2011
Milosz Ciznicki; Michal Kierzynka; Krzysztof Kurowski; Bogdan Ludwiczak; Krystyna Napierala; Jarosław Palczyński
The algorithms for isosurface extraction have become crucial in petroleum industry, medicine and many other fields over the last years. Nowadays market demands engender a need for methods that not only construct accurate 3D models but also deal with the problem efficiently. Recently, a few highly optimized approaches taking advantage of modern graphics processing units (GPUs) have been published in the literature. However, despite their satisfactory speed, they all may be unsuitable in real-life applications due to limits on maximum domain size they can process. In this paper we present a novel approach to surface extraction by combining the algorithm of Marching Tetrahedra with the idea of Histogram Pyramids. Our GPU-based application can process CT and MRI scan data. Thanks to domain decomposition, the only limiting factor for the size of input instance is the amount of memory needed to store the resulting model. The solution is also immensely fast achieving up to 107-fold speedup comparing to a serial CPU code. Moreover, multiple GPUs support makes it very scalable. Provided tool enables the user to visualize generated model and to modify it in an interactive manner.
ieee international conference on high performance computing data and analytics | 2011
Milosz Ciznicki; Krzysztof Kurowski; Antonio Plaza
Hyperspectral image compression has received considerable interest in recent years due to the enormous data volumes collected by imaging spectrometers for Earth Observation. JPEG2000 is an important technique for data compression which has been successfully used in the context of hyperspectral image compression, either in lossless and lossy fashion. Due to the increasing spatial, spectral and temporal resolution of remotely sensed hyperspectral data sets, fast (onboard) compression of hyperspectral data is becoming a very important and challenging objective, with the potential to reduce the limitations in the downlink connection between the Earth Observation platform and the receiving ground stations on Earth. For this purpose, implementation of hyperspectral image compression algorithms on specialized hardware devices are currently being investigated. In this paper, we develop an implementation of the JPEG2000 compression standard in commodity graphics processing units (GPUs). These hardware accelerators are characterized by their low cost and weight, and can bridge the gap towards on-board processing of remotely sensed hyperspectral data. Specifically, we develop GPU implementations of the lossless and lossy modes of JPEG2000. For the lossy mode, we investigate the utility of the compressed hyperspectral images for different compression ratios, using a standard technique for hyperspectral data exploitation such as spectral unmixing. In all cases, we investigate the speedups that can be gained by using the GPU implementations with regards to the serial implementations. Our study reveals that GPUs represent a source of computational power that is both accessible and applicable to obtaining compression results in valid response times in information extraction applications from remotely sensed hyperspectral imagery.
Journal of Computational Science | 2014
Milosz Ciznicki; Michal Kierzynka; Krzysztof Kurowski; Pawel Gepner
Abstract The use of graphics hardware for non-graphics applications has become popular among many scientific programmers and researchers as we have observed a higher rate of theoretical performance increase than the CPUs in recent years. However, performance gains may be easily lost in the context of a specific parallel application due to various both hardware and software factors. JPEG 2000 is a complex standard for data compression and coding, that provides many advanced capabilities demanded by more specialized applications. There are several JPEG 2000 implementations that utilize emerging parallel architectures with the built-in support for parallelism at different levels. Unfortunately, many available implementations are only optimized for a certain parallel architecture or they do not take advantage of recent capabilities provided by modern hardware and low level APIs. Thus, the main aim of this paper is to present a comprehensive real performance analysis of JPEG 2000. It consists of a chain of data and compute intensive tasks that can be treated as good examples of software benchmarks for modern parallel hardware architectures. In this paper we compare achieved performance results of various JPEG 2000 implementations executed on selected architectures for different data sets to identify possible bottlenecks. We discuss also best practices and advices for parallel software development to help users to evaluate in advance and then select appropriate solutions to accelerate the execution of their applications.
international conference on conceptual structures | 2012
Milosz Ciznicki; Michal Kierzynka; Krzysztof Kurowski; Pawel Gepner
Abstract The use of graphics hardware for non-graphics applications has become popular among many scientific programmers and researchers as we have observed a higher rate of theoretical performance increase than the CPUs in recent years. However, performance gains may be easily lost in the context of a specific parallel application due to various both hardware and software factors. Consequently, software benchmarks and performance testing are still the best techniques to compare the effciency of emerging parallel architectures with the built-in support for parallelism at different levels. Unfortunately, many available benchmarks are either relatively simple application kernels, they have been optimized only for a certain parallel architecture or they do not take advantage of recent capabilities provided by modern hardware and low level APIs. Thus, the main aim of this paper is to present a comprehensive real performance analysis of selected applications following the complex standard for data compression and coding -JPEG 2000. It consists of a chain of data and compute intensive tasks that can be treated as good examples of software benchmarks for modern parallel hardware architectures. In this paper we compare achieved performance results of our standard based benchmarks executed on selected architectures for different data sets to identify possible bottlenecks. We discuss also best practices and advices for parallel software development to help users to evaluate in advance and then select appropriate solutions to accelerate the execution of their applications.
international conference on parallel processing | 2011
Marek Blazewicz; Milosz Ciznicki; Krzysztof Kurowski; Paweł Lichocki
The Discrete Wavelet Transform (DWT) has gained the momentum in signal processing and image compression over the last decade bringing the concept up to the level of new image coding standard JPEG2000. Thanks to many added values in DWT, in particular inherent multi-resolution nature, wavelet-coding schemes are suitable for various applications where scalability and tolerable degradation are relevant. Moreover, as we demonstrate in this paper, it can be used as a perfect benchmarking procedure for more sophisticated data compression and multimedia applications using General Purpose Graphical Processor Units (GPGPUs). Thus, in this paper we show and compare experiments performed on reference implementations of DWT on Cell Broadband Engine Architecture (Cell B.E) and nVidia Graphical Processing Units (GPUs). The achieved results show clearly that although both GPU and Cell B.E. are being considered as representatives of the same hybrid architecture devices class they differ greatly in programming style and optimization techniques that need to be taken into account during the development. In order to show the speedup, the parallel algorithm has been compared to sequential computation performed on the x86 architecture.
PPAM (2) | 2016
Milosz Ciznicki; Michal Kulczewski; Krzysztof Kurowski
The recent advent of novel multi- and many-core architectures forces application programmers to deal with hardware-specific implementation details and to be familiar with software optimization techniques to benefit from new high-performance computing machines. An extra care must be taken for communication-intensive algorithms, which may be a bottleneck for forthcoming era of exascale computing. This paper aims to present a high level stencil framework implemented for the EULAG model that efficiently utilizes heterogeneous clusters. Only an efficient usage of both CPUs and GPUs with the flexible data decomposition method can lead to the maximum performance that scales communication-intensive elliptic solver with preconditioner.