Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Kamil Rocki is active.

Publication


Featured researches published by Kamil Rocki.


cluster computing and the grid | 2012

Accelerating 2-opt and 3-opt Local Search Using GPU in the Travelling Salesman Problem

Kamil Rocki; Reiji Suda

In this paper we are presenting high performance GPU implementations of the 2-opt and 3-opt local search algorithms used to solve the Traveling Salesman Problem. The main idea behind them is to take a route that crosses over itself and reorder it so that it does not. It is a very important local search technique. GPU usage greatly decreases the time needed to find the best edges to be swapped in a route. Our results show that at least 90% of the time during Iterated Local Search is spent on the local search itself. We used 13 TSPLIB problem instances with sizes ranging from 100 to 4461 cities for testing. Our results show that by using our GPU algorithm, the time needed to find optimal swaps can be decreased approximately 3 to 26 times compared to parallel CPU code using 32 cores. Additionally, we are pointing out the memory bandwidth limitation problem in current parallel architectures. We are showing that re-computing data is usually faster than reading it from memory in case of multi-core systems and we are proposing this approach as a solution.


ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

High Performance GPU Accelerated Local Optimization in TSP

Kamil Rocki; Reiji Suda

This paper presents a high performance GPU accelerated implementation of 2-opt local search algorithm for the Traveling Salesman Problem (TSP). GPU usage significantly decreases the execution time needed for tour optimization, however it also requires a complicated and well tuned implementation. With the problem size growing, the time spent on local optimization comparing the graph edges grows significantly. According to our results based on the instances from the TSPLIB library, the time needed to perform a simple local search operation can be decreased approximately 5 to 45 times compared to a corresponding parallel CPU code implementation using 6 cores. The code has been implemented in OpenCL and as well as in CUDA and tested on AMD and NVIDIA devices. The experimental studies show that the optimization algorithm using the GPU local search converges from up to 300 times faster compared to the sequential CPU version on average, depending on the problem size. The main contributions of this paper are the problem division scheme exploiting data locality which allows to solve arbitrarily big problem instances using GPU and the parallel implementation of the algorithm itself.


parallel processing and applied mathematics | 2009

Parallel minimax tree searching on GPU

Kamil Rocki; Reiji Suda

The paper describes results of minimax tree searching algorithm implemented within CUDA platform. The problem regards move choice strategy in the game of Reversi. The parallelization scheme and performance aspects are discussed, focusing mainly on warp divergence problem and data transfer size. Moreover, a method of minimizing warp divergence and performance degradation is described. The paper contains both the results of test performed on multiple CPUs and GPUs. Additionally, it discusses αβ parallel pruning implementation.


irregular applications: architectures and algorithms | 2013

Register level sort algorithm on multi-core SIMD processors

Tian Xiaochen; Kamil Rocki; Reiji Suda

State-of-the-art hardware increasingly utilizes SIMD parallelism, where multiple processing elements execute the same instruction on multiple data points simultaneously. However, irregular and data intensive algorithms are not well suited for such architectures. Due to their importance, it is crucial to obtain efficient implementations. One example of such a task is sort, a fundamental problem in computer science. In this paper we analyze distinct memory accessing models and propose two methods to employ highly efficient bitonic merge sort using SIMD instructions as register level sort. We achieve nearly 270x speedup (525M integers/s) on a 4M integer set using Xeon Phi coprocessor, where SIMD level parallelism accelerates the algorithm over 3 times. Our method can be applied to any device supporting similar SIMD instructions.


acm symposium on applied computing | 2014

The future of accelerator programming: abstraction, performance or can we have both?

Kamil Rocki; Martin Burtscher; Reiji Suda

In a perfect world, code would only be written once and would run on different devices with high efficiency. A programmers time would primarily be spent on thinking about the algorithms and data structures, not on implementing them. To a degree, that used to be the case in the era of frequency scaling on a single core. However, due to power limitations, parallel programming has become necessary to obtain performance gains. But parallel architectures differ substantially from each other, often require specialized knowledge, and typically necessitate reimplementation and fine tuning of application code. These slow tasks frequently result in situations where most of the time is spent reimplementing old rather than writing new code. The goal of our research is to find new programming techniques that increase productivity, maintain high performance, and provide abstraction to free the programmer from these unnecessary and time-consuming tasks. However, such techniques usually come at the cost of substantial performance degradation. This paper investigates current approaches to portable accelerator programming, seeking to answer whether they make it possible to combine high efficiency with sufficient algorithm abstraction. It discusses OpenCL as a potential solution and presents three approaches of writing portable code: GPU-centric, CPU-centric and combined. By applying the three approaches to a real-world program, we show that it is at least sometimes possible to run exactly the same code on many different devices with minimal performance degradation using parameterization. The main contributions of this paper are an extensive review of the current state-of-the-art regarding the stated problem and our original approach of addressing this problem with a generalized excessive-parallelism approach.


international conference on high performance computing and simulation | 2012

Accelerating 2-opt and 3-opt local search using GPU in the travelling salesman problem

Kamil Rocki; Reiji Suda

In this paper we are presenting high performance GPU implementations of the 2-opt and 3-opt local search algorithms used to solve the Traveling Salesman Problem. The main idea behind them is to take a route that crosses over itself and reorder it so that it does not. It is a very important local search technique. GPU usage greatly decreases the time needed to find the best edges to be swapped in a route. Our results show that at least 90% of the time during Iterated Local Search is spent on the local search itself. We used 13 TSPLIB problem instances with sizes ranging from 100 to 4461 cities for testing. Our results show that by using our GPU algorithm, the time needed to find optimal swaps can be decreased approximately 3 to 26 times compared to parallel CPU code using 32 cores. Additionally, we are pointing out the memory bandwidth limitation problem in current parallel architectures. We are showing that re-computing data is usually faster than reading it from memory in case of multi-core systems and we are proposing this approach as a solution.


acm symposium on parallel algorithms and architectures | 2012

Brief announcement: a GPU accelerated iterated local search TSP solver

Kamil Rocki; Reiji Suda

In this paper we are presenting high performance GPU implementations of the 2-opt and 3-opt local search algorithms used to solve the Traveling Salesman Problem. This type of local search optimization is a very effective and fast method in case of small problem instances. However, the time spent on comparing the graph edges grows significantly with the problem size growing. They are usually a part of global search algorithms such as Iterated Local Search (ILS). Our results showed, that at least 90% of the time during a single ILS run is spent on the local search itself. Therefore we utilized GPU to parallelize the local search and that greatly improved the overall speed of the algorithm. Our results show that the GPU accelerated algorithm finds the optimal swaps approximately 3 to 26 times compared to parallel CPU code using 32 cores, operating at the speed of over 1.5 TFLOPS on a single GeForce GTX 680 GPU. The preliminary experimental studies show that the optimization algorithm using the GPU local search converges 10 to 50 times faster on average compared to the sequential CPU version, depending on the problem size.


australasian joint conference on artificial intelligence | 2011

Parallel monte carlo tree search scalability discussion

Kamil Rocki; Reiji Suda

In this paper we are discussing which factors affect the scalability of the parallel Monte Carlo Tree Search algorithm. We have run the algorithm on CPUs and GPUs in Reversi game and SameGame puzzle on the TSUBAME supercomputer. We are showing that the most likely cause of the scaling bottleneck is the problem size. Therefore we are showing that the MCTS is a weak-scaling algorithm. We are not focusing on the relative scaling when compared to a single-threaded MCTS, but rather on the absolute scaling of the parallel MCTS algorithm.


arXiv: Computer Vision and Pattern Recognition | 2017

Utilization of Deep Reinforcement Learning for Saccadic-Based Object Visual Search

Tomasz Kornuta; Kamil Rocki

The paper focuses on the problem of learning saccades enabling visual object search. The developed system combines reinforcement learning with a neural network for learning to predict the possible outcomes of its actions. We validated the solution in three types of environment consisting of (pseudo)-randomly generated matrices of digits. The experimental verification is followed by the discussion regarding elements required by systems mimicking the fovea movement and possible further research directions.


field-programmable custom computing machines | 2013

High-Throughput and Low-Cost Hardware Accelerator for Privacy Preserving Publishing

Kamil Rocki; Martin Burtscher; Reiji Suda

In a perfect world, code would only be written once and would run on different devices with high efficiency. A programmers time would primarily be spent on thinking about the algorithms and data structures, not on implementing them. To a degree, that used to be the case in the era of frequency scaling on a single core. However, due to power limitations, parallel programming has become necessary to obtain performance gains. But parallel architectures differ substantially from each other, often require specialized knowledge, and typically necessitate reimplementation and fine tuning of application code. These slow tasks frequently result in situations where most of the time is spent reimplementing old rather than writing new code. The goal of our research is to find new programming techniques that increase productivity, maintain high performance, and provide abstraction to free the programmer from these unnecessary and time-consuming tasks. However, such techniques usually come at the cost of substantial performance degradation. This paper investigates current approaches to portable accelerator programming, seeking to answer whether they make it possible to combine high efficiency with sufficient algorithm abstraction. It discusses OpenCL as a potential solution and presents three approaches of writing portable code: GPU-centric, CPU-centric and combined. By applying the three approaches to a real-world program, we show that it is at least sometimes possible to run exactly the same code on many different devices with minimal performance degradation using parameterization. The main contributions of this paper are an extensive review of the current state-of-the-art regarding the stated problem and our original approach of addressing this problem with a generalized excessive-parallelism approach.

Collaboration


Dive into the Kamil Rocki's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Tomasz Kornuta

Warsaw University of Technology

View shared research outputs
Top Co-Authors

Avatar

Liyu Xia

University of Chicago

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge