Nikola Puzovic | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nikola Puzovic is active.

Explore More

Publication

Featured researches published by Nikola Puzovic.

symposium on computer architecture and high performance computing | 2007

DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems

Roberto Giorgi; Zdravko Popovic; Nikola Puzovic

One way to exploit Thread Level Parallelism (TLP) is to use architectures that implement novel multithreaded execution models, like Scheduled Data- Flow (SDF). This latter model promises an elegant decoupled and non-blocking execution of threads. Here we extend that model in order to be used in future scalable CMP systems where wire delay imposes to partition the design. In this paper we describe our approach and experiment with different distributed schedulers, different number of clusters and processors per cluster to show good scalability of our architecture. We describe our approach and present initial results on system scalability and performance. We suggest design choices to improve the scalability of the basic design.

international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2009

Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture

Roberto Giorgi; Zdravko Popovic; Nikola Puzovic

We believe that future many-core architectures should support a simple and scalable way to execute many threads that are generated by parallel programs. A good candidate to implement an efficient and scalable execution of threads is the DTA (Decoupled Threaded Architecture), which is designed to exploit fine/medium grained Thread Level Parallelism (TLP) by using a hardware scheduling unit and relying on existing simple cores. In this paper, we present an initial implementation of DTA concept in a many-core architecture where it interacts with other architectural components designed from scratch in order to address the problem of scalability. We present initial results that show the scalability of the solution that were obtained using a many-core simulator written in SARCSim (a variant of UNISIM) with DTA support.

international parallel and distributed processing symposium | 2009

Exploiting DMA to enable non-blocking execution in Decoupled Threaded Architecture

Roberto Giorgi; Zdravko Popovic; Nikola Puzovic

DTA (Decoupled Threaded Architecture) is designed to exploit fine/medium grained Thread Level Parallelism (TLP) by using a distributed hardware scheduling unit and relying on existing simple cores (in-order pipelines, no branch predictors, no ROBs).

international conference on embedded computer systems architectures modeling and simulation | 2013

Parallelizing general histogram application for CUDA architectures

Ugljesa Milic; Isaac Gelado; Nikola Puzovic; Alex Ramirez; Milo Tomasevic

Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.

international conference on cluster computing | 2010

A multi-pronged approach to benchmark characterization

Nikola Puzovic; Sally A. McKee; Revital Eres; Ayal Zaks; Paolo Gai; Stephan Wong; Roberto Giorgi

Understanding the behavior of current and future workloads is key for designers of future computer systems. If target workload characteristics are available, computer designers can use this information to optimize the system. This can lead to a chicken-and-egg problem: how does one characterize application behavior for an architecture that is a moving target and for which sophisticated modeling tools do not yet exist? We present a multi-pronged approach to benchmark characterization early in the design cycle. We collect statistics from multiple sources and combine them to create a comprehensive view of application behavior. We assume a fixed part of the system (service core) and a “to-be-designed” part that will gradually be developed under the measurements taken on the fixed part. Data are collected from measurements taken on existing hardware and statistics are obtained via emulation tools. These are supplemented with statistics extracted from traces and ILP information generated by the compiler. Although the motivation for this work is the classification of workloads for an embedded, reconfigurable, parallel architecture, the methodology can easily be adapted to other platforms.

international parallel and distributed processing symposium | 2013

Programmable and Scalable Reductions on Clusters

Jan Ciesko; Javier Bueno; Nikola Puzovic; Alex Ramirez; Rosa M. Badia; Jesús Labarta

Reductions matter and they are here to stay. Wide adoption of parallel processing hardware in a broad range of computer applications has encouraged recent research efforts on their efficient parallelization. Furthermore, trends towards high productivity languages in mainstream computing increases the demand for efficient programming support. In this paper we present a new approach on parallel reductions for distributed memory systems that provides both scalability and programmability. Using OmpSs, a task-based parallel programming model, the developer has the ability to express scalable reductions through a single pragma annotation. This pragma annotation is applicable for tasks as well as for work-sharing constructs (with implicit tasking) and instructs the compiler to generate the required runtime calls. The supporting runtime handles data and task distribution, parallel execution and data reduction. Scalability is achieved through a software cache that maximizes local and temporal data reuse and allows overlapped computation and communication. Results confirm scalability for up to 32 12-core cluster nodes.

complex, intelligent and software intensive systems | 2009

Introducing Hardware TLP Support in the Cell Processor

Roberto Giorgi; Zdravko Popovic; Nikola Puzovic

The focus of our study is the support for fine/medium grained Thread Level Parallelism (TLP) by using a hardware scheduling unit and relying on existing simple cores. Simple cores are grouped into clusters in order to provide a scalable solution. As a proof of concept, we use an implementation based on the Cell Broadband Engine (CBE). Cell is a multiprocessor on a chip developed by Sony, Toshiba and IBM that contains one general purpose core and eight coprocessor elements that accelerate the multimedia and vector processing. The aim of this paper is to present a possible implementation of DTA (Decoupled Threaded Architecture) that is based on the Cell processor, while keeping the scalability of the original DTA.

digital systems design | 2008

Analyzing Scalability of Deblocking Filter of H.264 via TLP Exploitation in a New Many-Core Architecture

Roberto Giorgi; Zdravko Popovic; Nikola Puzovic; Arnaldo Azevedo; Ben H. H. Juurlink

In this paper we present results of parallelization of Deblocking Filter (DF) of H.264 video codec on decoupled threaded architecture (DTA). We parallelized the code trying to exploit all available thread level parallelism and to make it suitable for DTA architecture. Experimental results show that significant speed up can be achieved and that DTA architecture can efficiently exploit available parallelism. We also show comparison with parallelized version of DF for Cell architecture.

international conference on industrial informatics | 2011

Early results from ERA — Embedded Reconfigurable Architectures

Stephan Wong; Anthony Brandon; Fakhar Anjam; Roel Seedorf; Roberto Giorgi; Zhibin Yu; Nikola Puzovic; Sally A. McKee; Magnus Själander; Luigi Carro; Georgios Keramidas

Archive | 2011

ERA – Embedded Reconfigurable Architectures

Stephan Wong; Luigi Carro; Mateus B. Rutzig; Debora Matos; Roberto Giorgi; Nikola Puzovic; Stefanos Kaxiras; Marcelo Cintra; Giuseppe Desoli; Paolo Gai; Sally A. McKee; Ayal Zaks

Explore More