Petar Radojković
Barcelona Supercomputing Center
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Petar Radojković.
high performance embedded architectures and compilers | 2012
Petar Radojković; Sylvain Girbal; Arnaud Grasset; Eduardo Quiñones; Sami Yehia; Francisco J. Cazorla
Commercial Off-The-Shelf (COTS) processors are now commonly used in real-time embedded systems. The characteristics of these processors fulfill system requirements in terms of time-to-market, low cost, and high performance-per-watt ratio. However, multithreaded (MT) processors are still not widely used in real-time systems because the timing analysis is too complex. In MT processors, simultaneously-running tasks share and compete for processor resources, so the timing analysis has to estimate the possible impact that the inter-task interferences have on the execution time of the applications. In this paper, we propose a method that quantifies the slowdown that simultaneously-running tasks may experience due to collision in shared processor resources. To that end, we designed benchmarks that stress specific processor resources and we used them to (1) estimate the upper limit of a slowdown that simultaneously-running tasks may experience because of collision in different shared processor resources, and (2) quantify the sensitivity of time-critical applications to collision in these resources. We used the presented method to determine if a given MT processor is a good candidate for systems with timing requirements. We also present a case study in which the method is used to analyze three multithreaded architectures exhibiting different configurations of resource sharing. Finally, we show that measuring the slowdown that real applications experience when simultaneously-running with resource-stressing benchmarks is an important step in measurement-based timing analysis. This information is a base for incremental verification of MT COTS architectures.
architectural support for programming languages and operating systems | 2012
Petar Radojković; Vladimir Cakarevic; Miquel Moreto; Javier Verdú; Alex Pajuelo; Francisco J. Cazorla; Mario Nemirovsky; Mateo Valero
The introduction of massively multithreaded (MMT) processors, comprised of a large number of cores with many shared resources, has made task scheduling, in particular task to hardware thread assignment, one of the most promising ways to improve system performance. However, finding an optimal task assignment for a workload running on MMT processors is an NP-complete problem. Due to the fact that the performance of the best possible task assignment is unknown, the room for improvement of current task-assignment algorithms cannot be determined. This is a major problem for the industry because it could lead to: (1)~A waste of resources if excessive effort is devoted to improving a task assignment algorithm that already provides a performance that is close to the optimal one, or (2)~significant performance loss if insufficient effort is devoted to improving poorly-performing task assignment algorithms. In this paper, we present a method based on Extreme Value Theory that allows the prediction of the performance of the optimal task assignment in MMT processors. We further show that executing a sample of several hundred or several thousand random task assignments is enough to obtain, with very high confidence, an assignment with a performance that is close to the optimal one. We validate our method with an industrial case study for a set of multithreaded network applications running on an UltraSPARC~T2 processor.
international parallel and distributed processing symposium | 2014
Ivan Grasso; Petar Radojković; Nikola Rajovic; Isaac Gelado; Alex Ramirez
A lot of effort from academia and industry has been invested in exploring the suitability of low-power embedded technologies for HPC. Although state-of-the-art embedded systems-on-chip (SoCs) inherently contain GPUs that could be used for HPC, their performance and energy capabilities have never been evaluated. Two reasons contribute to the above. Primarily, embedded GPUs until now, have not supported 64-bit floating point arithmetic - a requirement for HPC. Secondly, embedded GPUs did not provide support for parallel programming languages such as OpenCL and CUDA. However, the situation is changing, and the latest GPUs integrated in embedded SoCs do support 64-bit floating point precision and parallel programming models. In this paper, we analyze performance and energy advantages of embedded GPUs for HPC. In particular, we analyze ARM Mali-T604 GPU - the first embedded GPUs with OpenCL Full Profile support. We identify, implement and evaluate software optimization techniques for efficient utilization of the ARM Mali GPU Compute Architecture. Our results show that, HPC benchmarks running on the ARM Mali-T604 GPU integrated into Exynos 5250 SoC, on average, achieve speed-up of 8.7X over a single Cortex-A15 core, while consuming only 32% of the energy. Overall results show that embedded GPUs have performance and energy qualities that make them candidates for future HPC systems.
international symposium on microarchitecture | 2009
Vladimir Cakarevic; Petar Radojković; Javier Verdú; Alex Pajuelo; Francisco J. Cazorla; Mario Nemirovsky; Mateo Valero
Thread level parallelism (TLP) has become a popular trend to improve processor performance, overcoming the limitations of extracting instruction level parallelism. Each TLP paradigm, such as Simultaneous Multithreading or Chip-Multiprocessors, provides different benefits, which has motivated processor vendors to combine several TLP paradigms in each chip design. Even if most of these combined-TLP designs are homogeneous, they present different levels of hardware resource sharing, which introduces complexities on the operating system scheduling and load balancing. Commonly, processor designs provide two levels of resource sharing: Inter-core in which only the highest levels of the cache hierarchy are shared, and Intra-core in which most of the hardware resources of the core are shared . Recently, Sun Microsystems has released the UltraSPARC T2, a processor with three levels of hardware resource sharing: InterCore, IntraCore, and IntraPipe. In this work, we provide the first characterization of a three-level resource sharing processor, the UltraSPARC T2, and we show how multi-level resource sharing affects the operating system design. We further identify the most critical hardware resources in the T2 and the characteristics of applications that are not sensitive to resource sharing. Finally, we present a case study in which we run a real multithreaded network application, showing that a resource sharing aware scheduler can improve the system throughput up to 55%.
IEEE Transactions on Parallel and Distributed Systems | 2013
Petar Radojković; Vladimir Cakarevic; Javier Verdú; Alex Pajuelo; Francisco J. Cazorla; Mario Nemirovsky; Mateo Valero
The introduction of multithreaded processors comprised of a large number of cores with many shared resources makes thread scheduling, and in particular optimal assignment of running threads to processor hardware contexts to become one of the most promising ways to improve the system performance. However, finding optimal thread assignments for workloads running in state-of-the-art multicore/multithreaded processors is an NP-complete problem. In this paper, we propose BlackBox scheduler, a systematic method for thread assignment of multithreaded network applications running on multicore/multithreaded processors. The method requires minimum information about the target processor architecture and no data about the hardware requirements of the applications under study. The proposed method is evaluated with an industrial case study for a set of multithreaded network applications running on the UltraSPARC T2 processor. In most of the experiments, the proposed thread assignment method detected the best actual thread assignment in the evaluation sample. The method improved the system performance from 5 to 48 percent with respect to load balancing algorithms used in state-of-the-art OSs, and up to 60 percent with respect to a naive thread assignment.
Proceedings of the 2015 International Symposium on Memory Systems | 2015
Milan Radulovic; Darko Zivanovic; Daniel Ruiz; Bronis R. de Supinski; Sally A. McKee; Petar Radojković; Eduard Ayguadé
First defined two decades ago, the memory wall remains a fundamental limitation to system performance. Recent innovations in 3D-stacking technology enable DRAM devices with much higher bandwidths than traditional DIMMs. The first such products will soon hit the market, and some of the publicity claims that they will break through the memory wall. Here we summarize our analysis and expectations of how such 3D-stacked DRAMs will affect the memory wall for a set of representative HPC applications. We conclude that although 3D-stacked DRAM is a major technological innovation, it cannot eliminate the memory wall.
symposium on computer architecture and high performance computing | 2008
Petar Radojković; Vladimir Cakarevic; Javier Verdú; Alejandro Pajuelo; Roberto Gioiosa; Francisco J. Cazorla; Mario Nemirovsky; Mateo Valero
Numerous studies have shown that operating system (OS)noise is one of the reasons for significant performance degradation in clustered architectures. Although many studies examine the OS noise for high performance computing (HPC),especially in multi-processor/core systems, most of them focus on 2- or 4-core systems. In this paper, we analyze the major sources of OS noise on a massive multithreading processor, the Sun UltraSPARC T1,running Linux and Solaris. Since a real system is too complex to analyze, we compare those results with a low-overhead runtime environment: the Netra Data Plane Software Suite (Netra DPS). Our results show that the overhead introduced by the OStimer interrupt in Linux and Solaris depends on the particular core and hardware context in which the application is running. This overhead is up to 30% when the application is executed on the same hardware context of the timer interrupt handler and up to 10% when the application and the timer interrupt handler run on different contexts but on the same core.We detect no overhead when the benchmark and the timer interrupt handler run on different cores of the processor.
acm sigplan symposium on principles and practice of parallel programming | 2010
Petar Radojković; Vladimir Cakarevic; Javier Verdú; Alex Pajuelo; Francisco J. Cazorla; Mario Nemirovsky; Mateo Valero
In processors with several levels of hardware resource sharing,like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/thread must be assigned to one of the hardware contexts(strands). We call this last scheduling step the Thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing. We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative of multithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers.
international conference on program comprehension | 2015
Milan Pavlovic; Milan Radulovic; Alex Ramírez; Petar Radojković
Characterization of high-performance computing applications often has to be done without access to the source code. Computer architects, therefore, have a narrowed choice of instrumentation tools. Moreover, potentially large amount of collected data can prohibit creating a full time stamped event trace and analyzing it post-mortem. This paper describes Limpio -- a Light weight MPI instrumentation framework, that allows dynamic instrumentation of user-selected MPI calls, and customization of data gathering, analysis and visualization.
international symposium on microarchitecture | 2012
Petar Radojković; Paul M. Carpenter; Miquel Moreto; Alex Ramirez; Francisco J. Cazorla
One of the greatest challenges in computer architecture is how to write efficient, portable, and correct software for multi-core processors. A promising approach is to expose more parallelism to the compiler, through the use of domain-specific languages. The compiler can then perform complex transformations that the programmer would otherwise have had to do. Many important applications related to audio and video encoding, software radio and signal processing have regular behavior that can be represented using a stream programming language. When written in such a language, a portable stream program can be automatically mapped by the stream compiler onto multicore hardware. One of the most difficult tasks of the stream compiler is partitioning the stream program into software threads. The choice of partition significantly affects performance, but finding the optimal partition is an NP-complete problem. This paper presents a method, based on Extreme Value theory (EVT), that statistically estimates the performance of the optimal partition. Knowing the optimal performance improves the evaluation of any partitioning algorithm, and it is the most important piece of information when deciding whether an existing algorithm should be enhanced. We use the method to evaluate a recently-published partitioning algorithm based on a heuristic. We further analyze how the statistical method is affected by the choice of sampling method, and we recommend how sampling should be done. Finally, since a heuristic-based algorithm may not always be available, the user may try to find a good partition by picking the best from a random sample. We analyze whether this approach is likely to find a good partition. To the best of our knowledge, this study is the first application of EVT to a graph partitioning problem.