Milo Tomasevic
University of Belgrade
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Milo Tomasevic.
international symposium on microarchitecture | 1994
Milo Tomasevic; Veljko Milutinovic
Improving performance and scalability in shared-memory multiprocessors requires an appropriate solution to the well-known cache coherence problem. Hardware schemes-highly convenient because of their transparency for software-offer fully dynamic solutions, with an ability to achieve high performance. In Part 1 of this two-part series, we discussed the principles of the two major groups of hardware protocols and summarized relevant representatives. Here, we also briefly consider the coherence problem in multilevel cache hierarchies and large-scale, shared-memory multiprocessors.<<ETX>>
hawaii international conference on system sciences | 1995
Jelica Protic; Milo Tomasevic; Veljko Milutinovic
Distributed shared memory (DSM) systems have attracted considerable research efforts recently, since they combine the advantages of two different computer classes: shared memory multiprocessors and distributed systems. The most important one is the use of shared memory programming paradigm on physically distributed memories. One possible classification taxonomy, which includes two basic criteria and a number of related characteristic, is proposed and described. According to the basic classification criteria-implementation level of DSM mechanism-systems are organized into three groups: hardware, software, and hybrid DSM implementations. The paper also presents an almost exhaustive survey of the existing solutions in an uniform manner, presenting their DSM mechanisms and issues of importance for various DSM systems and approaches.<<ETX>>
hawaii international conference on system sciences | 1993
Milo Tomasevic; Veljko Milutinovic
Presents a comprehensive survey of software solutions for the maintenance of cache consistency in shared-memory multiprocessor systems. The lack of widely known, acceptably systematic, and flexible classification in this research field has been our basic motivation for this work. We have proposed a classification based on a set of ten carefully selected criteria that we considered most relevant. Existing solutions have been described and decomposed on the basis of this classification. Different solutions correspond to various points of an abstract multidimensional criterion-space. Such generalized approach enables the points corresponding to nonexistent but potentially useful solutions to be noticed and selected for exploration.<<ETX>>
Microprocessors and Microsystems | 1996
Milo Tomasevic; Veljko Milutinovic
Abstract Snoopy protocols are widely used for preserving of cache coherence in shared bus, shared memory multiprocessors. In this work, an attempt was made to improve their performance for parallel applications where sequential pattern of sharing prevails. The scheme introduced here, the WIP protocol, tries to achieve better data utilization, compared to the existing write-invalidate protocols by applying the principle of partial, word-based invalidation. The complete coherence mechanism is described in this paper. Both the analytical and the simulation methodology were used to evaluate the features of the proposed solution, and to compare it with the Berkeley and the Dragon protocols. Comparative evaluation is performed for a large variety of application and system oriented parameters. The results are also presented and discussed here. Implementation of the WIP in a cache memory unit is compared with the hardware complexity of the two considered protocols.
IEEE Parallel & Distributed Technology: Systems & Applications | 1996
Aleksandra Grujíc; Milo Tomasevic; Veljko Milutinovic
Distributed shared memory (DSM) combines the advantages of shared-memory multiprocessors and distributed computer systems. Evaluations of four experimental or commercial approaches to hardware DSM show its potential for large-scale, high-performance multiprocessor systems. Such an analysis helps in developing guidelines and practical recommendations to further improve existing systems. The four approaches are Dash (Directory Architecture for Shared Memory), SCI (Scalable Coherent Interface), DDM (Data Diffusion Machine) and KSR1 (Kendall Square Research-1).
Advances in Computers | 2017
V. Blagojević; Dragan Bojic; Miroslav Bojovic; Milos Cvetanovic; J. Đorđević; Đ. Đurđević; B. Furlan; S. Gajin; Z. Jovanović; D. Milićev; Veljko Milutinovic; B. Nikolić; J. Protić; M. Punt; Z. Radivojević; Ž. Stanisavljević; Sasa Stojanovic; I. Tartalja; Milo Tomasevic; P. Vuletić
Abstract This article represents an effort to help PhD students in computer science and engineering to generate good original ideas for their PhD research. Our effort is motivated by the fact that most PhD programs nowadays include several courses, as well as the research component, that should result in journal publications and the PhD thesis, all in a timeframe of 3–6 years. In order to help PhD students in computing disciplines to get focused on generating ideas and finding appropriate subject for their PhD research, we have analyzed some state-of-the-art inventions in the area of computing, as well as the PhD thesis research of faculty members of our department, and came up with a proposal of 10 methods that could be implemented to derive new ideas, based on the existing body of knowledge in the research field. This systematic approach provides guidance for PhD students, in order to improve their efficiency and reduce the dropout rate, especially in the area of computing.
international conference on embedded computer systems architectures modeling and simulation | 2013
Ugljesa Milic; Isaac Gelado; Nikola Puzovic; Alex Ramirez; Milo Tomasevic
Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.
hawaii international conference on system sciences | 1995
Milan M. Jovanovic; Milo Tomasevic; Veljko Milutinovic
The Reflective Memory/Memory Channel (RM/MC) system represents a modular bus-based system architecture that belongs to the class of distributed shared memory systems. The RM/MC system is characterized by an update consistency mechanism for shared data and efficient block transfers over the bus. This work has two main goals. First, an extensive simulation analysis using the functional RM/MC simulator based on a very convenient and flexible synthetic workload model was carried out in order to evaluate the different design and implementation decisions and variants of the RM/MC concepts for a wide variety of the values of the relevant application-, architecture-, and technology-related parameters. In this way, an optimal set of values of relevant parameters was found. Second, this paper presents one improvement to the basic concept introduced to enhance the real-time response of the system. The proposed idea combines the compile- and run-time actions intended to reduce the latency of short messages. A set of experiments is performed to evaluate the efficiency of the proposed enhancement. The most important results are presented and discussed here.<<ETX>>
telecommunications forum | 2011
Marko Misic; Milo Tomasevic
This paper presents a short survey and performance analysis of parallel sorting algorithms on graphics processing units. Three implementations of the representative sorting algorithms (Quicksort, Merge sort and Radix sort) were evaluated on CUDA platform which is used to execute programs on NVIDIA graphics processing units. Algorithms were carefully tested and evaluated using automated test environment with different datasets, especially those important for particular applications. Finally, the results of this analysis are briefly discussed.
engineering of computer based systems | 2009
Marija Punt; Jovan Djordjevic; Milo Tomasevic
An approach of designing a simulation environment for the on-line monitoring of a fault tolerant flight control computer is presented in this paper. The simulation environment is designed to evaluate an improved on-line monitoring technique for processors with a built-in cache. This technique assumes that a monitor checks on-line whether the execution of a program is in accordance with the control flow graph created for the program off-line by a preprocessor. The simulation environment consists of the target processor and the monitor, but also includes carefully chosen benchmark programs, fault injection modules and the preprocessor.