Is this you? Create Your Porfile

Tiago A. O. Alves

Federal University of Rio de Janeiro

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tiago A. O. Alves is active.

Explore More

Publication

Featured researches published by Tiago A. O. Alves.

International Journal of High Performance Systems Architecture | 2011

Trebuchet: exploring TLP with dataflow virtualisation

Tiago A. O. Alves; Leandro A. J. Marzulo; Felipe M. G. França; Vítor Santos Costa

Parallel programming has become mandatory to fully exploit the potential of multi-core CPUs. The dataflow model provides a natural way to exploit parallelism. However, specifying dependences and control using fine-grained instructions in dataflow programs can be complex and present unwanted overheads. To address this issue, we have designed TALM: a coarse-grained dataflow execution model to be used on top of widespread architectures. We implemented TALM as the Trebuchet virtual machine for multi-cores. The programmer identifies code blocks that can run in parallel and connects them to form a dataflow graph, which allows one to have the benefits of parallel dataflow execution in a Von Neumann machine, with small programming effort. We parallelised a set of seven applications using our approach and compared with OpenMP implementations. Results show that Trebuchet can be competitive with state-of-the-art technology, while providing the benefits of dataflow execution.

parallel computing | 2014

Couillard: Parallel programming via coarse-grained Data-flow Compilation

Leandro A. J. Marzulo; Tiago A. O. Alves; Felipe M. G. França; Vítor Santos Costa

Abstract Data-flow is a natural approach to parallelism. However, describing dependencies and control between fine-grained data-flow tasks can be complex and present unwanted overheads. TALM (TALM is an Architecture and Language for Multi-threading) introduces a user-defined coarse-grained parallel data-flow model, where programmers identify code blocks, called super-instructions, to be run in parallel and connect them in a data-flow graph. TALM has been implemented as a hybrid Von Neumann/data-flow execution system: the Trebuchet . We have observed that TALM’s usefulness largely depends on how programmers specify and connect super-instructions. Thus, we present Couillard , a full compiler that creates, based on an annotated C-program, a data-flow graph and C-code corresponding to each super-instruction. We show that our toolchain allows one to benefit from data-flow execution and explore sophisticated parallel programming techniques, with small effort. To evaluate our system we have executed a set of real applications on a large multi-core machine. Comparison with popular parallel programming methods shows competitive speedups, while providing an easier parallel programing approach. More specifically, for an application that follows the wavefront method, running with big inputs, Trebuchet achieved up to 4.7% speedup over Intel® TBB novel flow-graph approach and up to 44% over OpenMP.

symposium on computer architecture and high performance computing | 2014

A Minimalistic Dataflow Programming Library for Python

Tiago A. O. Alves; Brunno F. Goldstein; Felipe M. G. França; Leandro A. J. Marzulo

Current work on parallel programming models are trending towards the dataflow paradigm. Previous works on that topic have shown that dataflow programming is indeed a natural way to exploit parallelism in programs. However, there is still a gap in terms of ease of programming between high level languages adopted by the scientific community and the languages and tools available for dataflow programming. In this paper we present Sucuri: a minimalistic Python library that provides dataflow programming with reasonably simple syntax. To parallelize applications using our library, the programmer needs only to identify functions of his code that are good candidates for parallelization and instantiate a dataflow graph where each node is associated with one of such functions, and the edges between nodes describe data dependencies between functions. We then proceed to implement two benchmarks that represent important parallel programming patterns using our library and execute them on a cluster of multicores. Experimental results are promising, proving that our library can be an interesting first option for parallelization.

symposium on computer architecture and high performance computing | 2010

TALM: A Hybrid Execution Model with Distributed Speculation Support

Leandro A. J. Marzulo; Tiago A. O. Alves; Felipe M. G. França; Vítor Santos Costa

Parallel programming has become mandatory to fully exploit the potential of modern CPUs. The data-flow model provides a natural way to exploit parallelism. However, traditional data-flow programming is not trivial: specifying dependencies and control using fine-grained tasks (such as instructions) can be complex and present unwanted overheads. To address this issue we have built a coarse-grained data-flow model with speculative execution support to be used on top of widespread architectures, implemented as a hybrid Von Neumanm/data-flow execution system. We argue that speculative execution fits naturally with the data-flow model. Using speculative execution liberates the programmer to consider only the main dependencies, and still allows correct data-flow execution of coarse-grained tasks. Moreover, our speculation mechanism does not demand centralised control, which is a key feature for upcoming many-core systems, where scalability has become an important concern. An initial study on a artificial bank server application suggests that there is a wide range of scenarios where speculation can be very effective.

symposium on computer architecture and high performance computing | 2016

Task Scheduling in Sucuri Dataflow Library

Rafael J. N. Silva; Brunno F. Goldstein; Leandro Santiago; Alexandre C. Sena; Leandro A. J. Marzulo; Tiago A. O. Alves; Felipe M. G. França

Sucuri is a minimalistic Python library that provides dataflow programming through a reasonably simple syntax. It allows transparent execution on computer clusters and natural exploitation of parallelism. In Sucuri, programmers instantiate a dataflow graph, where each node is assigned to a function and edges represent data dependencies between nodes. The original implementation of Sucuri adopts a centralized scheduler, which incurs high communication overheads, specially in clusters with a large number of machines. In this paper we modify Sucuri so that each machine in a cluster will have its own scheduler. Before execution, the dataflow graph is partitioned, so that nodes can be distributed among the machines of the cluster. In runtime, idle workers will grab tasks from a ready queue in their local scheduler. Experimental results confirm that the solution can reduce communication overheads, improving performance in larger clusters.

international on-line testing symposium | 2014

Online error detection and recovery in dataflow execution

Tiago A. O. Alves; Sandip Kundu; Leandro A. J. Marzulo; Felipe M. G. França

The processor industry is well on its way towards manycore processors that comprise of large number of simple cores. The shift towards multi and manycores calls for new programming paradigms suitable for exploiting the inherent parallelism in applications. Dataflow execution was shown to be a good option for programming in such environments. It is well-known that as CMOS technology continues to scale, it becomes more prone to transient and permanent hardware faults. In this paper we present a novel mechanism for error detection and recovery that focuses on transient errors in dataflow execution. Due to the inherently parallel nature of dataflow, our solution is completely distributed and synchronizes only cores that have data dependencies between them, as opposed to prior work on error recovery that in general rely on global synchronization of the system. We evaluate the proposed solution via a software implementation on top of a dataflow runtime. Experimental results show that error detection overhead is highly related to the pressure on the memory bus. In memory bound applications, performance is found to deteriorate, while for other benchmarks, the observed overhead is less than 23%. We find no comparable previous work to contrast these results.

symposium on computer architecture and high performance computing | 2014

Stack-Tagged Dataflow

Leandro Santiago; Leandro A. J. Marzulo; Brunno F. Goldstein; Tiago A. O. Alves; Felipe M. G. França

Dynamic Dataflow allows simultaneous execution of instructions in different iterations of a loop, boosting parallelism exploitation. In this model, operands are tagged with their associated instance number, incremented as they go through the loop. Instruction execution is triggered when all input operand with the same tag become available. However, this traditional tagging mechanism often requires the generation of several control instructions to manipulate tags and guarantee the correct matching. To address this problem, this work presents Stack-Tagged Dataflow, a tagging mechanism that uses stacks of tags to reduce control overheads in dataflow, while isolating context from different nested loops. Push instructions at the beginning of loops will create a new context, while Pop instructions, at the end of the loop, will restore the original context tags. Experimental results show that Stack-Tagged Dataflow is a viable solution, suggesting that a hybrid compiling approach can be used to enable Stack-Tags only when needed.

2012 Third Workshop on Applications for Multi-Core Architecture | 2012

Scheduling Cyclic Task Graphs with SCC-Map

Alexandre Sardinha; Tiago A. O. Alves; Leandro A. J. Marzulo; Felipe M. G. França; Valmir Carneiro Barbosa; Vítor Santos Costa

The Dataflow execution model has been shown to be a good way of exploiting TLP, making parallel programming easier. In this model, tasks must be mapped to processing elements (PEs) considering the trade-off between communication and parallelism. Previous work on scheduling dependency graphs have mostly focused on directed a cyclic graphs, which are not suitable for dataflow (loops in the code become cycles in the graph). Thus, we present the SCC-Map: a novel static mapping algorithm that considers the importance of cycles during the mapping process. To validate our approach, we ran a set of benchmarks in on our dataflow simulator varying the communication latency, the number of PEs in the system and the placement algorithm. Our results show that the benchmark programs run significantly faster when mapped with SCC-Map. Moreover, we observed that SCC-Map is more effective than the other mapping algorithms when communication latency is higher.

symposium on computer architecture and high performance computing | 2015

Exploiting Parallelism in Linear Algebra Kernels through Dataflow Execution

Brunno F. Goldstein; Felipe M. G. França; Leandro A. J. Marzulo; Tiago A. O. Alves

Linear Algebra Kernels have an important role in many petroleum reservoir simulators, extensively used by the industry. The growth in problem size, specially in pre-salt exploration, has caused an increase in execution time of those kernels, thus requiring parallel programming to improve performance and make the simulation viable. On the other hand, exploiting parallelism in systems with an ever increasing number of cores may be an arduous task, as the programmer has to manage threads and care about synchronization issues. Current work on parallel programming models show that Dataflow Execution exploits parallelism in a natural way, allowing the programmer to focus solely on describing dependencies between portions of code. This work consists in implementing parallel Linear Algebra Kernels using the Dataflow model. The Trebuchet Dataflow Virtual Machine and the Sucuri Dataflow Library were used to evaluate the solutions with real inputs from reservoir simulators. Results have been compared with OpenMP and Intel Math Kernel Library and show that coarser-grained tasks are needed to hide the overheads of dataflow runtime environments. Therefore, level 2 and 3 linear algebra operations, such as Vector-Matrix and Matrix-Matrix products, presented the most promising results.

symposium on computer architecture and high performance computing | 2015

Graph Templates for Dataflow Programming

Alexandre C. Sena; Eduardo S. Vaz; Felipe M. G. França; Leandro A. J. Marzulo; Tiago A. O. Alves

Current works on parallel programming models are trending towards the dataflow paradigm, which naturally exploits parallelism in programs. The Sucuri Python Library provides basic features for creation and execution of dataflow graphs in parallel environments. However, there is still a gap between dataflow programming and traditional parallel programming. In this paper we aim at narrowing that gap by introducing a set of templates for Sucuri that represent some of the most important parallel programming patterns. Through these templates programmers can implement applications that use patterns such as fork/join, pipeline and wave front just by instantiating and connecting sub-graph objects. Evaluation showed that the use of templates makes programming easier, while allowing a significant reduction in lines of code, compared to manually creating the dataflow graph.

Explore More