Marcelo Cintra
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marcelo Cintra.
international symposium on computer architecture | 2000
Marcelo Cintra; Jose F. Martinez; Josep Torrellas
Speculative parallelization aggressively executes in parallel codes that cannot be fully parallelized by the compiler. Past proposals of hardware schemes have mostly focused on single-chip multiprocessors (CMPs), whose effectiveness is necessarily limited by their small size. Very few schemes have attempted this technique in the context of scalable shared-memory systems. In this paper, we present and evaluate a new hardware scheme for scalable speculative parallelization. This design needs relatively simple hardware and is efficiently integrated into a cache-coherent NUMA system. We have designed the scheme in a hierarchical manner that largely abstracts away the internals of the node. We effectively utilize a speculative CMP as the building block for our scheme. Simulations show that the architecture proposed delivers good speedups at a modest hardware cost. For a set of important non-analyzable scientific loops, we report average speedups of 4.2 for 16 processors. We show that support for per-word speculative state is required by our applications, or else the performance suffers greatly.
acm sigplan symposium on principles and practice of parallel programming | 2003
Marcelo Cintra; Diego R. Llanos
With speculative parallelization, code sections that cannot be fully analyzed by the compiler are aggressively executed in parallel. Hardware schemes are fast but expensive and require modifications to the processors and memory system. Software schemes require no extra hardware but can be inefficient.This paper proposes a new software-only speculative parallelization scheme. The scheme is developed after a systematic evaluation of the design options available and is shown to be efficient and robust and to outperform previously proposed schemes. The novelty and performance advantage of the scheme stem from the use of carefully tuned data structures, synchronization policies, and scheduling mechanisms. Experimental results show that our scheme has small overheads and, for applications with few or no data dependence violations, realizes on average 71% of the speedup of a manually parallelized version of the code, outperforming two recently proposed software-only speculative parallelization schemes. For applications with many data dependence violations, our performance monitors and switches can effectively curb the performance degradation.
high-performance computer architecture | 2002
Marcelo Cintra; Josep Torrellas
With speculative thread-level parallelization, codes that cannot be fully compiler-analyzed are aggressively executed in parallel. If the hardware detects a cross-thread dependence violation, it squashes offending threads and resumes execution. Unfortunately, frequent squashing cripples performance. This paper proposes a new framework of hardware mechanisms to eliminate most squashes due to data dependences in multiprocessors. The framework works by learning and predicting violations, and applying delayed-disambiguation, value prediction, and stall and release. The framework is suited for directory-based multiprocessors that track memory accesses at the system level with the coarse granularity of memory lines. Simulations of a 16-processor machine show that the framework is very effective. By adding our framework to a speculative CC-NUMA with 64-byte memory lines, we speed-up applications by an average of 4.3 times. Moreover, the resulting system is even 23% faster than a machine that tracks memory accesses at the fine granularity of words-a sophisticated system that is not compatible with mainstream cache coherence protocols.
international conference on data engineering | 2010
Konstantinos Krikellas; Stratis D. Viglas; Marcelo Cintra
We present the application of customized code generation to database query evaluation. The idea is to use a collection of highly efficient code templates and dynamically instantiate them to create query- and hardware-specific source code. The source code is compiled and dynamically linked to the database server for processing. Code generation diminishes the bloat of higher-level programming abstractions necessary for implementing generic, interpreted, SQL query engines. At the same time, the generated code is customized for the hardware it will run on. We term this approach holistic query evaluation. We present the design and development of a prototype system called HIQUE, the Holistic Integrated Query Engine, which incorporates our proposals. We undertake a detailed experimental study of the systems performance. The results show that HIQUE satisfies its design objectives, while its efficiency surpasses that of both well-established and currently-emerging query processing techniques.
IEEE Transactions on Parallel and Distributed Systems | 2005
Marcelo Cintra; Diego R. Llanos
With speculative parallelization, code sections that cannot be fully analyzed by the compiler are optimistically executed in parallel. Hardware schemes are fast but expensive and require modifications to the processors and/or memory system. Software schemes require no changes to the hardware of existing shared-memory systems, but can suffer from significant overheads involved with the speculative execution. In fact, the performance of software schemes is highly dependent on application characteristics, the design and implementation of the scheme, and the system configuration and size. This paper explores the design space of a recently proposed software speculative parallelization scheme. In the process, we gain insight into the most beneficial features of software schemes for speculative parallelization, as well as the most influential application characteristics. For instance, experimental results show that, contrary to intuition, checking for data dependence violations on every speculative store, as opposed to at commit time, leads to little performance degradation in the worst case and to significantly better performance with large configurations. Also, scheduling policies based on windows can perform very close to fully dynamic policies with a fraction of the memory overhead. Finally, experimental results show consistent speedups in the execution of loops that cannot be parallelized at compile time, both with and without RAW data dependences, for 4 to 32 processors.
high-performance computer architecture | 2008
Christian Fensch; Marcelo Cintra
The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. This paper proposes a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. The proposed mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. It allows only some controlled migration and replication of data and provides a sufficient degree of flexibility in the mapping through an extra level of indirection between virtual pages and physical tiles. We evaluate the proposed tiled CMP architecture on the Splash-2 scientific benchmarks and ALPBench multimedia benchmarks against one with private caches and a distributed directory cache coherence mechanism. Experimental results show that the performance degradation is as little as 0%, and 16% on average, compared to the cache coherent architecture across all benchmarks for 16 and 32 processors.
international conference on parallel architectures and compilation techniques | 2007
Salman Khan; Polychronis Xekalakis; John Cavazos; Marcelo Cintra
The vast number of transistors available through modern fabrication technology gives architects an unprecedented amount of freedom in chip-multiprocessor (CMP) designs. However, such freedom translates into a design space that is impossible to fully, or even partially to any significant fraction, explore through detailed simulation. In this paper we propose to address this problem using predictive modeling, a well-known machine learning technique. More specifically we build models that, given only a minute fraction of the design space, are able to accurately predict the behavior of the remaining designs orders of magnitude faster than simulating them. In contrast to previous work, our models can predict performance metrics not only for unseen CMP configurations for a given application, but also for unseen configurations of a new application that was not in the set of applications used to build the model, given only a very small number of results for this new application. We perform extensive experiments to show the efficacy of the technique for exploring the design space of CMPs running parallel applications. The technique is used to predict both energy-delay and execution time. Choosing both explicitly parallel applications and applications that are parallelized using the thread-level speculation (TLS) approach, we evaluate performance on a CMP design space with about 95 million points using 18 benchmarks with up to 1000 training points each. For predicting the energy-delay metric, prediction errors for unseen configurations of the same application range from 2.4% to 4.6% and for configurations of new applications from 3.1% to 4.9%.
very large data bases | 2015
Andreas Chatzistergiou; Marcelo Cintra; Stratis D. Viglas
Recent non-volatile memory (NVM) technologies, such as PCM, STT-MRAM and ReRAM, can act as both main memory and storage. This has led to research into NVM programming models, where persistent data structures remain in memory and are accessed directly through CPU loads and stores. Existing mechanisms for transactional updates are not appropriate in such a setting as they are optimized for block-based storage. We present REWIND, a user-mode library approach to managing transactional updates directly from user code written in an imperative general-purpose language. REWIND relies on a custom persistent in-memory data structure for the log that supports recoverable operations on itself. The scheme also employs a combination of non-temporal updates, persistent memory fences, and lightweight logging. Experimental results on synthetic transactional workloads and TPC-C show the overhead of REWIND compared to its non-recoverable equivalent to be within a factor of only 1.5 and 1.39 respectively. Moreover, REWIND outperforms state-of-the-art approaches for data structure recoverability as well as general purpose and NVM-aware DBMS-based recovery schemes by up to two orders of magnitude.
international conference on supercomputing | 2009
Polychronis Xekalakis; Nikolas Ioannou; Marcelo Cintra
With the current trend toward multicore architectures, improved execution performance can no longer be obtained via traditional single-thread instruction level parallelism (ILP), but, instead, via multithreaded execution.Generating thread-parallel programs is hard and thread-level speculation (TLS) has been suggested as an execution model that can speculatively exploit thread-level parallelism (TLP) even when thread independence cannot be guaranteed by the programmer/compiler. Alternatively, the helper threads (HT) execution model has been proposed where subordinate threads are executed in parallel with a main thread in order to improve the execution efficiency (i.e., ILP) of the latter. Yet another execution model, runahead execution (RA), has also been proposed where subordinate versions of the main thread are dynamically created especially to cope with long-latency operations, again with the aim of improving the execution efficiency of the main thread. Each one of these multithreaded execution models works best for different applications and application phases. In this paper we combine these three models into a single execution model and single hardware infrastructure such that the system can dynamically adapt to find the most appropriate multithreaded execution model. More specifically, TLS is favored whenever successful parallel execution of instructions in multiple threads (i.e., TLP) is possible and the system can seamlessly transition at run-time to the other models otherwise. In order to understand the tradeoffs involved, we also develop a performance model that allows one to quantitatively attribute overall performance gains to either TLP or ILP in such combined multithreaded execution model. Experimental results show that our unified execution model achieves speedups of up to 41.2%, with an average of 10.2%, over an existing state-of-the-art TLS system and speedups of up to 35.2%, with an average of 18.3%, over a flavor of runahead execution for a subset of the SPEC2000 Int benchmark suite.
international conference on parallel architectures and compilation techniques | 2011
Nikolas Ioannou; Michael Kauschke; Matthias Gries; Marcelo Cintra
To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than todays multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency domains but also information that spans multiple frequency and voltage domains. We implement our proposed hierarchical power control using a novel application-driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates as the front-end to the hierarchical DVFS controller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our approach provides significant improvement of the Energy Delay Product (EDP) of as much as 27.2%, and 11.4% on average, with an average increase in execution time of 7.7% over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effective DVFS control of the domains, compared to existing approaches.