Marco Elver | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marco Elver is active.

Explore More

Publication

Featured researches published by Marco Elver.

high-performance computer architecture | 2014

TSO-CC: Consistency directed cache coherence for TSO

Marco Elver; Vijay Nagarajan

Traditional directory coherence protocols are designed for the strictest consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that support relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually this comes at the cost of scalability (in terms of per core storage), which poses a problem with increasing number of cores in todays CMPs, most of which no longer are sequentially consistent. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we propose TSO-CC, a cache coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers, and instead relies on self-invalidation and detection of potential acquires using timestamps to satisfy the TSO memory consistency model lazily. Our results show that TSO-CC achieves average performance comparable to a MESI directory protocol, while TSO-CCs storage overhead per cache line scales logarithmically with increasing core count.

Frontiers in Neuroinformatics | 2013

An automated and reproducible workflow for running and analyzing neural simulations using Lancet and IPython Notebook

Jean-Luc Stevens; Marco Elver; James A. Bednar

Lancet is a new, simulator-independent Python utility for succinctly specifying, launching, and collating results from large batches of interrelated computationally demanding program runs. This paper demonstrates how to combine Lancet with IPython Notebook to provide a flexible, lightweight, and agile workflow for fully reproducible scientific research. This informal and pragmatic approach uses IPython Notebook to capture the steps in a scientific computation as it is gradually automated and made ready for publication, without mandating the use of any separate application that can constrain scientific exploration and innovation. The resulting notebook concisely records each step involved in even very complex computational processes that led to a particular figure or numerical result, allowing the complete chain of events to be replicated automatically. Lancet was originally designed to help solve problems in computational neuroscience, such as analyzing the sensitivity of a complex simulation to various parameters, or collecting the results from multiple runs with different random starting points. However, because it is never possible to know in advance what tools might be required in future tasks, Lancet has been designed to be completely general, supporting any type of program as long as it can be launched as a process and can return output in the form of files. For instance, Lancet is also heavily used by one of the authors in a separate research group for launching batches of microprocessor simulations. This general design will allow Lancet to continue supporting a given research project even as the underlying approaches and tools change.

international conference on parallel architectures and compilation techniques | 2015

RC3: Consistency Directed Cache Coherence for x86-64 with RC Extensions

Marco Elver; Vijay Nagarajan

The recent convergence towards programming language based memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. In effect, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86-64) cannot readily make use of lazy coherence protocols without either: changing the architectures consistency model to (a variant of) RC at the expense of backwards compatibility, or adapting the protocol to satisfy the stricter consistency model, thereby failing to benefit from synchronization information. We show an approach for the x86-64 architecture, which is a compromise between the two. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backwards compatibility with legacy codes and older microarchitectures. Second, we propose RC3, a scalable hardware cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers, and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3s storage overhead per cache line scales logarithmically with increasing core count, and reduces on-chip coherence storage overheads by 45% compared to a related approach specifically targeting TSO.

high-performance computer architecture | 2016

McVerSi: A test generation framework for fast memory consistency verification in simulation

Vijay Nagarajan; Marco Elver

The memory consistency model (MCM), which formally specifies the behaviour of the memory system, is used by programmers to reason about parallel programs. It is imperative that hardware adheres to the promised MCM. For this reason, hardware designs must be verified against the specified MCM. One common way to do this is via executing tests, where specific threads of instruction sequences are generated and their executions are checked for adherence to the MCM. It would be extremely beneficial to execute such tests under simulation, i.e. when the functional design implementation of the hardware is being prototyped. Most prior verification methodologies, however, target post-silicon environments, which when applied under simulation would be too slow. We propose McVerSi, a test generation framework for fast MCM verification of a full-system design implementation under simulation. Our primary contribution is a Genetic Programming (GP) based approach to MCM test generation, which relies on a novel crossover function that prioritizes memory operations contributing to non-determinism, thereby increasing the probability of uncovering MCM bugs. To guide tests towards exercising as much logic as possible, the simulators reported coverage is used as the fitness function. Furthermore, we increase test throughput by making the test workload simulation-aware. We evaluate our proposed framework using the Gem5 cycle accurate simulator in full-system mode with Ruby. We discover 2 new bugs due to the faulty interaction of the pipeline and the cache coherence protocol. Crucially, these bugs would not have been discovered through individual verification of the pipeline or the coherence protocol. We study 11 bugs in total. Our GP-based test generation approach finds all bugs consistently, therefore providing much higher guarantees compared to alternative approaches (pseudo-random test generation and litmus tests).

international symposium on microarchitecture | 2016

C 3 D: mitigating the NUMA bottleneck via coherent DRAM caches

Cheng-Chieh Huang; Rakesh Kumar; Marco Elver; Boris Grot; Vijayanand Nagarajan

Massive datasets prevalent in scale-out, enterprise, and high-performance computing are driving a trend toward ever-larger memory capacities per node. To satisfy the memory demands and maximize performance per unit cost, todays commodity HPC and server nodes tend to feature multi-socket shared memory NUMA organizations. An important problem in these designs is the high latency of accessing memory on a remote socket that results in degraded performance in workloads with large shared data working sets. This work shows that emerging DRAM caches can help mitigate the NUMA bottleneck by filtering up to 98% of remote memory accesses. To be effective, these DRAM caches must be private to each socket to allow caching of remote memory, which comes with the challenge of ensuring coherence across multiple sockets and GBs of DRAM cache capacity. Moreover, the high access latency of DRAM caches, combined with high inter-socket communication latencies, can make hits to remote DRAM caches slower than main memory accesses. These features challenge existing coherence protocols optimized for on-chip caches with fast hits and modest storage capacity. Our solution to these challenges relies on two insights. First, keeping DRAM caches clean avoids the need to ever access a remote DRAM cache on a read. Second, a non-inclusive on-chip directory that avoids tracking blocks in the DRAM cache enables a light-weight protocol for guaranteeing coherence without the staggering directory costs. Our design, called Clean Coherent DRAM Caches (C3D), leverages these insights to improve performance by 6.4-50.7% in a quad-socket system versus a baseline without DRAM caches.

programming language design and implementation | 2013

Fast RMWs for TSO: semantics and implementation

Bharghava Rajaram; Vijay Nagarajan; Susmit Sarkar; Marco Elver

Read-Modify-Write (RMW) instructions are widely used as the building blocks of a variety of higher level synchronization constructs, including locks, barriers, and lock-free data structures. Unfortunately, they are expensive in architectures such as x86 and SPARC which enforce (variants of) Total-Store-Order (TSO). A key reason is that RMWs in these architectures are ordered like a memory barrier, incurring the cost of a write-buffer drain in the critical path. Such strong ordering semantics are dictated by the requirements of the strict atomicity definition (type-1) that existing TSO RMWs use. Programmers often do not need such strong semantics. Besides, weakening the atomicity definition of TSO RMWs, would also weaken their ordering -- thereby leading to more efficient hardware implementations. In this paper we argue for TSO RMWs to use weaker atomicity definitions -- we consider two weaker definitions: type-2 and type-3, with different relaxed ordering differences. We formally specify how such weaker RMWs would be ordered, and show that type-2 RMWs, in particular, can seamlessly replace existing type-1 RMWs in common synchronization idioms -- except in situations where a type-1 RMW is used as a memory barrier. Recent work has shown that the new C/C++11 concurrency model can be realized by generating conventional (type-1) RMWs for C/C++11 SC-atomic-writes and/or SC-atomic-reads. We formally prove that this is equally valid using the proposed type-2 RMWs; type-3 RMWs, on the other hand, could be used for SC-atomic-reads (and optionally SC-atomic-writes). We further propose efficient microarchitectural implementations for type-2 (type-3) RMWs -- simulation results show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%), which translates into an overall performance improvement of up to 9.0% (9.2%) on a set of parallel programs, including those from the SPLASH-2, PARSEC, and STAMP benchmarks.

formal methods in computer-aided design | 2017

Verification of a lazy cache coherence protocol against a weak memory model

Christopher J. Banks; Marco Elver; Ruth Hoffmann; Susmit Sarkar; Paul B. Jackson; Vijay Nagarajan

In this paper, we verify a modern lazy cache coherence protocol, TSO-CC, against the memory consistency model it was designed for, TSO. We achieve this by first showing a weak simulation relation between TSO-CC (with a fixed number of processors) and a novel finite-state operational model which exhibits the laziness of TSO-CC and satisfies TSO. We then extend this by an existing parameterisation technique, allowing verification for an unbounded number of processors. The approach is executed entirely within a model checker, no external tool is required and very little in-depth knowledge of formal verification methods is required of the verifier.

BMC Neuroscience | 2013

An automated simulator-independent workflow for reproducible simulation and analysis using Lancet and IPython Notebook

Jean-Luc Stevens; Marco Elver; James A. Bednar

Lancet [1] is a new, simulator-independent utility for succinctly specifying, launching, and collating results from large batches of interrelated simulations. Neural simulations require significant time and computational resources, particularly when exploring the large parameter spaces involved. Simulators rarely provide specific, comprehensive support for launching and collecting results across batch runs, and so the process of going from idea to publishable results typically involves an ad-hoc set of manual practices and/or one-off shell scripts. This informal process can be difficult to replicate later, because information about each of the processing steps is lost over time. Here we demonstrate how Lancet can be used together with IPython Notebook [2] to provide a fully automated and fully reproducible workflow for neural simulations and similar batch-computing tasks. This workflow covers specifying what simulations are to be launched, storing metadata about each simulation run, collating the resulting output files, analyzing the results, and generating publication-quality figures that can be traced directly back to the original simulation and analysis code. This approach scales to hundreds of parallel jobs launched and simulation results spread across thousands of files, allowing users to focus on the scientific component of their work instead of writing repetitive boilerplate code. Lancet is most useful with batch schedulers such as Oracle Grid Engine or other computing clusters, but also works well with single workstations. Users are given a small set of composable primitives that can succinctly specify large parameter spaces, from which individual jobs are generated. The declared simulation can then be reviewed in detail, avoiding mistakes before valuable time and computational resources are expended. All Lancet components are designed as self-contained, declarative objects that constitute the elements of a small DSL (domain specific language). Once all the simulations are complete and the necessary files have been generated, Lancet collates the results for further analysis. To complete the workflow, the results can then be imported into an IPython Notebook, where they can be visualized interactively, with immediate feedback and a record of the analysis steps for reproducibility. This workflow allows you to assess your results for each simulation or compare results between different simulations. The generated data can be viewed in manageable chunks, without needing to directly manipulate files on either the local or remote filesystem. As parameters associated with each simulation are automatically recorded and tracked, all the relevant parameters are available for each file viewed. You can then process your data, saving it back out to separate files or to a database backend (HDF5 format using PyTables is currently supported [3]) while maintaining all the relevant metadata. The core of Lancet is written in pure Python (Python 2 and 3 are supported), offering a general framework that is easily integrated with external tools and simulators that keeps track of all parameters used, ensuring a reproducible workflow. The fundamental design is entirely independent of the tools that are invoked, making Lancet a flexible and general tool for anyone who needs to run and analyze the data generated by hundreds of time-consuming simulations.

design, automation, and test in europe | 2018