Justin E. Gottschlich
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Justin E. Gottschlich.
international conference on parallel architectures and compilation techniques | 2014
Irina Calciu; Justin E. Gottschlich; Tatiana Shpeisman; Gilles Pokam; Maurice Herlihy
The Intel Haswell processor includes restricted transactional memory (RTM), which is the first commodity-based hardware transactional memory (HTM) to become publicly available. However, like other real HTMs, such as IBMs Blue Gene/Q, Haswells RTM is best-effort, meaning it provides no transactional forward progress guarantees. Because of this, a software fallback system must be used in conjunction with Haswells RTM to ensure transactional programs execute to completion. To complicate matters, Haswell does not provide escape actions. Without escape actions, non-transactional instructions cannot be executed within the context of a hardware transaction, thereby restricting the ways in which a software fallback can interact with the HTM. As such, the challenge of creating a scalable hybrid TM (HyTM) that uses Haswells RTM and a software TM (STM) fallback is exacerbated. In this paper, we present Invyswell, a novel HyTM that exploits the benefits and manages the limitations of Haswells RTM. After describing Invyswells design, we show that it outperforms NOrec, a state-of-the-art STM, by 35%, Hybrid NOrec, NOrecs hybrid implementation, by 18%, and Haswells hardware-only lock elision by 25% across all STAMP benchmarks.
international symposium on computer architecture | 2013
Gilles Pokam; Klaus Danne; Cristiano Pereira; Rolf Kassa; Tim Kranich; Shiliang Hu; Justin E. Gottschlich; Nima Honarmand; Nathan Dautenhahn; Samuel T. King; Josep Torrellas
There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality. This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel. This papers focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.
international symposium on microarchitecture | 2011
Gilles Pokam; Cristiano Pereira; Shiliang Hu; Ali-Reza Adl-Tabatabai; Justin E. Gottschlich; Jungwoo Ha; Youfeng Wu
Shared memory multiprocessors are difficult to program because of the non-deterministic ways in which the memory operations from different threads interleave. To address this issue, many hardware-based memory race recorders have been proposed that efficiently log an ordering of the shared memory interleavings between threads for deterministic replay. These approaches are challenging to integrate into current processors because they change the cache subsystem or the coherence protocol, and they mostly support a sequentially consistent memory model. In this paper, we describe CoreRacer, a chunk-based memory race recorder architecture for multicore x86 TSO processors. CoreRacer does not modify the cache subsystem and yet it still integrates into the x86 TSO memory model. We show that by leveraging a specific x86 feature, the invariant timestamp, CoreRacer maintains ordering among chunks without piggybacking on cache coherence messages. We provide a detailed implementation and evaluation of CoreRacer on a cycle-accurate x86 simulator. We show that its integration cost into x86 is minimal and its overhead has negligible effect on performance.
international conference on parallel architectures and compilation techniques | 2012
Justin E. Gottschlich; Maurice Herlihy; Gilles Pokam; Jeremy G. Siek
This paper presents TMProf, a transactional memory (TM) profiler, based on three visualization principles. These principles are (i) the precise graphical representation of transaction interactions including cross-correlated information and source code, (ii) visualized soft real-time playback of concurrently executing transactions, and (iii) dynamic visualizations of multiple executions. We describe how these principles break new ground and create new challenges for TM profilers. We discuss our experience using TMProf with InvalSTM, a state-of-the-art software TM, and show how TMProfs feedback led to the design of two new contention managers (CMs). We demonstrate the performance benefits of these CMs, which generally led to improved performance as the amount of work and threads increase per benchmark. Our experimental results show that iBalanced, one of our newly designed CMs, can increase transaction throughput by nearly 10× over iFair, InvalSTMs previously best performing CM.
international conference on parallel architectures and compilation techniques | 2013
Justin E. Gottschlich; Gilles Pokam; Cristiano Pereira; Youfeng Wu
To reduce the complexity of debugging multithreaded programs, researchers have developed many techniques that automatically detect bugs that arise from shared memory errors. These techniques can identify a wide range of bugs, but it can be challenging for a programmer to reproduce a specific bug that he or she is interested in using such techniques. This is because these techniques were not intended for individual bug reproduction but rather an exploratory search for possible bugs. To address this concern we present concurrent predicates (CPs) and concurrent predicate expressions (CPEs), which allow programmers to single out a specific bug by specifying the schedule and program state that must be satisfied for the bug to be reproduced. We present the recipes, that is, the mechanical processes, we use to reproduce data races, atomicity violations, and deadlocks with CP and CPE. We then show how these recipes apply to the diagnosis and reproduction of bugs from 13 handcrafted bugs, five real-world application bugs from RADBench, and three previously unresolved bugs from TBoost.STM, which now includes the fixes we generated using CP and CPE.
languages and compilers for parallel computing | 2011
Justin E. Gottschlich; Jaewoong Chung
Transactional memory (TM) is a promising alternative to mutual exclusion. In spite of this, it may be unrealistic for TM programs to be devoid of locks due to their abundant use in legacy software systems. Consequently, for TMs to be practical they may need to manage the interaction of transactions and locks when they access the same shared-memory. This paper presents two algorithms, one coarse-grained and one fine-grained, that improve the state-of-the-art performance for TMs that support the concurrent execution of locks and transactions. We also discuss the programming language constructs that are necessary to implement such algorithms and present analyses that compare and contrast our approach with prior work. Our analyses demonstrate that, (i) in general, our proposed coarse- and fine-grained algorithms improve program concurrency but (ii) an algorithm’s concurrent throughput potential does not always lead to realized performance gains.
international workshop on openmp | 2014
Michael Wong; Eduard Ayguadé; Justin E. Gottschlich; Victor Luchangco; Bronis R. de Supinski; Barna L. Bihari
The OpenMP specification lacks a composable shared memory concurrency mechanism: the current OpenMP concurrency mechanisms, such as OMP critical, locks, or atomics, do not support composition. In this paper, we motivate the need for transactional memory (TM) in OpenMP. The chief reason is to support composition of realistic programs, but we also consider whether TM is easier to program than locks, the use case for TM, and whether a software-only TM can outperform traditional locking through a survey of recent publications. This paper advances upon previous proposals of OpenMP TM by introducing a new construct specifically to handle irrevocable actions, which is also composable. It also proposes a pure atomic transaction construct as well as the concept of transaction safety. Further, we examine how our proposed construct integrates with current OpenMP constructs.
international conference on parallel architectures and compilation techniques | 2015
Yujie Liu; Justin E. Gottschlich; Gilles Pokam; Michael F. Spear
The availability of commercial hardware transactionalmemory (TM) systems has not yet been met with a rise in the numberof large-scale programs that use memory transactions explicitly. Asignificant impediment to the use of TM is the lack of tool support, specifically profilers that can identify and explain performance anomalies. In this paper, we introduce an end-to-end system that enables lowoverheadperformance profiling of large-scale transactional programs. We present algorithms and an implementation for Intels Haswellprocessors. With our system, it is possible to record a transactionalprograms execution with minimal overhead, and then replay it withina custom profiling tool to identify causes of contention and aborts, down to the granularity of individual memory accesses. Evaluationshows that our algorithms have low overhead, and our tools enableprogrammers to effectively explain performance anomalies.
arXiv: Artificial Intelligence | 2018
Justin E. Gottschlich; Armando Solar-Lezama; Nesime Tatbul; Michael Carbin; Martin C. Rinard; Regina Barzilay; Saman P. Amarasinghe; Joshua B. Tenenbaum; Tim Mattson
In this position paper, we describe our vision of the future of machine programming through a categorical examination of three pillars of research. Those pillars are: (i) intention, (ii) invention, and (iii) adaptation. Intention emphasizes advancements in the human-to-computer and computer-to-machine-learning interfaces. Invention emphasizes the creation or refinement of algorithms or core hardware and software building blocks through machine learning (ML). Adaptation emphasizes advances in the use of ML-based constructs to autonomously evolve software.
Archive | 2012
Tim Kranich; Gilles Pokam; Justin E. Gottschlich; Klaus Danne; Rolf Kassa; Shiliang Hu; Cristiano Pereira