Is this you? Create Your Porfile

Marek Olszewski

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marek Olszewski is active.

Explore More

Publication

Featured researches published by Marek Olszewski.

architectural support for programming languages and operating systems | 2009

Kendo: efficient deterministic multithreading in software

Marek Olszewski; Jason Ansel; Saman P. Amarasinghe

Although chip-multiprocessors have become the industry standard, developing parallel applications that target them remains a daunting task. Non-determinism, inherent in threaded applications, causes significant challenges for parallel programmers by hindering their ability to create parallel applications with repeatable results. As a consequence, parallel applications are significantly harder to debug, test, and maintain than sequential programs. This paper introduces Kendo: a new software-only system that provides deterministic multithreading of parallel applications. Kendo enforces a deterministic interleaving of lock acquisitions and specially declared non-protected reads through a novel dynamically load-balanced deterministic scheduling algorithm. The algorithm tracks the progress of each thread using performance counters to construct a deterministic logical time that is used to compute an interleaving of shared data accesses that is both deterministic and provides good load balancing. Kendo can run on todays commodity hardware while incurring only a modest performance cost. Experimental results on the SPLASH-2 applications yield a geometric mean overhead of only 16% when running on 4 processors. This low overhead makes it possible to benefit from Kendo even after an application is deployed. Programmers can start using Kendo today to program parallel applications that are easier to develop, debug, and test.

programming language design and implementation | 2009

PetaBricks: a language and compiler for algorithmic choice

Jason Ansel; Cy P. Chan; Yee Lok Wong; Marek Olszewski; Qin Zhao; Alan Edelman; Saman P. Amarasinghe

It is often impossible to obtain a one-size-fits-all solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, data, and available system resources. In some cases, completely different algorithms may provide the best performance. Current compiler and programming language techniques are able to change some of these parameters, but today there is no simple way for the programmer to express or the compiler to choose different algorithms to handle different parts of the data. Existing solutions normally can handle only coarse-grained, library level selections or hand coded cutoffs between base cases and recursive cases. We present PetaBricks, a new implicitly parallel language and compiler where having multiple implementations of multiple algorithms to solve a problem is the natural way of programming. We make algorithmic choice a first class construct of the language. Choices are provided in a way that also allows our compiler to tune at a finer granularity. The PetaBricks compiler autotunes programs by making both fine-grained as well as algorithmic choices. Choices also include different automatic parallelization techniques, data distributions, algorithmic parameters, transformations, and blocking. Additionally, we introduce novel techniques to autotune algorithms for different convergence criteria. When choosing between various direct and iterative methods, the PetaBricks compiler is able to tune a program in such a way that delivers near-optimal efficiency for any desired level of accuracy. The compiler has the flexibility of utilizing different convergence criteria for the various components within a single algorithm, providing the user with accuracy choice alongside algorithmic choice.

international conference on parallel architectures and compilation techniques | 2007

JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory

Marek Olszewski; J. Cutler; J.G. Steffan

With the advent of chip-multiprocessors, we are faced with the challenge of parallelizing performance-critical software. Transactional memory (TM) has emerged as a promising programming model allowing programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Many implementations of hardware, software, and hybrid support for TM have been proposed; of these, software-only implementations (STMs) are especially compelling since they can be used with current commodity hardware. However, in addition to higher overheads, many existing STM systems are limited to either managed languages or intrusive APIs. Furthermore, transactions in STMs cannot normally contain calls to unobservable code such as shared libraries or system calls. In this paper we present JudoSTM, a novel dynamic binary-rewriting approach to implementing STM that supports C and C++ code. Furthermore, by using value-based conflict detection, JudoSTM additionally supports the transactional execution of both (i) irreversible system calls and (ii) library functions that may contain locks. We significantly lower overhead through several novel optimizations that improve the quality of rewritten code and reduce the cost of conflict detection and buffering. We show that our approach performs comparably to Rochesters RSTM library-based implementation- demonstrating that a dynamic binary-rewriting approach to implementing STM is an interesting alternative.

symposium on code generation and optimization | 2011

Language and compiler support for auto-tuning variable-accuracy algorithms

Jason Ansel; Yee Lok Wong; Cy P. Chan; Marek Olszewski; Alan Edelman; Saman P. Amarasinghe

Approximating ideal program outputs is a common technique for solving computationally difficult problems, for adhering to processing or timing constraints, and for performance optimization in situations where perfect precision is not necessary. To this end, programmers often use approximation algorithms, iterative methods, data resampling, and other heuristics. However, programming such variable accuracy algorithms presents difficult challenges since the optimal algorithms and parameters may change with different accuracy requirements and usage environments. This problem is further compounded when multiple variable accuracy algorithms are nested together due to the complex way that accuracy requirements can propagate across algorithms and because of the size of the set of allowable compositions. As a result, programmers often deal with this issue in an ad-hoc manner that can sometimes violate sound programming practices such as maintaining library abstractions. In this paper, we propose language extensions that expose trade-offs between time and accuracy to the compiler. The compiler performs fully automatic compile-time and installtime autotuning and analyses in order to construct optimized algorithms to achieve any given target accuracy. We present novel compiler techniques and a structured genetic tuning algorithm to search the space of candidate algorithms and accuracies in the presence of recursion and sub-calls to other variable accuracy code. These techniques benefit both the library writer, by providing an easy way to describe and search the parameter and algorithmic choice space, and the library user, by allowing high level specification of accuracy requirements which are then met automatically without the need for the user to understand any algorithm-specific parameters. Additionally, we present a new suite of benchmarks, written in our language, to examine the efficacy of our techniques. Our experimental results show that by relaxing accuracy requirements, we can easily obtain performance improvements ranging from 1.1× to orders of magnitude of speedup.

acm symposium on parallel algorithms and architectures | 2009

Scalable reader-writer locks

Yossi Lev; Victor Luchangco; Marek Olszewski

We present three new reader-writer lock algorithms that scale under high read-only contention. Many previous reader-writer locks suffer significant degradation when many readers attempt to acquire the lock concurrently, even though they are all allowed to hold the lock at the same time. In contrast, our locks scale almost perfectly when there is only read contention on a 4-chip system with a total of 256 hardware threads. Two of the algorithms extend the MCS queue mutex to provide reader-writer synchronization with low overhead, and can be used when busy-waiting synchronization is appropriate. The third algorithm is an improvement on a production-quality reader-writer lock used in the SolarisTM kernel, which provides robust priority and flexible fairness guarantees. A key tool we developed to implement our reader-writer locks is the closable scalable nonzero indicator (C-SNZI), a variation on the SNZI object. C-SNZI objects allow us to significantly reduce the contention among reads when many readers try to acquire the lock concurrently, but keeps the acquisition overhead small in the absence of read contention. We present an algorithm for C-SNZI that achieves this goal, and show how it can be used by each of our lock algorithms to provide scalable reader-writer locks with different fairness guarantees.

acm symposium on parallel algorithms and architectures | 2010

Simplifying concurrent algorithms by exploiting hardware transactional memory

David Dice; Yossi Lev; Virendra J. Marathe; Mark S. Moir; Daniel S. Nussbaum; Marek Olszewski

We explore the potential of hardware transactional memory (HTM) to improve concurrent algorithms. We illustrate a number of use cases in which HTM enables significantly simpler code to achieve similar or better performance than existing algorithms for conventional architectures. We use Suns prototype multicore chip, code-named Rock, to experiment with these algorithms, and discuss ways in which its limitations prevent better results, or would prevent production use of algorithms even if they are successful. Our use cases include concurrent data structures such as double ended queues, work stealing queues and scalable non-zero indicators, as well as a scalable malloc implementation and a simulated annealing application. We believe that our paper makes a compelling case that HTM has substantial potential to make effective concurrent programming easier, and that we have made valuable contributions in guiding designers of future HTM features to exploit this potential.

european conference on computer systems | 2007

JIT instrumentation: a novel approach to dynamically instrument operating systems

Marek Olszewski; Keir Mierle; Adam Czajkowski; Angela Demke Brown

As modern operating systems become more complex, understanding their inner workings is increasingly difficult. Dynamic kernel instrumentation is a well established method of obtaining insight into the workings of an OS, with applications including debugging, profiling and monitoring, and security auditing. To date, all dynamic instrumentation systems for operating systems follow the probe-based instrumentation paradigm. While efficient on fixed-length instruction set architectures, probes are extremely expensive on variable-length ISAs such as the popular Intel x86 and AMD x86-64. We propose using just-in-time (JIT) instrumentation to overcome this problem. While common in user space, JIT instrumentation has not until now been attempted in kernel space. In this work, we show the feasibility and desirability of kernel-based JIT instrumentation for operating systems with our novel prototype, implemented as a Linux kernel module. The prototype is fully SMP capable. We evaluate our prototype against the popular Kprobes Linux instrumentation tool. Our prototype outperforms Kprobes, at both micro and macro levels, by orders of magnitude when applying medium- and fine-grained instrumentation.

compilers, architecture, and synthesis for embedded systems | 2012

Siblingrivalry: online autotuning through local competitions

Jason Ansel; Maciej Pacula; Yee Lok Wong; Cy P. Chan; Marek Olszewski; Una-May O'Reilly; Saman P. Amarasinghe

Modern high performance libraries, such as ATLAS and FFTW, and programming languages, such as PetaBricks, have shown that autotuning computer programs can lead to significant speedups. However, autotuning can be burdensome to the deployment of a program, since the tuning process can take a long time and should be re-run whenever the program, microarchitecture, execution environment, or tool chain changes. Failure to re-autotune programs often leads to widespread use of sub-optimal algorithms. With the growth of cloud computing, where computations can run in environments with unknown load and migrate between different (possibly unknown) microarchitectures, the need for online autotuning has become increasingly important. We present SiblingRivalry, a new model for always-on online autotuning that allows parallel programs to continuously adapt and optimize themselves to their environment. In our system, requests are processed by dividing the available cores in half, and processing two identical requests in parallel on each half. Half of the cores are devoted to a known safe program configuration, while the other half are used for an experimental program configuration chosen by our self-adapting evolutionary algorithm. When the faster configuration completes, its results are returned, and the slower configuration is terminated. Over time, this constant experimentation allows programs to adapt to changing dynamic environments and often outperform the original algorithm that uses the entire system.

architectural support for programming languages and operating systems | 2012

Aikido: accelerating shared data dynamic analyses

Marek Olszewski; Qin Zhao; David Koh; Jason Ansel; Saman P. Amarasinghe

Despite a burgeoning demand for parallel programs, the tools available to developers working on shared-memory multicore processors have lagged behind. One reason for this is the lack of hardware support for inspecting the complex behavior of these parallel programs. Inter-thread communication, which must be instrumented for many types of analyses, may occur with any memory operation. To detect such thread communication in software, many existing tools require the instrumentation of all memory operations, which leads to significant performance overheads. To reduce this overhead, some existing tools resort to random sampling of memory operations, which introduces false negatives. Unfortunately, neither of these approaches provide the speed and accuracy programmers have traditionally expected from their tools. In this work, we present Aikido, a new system and framework that enables the development of efficient and transparent analyses that operate on shared data. Aikido uses a hybrid of existing hardware features and dynamic binary rewriting to detect thread communication with low overhead. Aikido runs a custom hypervisor below the operating system, which exposes per-thread hardware protection mechanisms not available in any widely used operating system. This hybrid approach allows us to benefit from the low cost of detecting memory accesses with hardware, while maintaining the word-level accuracy of a software-only approach. To evaluate our framework, we have implemented an Aikido-enabled vector clock race detector. Our results show that the Aikido enabled race-detector outperforms existing techniques that provide similar accuracy by up to 6.0x, and 76% on average, on the PARSEC benchmark suite.

Archive | 2009