Karin Strauss | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Karin Strauss is active.

Explore More

Publication

Featured researches published by Karin Strauss.

international symposium on computer architecture | 2010

Use ECP, not ECC, for hard failures in resistive memories

Stuart E. Schechter; Gabriel H. Loh; Karin Strauss; Doug Burger

As leakage and other charge storage limitations begin to impair the scalability of DRAM, non-volatile resistive memories are being developed as a potential replacement. Unfortunately, current error correction techniques are poorly suited to this emerging class of memory technologies. Unlike DRAM, PCM and other resistive memories have wear lifetimes, measured in writes, that are sufficiently short to make cell failures common during a systems lifetime. However, resistive memories are much less susceptible to transient faults than DRAM. The Hamming-based ECC codes used in DRAM are designed to handle transient faults with no effective lifetime limits, but ECC codes applied to resistive memories would wear out faster than the cells they are designed to repair. This paper evaluates Error-Correcting Pointers (ECP), a new approach to error correction optimized for memories in which errors are the result of permanent cell failures that occur, and are immediately detectable, at write time. ECP corrects errors by permanently encoding the locations of failed cells into a table and assigning cells to replace them. ECP provides longer lifetimes than previously proposed solutions with equivalent overhead. Whats more, as the level of variance in cell lifetimes increases -- a likely consequence of further scalaing -- ECPs margin of improvement over existing schemes increases.

acm sigplan symposium on principles and practice of parallel programming | 2006

POSH: a TLS compiler that exploits program structure

Wei Liu; James Tuck; Luis Ceze; Wonsun Ahn; Karin Strauss; Jose Renau; Josep Torrellas

As multi-core architectures with Thread-Level Speculation (TLS) are becoming better understood, it is important to focus on TLS compilation. TLS compilers are interesting in that, while they do not need to fully prove the independence of concurrent tasks, they make choices of where and when to generate speculative tasks that are crucial to overall TLS performance.This paper presents POSH, a new, fully automated TLS compiler built on top of gcc. POSH is based on two design decisions. First, to partition the code into tasks, it leverages the code structures created by the programmer, namely subroutines and loops. Second, it uses a simple profiling pass to discard ineffective tasks. With the code generated by POSH, a simulated TLS chip multiprocessor with 4 superscalar cores delivers an average speedup of 1.30 for the SPECint 2000 applications. Moreover, an estimated 26% of this speedup is a result of the implicit data prefetching provided by squashed tasks.

international symposium on microarchitecture | 2013

Approximate storage in solid-state memories

Adrian Sampson; Jacob Nelson; Karin Strauss; Luis Ceze

Memories today expose an all-or-nothing correctness model that incurs significant costs in performance, energy, area, and design complexity. But not all applications need high-precision storage for all of their data structures all of the time. This paper proposes mechanisms that enable applications to store data approximately and shows that doing so can improve the performance, lifetime, or density of solid-state memories. We propose two mechanisms. The first allows errors in multi-level cells by reducing the number of programming pulses used to write them. The second mechanism mitigates wear-out failures and extends memory endurance by mapping approximate data onto blocks that have exhausted their hardware error correction resources. Simulations show that reduced-precision writes in multi-level phase-change memory cells can be 1.7x faster on average and using failed blocks can improve array lifetime by 23% on average with quality loss under 10%.

international conference on supercomputing | 2005

Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation

Jose Renau; James Tuck; Wei Liu; Luis Ceze; Karin Strauss; Josep Torrellas

Chip Multiprocessors (CMPs) are flexible, high-frequency platforms on which to support Thread-Level Speculation (TLS). However, for TLS to deliver on its promise, CMPs must exploit multiple sources of speculative task-level parallelism, including any nesting levels of both subroutines and loop iterations. Unfortunately, these environments are hard to support in decentralized CMP hardware: since tasks are spawned out-of-order and unpredictably, maintaining key TLS basics such as task ordering and efficient resource allocation is challenging.While the concept of out-of-order spawning is not new, this paper is the first to propose a set of microarchitectural mechanisms that, altogether, fundamentally enable fast TLS with out-of-order spawn in a CMP. Moreover, we develop a fully-automated TLS compiler for aggressive out-of-order spawn. With our mechanisms, a TLS CMP with four 4-issue cores achieves an average speedup of 1.30 for full SPECint 2000 applications; the corresponding speedup for in-order only spawn is 1.04. Overall, our mechanisms unlock the potential of TLS for the toughest applications.

international symposium on computer architecture | 2010

Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races

Brandon Lucia; Luis Ceze; Karin Strauss; Shaz Qadeer; Hans-Juergen Boehm

We argue in this paper that concurrency errors should be treated as exceptions, i.e., have fail-stop behavior and precise semantics. We propose an exception model based on conflict of synchronization free regions, which precisely detects a broad class of data-races. We show that our exceptions provide enough guarantees to simplify high-level programming language semantics and debugging, but are significantly cheaper to enforce than traditional data-race detection. To make the performance cost of enforcement negligible, we propose architecture support for accurately detecting and precisely delivering these exceptions. We evaluate the suitability of our model as well as the behavior of our architectural mechanisms using the PARSEC benchmark suite and commercial applications. Our results show that the exception model largely reflects how programmers are already writing code and that the main memory, traffic and performance overheads of the enforcement mechanisms we propose are very low.

international symposium on microarchitecture | 2007

Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss; Xiaowei Shen; Josep Torrellas

Snoopy cache coherence can be implemented in any physical network topology by embedding a logical unidirectional ring in the network. Control messages are forwarded using the ring, while other messages can use any path. While the resulting coherence protocols are inexpensive to implement, they enable many ways of overlapping multiple transactions that access the same line-making it hard to reason about correctness. Moreover, snoop requests are required to traverse the ring, therefore lengthening coherence transaction latencies. In this paper, we address these problems and make two main contributions. First, we introduce the ordering invariant, which ensures the correct serialization of colliding transactions in embedded-ring protocols. Second, based on this invariant, we remove the requirement that snoop requests traverse the ring. Instead, they are delivered using any network path, as long as snoop responses - which are typically off the critical path - use the logical ring. This approach substantially reduces coherence transaction latency. We call the resulting protocol Uncorq. We show that, on a 64-node chip multiprocessor (CMP), Uncorq improves the performance, on average, by 23% for SPLASH-2 applications and by 10% for commercial applications. With an additional simple prefetching optimization, the performance improvement is, on average, 26% for SPLASH-2 applications and 18% for commercial applications.

high-performance computer architecture | 2002

Evaluation of a multithreaded architecture for cellular computing

Călin Caşcaval; José G. Castaños; Luis Ceze; Monty M. Denneau; Manish Gupta; Derek Lieber; José E. Moreira; Karin Strauss; Henry S. Warren

Cyclops is a new architecture for high-performance parallel computers that is being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP (symmetric multiprocessor) system with multiple threads of execution, embedded memory and integrated communications hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper, we describe the Cyclops architecture and evaluate two of its new hardware features: a memory hierarchy with a flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40 GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high-performance systems.

international symposium on microarchitecture | 2011

Preventing PCM banks from seizing too much power

Andrew W. Hay; Karin Strauss; Timothy Sherwood; Gabriel H. Loh; Doug Burger

Widespread adoption of Phase Change Memory (PCM) requires solutions to several problems recently addressed in the literature, including limited endurance, increased write latencies, and system-level changes required to exploit non-volatility. One important difference between PCM and DRAM that has received less attention is the increased need for write power management. Writing to a PCM cell requires high current density over hundreds of nanoseconds, and hard limits on the number of simultaneous writes must be enforced to ensure correct operation, limiting write throughput and therefore overall performance. Because several wear reduction schemes only write those bits that need to be written, the amount of power required to write a cache line back to memory under such a system is now variable, which creates opportunity to reduce write power. This paper proposes policies that monitor the bits that have actually been changed over time, as opposed to simply those lines that are dirty. These polices can more effectively allocate power across the system to improve write concurrency. This method for allocating power across the memory subsystem is built on the idea of “power tokens,” a transferable, but time-specific, allocation of power. The results show that with a storage overhead of 4.3% in the last-level cache, a power-aware memory system can improve the performance of multiprogrammed workloads by up to 84%.

architectural support for programming languages and operating systems | 2012

PocketWeb: instant web browsing for mobile devices

Dimitrios Lymberopoulos; Oriana Riva; Karin Strauss; Akshay Mittal; Alexandros Ntoulas

The high network latencies and limited battery life of mobile phones can make mobile web browsing a frustrating experience. In prior work, we proposed trading memory capacity for lower web access latency and a more convenient data transfer schedule from an energy perspective by prefetching slowly-changing data (search queries and results) nightly, when the phone is charging. However, most web content is intrinsically much more dynamic and may be updated multiple times a day, thus eliminating the effectiveness of periodic updates. This paper addresses the challenge of prefetching dynamic web content in a timely fashion, giving the user an instant web browsing experience but without aggravating the battery lifetime issue. We start by analyzing the web access traces of 8,000 users, and observe that mobile web browsing exhibits a strong spatiotemporal signature, which is different for every user. We propose to use a machine learning approach based on stochastic gradient boosting techniques to efficiently model this signature on a per user basis. The machine learning model is capable of accurately predicting future web accesses and prefetching the content in a timely manner. Our experimental evaluation with 48,000 models trained on real user datasets shows that we can accurately prefetch 60% of the URLs for about 80-90% of the users within 2 minutes before the request. The system prototype we built not only provides more than 80% lower web access time for more than 80% of the users, but it also achieves the same or lower radio energy dissipation by more than 50% for the majority of mobile users.

international symposium on computer architecture | 2006

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Karin Strauss; Xiaowei Shen; Josep Torrellas

A simple and low-cost approach to supporting snoopy cache coherence is to logically embed a unidirectional ring in the network of a multiprocessor, and use it to transfer snoop messages. Other messages can use any link in the network. While this scheme works for any network topology, a naive implementation may result in long response times or in many snoop messages and snoop operations. To address this problem, this paper proposes Flexible Snooping algorithms, a family of adaptive forwarding and filtering snooping algorithms. In these algorithms, a node receiving a snoop request may either forward it to another node and then perform the snoop, or snoop and then forward it, or simply forward it without snooping. The resulting design space offers trade-offs in number of snoop operations and messages, response time, and energy consumption. Our analysis using SPLASH-2, SPECjbb, and SPECweb workloads finds several snooping algorithms that are more costeffective than current ones. Specifically, our choice for a highperformance snooping algorithm is faster than the currently fastest algorithm while consuming 9-17% less energy; our choice for an energy-efficient algorithm is only 3-6% slower than the previous one while consuming 36-42% less energy.

Explore More