Alex Solomatnikov | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alex Solomatnikov is active.

Explore More

Publication

Featured researches published by Alex Solomatnikov.

international symposium on computer architecture | 2007

Comparing memory systems for chip multiprocessors

Jacob Leverich; Hideho Arakida; Alex Solomatnikov; Amin Firoozshahian; Mark Horowitz; Christos Kozyrakis

There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an applications code.

IEEE Micro | 2010

Rethinking Digital Design: Why Design Must Change

Ofer Shacham; Omid Azizi; Megan Wachs; Wajahat Qadeer; Zain Asgar; Kyle Kelley; John P. Stevenson; Stephen Richardson; Mark Horowitz; Benjamin C. Lee; Alex Solomatnikov; Amin Firoozshahian

Because of technology scaling, power dissipation is todays major performance limiter. Moreover, the traditional way to achieve power efficiency, application-specific designs, is prohibitively expensive. These power and cost issues necessitate rethinking digital design. To reduce design costs, we need to stop building chip instances, and start making chip generators instead. Domain-specific chip generators are templates that codify designer knowledge and design trade-offs to create different application-optimized chips.

international symposium on microarchitecture | 2008

Verification of chip multiprocessor memory systems using a relaxed scoreboard

Ofer Shacham; Megan Wachs; Alex Solomatnikov; Amin Firoozshahian; Stephen Richardson; Mark Horowitz

Verification of chip multiprocessor memory systems remains challenging. While formal methods have been used to validate protocols, simulation is still the dominant method used to validate memory system implementation. Having a memory scoreboard, a high-level model of the memory, greatly aids simulation based validation, but accurate score-boards are complex to create since often they depend not only on the memory and consistency model but also on its specific implementation. This paper describes a methodology of using a relaxed scoreboard, which greatly reduces the complexity of creating these memory models. The relaxed scoreboard tracks the operations of the system to maintain a set of values that could possibly be valid for each memory location. By allowing multiple possible values, the model used in the scoreboard is only loosely coupled with the specific design, which decouples the construction of the checker from the implementation, allowing the checker to be used early in the design and to be built up incrementally, and greatly reduces the scoreboard design effort. We demonstrate the use of the relaxed scoreboard in verifying RTL implementations of two different memory models, Transactional Coherency and Consistency (TCC) and Relaxed Consistency, for up to 32 processors. The resulting checker has a performance slowdown of 19% for checking Relaxed Consistency, and less than 30% for TCC, allowing it to be used in all simulation runs.

architectural support for programming languages and operating systems | 2012

HICAMP: architectural support for efficient concurrency-safe shared structured data access

David R. Cheriton; Amin Firoozshahian; Alex Solomatnikov; John P. Stevenson; Omid Azizi

Programming language and operating system support for efficient concurrency-safe access to shared data is a key concern for the effective use of multi-core processors. Most research has focused on the software model of multiple threads accessing this data within a single shared address space. However, many real applications are actually structured as multiple separate processes for fault isolation and simplified synchronization. In this paper, we describe the HICAMP architecture and its innovative memory system, which supports efficient concurrency safe access to structured shared data without incurring the overhead of inter-process communication. The HICAMP architecture also provides support for programming language and OS structures such as threads, iterators, read-only access and atomic update. In addition to demonstrating that HICAMP is beneficial for multi-process structured applications, our evaluation shows that the same mechanisms provide substantial benefits for other areas, including sparse matrix computations and virtualization.

ACM Transactions on Architecture and Code Optimization | 2008

Comparative evaluation of memory models for chip multiprocessors

Jacob Leverich; Hideho Arakida; Alex Solomatnikov; Amin Firoozshahian; Mark Horowitz; Christoforos E. Kozyrakis

There are two competing models for the on-chip memory in Chip Multiprocessor (CMP) systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications on systems with up to 16 cores, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and nonallocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an applications code.

international symposium on microarchitecture | 2009

Using a configurable processor generator for computer architecture prototyping

Alex Solomatnikov; Amin Firoozshahian; Ofer Shacham; Zain Asgar; Megan Wachs; Wajahat Qadeer; Stephen Richardson; Mark Horowitz

Building hardware prototypes for computer architecture research is challenging. Unfortunately, development of the required software tools (compilers, debuggers, runtime) is even more challenging, which means these systems rarely run real applications. To overcome this issue, when developing our prototype platform, we used the Tensilica processor generator to produce a customized processor and corresponding software tools and libraries. While this base processor was very different from the streamlined custom processor we initially imagined, it allowed us to focus on our main objective - the design of a reconfigurable CMP memory system - and to successfully tape out an 8-core CMP chip with only a small group of designers. One person was able to handle processor configuration and hardware generation, support of a complete software tool chain, as well as developing the custom runtime software to support three different programming models. Having a sophisticated software tool chain not only allowed us to run more applications on our machine, it once again pointed out the need to use optimized code to get an accurate evaluation of architectural features.

international conference on supercomputing | 2012

Sparse matrix-vector multiply on the HICAMP architecture

John P. Stevenson; Amin Firoozshahian; Alex Solomatnikov; Mark Horowitz; David R. Cheriton

Sparse matrix-vector multiply (SpMV) is a critical task in the inner loop of modern iterative linear system solvers and exhibits very little data reuse. This low reuse means that its performance is bounded by main-memory bandwidth. Moreover, the random patterns of indirection make it difficult to achieve this bound. We present sparse matrix storage formats based on deduplicated memory. These formats reduce memory traffic during SpMV and thus show significantly improved performance bounds: 90x better in the best case. Additionally, we introduce a matrix format that inherently exploits any amount of matrix symmetry and is at the same time fully compatible with non-symmetric matrix code. Because of this, our method can concurrently operate on a symmetric matrix without complicated work partitioning schemes and without any thread synchronization or locking. This approach takes advantage of growing processor caches, but incurs an instruction count overhead. It is feasible to overcome this issue by using specialized hardware as shown by the recently proposed Hierarchical Immutable Content-Addressable Memory Processor, or HICAMP architecture.

international symposium on computer architecture | 2010