Is this you? Create Your Porfile

Krste Asanovic

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Krste Asanovic is active.

Explore More

Publication

Featured researches published by Krste Asanovic.

high-performance computer architecture | 2005

Unbounded transactional memory

C.S. Ananian; Krste Asanovic; Bradley C. Kuszmaul; Charles E. Leiserson; Sean Lie

Hardware transactional memory should support unbounded transactions: transactions of arbitrary size and duration. We describe a hardware implementation of unbounded transactional memory, called UTM, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory. We performed a cycle-accurate simulation of a simplified architecture, called LTM. LTM is based on UTM but is easier to implement, because it does not change the memory subsystem outside of the processor. LTM allows nearly unbounded transactions, whose footprint is limited only by physical memory size and whose duration by the length of a timeslice. We assess UTM and LTM through microbenchmarking and by automatically converting the SPECjvm98 Java benchmarks and the Linux 2.4.19 kernel to use transactions instead of locks. We use both cycle-accurate simulation and instrumentation to understand benchmark behavior. Our studies show that the common case is small transactions that commit, even when contention is high, but that some applications contain very large transactions. For example, although 99.9% of transactions in the Linux study touch 54 cache lines or fewer, some transactions touch over 8000 cache lines. Our studies also indicate that hardware support is required, because some applications spend over half their time in critical regions. Finally, they suggest that hardware support for transactions can make Java programs run faster than when run using locks and can increase the concurrency of the Linux kernel by as much as a factor of 4 with no additional programming work.

parallel computing | 2009

A view of the parallel computing landscape

Krste Asanovic; Rastislav Bodik; James Demmel; Tony M. Keaveny; Kurt Keutzer; John Kubiatowicz; Nelson Morgan; David A. Patterson; Koushik Sen; John Wawrzynek; David Wessel; Katherine A. Yelick

Writing programs that scale with increasing numbers of cores should be as easy as writing programs for sequential computers.

international symposium on computer architecture | 2005

Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Michael Zhang; Krste Asanovic

In this paper, we consider tiled chip multiprocessors (CMP) where each tile contains a slice of the total on-chip L2 cache storage and tiles are connected by an on-chip network. The L2 slices can be managed using two basic schemes: 1) each slice is treated as a private L2 cache for the tile; 2) all slices are treated as a single large L2 cache shared by all tiles. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity, as each tile creates local copies of any line it touches. A shared L2 cache increases the effective cache capacity for shared data, but incurs long hit latencies when L2 data is on a remote tile. We present a new cache management policy, victim replication, which combines the advantages of private and shared schemes. Victim replication is a variant of the shared scheme which attempts to keep copies of local primary cache victims within the local L2 cache slice. Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of both single-threaded and multi-threaded benchmarks running on an 8-processor tiled CMP. We show that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16% for multi-threaded benchmarks and 24% for single-threaded benchmarks, providing better overall performance than either private or shared schemes.

international symposium on low power electronics and design | 2003

Reducing power density through activity migration

Seongmoo Heo; Kenneth C. Barr; Krste Asanovic

Power dissipation is unevenly distributed in modern microprocessors leading to localized hot spots with significantly greater die temperature than surrounding cooler regions. Excessive junction temperature reduces reliability and can lead to catastrophic failure. We examine the use of activity migration which reduces peak junction temperature by moving computation between multiple replicated units. Using a thermal model that includes the temperature dependence of leakage power, we show that sustainable power dissipation can be increased by nearly a factor of two for a given junction temperature limit. Alternatively, peak die temperature can be reduced by 12.4°C at the same clock frequency. The model predicts that migration intervals of around 20--200 are required to achieve the maximum sustainable power increase. We evaluate several different forms of replication and migration policy control.

architectural support for programming languages and operating systems | 2002

Mondrian memory protection

Emmett Witchel; Josh Cates; Krste Asanovic

Mondrian memory protection (MMP) is a fine-grained protection scheme that allows multiple protection domains to flexibly share memory and export protected services. In contrast to earlier page-based systems, MMP allows arbitrary permissions control at the granularity of individual words. We use a compressed permissions table to reduce space overheads and employ two levels of permissions caching to reduce run-time overheads. The protection tables in our implementation add less than 9% overhead to the memory space used by the application. Accessing the protection tables adds than 8% additional memory references to the accesses made by the application. Although it can be layered on top of demand-paged virtual memory, MMP is also well-suited to embedded systems with a single physical address space. We extend MMP to support segment translation which allows a memory segment to appear at another location in the address space. We use this translation to implement zero-copy networking underneath the standard read system call interface, where packet payload fragments are connected together by the translation system to avoid data copying. This saves 52% of the memory references used by a traditional copying network stack.

Nature | 2015

Single-chip microprocessor that communicates directly using light

Chen Sun; Mark T. Wade; Yunsup Lee; Jason S. Orcutt; Luca Alloatti; Michael Georgas; Andrew Waterman; Jeffrey M. Shainline; Rimas Avizienis; Sen Lin; Benjamin R. Moss; Rajesh Kumar; Fabio Pavanello; Amir H. Atabaki; Henry Cook; Albert J. Ou; Jonathan Leu; Yu-Hsin Chen; Krste Asanovic; Rajeev J. Ram; Miloš A. Popović; Vladimir Stojanovic

Data transport across short electrical wires is limited by both bandwidth and power density, which creates a performance bottleneck for semiconductor microchips in modern computer systems—from mobile phones to large-scale data centres. These limitations can be overcome by using optical communications based on chip-scale electronic–photonic systems enabled by silicon-based nanophotonic devices8. However, combining electronics and photonics on the same chip has proved challenging, owing to microchip manufacturing conflicts between electronics and photonics. Consequently, current electronic–photonic chips are limited to niche manufacturing processes and include only a few optical devices alongside simple circuits. Here we report an electronic–photonic system on a single chip integrating over 70 million transistors and 850 photonic components that work together to provide logic, memory, and interconnect functions. This system is a realization of a microprocessor that uses on-chip photonic devices to directly communicate with other chips using light. To integrate electronics and photonics at the scale of a microprocessor chip, we adopt a ‘zero-change’ approach to the integration of photonics. Instead of developing a custom process to enable the fabrication of photonics, which would complicate or eliminate the possibility of integration with state-of-the-art transistors at large scale and at high yield, we design optical devices using a standard microelectronics foundry process that is used for modern microprocessors. This demonstration could represent the beginning of an era of chip-scale electronic–photonic systems with the potential to transform computing system architectures, enabling more powerful computers, from network infrastructure to data centres and supercomputers.

international conference on mobile systems, applications, and services | 2003

Energy aware lossless data compression

Kenneth C. Barr; Krste Asanovic

Wireless transmission of a bit can require over 1000 times more energy than a single 32-bit computation. It would therefore seem desirable to perform significant computation to reduce the number of bits transmitted. If the energy required to compress data is less than the energy required to send it, there is a net energy savings and consequently, a longer battery life for portable computers. This paper reports on the energy of lossless data compressors as measured on a StrongARM SA-110 system. We show that with several typical compression tools, there is a net energy increase when compression is applied before transmission. Reasons for this increase are explained, and hardware-aware programming optimizations are demonstrated. When applied to Unix compress, these optimizations improve energy efficiency by 51%. We also explore the fact that, for many usage models, compression and decompression need not be performed by the same algorithm. By choosing the lowest-energy compressor and decompressor on the test platform, rather than using default levels of compression, overall energy to send compressible web data can be reduced 31%. Energy to send harder-to-compress English text can be reduced 57%. Compared with a system using a single optimized application for both compression and decompression, the asymmetric scheme saves 11% or 12% of the total energy depending on the dataset.

high performance interconnects | 2008

Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics

Christopher Batten; Ajay Joshi; Jason S. Orcutt; Anatoly Khilo; Benjamin Moss; Charles W. Holzwarth; Miloš A. Popović; Hanqing Li; Henry I. Smith; Judy L. Hoyt; Franz X. Kärtner; Rajeev J. Ram; Vladimir Stojanovic; Krste Asanovic

We present a new monolithic silicon photonics technology suited for integration with standard bulk CMOS processes, which reduces costs and improves opto-electrical coupling compared to previous approaches. Our technology supports dense wavelength-division multiplexing with dozens of wavelengths per waveguide. Simulation and experimental results reveal an order of magnitude better energy-efficiency than electrical links in the same technology generation. Exploiting key features of our photonics technology, we have developed a processor-memory network architecture for future manycore systems based on an opto-electrical global crossbar. We illustrate the advantages of the proposed network architecture using analytical models and simulations with synthetic traffic patterns. For a power-constrained system with 256 cores connected to 16 DRAM modules using an opto-electrical crossbar, aggregate network throughput can be improved by ap8-10times compared to an optimized purely electrical network.

ACM Transactions on Computer Systems | 2006

Energy-aware lossless data compression

Kenneth C. Barr; Krste Asanovic

Wireless transmission of a single bit can require over 1000 times more energy than a single computation. It can therefore be beneficial to perform additional computation to reduce the number of bits transmitted. If the energy required to compress data is less than the energy required to send it, there is a net energy savings and an increase in battery life for portable computers. This article presents a study of the energy savings possible by losslessly compressing data prior to transmission. A variety of algorithms were measured on a StrongARM SA-110 processor. This work demonstrates that, with several typical compression algorithms, there is a actually a net energy increase when compression is applied before transmission. Reasons for this increase are explained and suggestions are made to avoid it. One such energy-aware suggestion is asymmetric compression, the use of one compression algorithm on the transmit side and a different algorithm for the receive path. By choosing the lowest-energy compressor and decompressor on the test platform, overall energy to send and receive data can be reduced by 11% compared with a well-chosen symmetric pair, or up to 57% over the default symmetric zlib scheme.

IEEE Computer | 1997

Scalable processors in the billion-transistor era: IRAM

Christoforos E. Kozyrakis; Stylianos Perissakis; David A. Patterson; Thomas E. Anderson; Krste Asanovic; Neal Cardwell; Richard Fromm; Jason Golbus; Benjamin Gribstad; Kimberly Keeton; Randi Thomas; Noah Treuhaft; Katherine A. Yelick

Members of the University of California, Berkeley, argue that the memory system will be the greatest inhibitor of performance gains in future architectures. Thus, they propose the intelligent RAM or IRAM. This approach greatly increases the on-chip memory capacity by using DRAM technology instead of much less dense SRAM memory cells. The resultant on-chip memory capacity coupled with the high bandwidths available on chip should allow cost-effective vector processors to reach performance levels much higher than those of traditional architectures. Although vector processors require explicit compilation, the authors claim that vector compilation technology is mature (having been used for decades in supercomputers), and furthermore, that future workloads will contain more heavily vectorizable components.

Explore More