Mikko H. Lipasti | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mikko H. Lipasti is active.

Explore More

Publication

Featured researches published by Mikko H. Lipasti.

architectural support for programming languages and operating systems | 1996

Value locality and load value prediction

Mikko H. Lipasti; Christopher Wilkerson; John Paul Shen

Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describe how to effectively capture and exploit it in order to perform load value prediction. Temporal and spatial locality are attributes of storage locations, and describe the future likelihood of references to those locations or their close neighbors. In a similar vein, value locality describes the likelihood of the recurrence of a previously-seen value within a storage location. Modern processors already exploit value locality in a very restricted sense through the use of control speculation (i.e. branch prediction), which seeks to predict the future value of a single condition bit based on previously-seen values. Our work extends this to predict entire 32- and 64-bit register values based on previously-seen values. We find that, just as condition bits are fairly predictable on a per-static-branch basis, full register values being loaded from memory are frequently predictable as well. Furthermore, we show that simple microarchitectural enhancements to two modern microprocessor implementations (based on the PowerPC 620 and Alpha 21164) that enable load value prediction can effectively exploit value locality to collapse true dependencies, reduce average memory latency and bandwidth requirements, and provide measurable performance gains.

international symposium on microarchitecture | 1996

Exceeding the dataflow limit via value prediction

Mikko H. Lipasti; John Paul Shen

For decades, the serialization constraints imposed by true data dependences have been regarded as an absolute limit-the dataflow limit-on the parallel execution of serial programs. This paper proposes a new technique-value prediction-for exceeding that limit that allows data dependent instructions to issue and execute in parallel without violating program semantics. This technique is built on the concept of value locality which describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Value prediction consists of predicting entire 32- and 64-bit register values based on previously-seen values. We find that such register values being written by machine instructions are frequently predictable. Furthermore, we show that simple microarchitectural enhancements to a modern microprocessor implementation based on the PowerPC 620 that enable value prediction can effectively exploit value locality to collapse true dependences, reduce average result latency and provide performance gains of 4.5%-23% (depending on machine model) by exceeding the dataflow limit.

international symposium on microarchitecture | 1995

SPAID: software prefetching in pointer- and call-intensive environments

Mikko H. Lipasti; William Jon Schmidt; Steven R. Kunkel; Robert Ralph Roediger

Lack of object code compatibility in VLIW architectures is a severe limit to their adoption as a general-purpose computing paradigm. Previous approaches include hardware and software techniques, both of which have drawbacks. Hardware techniques add to the complexity of the architecture, whereas software techniques require multiple executables. This paper presents a technique called dynamic rescheduling that applies software techniques dynamically, using intervention by the operating system. Results are presented to demonstrate the viability of the technique using the Illinois IMPACT compiler and the TINKER architectural framework.

high performance computer architecture | 2001

An architectural evaluation of Java TPC-W

Harold W. Cain; Ravi Rajwar; Morris Marden; Mikko H. Lipasti

The use of the Java programming language for implementing server-side application logic is increasingly in popularity yet there is very little known about the architectural requirements of this emerging commercial workload. We present a detailed characterization of the Transaction Processing Councils TPC-W web benchmark, implemented in Java. The TPC-W benchmark is designed to exercise the web server and transaction processing system of a typical e-commerce web site. We have implemented TPC-W as a collection of Java servlets, and present an architectural study detailing the memory system and branch predictor behavior of the workload. We also evaluate the effectiveness of a coarse-grained multithreaded processor at increasing system throughput using TPC-W and other commercial workloads. We measure system throughput improvements from 8% to 41% for a two context processor, and 12% to 60% for a four context uniprocessor over a single-threaded uniprocessor despite decreased branch prediction accuracy and cache hit rates.

international symposium on microarchitecture | 2009

SCARAB: a single cycle adaptive routing and bufferless network

Mitchell Hayenga; Natalie D. Enright Jerger; Mikko H. Lipasti

As technology scaling drives the number of processor cores upward, current on-chip routers consume substantial portions of chip area and power budgets. Since existing research has greatly reduced router latency overheads and capitalized on available on-chip bandwidth, power constraints dominate interconnection network design. Recently research has proposed bufferless routers as a means to alleviate these constraints, but to date all designs exhibit poor operational frequency, throughput, or latency. In this paper, we propose an efficient bufferless router which lowers average packet latency by 17.6% and dynamic energy by 18.3% over existing bufferless on-chip network designs. In order to maintain the energy and area benefit of bufferless routers while delivering ultra-low latencies, our router utilizes an opportunistic processor-side buffering technique and an energy-efficient circuit-switched network for delivering negative acknowledgments for dropped packets.

international symposium on computer architecture | 2009

Achieving predictable performance through better memory controller placement in many-core CMPs

Dennis Abts; Natalie D. Enright Jerger; John Kim; Dan Gibson; Mikko H. Lipasti

In the near term, Moores law will continue to provide an increasing number of transistors and therefore an increasing number of on-chip cores. Limited pin bandwidth prevents the integration of a large number of memory controllers on-chip. With many cores, and few memory controllers, where to locate the memory controllers in the on-chip interconnection fabric becomes an important and as yet unexplored question. In this paper we show how the location of the memory controllers can reduce contention (hot spots) in the on-chip fabric and lower the variance in reference latency. This in turn provides predictable performance for memory-intensive applications regardless of the processing core on which a thread is scheduled. We explore the design space of on-chip fabrics to find optimal memory controller placement relative to different topologies (i.e. mesh and torus), routing algorithms, and workloads.

international symposium on computer architecture | 2000

On the value locality of store instructions

Kevin M. Lepak; Mikko H. Lipasti

Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously-seen program values, has been studied enthusiastically in the recent published literature. Much of the energy has focused on refining the initial efforts at predicting load instruction outcomes, with the balance of the effort examining the value locality of either all register-writing instructions, or a focused subset of them. Surprisingly, there has been very little published characterization of or effort to exploit the value locality of data words stored to memory by computer programs. This paper presents such a characterization, proposes both memory-centric (based on message passing) and producer-centric (based on program structure) prediction mechanisms for stored data values, introduces the concept of silent stores and new definitions of multiprocessor false sharing based on these observations, and suggests new techniques for aligning cache coherence protocols and microarchitectural store handling techniques to exploit the value locality of stores. We find that realistic implementations of these techniques can significantly reduce multiprocessor data bus traffic and are more effective at reducing address bus traffic than the addition of Exclusive state to a MSI coherence protocol. We also show that squashing of silent stores can provide uniprocessor speedups greater than the addition of store-to-load forwarding.

IEEE Computer Architecture Letters | 2007

Circuit-Switched Coherence

Natalie D. Enright Jerger; Mikko H. Lipasti; Li-Shiuan Peh

Circuit-switched networks can significantly lower the communication latency between processor cores, when compared to packet-switched networks, since once circuits are set up, communication latency approaches pure interconnect delay. However, if circuits are not frequently reused, the long set up time and poorer interconnect utilization can hurt overall performance. To combat this problem, we propose a hybrid router design which intermingles packet-switched flits with circuit-switched flits. Additionally, we co-design a prediction-based coherence protocol that leverages the existence of circuits to optimize pair-wise sharing between cores. The protocol allows pair-wise sharers to communicate directly with each other via circuits and drives up circuit reuse. Circuit-switched coherence provides overall system performance improvements of up to 17% with an average improvement of 10% and reduces network latency by up to 30%.

international symposium on microarchitecture | 2009

Light speed arbitration and flow control for nanophotonic interconnects

Dana Vantrease; Nathan L. Binkert; Robert Schreiber; Mikko H. Lipasti

By providing high bandwidth chip-wide communication at low latency and low power, on-chip optics can improve many-core performance dramatically. Optical channels that connect many nodes and allow for single cycle cache-line transmissions will require fast, high bandwidth arbitration. We exploit CMOS nanophotonic devices to create arbiters that meet the demands of on-chip optical interconnects. We accomplish this by exploiting a unique property of optical devices that allows arbitration to scale with latency bounded by the time of flight of light through a silicon waveguide that passes all requesters. We explore two classes of distributed token-based arbitration, channel based and slot based, and tailor them to optics. Channel based protocols allocate an entire waveguide to one requester at a time, whereas slot based protocols allocate fixed sized slots in the waveguide. Simple optical protocols suffer from a fixed prioritization of users and can starve those with low priority; we correct this with new schemes that vary the priorities dynamically to ensure fairness. On a 64-node optical interconnect under uniform random single-cycle traffic, our fair slot protocol achieves 74% channel utilization, while our fair channel protocol achieves 45%. Ours are the first arbitration protocols that exploit optics to simultaneously achieve low latency, high utilization, and fairness.

international symposium on computer architecture | 2004

Memory Ordering: A Value-Based Approach

Harold W. Cain; Mikko H. Lipasti

Conventional out-of-order processors employ a multi-ported,fully-associative load queue to guarantee correctmemory reference order both within a single thread of executionand across threads in a multiprocessor system. Asimprovements in process technology and pipelining lead tohigher clock frequencies, scaling this complex structure toaccommodate a larger number of in-flight loads becomesdifficult if not impossible. Furthermore, each access to thiscomplex structure consumes excessive amounts of energy.In this paper, we solve the associative load queue scalabilityproblem by completely eliminating the associative loadqueue. Instead, data dependences and memory consistencyconstraints are enforced by simply re-executing loadinstructions in program order prior to retirement. Usingheuristics to filter the set of loads that must be re-executed,we show that our replay-based mechanism enables a simple,scalable, and energy-efficient FIFO load queue designwith no associative lookup functionality, while sacrificingonly a negligible amount of performance and cache bandwidth.

Explore More