Konrad K. Lai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Konrad K. Lai is active.

Explore More

Publication

Featured researches published by Konrad K. Lai.

international symposium on computer architecture | 2005

Virtualizing Transactional Memory

Ravi Rajwar; Maurice Herlihy; Konrad K. Lai

Writing concurrent programs is difficult because of the complexity of ensuring proper synchronization. Conventional lock-based synchronization suffers from well-known limitations, so researchers have considered nonblocking transactions as an alternative. Recent hardware proposals have demonstrated how transactions can achieve high performance while not suffering limitations of lock-based mechanisms. However, current hardware proposals require programmers to be aware of platform-specific resource limitations such as buffer sizes, scheduling quanta, as well as events such as page faults, and process migrations. If the transactional model is to gain wide acceptance, hardware support for transactions must be virtualized to hide these limitations in much the same way that virtual memory shields the programmer from platform-specific limitations of physical memory. This paper proposes virtual transactional memory (VTM), a user-transparent system that shields the programmer from various platform-specific resource limitations. VTM maintains the performance advantage of hardware transactions, incurs low overhead in time, and has modest costs in hardware support. While many system-level challenges remain, VTM takes a step toward making transactional models more widely acceptable.

international symposium on computer architecture | 2005

The Impact of Performance Asymmetry in Emerging Multicore Architectures

Saisanthosh Balakrishnan; Ravi Rajwar; Michael D. Upton; Konrad K. Lai

Performance asymmetry in multicore architectures arises when individual cores have different performance. Building such multicore processors is desirable because many simple cores together provide high parallel performance while a few complex cores ensure high serial performance. However, application developers typically assume computational cores provide equal performance, and performance asymmetry breaks this assumption. This paper is concerned with the behavior of commercial applications running on performance asymmetric systems. We present the first study investigating the impact of performance asymmetry on a wide range of commercial applications using a hardware prototype. We quantify the impact of asymmetry on an applications performance variance when run multiple times, and the impact on the applications scalability. Performance asymmetry adversely affects behavior of many workloads. We study ways to eliminate these effects. In addition to asymmetry-aware operating system kernels, the application often itself needs to be aware of performance asymmetry for stable and scalable performance.

ieee international conference on high performance computing data and analytics | 2013

Performance evaluation of Intel® transactional synchronization extensions for high-performance computing

Richard M. Yoo; Christopher J. Hughes; Konrad K. Lai; Ravi Rajwar

Intel has recently introduced Intel® Transactional Synchronization Extensions (Intel® TSX) in the Intel 4th Generation Core™ Processors. With Intel TSX, a processor can dynamically determine whether threads need to serialize through lock-protected critical sections. In this paper, we evaluate the first hardware implementation of Intel TSX using a set of high-performance computing (HPC) workloads, and demonstrate that applying Intel TSX to these workloads can provide significant performance improvements. On a set of real-world HPC workloads, applying Intel TSX provides an average speedup of 1.41x. When applied to a parallel user-level TCP/IP stack, Intel TSX provides 1.31x average bandwidth improvement on network intensive applications. We also demonstrate the ease with which we were able to apply Intel TSX to the various workloads.

Proceedings of the IEEE | 2001

Coming challenges in microarchitecture and architecture

Ronny Ronen; Avi Mendelson; Konrad K. Lai; Shih-Lien Lu; Fred J. Pollack; John Paul Shen

In the past several decades, the world of computers and especially that of microprocessors has witnessed phenomenal advances. Computers have exhibited ever-increasing performance and decreasing costs, making them more affordable and in turn, accelerating additional software and hardware development that fueled this process even more. The technology that enabled this exponential growth is a combination of advancements in process technology, microarchitecture, architecture, and design and development tools. While the pace of this progress has been quite impressive over the last two decades, it has become harder and harder to keep up this pace. New process technology requires more expensive megafabs and new performance levels require larger die, higher power consumption, and enormous design and validation effort. Furthermore, as CMOS technology continues to advance, microprocessor design is exposed to a new set of challenges. In the near future, microarchitecture has to consider and explicitly manage the limits of semiconductor technology, such as wire delays, power dissipation, and soft errors. In this paper we describe the role of microarchitecture in the computer world present the challenges ahead of us, and highlight areas where microarchitecture can help address these challenges.

international conference on supercomputing | 2002

Bloom filtering cache misses for accurate data speculation and prefetching

Jih-Kwon Peir; Shih-Chang Lai; Shih-Lien Lu; Jared Stark; Konrad K. Lai

A processor must know a load instructions latency to schedule the loads dependent instructions at the correct time. Unfortunately, modern processors do not know this latency until well after the dependent instructions should have been scheduled to avoid pipeline bubbles between themselves and the load. One solution to this problem is to predict the loads latency, by predicting whether the load will hit or miss in the data cache. Existing cache hit/miss predictors, however, can only correctly predict about 50% of cache misses.This paper introduces a new hit/miss predictor that uses a Bloom Filter to identify cache misses early in the pipeline. This early identification of cache misses allows the processor to more accurately schedule instructions that are dependent on loads and to more precisely prefetch data into the cache. Simulations using a modified SimpleScalar model show that the proposed Bloom Filter is nearly perfect, with a prediction accuracy greater than 99% for the SPECint2000 benchmarks. IPC (Instructions Per Cycle) performance improved by 19% over a processor that delayed the scheduling of instructions dependent on a load until the load latency was known, and by 6% and 7% over a processor that always predicted a load would hit the cache and with a counter-based hit/miss predictor respectively. This IPC reaches 99.7% of the IPC of a processor with perfect scheduling.

international symposium on computer architecture | 2005

Scalable Load and Store Processing in Latency Tolerant Processors

Amit Gandhi; Haitham Akkary; Ravi Rajwar; S.T. Srinivasasn; Konrad K. Lai

Memory latency tolerant architectures achieve high performance by supporting thousands of in-flight instructions without scaling cycle-critical processor resources. We present new load-store processing algorithms for latency tolerant architectures. We augment primary load and store queues with secondary buffers. The secondary load buffer is a set associative structure, similar to a cache. The secondary store queue, the store redo log (SRL) is a first-in first-out (FIFO) structure recording the program order of all stores completed in parallel with a miss, and has no CAM and search functions. Instead of the secondary store queue, a cache provides temporary forwarding. The SRL enforces memory ordering by ensuring memory updates occur in program order once the miss data arrives from memory. The new algorithms remove fundamental sources of power, and area inefficiency in load and store processing by eliminating the CAM and search functions in the secondary load and store buffers, and still achieve competitive performance compared to hierarchical designs

architectural support for programming languages and operating systems | 1982

Supporting ada memory management in the iAPX-432

Fred J. Pollack; George W. Cox; Dan W. Hammerstrom; Kevin C. Kahn; Konrad K. Lai; Justin R. Rattner

In this paper, we describe how the memory management mechanisms of the Intel iAPX-432 are used to implement the visibility rules of Ada. At any point in the execution of an Ada® program on the 432, the program has a protected address space that corresponds exactly to the programs accessibility at the corresponding point in the programs source. This close match of architecture and language did not occur because the 432 was designed to execute Ada—it was not. Rather, both Ada and the 432 are the result of very similar design goals. To illustrate this point, we compare, in their support for Ada, the memory management mechanisms of the 432 to those of traditional computers. The most notable differences occur in heap-space management and multitasking. With respect to the former, we describe a degree of hardware/software cooperation that is not typical of other systems. In the latter area, we show how Adas view of sharing is the same as the 432, but differs totally from the sharing permitted by traditional systems. A description of these differences provide some insight into the problems of implementing an Ada compiler for a traditional architecture.

international symposium on computer architecture | 2004

Prophet/Critic Hybrid Branch Prediction

Ayose Falcón; Jared Stark; Alex Ramirez; Konrad K. Lai; Mateo Valero

This paper introduces the prophet/critic hybrid conditional branch predictor, which has two component predictors that play the role of either prophet or critic. The prophet is a conventional predictor that uses branch history to predict the direction of the current branch. Further accesses of the prophet yield predictions for the branches following the current one. Predictions for the current branch and the ones that follow are collectively known as the branchs future. They are actually a prophecy, or predicted branch future. The critic uses both the branchs history and future to give a critique of the prophets prediction fo the current branch. The critique, either agree or disagree, is used to generate the final prediction for the branch. Our results show an 8K + 8K byte prophet/critic hybrid has 39% fewer mispredicts than a 16K byte 2Bc - gskew predictor-a predictor similar to that of the proposed Compaq* Alpha* EV8 processor - across a wide range of applications. The distance between pipeline flushes due to mispredicts increases from one flush per 418 micro-operations (uops) to one per 680 uops. For gcc, the percentage of mispredicted branches drops from 3.11% to 1.23%. On a machine based on the Intel/spl reg/ Pentium/spl reg/ 4 processor, this improves uPC (Uops Per Cycle) by 7.8% (18% for gcc) and reduces the number of uops fetched (along both correct and incorrect paths) by 8.6%.

high-performance computer architecture | 2014

Improving in-memory database index performance with Intel ® Transactional Synchronization Extensions

Tomas Karnagel; Roman Dementiev; Ravi Rajwar; Konrad K. Lai; Thomas Legler; Benjamin Schlegel; Wolfgang Lehner

The increasing number of cores every generation poses challenges for high-performance in-memory database systems. While these systems use sophisticated high-level algorithms to partition a query or run multiple queries in parallel, they also utilize low-level synchronization mechanisms to synchronize access to internal database data structures. Developers often spend significant development and verification effort to improve concurrency in the presence of such synchronization. The Intel® Transactional Synchronization Extensions (Intel® TSX) in the 4th Generation Core™ Processors enable hardware to dynamically determine whether threads actually need to synchronize even in the presence of conservatively used synchronization. This paper evaluates the effectiveness of such hardware support in a commercial database. We focus on two index implementations: a B+Tree Index and the Delta Storage Index used in the SAP HANA® database system. We demonstrate that such support can improve performance of database data structures such as index trees and presents a compelling opportunity for the development of simpler, scalable, and easy-to-verify algorithms.

symposium on operating systems principles | 1981

A unified model and implementation for interprocess communication in a multiprocessor environment

George W. Cox; William M. Corwin; Konrad K. Lai; Fred J. Pollack

This paper describes interprocess communication and process dispatching on the Intel 432. The primary assets of the facility are its generality and its usefulness in a wide range of applications. The conceptual model, supporting mechanisms, available interfaces, current implementations, and absolute and comparative performance are described. The Intel 432 is an object-based multiprocessor. There are two processor types: General Data Processors (GDPs) and Interface Processors (IPs). These processors provide several operating system functions in hardware by defining and using a number of processor-recognized objects and high-level instructions. In particular, they use several types of processor-recognized objects to provide a unified structure for both interprocess communication and process dispatching. One of the prime motivations for providing this level of hardware support is to improve efficiency of these facilities over similar facilities implemented in software. With greater efficiency, they become more practically useful [Stonebraker 81]. The unification allows these traditionally separate facilities to be described by a single conceptual model and implemented by a single set of mechanisms. The 432 model is based on using objects to play roles. The roles are those of requests and servers. In general, a request is a petition for some service and a server is an agent that performs the requested service. Various types of objects are used to represent role-players. The role played by an object may change over time. The type and state of an object determines what role it is playing at any given instant. For any particular class of request, based upon type and state, there is typically a corresponding class of expected server. The request/server model may be applied to a number of common communication situations. In the full paper, several situations are discussed: one-way requestor to server, two-way requestor to server to requestor, nondistinguished requestors, resource source selectivity, nondistinguished servers, and mutual exclusion. While the model embodies most of the essential aspects of the 432s interprocess communication and process dispatching facilities, it leaves a great many practical questions unanswered. The full paper describes our solutions to those problems which often stand between an apparently good model and a successful implementation, namely: binding, queue structure, queuing disciplines, blocking, vectoring, dispatching mixes, and hardware/software cooperation. With an understanding of the mechanisms employed, the paper then reviews the instruction interface to and potential uses of the port mechanism. This instruction interface is provided by seven instructions: SEND, RECEIVE, CONDITIONAL SEND, CONDITIONAL RECEIVE, SURROGATE SEND, SURROGATE RECEIVE, and DELAY. The implementations of the port mechanism are then discussed. The port mechanism is implemented in microcode on both the GDP and IP. Although the microarchitectures differ, in both cases the implementation requires between 600 and 800 lines of vertically encoded 16-bit microinstructions. The corresponding execution times are roughly comparable, with the IP about 20% slower even though most of its microinstructions are twice as slow. Both implementations resulted from the hand translation of the Ada-based algorithms that describe these operations. Finally, the paper characterizes the performance of the 432 port mechanisms and contrasts its performance to other implementations of similar mechanisms. Three recently implemented mechanisms were chosen: one implemented completely in software (i.e., the Exchange mechanism of RMX/86 [Intel 80]) and two implemented in a combination of hardware and software (i.e., the Star0S and Medusa mechanisms of Cm* [Jones 80]). To make the comparison as fair as possible, times for each system are normalized to account for differences in their underlying hardware. The normalization factor is called a “tick” (similar to [Lampson 80]). In the full paper, the absolute and normalized performance of these implementations is examined in six different cases: conditional send time, conditional receive time, minumum message transit time, send plus minimum dispatching latency time, non-blocking send time, and blocking receive time. These performance comparisons show a 3 to 7x normalized performance advantage over the software implemented RMX/86. They show similar normalized performance to the Cm* implementations.

Explore More