Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Chi-Keung Luk is active.

Publication


Featured researches published by Chi-Keung Luk.


architectural support for programming languages and operating systems | 1996

Compiler-based prefetching for recursive data structures

Chi-Keung Luk; Todd C. Mowry

Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and todays high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, its potential in pointer-based applications has remained largely unexplored. This paper investigates compiler-based prefetching for pointer-based applications---in particular, those containing recursive data structures. We identify the fundamental problem in prefetching pointer-based data structures and propose a guideline for devising successful prefetching schemes. Based on this guideline, we design three prefetching schemes, we automate the most widely applicable scheme (greedy prefetching) in an optimizing research compiler, and we evaluate the performance of all three schemes on a modern superscalar processor similar to the MIPS R10000. Our results demonstrate that compiler-inserted prefetching can significantly improve the execution speed of pointer-based codes---as much as 45% for the applications we study. In addition, the more sophisticated algorithms (which we currently perform by hand, but which might be implemented in future compilers) can improve performance by as much as twofold. Compared with the only other compiler-based pointer prefetching scheme in the literature, our algorithms offer substantially better performance by avoiding unnecessary overhead and hiding more latency.


international symposium on computer architecture | 2001

Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

Chi-Keung Luk

Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution—essentially a combined act of speculative address generation and prefetching—to accelerate the main thread. In this paper, we propose such a pre-execution technique for simultaneous multithreading (SMT) processors. By using software to control pre-execution, we are able to handle some of the most important access patterns that are typically difficult to prefetch. Compared with existing work on pre-execution, our technique is significantly simpler to implement (e.g., no integration of pre-execution results, no need of shortening programs for pre-execution, and no need of special hardware to copy register values upon thread spawns). Consequently, only minimal extensions to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching.


international symposium on microarchitecture | 1997

Predicting data cache misses in non-numeric applications through correlation profiling

Todd C. Mowry; Chi-Keung Luk

To maximize the benefit and minimize the overhead of software-based latency tolerance techniques. We apply them precisely to the set of dynamic references that suffer cache misses. Unfortunately, the information provided by the state-of-the-art cache miss profiling technique (summary profiling) is inadequate for references with intermediate miss ratios-it results in either failing to hide latency, or else inserting unnecessary overhead. To overcome this problem, we propose and evaluate a new technique, correlation profiling, which improves predictability by correlating the caching behavior with the associated dynamic context. Our experimental results demonstrate that roughly half of the 22 non-numeric applications we study can potentially enjoy significant reductions in memory stall time by exploiting at least one of the three forms of correlation profiling we consider.


international symposium on microarchitecture | 1998

Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors

Chi-Keung Luk; Todd C. Mowry

Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors since they fail to issue prefetches early enough (particularly for non-sequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of non-sequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach results in speedups ranging from 9.4% to 18.5% (13.3% on average) over the original execution time on an out-of-order superscalar processor; which is more than double the average speedup of the best existing schemes (6.5%). This is accomplished by hiding an average of 71% of the original instruction stall time, compared with only 36% for the best existing schemes. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, that the compiler can limit the code expansion to less than 10% on average, and that our scheme is robust with respect to variations in miss latency and bandwidth.


IEEE Transactions on Computers | 1999

Automatic compiler-inserted prefetching for pointer-based applications

Chi-Keung Luk; Todd C. Mowry

As the disparity between processor and memory speeds continues to grow, memory latency is becoming an increasingly important performance bottleneck. While software-controlled prefetching is an attractive technique for tolerating this latency, its success has been limited thus far to array-based numeric codes. In this paper, we expand the scope of automatic compiler-inserted prefetching to also include the recursive data structures commonly found in pointer-based applications. We propose three compiler-based prefetching schemes, and automate the most widely applicable scheme (greedy prefetching) in an optimizing research compiler. Our experimental results demonstrate that compiler-inserted prefetching can offer significant performance gains on both uniprocessors and large-scale shared-memory multiprocessors.


international symposium on computer architecture | 1999

Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation

Chi-Keung Luk; Todd C. Mowry

By optimizing data layout at run-time, we can potentially enhance the performance of caches by actively creating spatial locality, facilitating prefetching, and avoiding cache conflicts and false sharing. Unfortunately, it is extremely difficult to guarantee that such optimizations are safe in practice on todays machines, since accurately updating all pointers to an object requires perfect alias information, which is well beyond the scope of the compiler for languages such as C. To overcome this limitation, we propose a technique called memory forwarding which effectively adds a new layer of indirection within the memory system whenever necessary to guarantee that data relocation is always safe. Because actual forwarding rarely occurs (it exists as a safety net), the mechanism can be implemented as an exception in modern superscalar processors. Our experimental results demonstrate that the aggressive layout optimizations enabled by memory forwarding can result in significant speedups---more than twofold in some cases---by reducing the number of cache misses, improving the effectiveness of prefetching, and conserving memory bandwidth.


ACM Transactions on Computer Systems | 2001

Architectural and Compiler Support for Effective Instruction Prefetching: A Cooperative Approach.

Chi-Keung Luk; Todd C. Mowry

Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors, since they fail to issue prefetches early enough (particularly for nonsequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of nonsequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction-prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach hides 50% or more tof the latecy remaining with the best previous techniques, while at the same time reduces the number of useless prefetches by a factor of six. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, and that the compiler can limit the code expansion to only 9% on average. In addition, we show that the performance of our technique can be further increased by using profiling information to help reduce cache conflicts and unnecessary prefetches. From an architectural perspective, these performance advantages are sustained over a range of common miss latencies and bandwidth. Finally, our technique is cost effective as well, since it delivers performance comparable to (or even better than) that of larger caches, but requires a much smaller hardware budget.


Computer Languages | 1995

I+: A multiparadigm language for object-oriented declarative programming

Kam-Wing Ng; Chi-Keung Luk

This paper presents a multiparadigm language I^+ which is an integration of the three major programming paradigms: object-oriented, logic and functional. I^+ has an object-oriented framework in which the notions of classes, objects, methods, inheritance and message passing are supported. Methods may be specified as clauses or functions, thus the two declarative paradigms are incorporated at the method level of the object-oriented paradigm. In addition, two levels of parallelism may be exploited in I^+ programming. Therefore I^+ is a multiparadigm language for object-oriented declarative programming as well as parallel programming.


Microprocessing and Microprogramming | 1995

A survey of languages integrating functional, object-oriented and logic programming

Kam-Wing Ng; Chi-Keung Luk

Abstract Functional, object-oriented and logic programming are widely regarded as the three most dominant programming paradigms nowadays. For the past decade, many attempts have been made to integrate these three paradigms into a single language. This paper is a survey of this new breed of multiparadigm languages. First we give a succinct introduction to the three paradigms. Then we discuss a variety of approaches to the integration of the three paradigms through an overview of some of the existing multiparadigm languages. All possible combinations of the three paradigms, namely logic + object-oriented, functional + logic, functional + object-oriented, and object-oriented + logic + functional, are considered separately. For the purpose of classification, we have proposed a design space of programming languages called the FOOL-space.


IEEE Transactions on Computers | 2000

Understanding why correlation profiling improves the predictability of data cache misses in nonnumeric applications

Todd C. Mowry; Chi-Keung Luk

Latency-tolerance techniques offer the potential for bridging the ever-increasing speed gap between the memory subsystem and todays high-performance processors. However, to fully exploit the benefit of these techniques, one must be careful to apply them only to the dynamic references that are likely to suffer cache misses-otherwise the runtime overheads can potentially offset any gains. In this paper, we focus on isolating dynamic miss instances in nonnumeric applications, which is a difficult but important problem. Although compilers cannot statically analyze data locality in nonnumeric applications, one viable approach is to use profiling information to measure the actual miss behavior. Unfortunately, the state-of-the-art in cache miss profiling (which we call summary profiling) is inadequate for references with intermediate miss ratios-it either misses opportunities to hide latency, or else inserts overhead that is unnecessary. To overcome this problem, we propose and evaluate a new profiling technique that helps predict which dynamic instances of a static memory reference will hit or miss in the cache: correlation profiling Our experimental results demonstrate that roughly half of the 21 nonnumeric applications we study can potentially enjoy significant reductions in memory stall time by exploiting at least one of the three forms of correlation profiling we consider: control-flow correlation, self correlation, and global correlation. In addition, our detailed case studies illustrate that self correlation succeeds because a given references cache outcomes often contain repeated patterns and control-flow correlation succeeds because cache outcomes are often call-chain dependent. Finally, we suggest a number of ways to exploit correlation profiling in practice and demonstrate that software prefetching can achieve better performance on a modern superscalar processor when directed by correlation profiling rather than summary profiling information.

Collaboration


Dive into the Chi-Keung Luk's collaboration.

Top Co-Authors

Avatar

Todd C. Mowry

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Kam-Wing Ng

The Chinese University of Hong Kong

View shared research outputs
Researchain Logo
Decentralizing Knowledge