John Wai Cheong Fu
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by John Wai Cheong Fu.
international symposium on microarchitecture | 1992
John Wai Cheong Fu; Bob Janssens
The execution of numerically intensive programs presents a challenge to memory system designers. Numerical program execution can be accelerated by pipelined arithmetic units, bur to be effective, must be supported by high speed memory access. A cache memory is a well known hardware mechanism used to reduce the average memory access latency. Numerical programs, however, often have poor cache pei
hawaii international conference on system sciences | 1994
John Wai Cheong Fu
ormance. Stride directed prefetching has been proposed to improve the cache performance of numerical programs executing on a vector processor. This paper shows how this approach can be extended to a scalar processor by using a simple hardware mechanism, called a stride prediction table (SPT), to calculate the stride distances of array accesses made from within the loop body of a program. The results using selected programs from the PERFECT and SPEC benchmark show that stride directed prefetching on a scalar processor can significantly reduce the cache miss rate of particular programs and a SPT need only a small number of entries to be effective.
international conference on parallel processing | 1993
John Wai Cheong Fu
Trace driven simulation is a well known method for evaluating computer architecture options and is the technique of choice in most published cache and memory studies. Ideally, a trace should contain all the necessary events generated by a program. However, this is usually impractical for all but the most trivial of programs because of trace storage and simulation time costs. As computer systems increase in performance and complexity there is a growing need to use larger and more realistic programs for trace driven simulation. This has lead to a growing interest in applying sampling techniques to reduce trace driven simulation costs. This paper reports on same experiments in trace sampling and discusses a prediction method for resolving cold-start or fill references when simulating with a sampled trace. The paper shows how a small sampled trace can capture the characteristics of a much larger trace and cache simulations results are presented using these sampled traces and the prediction method.<<ETX>>
international parallel processing symposium | 1994
Jeff Baxter; John Wai Cheong Fu; Balkrishna Ramkumar
High speed architectures usually employ some form of parallelism or concurrency. Parallel or concurrent execution of a program not only increases the rate at which references are issued to the memory system but also changes the behavior of these references, relative to its serial-scalar execution. This paper reports the variations in program memory reference behavior when automatically transformed by a compiler and executed on parallel and vector archirectures. Using traces of the PERFECT benchmark set, executed on on Aliiant FX180 in a single scalar processor, single vector processor, scalar multiprocessor and vector multiprocessor modes, measurements are reported for issue rates, reference locality and data sharing.
hawaii international conference on system sciences | 1994
John Wai Cheong Fu; A.L.N. Reddy
Addresses the issue of resource management in parallel systems. Two new hybrid algorithms for general resource management in distributed memory computers are presented. T-hybrid is a decoupled algorithm that combines a static template allocation scheme with a low-cost local demand-driven dynamic algorithm while C-hybrid is a coupled algorithm that combines a simple static allocation scheme with the same dynamic algorithm as T-hybrid. A set of test programs is scheduled using these hybrid algorithms and compared with scheduling using pure static and dynamic schemes when executed on an Ncube2 system under the Chare Kernel. The results show that the two new hybrid strategies provide faster execution times than all the pure dynamic and static algorithms investigated, and that the simpler C-hybrid algorithm resulted in an execution time about 6% faster than T-hybrid and 22% faster than the fastest non-hybrid scheme.<<ETX>>
Archive | 2001
Len Schultz; Nhon Quach; Dean Mulla; Jim Hays; John Wai Cheong Fu
Advances in technology and computer design are resulting in impressive increases in raw processor power. Currently, new processor implementations are showing almost a doubling in clock frequency. Moreover, with each new generation, processor designers are incorporating more advance architecture techniques such as instruction level parallelism into these implementations. Memory technology also continues to improve, but as always, memory performance still trail processor requirements. As raw processor performance increases, the memory access latency, be it to main memory or to disk memory, becomes more significant to overall system performance. When the data access rate of a memory subsystem does not meet the data request rate of the processor, system performance is less than desired. With todays high speed processors, meeting this request rate is more and more difficult to achieve. The rapid development of integration technology has resulted in a significant trend in computer system design; almost all computer systems being designed are based on single chip processor implementations i.e. the microprocessor. This recent shift in the design of computer systems is expected to continue and future computer systems will not be classified by processor architecture but by the structure and cost of the interconnect, IO and memory subsystems.<<ETX>>
Archive | 2004
Nhon Quach; John Wai Cheong Fu; Sunny Huang; Jeen Miin; Dean Mulla
Archive | 1997
Stefan Rusu; John Wai Cheong Fu; Simon M. Tam
Archive | 1998
John Wai Cheong Fu; Dean Mulla; Gregory S. Mathews; Stuart E. Sailer
Archive | 1997
John Wai Cheong Fu; Muthurajan Jayakumar