Truman Joe
Stanford University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Truman Joe.
IEEE Transactions on Parallel and Distributed Systems | 1993
Daniel E. Lenoski; James P. Laudon; Truman Joe; David Nakahira; Luis Stevens; Anoop Gupta; John L. Hennessy
The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. The hardware overhead of directory-based cache coherence in a 48-processor is examined. The data show that the overhead is only about 10-15%, which appears to be a small cost for the ease of programming offered by coherent caches and the potential for higher performance. The performance of the system is discussed, and the speedups obtained by a variety of parallel applications running on the prototype are shown. Using a sophisticated hardware performance monitor, the effectiveness of coherent caches and the relationship between an applications reference behavior and its speedup are characterized. The optimizations incorporated in the DASH protocol are evaluated in terms of their effectiveness on parallel applications and on atomic tests that stress the memory system. >
international symposium on computer architecture | 1992
Daniel E. Lenoski; James P. Laudon; Truman Joe; David Nakahira; Luis Stevens; Anoop Gupta; John L. Hennessy
The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design trade-offs, prototypes are essential to ensure that no critical details are overlooked. A prototype provides convincing evidence of the feasibility of the design allows one to accurately estimate both the hardware and the complexity cost of various features, and provides a platform for studying real workloads. A 16-processor prototype of the DASH multiprocessor has been operational for the last six months. In this paper, the hardware overhead of directory-based cache coherence in the prototype is examined. We also discuss the performance of the system, and the speedups obtained by parallel applications running on the prototype. Using a sophisticated hardware performance monitor, we characterize the effectiveness of coherent caches and the relationship between an applications reference behavior and its speedup.
international symposium on computer architecture | 1994
Truman Joe; John L. Hennessy
Cache only memory architectures (COMA) have an inherent memory overhead due to the organization of main memory as a large cache called an attraction memory. This overhead consists of memory left unallocated for performance reasons as well as additional physical memory required due to the cache organization of memory. In this work, we examine the effect of data reshuffling and data replication on the memory overhead. Data reshuffling occurs when space needs to be allocated to store a remote memory line in the local memory. Data that is reshuffled is sent between memories via replacement messages. A simple mathematical model predicts the frequency of data reshuffling as a function of the attraction memory parameters. Simulation data shows that the frequency of data reshuffling is sensitive to the allocation policy and associativity of the memory but is relatively unaffected by the block size chosen. The simulation data also shows that data replication in the attraction memory is important for good performance, but most gains can be achieved through replication in the processor caches.
conference on high performance computing (supercomputing) | 1993
Jaswinder Pal Singh; Truman Joe; Anoop Gupta; John L. Hennessy
Two interesting variants of large-scale shared-address-space parallel architectures are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). Both have distributed main memory and use directory-based cache coherence. While both architectures migrate and replicate data at the cache level automatically under hardware control, COMA machines do this at the main memory level as well. The authors compare the parallel performance of a recent realization of each type of architecture, the Stanford DASH multiprocessor (CC-NUMA) and the Kendall Square Research KSR-1 (COMA). Using a suite of important computational kernels and complete scientific applications, they examine performance differences resulting both from the CC-NUMA/COMA nature of the machines as well as from specific differences in system implementation.
Archive | 1993
Anoop Gupta; Truman Joe
Archive | 1995
Truman Joe
Archive | 1993
Jaswinder Pal Singh; Truman Joe; John Hennessy; Anoop Gupta
Archive | 1993
Daniel E. Lenoski; James P. Laudon; Truman Joe; David Nakahira; Luis Stevens; Anoop Gupta; John Hennessy
Multiprocessor performance measurement and evaluation | 1995
Daniel E. Lenoski; James P. Laudon; Truman Joe; David Nakahira; Luis Stevens; Anoop Gupta; John L. Hennessy
Multiprocessor performance measurement and evaluation | 1995
Per Stenström; Truman Joe; Anoop Gupta