Jamison D. Collins
University of California, San Diego
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jamison D. Collins.
international symposium on computer architecture | 2001
Jamison D. Collins; Hong Wang; Dean M. Tullsen; Christopher J. Hughes; Yong-Fong Lee; Daniel M. Lavery; John Paul Shen
This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium#8482; ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%.
international symposium on microarchitecture | 2001
Jamison D. Collins; Dean M. Tullsen; Hong Wang; John Paul Shen
A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, that have been extracted directly from the program they are trying to accelerate. Proposed techniques typically require manual user intervention to extract and optimize instruction sequences. This paper proposes Dynamic Speculative Precomputation, which performs all necessary instruction analysis, extraction, and optimization through the use of back-end instruction analysis hardware, located off the processors critical path. For a set of memory limited benchmarks an average speedup of 14% is achieved when constructing simple p-slices, and this gain grows to 33% when making use of aggressive optimizations.
international symposium on microarchitecture | 2002
Jamison D. Collins; Suleyman Sair; Brad Calder; Dean M. Tullsen
Data prefetching effectively reduces the negative effects of long load latencies on the performance of modern processors. Hardware prefetchers employ hardware structures to predict future memory addresses based on previous patterns. Thread-based prefetchers use portions of the actual program code to determine future load addresses for prefetching. This paper proposes the use of a pointer cache, which tracks pointer transitions, to aid prefetching. The pointer cache provides, for a given pointers effective address, the base address of the object pointed to by the pointer. We examine using the pointer cache in a wide issue superscalar processor as a value predictor and to aid prefetching when a chain of pointers is being traversed. When a load misses in the L1 cache, but hits in the pointer cache, the first two cache blocks of the pointed to object are prefetched. In addition, the loads dependencies are broken by using the pointer cache hit as a value prediction. We also examine using the pointer cache to allow speculative precomputation to run farther ahead of the main thread of execution than in prior studies. Previously proposed thread-based prefetchers are limited in how far they can run ahead of the main thread when traversing a chain of recurrent dependent loads. When combined with the pointer cache, a speculative thread can make better progress ahead of the main thread, rapidly traversing data structures in the face of cache misses caused by pointer transitions.
international parallel and distributed processing symposium | 2004
Jamison D. Collins; Dean M. Tullsen
Summary form only given. Clustering is an architectural technique that allows the design of wide superscalar processors without sacrificing cycle time, but at the cost of longer communication latencies. Simultaneous multithreading architectures effectively tolerate instruction latency, but put even more pressure on timing-critical processor resources. We show that the synergistic combination of the two techniques minimizes the IPC impact of the clustered architecture, and even permits more aggressive clustering of the processor than is possible with a single-threaded processor. Additionally, we show that multithreading enables effective instruction steering policies unavailable to a single-threaded clustered architecture. We explore the impact of aggressively clustering four complex processor structures, (1) instruction window wakeup and functional unit bypass logic, (2) register renaming logic, (3) the fetch unit, and (4) the integer register file, on a simultaneous multithreading processor.
international symposium on microarchitecture | 1999
Jamison D. Collins; Dean M. Tullsen
This paper describes the Miss Classification Table, a simple mechanism that enables the processor or memory controller to identify each cache miss as either a conflict miss or a capacity (non-conflict) miss. The miss classification table works by storing part of the tag of the most recently evicted line of a cache set. If the next miss to that cache set has a matching tag, it is identified as a conflict miss. This technique correctly identifies 87% of misses in the worst case. Several applications of this information are demonstrated, including improvements to victim caching, next-line prefetching, cache exclusion, and a pseudo-associative cache. This paper also presents the Adaptive Miss Buffer (AMB), which combines several of these techniques, targeting each miss with the most appropriate optimization, all within a single small miss buffer. The AMBs combination of techniques achieves 16% better performance than any single technique alone.
international symposium on microarchitecture | 2004
Jamison D. Collins; Dean M. Tullsen; Hong Wang
This paper presents a novel microarchitecture technique for accurately predicting control flow reconvergence dynamically. A reconvergence point is the earliest dynamic instruction in the program where we can expect program paths to reconverge regardless of the outcome or target of the current branch. Thus, even if the immediate control flow after a branch is uncertain, execution following the reconvergence point is certain. This paper proposes a novel hardware re-convergence predictor which is both implementable and accurate, with a 4KB predictor achieving more than 95% accuracy for SPEC INT, and larger implementations achieving greater than 99% accuracy. The information provided from reconvergence prediction can increase the effectiveness of a range of previously proposed performance optimizations, including speculative multithreading, control independence, and squash reuse. This paper also demonstrates a new technique that takes advantage of the dynamic reconvergence prediction information in order to predict a wrong path excursion ahead of branch resolution. On average, 34% of wrong path fetches are eliminated.
ACM Transactions on Computer Systems | 2001
Jamison D. Collins; Dean M. Tullsen
This paper describes the miss classification table, a simple mechanism that enables the processor or memory controller to identify each cache miss as either a conflict miss or a capacity (non-conflict) miss. The miss classification table works by storing part of the tag of the most recently evicted line of a cache set. If the next miss to that cache set has a matching tag, it is identified as a conflict miss. This technique correctly identifies 88% of misses.Several applications of this information are demonstrated, including improvements to victim caching, next-line prefetching, cache exclusion, and a pseudo-associative cache. This paper also presents the adaptive miss buffer (AMB), which combines several of these techniques, targeting each miss with the most appropriate optimization, all within a single small miss buffer. The AMBs combination of techniques achieves 16% better performance than any single technique alone.
custom integrated circuits conference | 1999
Yiorgos Makris; Jamison D. Collins; Alex Orailoglu; Praveen Vishakantaiah
We discuss a methodology for analyzing the testability of large hierarchical RTL designs, based upon the existence of module reachability paths, suitable for automatically deriving globally applicable test from locally generated vectors. Such reachability paths utilize module transparency behavior, as captured by the introduced channel transparency definition. Lack of transparency and unreachable module UOs pinpoint testability bottlenecks apt for efficient DFT modifications. Application of this methodology on example designs results in significant fault coverage improvement and test generation speedup, as compared to complete design gate-level ATPG.
Journal of Electronic Testing | 2002
Yiorgos Makris; Jamison D. Collins; Alex Orailoglu
Hierarchical approaches address the complexity of test generation through symbolic reachability paths that provide access to the I/Os of each module in a design. However, while transparency behavior suitable for symbolic design traversal can be utilized for constructing reachability paths for datapath modules, control modules do not exhibit transparency. Therefore, incorporating such modules in reachability path construction requires exhaustive search algorithms or expensive DFT hardware. In this paper, we discuss a fast hierarchical test path construction method for circuits with DFT-free controller-datapath interface. A transparency-based RT-Level hierarchical test generation scheme is devised for the datapath, wherein locally generated vectors are translated into global design test. Additionally, the controller is examined through the introduced concept of influence tables, which are used to generate valid control state sequences for testing each module through hierarchical test paths. Fault coverage and vector count levels thus attained match closely those of traditional test generation methods, while sharply reducing the corresponding computational cost and test generation time.
asian test symposium | 2000
Yiorgos Makris; Jamison D. Collins; Alex Orailoglu
We discuss a hierarchical test generation method for DFT-free controller-datapath pairs. A transparency based scheme is devised for the datapath, wherein locally generated vectors are translated into global design test. The controller is examined through influence tables, used to generate valid control state sequences for testing each module through hierarchical test paths. Fault coverage levels and vector counts thus attained match closely, those of traditional test generation methodologies, while sharply reducing the corresponding computational cost.