Collin McCurdy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Collin McCurdy is active.

Explore More

Publication

Featured researches published by Collin McCurdy.

architectural support for programming languages and operating systems | 2010

The Scalable Heterogeneous Computing (SHOC) benchmark suite

Anthony Danalis; Gabriel Marin; Collin McCurdy; Jeremy S. Meredith; Philip C. Roth; Kyle Spafford; Vinod Tipparaju; Jeffrey S. Vetter

Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOCs initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard. SHOC is a spectrum of programs that test the performance and stability of these scalable heterogeneous computing systems. At the lowest level, SHOC uses microbenchmarks to assess architectural features of the system. At higher levels, SHOC uses application kernels to determine system-wide performance including many system features such as intranode and internode communication among devices. SHOC includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.

international symposium on performance analysis of systems and software | 2010

Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms

Collin McCurdy; Jeffrey S. Vetter

Until recently, most high-end scientific applications have been immune to performance problems caused by Non-Uniform Memory Access (NUMA). However, current trends in microprocessor design are pushing NUMA to smaller and smaller scales. This paper examines the current state of NUMA and makes several contributions. First, we summarize the performance problems that NUMA can present for multi-threaded applications and describe methods of addressing them. Second, we demonstrate that NUMA can indeed be a significant problem for scientific applications, showing that it can mean the difference between an application scaling perfectly and failing to scale at all. Third, we describe, in increasing order of usefulness, three methods of using hardware performance counters to aid in finding NUMA-related problems. Finally, we introduce Memphis, a data-centric toolset that uses Instruction Based Sampling to help pinpoint problematic memory accesses, and demonstrate how we used it to improve the performance of several production-level codes - HYCOM, XGC1 and CAM - by 13%, 23% and 24% respectively.

ieee international conference on high performance computing data and analytics | 2008

Early evaluation of IBM BlueGene/P

Sadaf R. Alam; Richard Frederick Barrett; Michael H Bast; Mark R. Fahey; Jeffery A. Kuehn; Collin McCurdy; James H. Rogers; Philip C. Roth; Ramanan Sankaran; Jeffrey S. Vetter; Patrick H. Worley; Weikuan Yu

BlueGene/P (BG/P) is the second generation BlueGene architecture from IBM, succeeding BlueGene/L (BG/L). BG/P is a system-on-a-chip (SoC) design that uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point unit per core. These chips are connected with multiple interconnection networks including a 3-D torus, a global collective network, and a global barrier network. The design is intended to provide a highly scalable, physically dense system with relatively low power requirements per flop. In this paper, we report on our examination of BG/P, presented in the context of a set of important scientific applications, and as compared to other major large scale supercomputers in use today. Our investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4s Opteron. We also find that BG/P uses very low power per floating point operation for certain kernels, yet it has less of a power advantage when considering science driven metrics for mission applications.

international symposium on performance analysis of systems and software | 2008

Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors

Collin McCurdy; Alan L. Cox; Jeffrey S. Vetter

The floating point portion of the SPEC CPU suite and the HPC Challenge suite are widely recognized and utilized as benchmarks that represent scientific application behavior. In this work we show that while these benchmark suites may be representative of the cache behavior of production scientific applications, they do not accurately represent the TLB behavior of these applications. Furthermore, we demonstrate that the difference can have a significant impact on performance. In the first part of the paper we present results from implementation-independent trace-based simulations which demonstrate that benchmarks exhibit significantly different TLB behavior for a range of page sizes than a representative set of production applications. In the second part we validate these results on the AMD Opteron implementation of the x86 architecture, showing that false conclusions about choice of page size, drawn from benchmark performance, can result in performance degradations of up to nearly 50% for the production applications we investigated..

Operating Systems Review | 2006

Kernel-level single system image for petascale computing

Hong Ong; Jeffrey S. Vetter; R. Scott Studham; Collin McCurdy; Bruce Walker; Alan L. Cox

Scientific computing users typically prefer UNIX or UNIX-like operating systems as their runtime for managing software and hardware resources. These UNIX-like systems were originally designed for a single processor as well as for a broad range of programming and usage models. Although UNIX-like systems have successfully been modified to work in SMP or NUMA configuration, their internal structures remain relatively the same over the years. As we move toward the era of petascale computing, these UNIX-like systems are no longer suitable. For instance, the relative cost of supporting generic usages and system services will increase by a magnitude and thus affect the overall system performance; there are insufficient system services to globally manage parallelism, processes, and resources; users may not see the petascale system as a single powerful machine but rather as a set of multiple independent servers. A single system image (SSI) operating system is essential for efficiently manage parallelism, resources and processes as well as providing parallel processing transparency for a system possibly equipped with hundred thousand of processors. However, the success of a petascale SSI operating system goes beyond technical challenges. In particular, it must look very much like the normal UNIX, run unmodified software, scale incrementally, and equip with built-in high availability supports. This position paper focuses on these issues and discusses the development of a petascale SSI, based on an existing kernel-level SSI system, OpenSSI.

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2013

Quantifying Architectural Requirements of Contemporary Extreme-Scale Scientific Applications

Jeffrey S. Vetter; Seyong Lee; Dong Li; Gabriel Marin; Collin McCurdy; Jeremy S. Meredith; Philip C. Roth; Kyle Spafford

As detailed in recent reports, HPC architectures will continue to change over the next decade in an effort to improve energy efficiency, reliability, and performance. At this time of significant disruption, it is critically important to understand specific application requirements, so that these architectural changes can include features that satisfy the requirements of contemporary extreme-scale scientific applications. To address this need, we have developed a methodology supported by a toolkit that allows us to investigate detailed computation, memory, and communication behaviors of applications at varying levels of resolution. Using this methodology, we performed a broad-based, detailed characterization of 12 contemporary scalable scientific applications and benchmarks. Our analysis reveals numerous behaviors that sometimes contradict conventional wisdom about scientific applications. For example, the results reveal that only one of our applications executes more floating-point instructions than other types of instructions. In another example, we found that communication topologies are very regular, even for applications that, at first glance, should be highly irregular. These observations emphasize the necessity of measurement-driven analysis of real applications, and help prioritize features that should be included in future architectures.

international conference on supercomputing | 2013

Diagnosis and optimization of application prefetching performance

Gabriel Marin; Collin McCurdy; Jeffrey S. Vetter

Hardware prefetchers are effective at recognizing streaming memory access patterns and at moving data closer to the processing units to hide memory latency. However, hardware prefetchers can track only a limited number of data streams due to finite hardware resources. In this paper, we introduce the term streaming concurrency to characterize the number of parallel, logical data streams in an application. We present a simulation algorithm for understanding the streaming concurrency at any point in an application, and we show that this metric is a good predictor of the number of memory requests initiated by streaming prefetchers. Next, we try to understand the causes behind poor prefetching performance. We identified four prefetch unfriendly conditions and we show how to classify an applications memory references based on these conditions. We evaluated our analysis using the SPEC CPU2006 benchmark suite. We selected two benchmarks with unfavorable access patterns and transformed them to improve their prefetching effectiveness. Results show that making applications more prefetcher friendly can yield meaningful performance gains.

international parallel and distributed processing symposium | 2012

Efficient Quality Threshold Clustering for Parallel Architectures

Anthony Danalis; Collin McCurdy; Jeffrey S. Vetter

Quality Threshold Clustering (QTC) is an algorithm for partitioning data, in fields such as biology, where clustering of large data-sets can aid scientific discovery. Unlike other clustering algorithms, QTC does not require knowing the number of clusters a priori, however, its perceived need for high computing power often makes it an unattractive choice. This paper presents a thorough study of QTC. We analyze the worst case complexity of the algorithm and discuss methods to reduce it by trading memory for computation. We also demonstrate how the expected running time of QTC is affected by the structure of the input data. We describe how QTC can be parallelized, and discuss implementation details of our thread-parallel, GPU, and distributed memory implementations of the algorithm. We demonstrate the efficiency of our implementations through experimental data. We show how data sets with tens of thousands of elements can be clustered in a matter of minutes in a modern GPU, and seconds in a small scale cluster of multi-core CPUs, or multiple GPUs. Finally, we discuss how user selected parameters, as well as algorithmic and implementation choices, affect performance.

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2013

Characterizing the Impact of Prefetching on Scientific Application Performance

Collin McCurdy; Gabriel Marin; Jeffrey S. Vetter

In order to better understand the impact of hardware and software data prefetching on scientific application performance, this paper introduces two analysis techniques, one micro-architecture-centric and the other application-centric. We use these techniques to analyze representative full-scale production applications from five important Exascale target areas. We find that despite a great diversity in prefetching effectiveness across and even within applications, there is a strong correlation between regions where prefetching is most needed, due to high levels of memory traffic, and where it is most effective. We also observe that the application-centric analysis can explain many of the differences in prefetching effectiveness observed across the studied applications.

international parallel and distributed processing symposium | 2012