Martin Burtscher
Texas State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Martin Burtscher.
IEEE Design & Test of Computers | 2005
Christianto C. Liu; Ilya Ganusov; Martin Burtscher; Sandip Tiwari
Microprocessor performance has been improving at roughly 60% per year. Memory access times, however, have improved by less than 10% per year. The resulting gap between logic and memory performance has forced microprocessor designs toward complex and power-hungry architectures that support out-of-order and speculative execution. Moreover, processors have been designed with increasingly large cache hierarchies to hide main memory latency. This article examines how 3D IC technology can improve interactions between the processor and memory. Our work examines the performance of a single-core, single-threaded processor under representative work loads. We have shown that reducing memory latency by bringing main memory on chip gives us near-perfect performance. Three-dimensional IC technology can provide the much needed bandwidth without the cost, design complexity, and power issues associated with a large number of off-chip pins. The principal challenge remains the demonstration of a highly manufacturable 3D IC technology with high yield and low cost.
ieee international symposium on workload characterization | 2012
Martin Burtscher; Rupesh Nasre; Keshav Pingali
GPUs have been used to accelerate many regular applications and, more recently, irregular applications in which the control flow and memory access patterns are data-dependent and statically unpredictable. This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures. For a suite of 13 benchmarks, we find that (i) irregularity at the warp level varies widely, (ii) control-flow irregularity and memory-access irregularity are largely independent of each other, and (iii) most kernels, including regular ones, exhibit some irregularity. A programs irregularity can change between different inputs, systems, and arithmetic precision but generally stays in a specific region of the irregularity space. Whereas some highly tuned implementations of irregular algorithms exhibit little irregularity, trading off extra irregularity for better locality or less work can improve overall performance.
acm sigplan symposium on principles and practice of parallel programming | 2009
Milind Kulkarni; Martin Burtscher; Rajasekhar Inkulu; Keshav Pingali; Calin Cascaval
Irregular programs are programs organized around pointer-based data structures such as trees and graphs. Recent investigations by the Galois project have shown that many irregular programs have a generalized form of data-parallelism called amorphous data-parallelism. However, in many programs, amorphous data-parallelism cannot be uncovered using static techniques, and its exploitation requires runtime strategies such as optimistic parallel execution. This raises a natural question: how much amorphous data-parallelism actually exists in irregular programs? In this paper, we describe the design and implementation of a tool called ParaMeter that produces parallelism profiles for irregular programs. Parallelism profiles are an abstract measure of the amount of amorphous data-parallelism at different points in the execution of an algorithm, independent of implementation-dependent details such as the number of cores, cache sizes, load-balancing, etc. ParaMeter can also generate constrained parallelism profiles for a fixed number of cores. We show parallelism profiles for seven irregular applications, and explain how these profiles provide insight into the behavior of these applications.
IEEE Transactions on Computers | 2009
Martin Burtscher; Paruj Ratanaworabhan
Many scientific programs exchange large quantities of double-precision data between processing nodes and with mass storage devices. Data compression can reduce the number of bytes that need to be transferred and stored. However, data compression is only likely to be employed in high-end computing environments if it does not impede the throughput. This paper describes and evaluates FPC, a fast lossless compression algorithm for linear streams of 64-bit floating-point data. FPC works well on hard-to-compress scientific data sets and meets the throughput demands of high-performance systems. A comparison with five lossless compression schemes, BZIP2, DFCM, FSD, GZIP, and PLMI, on 4 architectures and 13 data sets shows that FPC compresses and decompresses one to two orders of magnitude faster than the other algorithms at the same geometric-mean compression ratio. Moreover, FPC provides a guaranteed throughput as long as the prediction tables fit into the L1 data cache. For example, on a 1.6-GHz Itanium 2 server, the throughput is 670 Mbytes/s regardless of what data are being compressed.
data compression conference | 2006
Paruj Ratanaworabhan; Jian Ke; Martin Burtscher
In scientific computing environments, large amounts of floating-point data often need to be transferred between computers as well as to and from storage devices. Compression can reduce the number of bits that need to be transferred and stored. However, the run-time overhead due to compression may be undesirable in high-performance settings where short communication latencies and high bandwidths are essential. This paper describes and evaluates a new compression algorithm that is tailored to such environments. It typically compresses numeric floating-point values better and faster than other algorithms do. On our data sets, it achieves compression ratios between 1.2 and 4.2 as well as compression and decompression throughputs between 2.8 and 5.9 million 64-bit double-precision numbers per second on a 3 GHz Pentium 4 machine.
GPU Computing Gems Emerald Edition | 2011
Martin Burtscher; Keshav Pingali
Publisher Summary This chapter describes the first CUDA implementation of the classical Barnes Hut n-body algorithm that runs entirely on the GPU. The Barnes Hut force-calculation algorithm is widely used in n-body simulations such as modeling the motion of galaxies. It hierarchically decomposes the space around the bodies into successively smaller boxes, called cells, and computes summary information for the bodies contained in each cell, allowing the algorithm to quickly approximate the forces (e.g., gravitational, electric, or magnetic) that the n bodies induce upon each other. The Barnes Hut algorithm is challenging to implement efficiently in CUDA because it repeatedly builds and traverses an irregular tree-based data structure, it performs a lot of pointer-chasing memory operations, and it is typically expressed recursively. GPU-specific operations such as thread-voting functions are exploited to greatly improve performance and make use of fence instructions to implement lightweight synchronization without atomic operations. The main conclusion of the work is that GPUs can be used to accelerate irregular codes, not just regular codes. However, a great deal of programming effort is required to achieve good performance. The biggest performance win, though, came from turning some of the unique architectural features of GPUs, which are often regarded as performance hurdles for irregular codes, into assets. Future direction indicates workings on writing high-performing CUDA implementations for other irregular algorithms.
ieee international conference on high performance computing data and analytics | 2010
Martin Burtscher; Byoung-Do Kim; Jeffrey R. Diamond; John D. McCalpin; Lars Koesterke; James C. Browne
HPC systems are notorious for operating at a small fraction of their peak performance, and the ongoing migration to multi-core and multi-socket compute nodes further complicates performance optimization. The readily available performance evaluation tools require considerable effort to learn and utilize. Hence, most HPC application writers do not use them. As remedy, we have developed PerfExpert, a tool that combines a simple user interface with a sophisticated analysis engine to detect probable core, socket, and node-level performance bottlenecks in each important procedure and loop of an application. For each bottle-neck, PerfExpert provides a concise performance assessment and suggests steps that can be taken by the programmer to improve performance. These steps include compiler switches and optimization strategies with code examples. We have applied PerfExpert to several HPC production codes on the Ranger supercomputer. In all cases, it correctly identified the critical code sections and provided accurate assessments of their performance.
acm sigplan symposium on principles and practice of parallel programming | 2012
Mario Méndez-Lojo; Martin Burtscher; Keshav Pingali
Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components. In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores. Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.
measurement and modeling of computer systems | 2004
Martin Burtscher
Trace files are widely used in research and academia to study the behavior of programs. They are simple to process and guarantee repeatability. Unfortunately, they tend to be very large. This paper describes vpc3, a fundamentally new approach to compressing program traces. Vpc3 employs value predictors to bring out and amplify patterns in the traces so that conventional compressors can compress them more effectively. In fact, our approach not only results in much higher compression rates but also provides faster compression and decompression. For example, compared to bzip2, vpc3s geometric mean compression rate on SPECcpu2000 store address traces is 18.4 times higher, compression is ten times faster, and decompression is three times faster.
acm sigplan symposium on principles and practice of parallel programming | 2009
Paruj Ratanaworabhan; Martin Burtscher; Darko Kirovski; Benjamin G. Zorn; Rahul Nagpal; Karthik Pattabiraman
This paper introduces ToleRace, a runtime system that allows programs to detect and even tolerate asymmetric data races. Asymmetric races are race conditions where one thread correctly acquires and releases a lock for a shared variable while another thread improperly accesses the same variable. ToleRace provides approximate isolation in the critical sections of lock-based parallel programs by creating a local copy of each shared variable when entering a critical section, operating on the local copies, and propagating the appropriate copies upon leaving the critical section. We start by characterizing all possible interleavings that can cause races and precisely describe the effect of ToleRace in each case. Then, we study the theoretical aspects of an oracle that knows exactly what type of interleaving has occurred. Finally, we present two software implementations of ToleRace and evaluate them on multithreaded applications from the SPLASH2 and PARSEC suites. Our implementation on top of a dynamic instrumentation tool, which works directly on executables and requires no source code modifications, incurs an overhead of a factor of two on average. Manually adding ToleRace to the source code of these applications results in an average overhead of 6.4 percent.