Michael Stumm
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael Stumm.
european conference on computer systems | 2007
David K. Tam; Reza Azimi; Michael Stumm
The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses. In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in todays processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.
architectural support for programming languages and operating systems | 2009
David K. Tam; Reza Azimi; Livio Soares; Michael Stumm
Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.
custom integrated circuits conference | 1992
D.G. Elliott; W.M. Snelgrove; Michael Stumm
Computational RAM (CoRAM) is conventional RAM with SIMD processors added to the sense amplifiers . These bit-serial, externally programmed processors add only a small amount of area to the chip and in a 32Mbyte memory have an aggregate performance of 13 billion 32 bit operations per second. The chips are extendible and completely software programmable. In this paper we describe (1) the CoRAM architecture, (2) a working 8Kbit prototype, (3) a full scale CoRAM designed in a 4Mbit DRAM process, and (4) CoRAM applications.
IEEE Transactions on Parallel and Distributed Systems | 1992
Songnian Zhou; Michael Stumm; Kai Li; David Wortman
The design, implementation, and performance of heterogeneous distributed shared memory (HDSM) are studied. A prototype HDSM system that integrates very different types of hosts has been developed, and a number of applications of this system are reported. Experience shows that despite a number of difficulties in data conversion, HDSM is implementable with minimal loss in functional and performance transparency when compared to homogeneous DSM systems. >
IEEE Computer | 1991
Zvonko G. Vranesic; Michael Stumm; David Lewis; Ron White
The architecture of the Hector multiprocessor, which exploits current microprocessor technology to produce a machine with a good cost/performance tradeoff, is described. A key design feature of Hector is its interconnection backplane, which can accommodate future technology because it uses simple hardware with short critical paths in logic circuits and short lines in the interconnection network. The system is reliable and flexible and can be realized at a relatively low cost. The hierarchical structure results in a fast backplane and a bandwidth that increases linearly with the number of processors. Hector scales efficiently to larger sizes and faster processors.<<ETX>>
international conference on parallel processing | 1993
Hui Li; Sudarsan Tandri; Michael Stumm; Kenneth C. Sevcik
An improtant issue in the parallel execution of loops is how to partition and schedule the loops onto the available processors. While most existing dynamic scheduling algorithms manage to load imbalance well, they fail to take locality into account and therefore perform poorly on parallel systems with non-uniform memory access times.
international symposium on microarchitecture | 2008
Livio Soares; David K. Tam; Michael Stumm
It is well recognized that LRU cache-line replacement can be ineffective for applications with large working sets or non-localized memory access patterns. Specifically, in last-level processor caches, LRU can cause cache pollution by inserting non-reuseable elements into the cache while evicting reusable ones. The work presented in this paper addresses last-level cache pollution through a dynamic operating system mechanism, called ROCS, requiring no change to underlying hardware and no change to applications. ROCS employs hardware performance counters on a commodity processor to characterize application cache behavior at run-time. Using this online profiling, cache unfriendly pages are dynamically mapped to a pollute buffer in the cache, eliminating competition between reusable and non-reusable cache lines. The operating system implements the pollute buffer through a page-coloring based technique, by dedicating a small slice of the last-level cache to store non-reusable pages. Measurements show that ROCS, implemented in the Linux 2.6.24 kernel and running on a 2.3 GHz PowerPC 970FX, improves performance of memory-intensive SPEC CPU 2000 and NAS benchmarks by up to 34%, and 16% on average.
international conference on supercomputing | 2005
Reza Azimi; Michael Stumm; Robert W. Wisniewski
Hardware performance counters (HPCs) are increasingly being used to analyze performance and identify the causes of performance bottlenecks. However, HPCs are difficult to use for several reasons. Microprocessors do not provide enough counters to simultaneously monitor the many different types of events needed to form an over-all understanding of performance. Moreover, HPCs primarily count low-level micro-architectural events from which it is difficult to extract high-level insight required for identifying causes of performance problems.We describe two techniques that help overcome these difficulties, allowing HPCs to be used in dynamic real-time optimizers. First, statistical sampling is used to dynamically multiplex HPCs and make a larger set of logical HPCs available. Using real programs, we show experimentally that it is possible through this sampling to obtain counts of hardware events that are statistically similar (within 15%) to complete non-sampled counts, thus allowing us to provide a much larger set of logical HPCs. Second, we observe that stall cycles are a primary source of inefficiencies, and hence they should be major targets for software optimization. Based on this observation, we build a simple model in real-time that speculatively associates each stall cycle to a processor component that likely caused the stall. The information needed to produce this model is obtained using our HPC multiplexing facility to monitor a large number of hardware components simultaneously. Our analysis shows that even in an out-of-order superscalar micro-processor such a speculative approach yields a fairly accurate model with run-time overhead for collection and computation of under 2%.These results demonstrate that we can effective analyze on-line performance of application and system code running at full speed. The stall analysis shows where performance is being lost on a given processor.
The Journal of Supercomputing | 1995
Ronald C. Unrau; Orran Krieger; Benjamin Gamsa; Michael Stumm
We introduce the concept ofhierarchical clustering as a way to structure shared-memory multiprocessor operating systems for scalability. The concept is based on clustering and hierarchical system design. Hierarchical clustering leads to a modular system, composed of easy-to-design and efficient building blocks. The resulting structure is scalable because it 1) maximizes locality, which is key to good performance in NUMA (non-uniform memory access) systems and 2) provides for concurrency that increases linearly with the number of processors. At the same time, there is tight coupling within a cluster, so the system performs well for local interactions that are expected to constitute the common case. A clustered system can easily be adapted to different hardware configurations and architectures by changing the size of the clusters. We show how this structuring technique is applied to the design of a microkernel-based operating system calledHurricane. This prototype system is the first complete and running implementation of its kind and demonstrates the feasibility of a hierarchically clustered system. We present performance results based on the prototype, demonstrating the characteristics and behavior of a clustered system. In particular, we show how clustering trades off the efficiencies of tight coupling for the advantages of replication, increased locality, and decreased lock contention.
Operating Systems Review | 2009
Reza Azimi; David K. Tam; Livio Soares; Michael Stumm
Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.