Livio Soares
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Livio Soares.
architectural support for programming languages and operating systems | 2009
David K. Tam; Reza Azimi; Livio Soares; Michael Stumm
Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.
international symposium on microarchitecture | 2008
Livio Soares; David K. Tam; Michael Stumm
It is well recognized that LRU cache-line replacement can be ineffective for applications with large working sets or non-localized memory access patterns. Specifically, in last-level processor caches, LRU can cause cache pollution by inserting non-reuseable elements into the cache while evicting reusable ones. The work presented in this paper addresses last-level cache pollution through a dynamic operating system mechanism, called ROCS, requiring no change to underlying hardware and no change to applications. ROCS employs hardware performance counters on a commodity processor to characterize application cache behavior at run-time. Using this online profiling, cache unfriendly pages are dynamically mapped to a pollute buffer in the cache, eliminating competition between reusable and non-reusable cache lines. The operating system implements the pollute buffer through a page-coloring based technique, by dedicating a small slice of the last-level cache to store non-reusable pages. Measurements show that ROCS, implemented in the Linux 2.6.24 kernel and running on a 2.3 GHz PowerPC 970FX, improves performance of memory-intensive SPEC CPU 2000 and NAS benchmarks by up to 34%, and 16% on average.
Operating Systems Review | 2009
Reza Azimi; David K. Tam; Livio Soares; Michael Stumm
Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.
ACM Transactions on Computer Systems | 2007
Jonathan Appavoo; Dilma Da Silva; Orran Krieger; Marc A. Auslander; Michal Ostrowski; Bryan S. Rosenburg; Amos Waterland; Robert W. Wisniewski; Jimi Xenidis; Michael Stumm; Livio Soares
Designing and implementing system software so that it scales well on shared-memory multiprocessors (SMMPs) has proven to be surprisingly challenging. To improve scalability, most designers to date have focused on concurrency by iteratively eliminating the need for locks and reducing lock contention. However, our experience indicates that locality is just as, if not more, important and that focusing on locality ultimately leads to a more scalable system.n In this paper, we describe a methodology and a framework for constructing system software structured for locality, exploiting techniques similar to those used in distributed systems. Specifically, we found two techniques to be effective in improving scalability of SMMP operating systems: (i) an object-oriented structure that minimizes sharing by providing a natural mapping from independent requests to independent code paths and data structures, and (ii) the selective partitioning, distribution, and replication of object implementations in order to improve locality. We describe concrete examples of distributed objects and our experience implementing them. We demonstrate that the distributed implementations improve the scalability of operating-system-intensive parallel workloads.
international symposium on memory management | 2007
Reza Azimi; Livio Soares; Michael Stumm; Thomas Walsh; Angela Demke Brown
Traditionally, operating systems use a coarse approximation of memory accesses to implement memory management algorithms by monitoring page faults or scanning page table entries. With finer-grained memory access information, however, the operating system can manage memory muchmore effectively. Previous work has proposed the use of a software mechanism based on virtual page protection and soft faults to track page accesses at finer granularity. In this paper, we show that while this approach is effective for some applications, for many others it results in an unacceptably high overhead. We propose simple Page Access Tracking Hardware (PATH)to provide accurate page access information to the operating system. The suggested hardware support is generic andcan be used by various memory management algorithms. In this paper, we show how the information generated by PATH can be used to implement (i) adaptive page replacement policies, (ii) smart process memory allocation to improve performance or to provide isolation and better process prioritization, and (iii) effectively prefetch virtual memory pages when applications have non-trivial memory access patterns. Our simulation results show that these algorithms can dramatically improve performance (up to 500%) with PATH-provided information, especially when the system is under memory pressure. We show that the software overhead of processing PATH information is less than 6% acrossthe applications we examined (less than 3% in all but two applications), which is at least an order of magni.
european conference on parallel processing | 2007
Robert W. Wisniewski; Reza Azimi; Mathieu Desnoyers; Maged M. Michael; José E. Moreira; Doron Shiloach; Livio Soares
Clusters of loosely connected machines are becoming an important model for commercial computing. The cost/performance ratio makes these scale-out solutions an attractive platform for a class of computational needs. The work we describe in this paper focuses on understanding performance when using a scale-out environment to run commercial workloads. We describe the novel scale-out environment we configured and the workload we ran on it. We explain the unique performance challenges faced in such an environment and the tools we applied and improved for this environment to address the challenges. We present data from the tools that proved useful in optimizing performance on our system. We discuss the lessons we learned applying and modifying existing tools to a commercial scale-out environment, and offer insights into making future performance tools effective in this environment.
operating systems design and implementation | 2010
Livio Soares; Michael Stumm
usenix annual technical conference | 2011
Livio Soares; Michael Stumm
hot topics in operating systems | 2011
Jeffrey C. Mogul; Andrew Baumann; Timothy Roscoe; Livio Soares
Archive | 2014
Livio Soares; Michael Stumm