Michael LeBeane
University of Texas at Austin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael LeBeane.
ieee international conference on high performance computing data and analytics | 2015
Michael LeBeane; Shuang Song; Reena Panda; Jee Ho Ryoo; Lizy Kurian John
Large scale graph analytics are an important class of problem in the modern data center. However, while data centers are trending towards a large number of heterogeneous processing nodes, graph analytics frameworks still operate under the assumption of uniform compute resources. In this paper, we develop heterogeneity-aware data ingress strategies for graph analytics workloads using the popular PowerGraph framework. We illustrate how simple estimates of relative node computational throughput can guide heterogeneity-aware data partitioning algorithms to provide balanced graph cutting decisions. Our work enhances five online data ingress strategies from a variety of sources to optimize application execution for throughput differences in heterogeneous data centers. The proposed partitioning algorithms improve the runtime of several popular machine learning and data mining applications by as much as a 65% and on average by 32% as compared to the default, balanced partitioning approaches.
symposium on computer architecture and high performance computing | 2015
Reena Panda; Christopher Erb; Michael LeBeane; Jee Ho Ryoo; Lizy Kurian John
Big data revolution has created an unprecedented demand for intelligent data management solutions on a large scale. While data management has traditionally been used as a synonym for relational data processing, in recent years a new group popularly known as NoSQL databases have emerged as a competitive alternative. There is a pressing need to gain greater understanding of the characteristics of modern databases to architect targeted computers. In this paper, we investigate four popular NoSQL/SQL-style databases and evaluate their hardware performance on modern computer systems. Based on data collected from real hardware, we evaluate how efficiently modern databases utilize the underlying systems and make several recommendations to improve their performance efficiency. We observe that performance of modern databases is severely limited by poor cache/memory performance. Nonetheless, we demonstrate that dynamic execution techniques are still effective in hiding a significant fraction of the stalls, thereby improving performance. We further show that NoSQL databases suffer from greater performance inefficiencies than their SQL counterparts. SQL databases outperform NoSQL databases for most operations and are beaten by NoSQL databases only in a few cases. NoSQL databases provide a promising competitive alternative to SQL-style databases, however, they are yet to be optimized to fully reach the performance of contemporary SQL systems. We also show that significant diversity exists among different database implementations and big-data benchmark designers can leverage our analysis to incorporate representative workloads to encapsulate the full spectrum of data-serving applications. In this paper, we also compare data-serving applications with other popular benchmarks such as SPEC CPU2006 and SPECjbb2005.
international conference on parallel processing | 2016
Shuang Song; Meng Li; Xinnian Zheng; Michael LeBeane; Jee Ho Ryoo; Reena Panda; Andreas Gerstlauer; Lizy Kurian John
Big data decision-making techniques take advantage of large-scale data to extract important insights from them. One of the most important classes of such techniques falls in the domain of graph applications, where data segments and their inherent relationships are represented as vertices and edges. Efficiently processing large-scale graphs involves many subtle tradeoffs and is still regarded as an open-ended problem. Furthermore, as modern data centers move towards increased heterogeneity, the traditional assumption of homogeneous environments in current graph processing frameworks is no longer valid. Prior work estimates the graph processing power of heterogeneous machines by simply reading hardware configurations, which leads to suboptimal load balancing. In this paper, we propose a profiling methodology leveraging synthetic graphs for capturing a nodes computational capability and guiding graph partitioning in heterogeneous environments with minimal overheads. We show that by sampling the execution of applications on synthetic graphs following a power-law distribution, the computing capabilities of heterogeneous clusters can be captured accurately (<;10% error). Our proxy-guided graph processing system results in a maximum speedup of 1.84x and 1.45x over a default system and prior work, respectively. On average, it achieves 17.9% performance improvement and 14.6% energy reduction as compared to prior heterogeneity-aware work.
symposium on computer architecture and high performance computing | 2015
Michael LeBeane; Jee Ho Ryoo; Reena Panda; Lizy Kurian John
Extensive research has focused on estimating power to guide advances in power management schemes, thermal hot spots, and voltage noise. However, simulated power models are slow and struggle with deep software stacks, while direct measurements are typically coarse-grained. This paper introduces Watt Watcher, a multicore power measurement framework that offers fine-grained functional unit breakdowns. Watt Watcher operates by passing event counts and a hardware descriptor file into configurable back-end power models based on McPAT. Researchers and vendors can add other processors to our tool by mapping to the Watt Watcher interface. We show that Watt Watcher, when calibrated, has a MAPE (mean absolute percentage error) of 2.67% aggregated over all benchmarks when compared to measured power consumption on SPEC CPU 2006 and multithreaded PARSEC benchmarks across three different machines of various form factors and manufacturing processes. We present two use cases showing how Watt Watcher can derive insights that are difficult to obtain through other measurement infrastructures. Additionally, we illustrate how Watt Watcher can be used to provide insights into challenging big data and cloud workloads on a server CPU. Through the use of Watt Watcher, it is possible to obtain a detailed power breakdown on real hardware without vendor proprietary models or hardware instrumentation.
ieee international conference on high performance computing data and analytics | 2017
Michael LeBeane; Khaled Hamidouche; Brad Benton; Mauricio Breternitz; Steven K. Reinhardt; Lizy Kurian John
GPUs are widespread across clusters of compute nodes due to their attractive performance for data parallel codes. However, communicating between GPUs across the cluster is cumbersome when compared to CPU networking implementations. A number of recent works have enabled GPUs to more naturally access the network, but suffer from performance problems, require hidden CPU helper threads, or restrict communications to kernel boundaries. In this paper, we propose GPU Triggered Networking, a novel, GPU-centric networking approach which leverages the best of CPUs and GPUs. In this model, CPUs create and stage network messages and GPUs trigger the network interface when data is ready to send. GPU Triggered Networking decouples these two operations, thereby removing the CPU from the critical path. We illustrate how this approach can provide up to 25% speedup compared to standard GPU networking across microbenchmarks, a Jacobi stencil, an important MPI collective operation, and machine-learning workloads.
international conference on embedded computer systems architectures modeling and simulation | 2016
Reena Panda; Xinnian Zheng; Shuang Song; Jee Ho Ryoo; Michael LeBeane; Andreas Gerstlauer; Lizy Kurian John
Fast and efficient design space exploration is a critical requirement for designing computer systems, however, the growing complexity of hardware/software systems and significantly long run-times of detailed simulators often makes it challenging. Machine learning (ML) models have been proposed as popular alternatives that enable fast exploratory studies. The accuracy of any ML model depends heavily on the re-presentativeness of applications used for training the predictive models. While prior studies have used standard benchmarks or hand-tuned micro-benchmarks to train their predictive models, in this paper, we argue that it is often sub-optimal because of their limited coverage of the program state-space and their inability to be representative of the larger suite of real-world applications. In order to overcome challenges in creating representative training sets, we propose Genesys, an automatic workload generation methodology and framework, which builds upon key low-level application characteristics and enables systematic generation of applications covering a broad range of program behavior state-space without increasing the training time. We demonstrate that the automatically generated training sets improve upon the state-space coverage provided by applications from popular benchmarking suites like SPEC-CPU2006, MiBench, Media Bench, TPC-H by over llx and improve the accuracy of two machine learning based power and performance prediction systems by over 2.5x and 3.6x respectively.
international conference on parallel architectures and compilation techniques | 2018
Michael LeBeane; Khaled Hamidouche; Brad Benton; Mauricio Breternitz; Steven K. Reinhardt; Lizy Kurian John
Current state-of-the-art in GPU networking advocates a host-centric model that reduces performance and increases code complexity. Recently, researchers have explored several techniques for networking within a GPU kernel itself. These approaches, however, suffer from high latency, waste energy on the host, and are not scalable with larger/more GPUs on a node. In this work, we introduce Command Processor Networking (ComP-Net), which leverages the availability of scalar cores integrated on the GPU itself to provide high-performance intra-kernel networking. ComP-Net enables efficient synchronization between the Command Processors and Compute Units on the GPU through a line locking scheme implemented in the GPUs shared last-level cache. We illustrate that ComP-Net can improve application performance by up to 20% and provide up to 50% reduction in energy consumption vs. competing networking techniques across a Jacobi stencil, allreduce collective, and machine learning applications.
ieee international conference on high performance computing data and analytics | 2016
Michael LeBeane; Brandon Potter; Abhisek Pan; Alexandru Dutu; Vinay Agarwala; Wonchan Lee; Deepak Majeti; Bibek Ghimire; Eric Van Tassell; Samuel Wasmundt; Brad Benton; Mauricio Breternitz; Michael L. Chu; Mithuna Thottethodi; Lizy Kurian John; Steven K. Reinhardt
Accelerators have emerged as an important component of modern cloud, datacenter, and HPC computing environments. However, launching tasks on remote accelerators across a network remains unwieldy, forcing programmers to send data in large chunks to amortize the transfer and launch overhead. By combining advances in intra-node accelerator unification with one-sided Remote Direct Memory Access (RDMA) communication primitives, it is possible to efficiently implement lightweight tasking across distributed-memory systems. This paper introduces Extended Task Queuing (XTQ), an RDMA-based active messaging mechanism for accelerators in distributed-memory systems. XTQs direct NIC-to-accelerator communication decreases inter-node GPU task launch latency by 10–15% for small-to-medium sized messages and ameliorates CPU message servicing overheads. These benefits are shown in the context of MPI accumulate, reduce, and allreduce operations with up to 64 nodes. Finally, we illustrate how XTQ can improve the performance of popular deep learning workloads implemented in the Computational Network Toolkit (CNTK).
international conference on parallel processing | 2015
Jee Ho Ryoo; Saddam Jamil Quirem; Michael LeBeane; Reena Panda; Shuang Song; Lizy Kurian John
Recently, GPGPUs have positioned themselves in the mainstream processor arena with their potential to perform a massive number of jobs in parallel. At the same time, many GPGPU benchmark suites have been proposed to evaluate the performance of GPGPUs. Both academia and industry have been introducing new sets of benchmarks each year while some already published benchmarks have been updated periodically. However, some benchmark suites contain benchmarks that are duplicates of each other or use the same underlying algorithm. This results in an excess of workloads in the same performance spectrum. In this paper, we provide a methodology to obtain a set of new GPGPU benchmarks that are located in the unexplored region of the performance spectrum. Our proposal uses statistical methods to understand the performance spectrum coverage and uniqueness of existing benchmark suites. Later we show techniques to identify areas that are not explored by existing benchmarks by visually showing the performance spectrum coverage. Finding unique key metrics for future benchmarks to broaden its performance spectrum coverage is also explored using hierarchical clustering and ranking by Hotel lings T2 method. Finally, key metrics are categorized into GPGPU performance related components to show how future benchmarks can stress each of the categorized metrics to distinguish themselves in the performance spectrum. Our methodology can serve as a performance spectrum oriented guidebook for designing future GPGPU benchmarks.
ieee international symposium on workload characterization | 2014
Jee Ho Ryoo; Michael LeBeane; Muhammad Iqbal; Lizy Kurian John
With massive amounts of information on the web, cloud applications are rapidly emerging as one of the main-stream domains in modern computing, yet very little is known about their behavior. To our knowledge, this paper presents the first detailed study of control flow behavior in cloud workloads. We characterize branch predictability behavior of cloud and big data benchmarks, and compare against those of widely known CPU workloads based on profiling and simulation. Our in-depth branch analysis of workloads present striking differences in terms of higher prevalence of indirect branches, larger offsets in branch targets, abundance of multi-target branches and low BTB hit-rates. We identify performance bottlenecks involving branch predictability and provide suggestions that can be incorporated in future datacenter oriented processor designs. We perform Principal Component Analysis and clustering techniques to understand similarity/dissimilarity between cloud and CPU workloads.