Reena Panda
University of Texas at Austin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Reena Panda.
ieee international conference on high performance computing data and analytics | 2015
Michael LeBeane; Shuang Song; Reena Panda; Jee Ho Ryoo; Lizy Kurian John
Large scale graph analytics are an important class of problem in the modern data center. However, while data centers are trending towards a large number of heterogeneous processing nodes, graph analytics frameworks still operate under the assumption of uniform compute resources. In this paper, we develop heterogeneity-aware data ingress strategies for graph analytics workloads using the popular PowerGraph framework. We illustrate how simple estimates of relative node computational throughput can guide heterogeneity-aware data partitioning algorithms to provide balanced graph cutting decisions. Our work enhances five online data ingress strategies from a variety of sources to optimize application execution for throughput differences in heterogeneous data centers. The proposed partitioning algorithms improve the runtime of several popular machine learning and data mining applications by as much as a 65% and on average by 32% as compared to the default, balanced partitioning approaches.
international performance computing and communications conference | 2014
Reena Panda; Lizy Kurian John
Performance of modern day computer systems greatly depends on the wide range of workloads, which run on the systems. Thus, a representative set of workloads, representing the different classes of real-world applications, need to be used by computer designers and researchers for processor design-space evaluation studies. While a number of different benchmark suites are available, a few common benchmark suites like the SPEC CPU2006 benchmarks are widely used by researchers either due to ease of setup, or simulation time constraints etc. However, as the popular benchmarks such as SPEC CPU2006 benchmarks do not capture the characteristics of the wide variety of emerging real-world applications, using them as the basis for performance evaluation may lead to either suboptimal designs or misleading results. In this paper, we characterize the behavior of the data analytics workloads, an important class of emerging applications, and perform a systematic similarity analysis with the popular SPEC CPU2006 & SPECjbb2013 benchmarks suites. To characterize the workloads, we use hardware performance counter based measurements and a variety of extracted micro-architecture independent workload characteristics. Then, we use statistical data analysis techniques, namely principal component analysis and clustering techniques, to analyze the similarity/dissimilarity among these different classes of applications. In this paper, we demonstrate the inherent differences between the characteristics of the different classes of applications and how to arrive at meaningful subsets of benchmarks, which will help in faster and more accurate targeted early hardware system performance evaluation.
symposium on computer architecture and high performance computing | 2015
Reena Panda; Christopher Erb; Michael LeBeane; Jee Ho Ryoo; Lizy Kurian John
Big data revolution has created an unprecedented demand for intelligent data management solutions on a large scale. While data management has traditionally been used as a synonym for relational data processing, in recent years a new group popularly known as NoSQL databases have emerged as a competitive alternative. There is a pressing need to gain greater understanding of the characteristics of modern databases to architect targeted computers. In this paper, we investigate four popular NoSQL/SQL-style databases and evaluate their hardware performance on modern computer systems. Based on data collected from real hardware, we evaluate how efficiently modern databases utilize the underlying systems and make several recommendations to improve their performance efficiency. We observe that performance of modern databases is severely limited by poor cache/memory performance. Nonetheless, we demonstrate that dynamic execution techniques are still effective in hiding a significant fraction of the stalls, thereby improving performance. We further show that NoSQL databases suffer from greater performance inefficiencies than their SQL counterparts. SQL databases outperform NoSQL databases for most operations and are beaten by NoSQL databases only in a few cases. NoSQL databases provide a promising competitive alternative to SQL-style databases, however, they are yet to be optimized to fully reach the performance of contemporary SQL systems. We also show that significant diversity exists among different database implementations and big-data benchmark designers can leverage our analysis to incorporate representative workloads to encapsulate the full spectrum of data-serving applications. In this paper, we also compare data-serving applications with other popular benchmarks such as SPEC CPU2006 and SPECjbb2005.
IEEE Computer Architecture Letters | 2012
Reena Panda; Paul V. Gratz; Daniel A. Jiménez
Computer architecture is beset by two opposing trends. Technology scaling and deep pipelining have led to high memory access latencies; meanwhile, power and energy considerations have revived interest in traditional in-order processors. In-order processors, unlike their superscalar counterparts, do not allow execution to continue around data cache misses. In-order processors, therefore, suffer a greater performance penalty in the light of the current high memory access latencies. Memory prefetching is an established technique to reduce the incidence of cache misses and improve performance. In this paper, we introduce B-Fetch, a new technique for data prefetching which combines branch prediction based lookahead deep path speculation with effective address speculation, to efficiently improve performance in in-order processors. Our results show that B-Fetch improves performance 38.8% on SPEC CPU2006 benchmarks, beating a current, state-of-the-art prefetcher design at ~1/3 the hardware overhead.
international symposium on performance analysis of systems and software | 2017
Reena Panda; Xinnian Zheng; Lizy Kurian John
With increasing memory footprints and working set sizes of emerging workloads, system designers need to evaluate new memory hierarchies with large last level caches (LLCs), DRAM caches, large DRAMs, etc. to optimize performance gains. This requires a deep understanding of the memory access behavior of the target workloads. It is important to have accurate mechanisms to generate address streams to study memory access behavior at and beyond LLCs. Prior memory trace generation proposals such as WEST and STM utilize LRU stack distance to capture temporal locality in the data access streams. In addition, STM also captures spatial locality information by modeling stride-based access patterns. However, a key drawback of prior models is that the metadata that they store to capture locality is significantly high. In this paper, we propose an efficient, light-weight methodology to generate accurate traces for modeling address Streams for LLC And Beyond (SLAB). SLAB leverages the key insight that memory access patterns can be efficiently characterized by combining locality and reuse statistics captured from both instruction and address streams. Compared to prior studies, which capture patterns solely based on data addresses, using the additional instruction stream localized information significantly reduces the space complexity. For programs where dominant instruction-localized patterns do not exist, SLAB exploits multi-granularity data reuse distances. We evaluate SLAB using SPEC CPU2006 and Cloudsuite benchmarks. With meta-data sizes of less than 7% of the original LLC traces, SLAB demonstrates over 91% accuracy in replicating original application behavior across ∼9000 different cache, prefetcher and memory configurations.
international symposium on performance analysis of systems and software | 2017
Reena Panda; Lizy Kurian John
Early design space evaluation of computer systems is usually performed using performance models (e.g., detailed simulators, RTL-based models, etc.). However, it is very challenging (often impossible) to run many emerging applications on detailed performance models owing to their complex software-stacks and long run times. To overcome such challenges in benchmarking these complex applications, we propose a proxy generation methodology, PerfProx to generate miniature proxy benchmarks, which are representative of the performance of real-world applications and yet, converge to results quickly and do not need any complex software-stack support. Past proxy generation research utilizes detailed micro-architecture independent metrics derived from detailed simulators, which are often difficult to generate for many emerging applications. PerfProx enables fast and efficient proxy generation using performance metrics derived primarily from hardware performance counters. We evaluate the proxy generation framework on three modern databases (Cassandra, MongoDB and MySQL) running data-serving and data-analytics applications. The proxy benchmarks mimic the performance (IPC) of the original applications with ∼94% accuracy, while significantly reducing the instruction count.
international conference on supercomputing | 2016
Reena Panda; Yasuko Eckert; Nuwan Jayasena; Onur Kayiran; Michael W. Boyer; Lizy Kurian John
Near-memory processing or processing-in-memory (PIM) is regaining a lot of interest recently as a viable solution to overcome the challenges imposed by memory wall. This trend has been mainly fueled by the emergence of 3D-stacked memories. GPUs are touted as great candidates for in-memory processors due to their superior bandwidth utilization capabilities. Although putting a GPU core beneath memory exposes it to unprecedented memory bandwidth, in this paper, we demonstrate that significant opportunities still exist to improve the performance of the simpler, in-memory GPU processors (GPU-PIM) by improving their memory performance. Thus, we propose three light-weight, practical memory-side prefetchers to improve the performance of GPU-PIM systems. The proposed prefetchers exploit the patterns in individual memory accesses and synergy in the wavefront-localized memory streams, combined with a better understanding of the memory-system state, to prefetch from DRAM row buffers into on-chip prefetch buffers, thereby achieving over 75% prefetcher accuracy and 40% improvement in row buffer locality. In order to maximize utilization of prefetched data and minimize thrashing, the prefetchers also use a novel prefetch buffer management policy based on a unique dead-row prediction mechanism together with an eviction-based prefetch-trigger policy to control their aggressiveness. The proposed prefetchers improve performance by over 60% (max) and 9% on average as compared to the baseline, while achieving over 33% of the performance benefits of perfect-L2 using less than 5.6KB of additional hardware. The proposed prefetchers also outperform the state-of-the-art memory-side prefetcher, OWL by more than 20%.
international symposium on performance analysis of systems and software | 2017
Jiajun Wang; Reena Panda; Lizy Kurian John
Cloud computing is gaining popularity due to its ability to provide infrastructure, platform and software services to clients on a global scale. Using cloud services, clients reduce the cost and complexity of buying and managing the underlying hardware and software layers. Popular services like web search, data analytics and data mining typically work with big data sets that do not fit into top level caches. Thus performance efficiency of last-level caches and the off-chip memory becomes a crucial determinant of cloud application performance. In this paper we use CloudSuite as an example and we study how prefetching schemes affect cloud workloads. We conduct detailed analysis on address patterns to explore the correlation between prefetching performance and intrinsic workload characteristics. Our work focuses particularly on the behavior of memory accesses at the last-level cache and beyond. We observe that cloud workloads in general do not have dominant strides. State-of-the-art prefetching schemes are only able to improve performance for some cloud applications such as web search. Our analysis shows that cloud workloads with long temporal reuse patterns often get negatively impacted by prefetching, especially if their working set is larger than the cache size.
international conference on parallel architectures and compilation techniques | 2016
Jee Ho Ryoo; Mitesh R. Meswani; Reena Panda; Lizy Kurian John
With current DRAM technology reaching its limit, emerging heterogeneous memory systems have become attractive to keep the memory performance scaling. This paper argues for using a small, fast memory closer to the processor as part of a flat address space where the memory system is composed of two or more memory types. OS-transparent management of such memory has been proposed in prior works such as CAMEO and Part of Memory (PoM) work. Data migration is typically handled either at coarse granularity with high bandwidth overheads (as in PoM) or atne granularity with low hit rate (as in CAMEO). Prior work uses restricted address mapping from only congruence groups in order to simplify the mapping. At any time, only one page (block) from a congruence group is resident in the fast memory. In this paper, we present a at address space organization called SILC-FM that uses large granularity but allows subblocks from two pages to coexist in an interleaved fashion in fast memory. Data movement is done at subblocked granularity, avoiding fetching of useless subblocks and consuming less bandwidth compared to migrating the entire large block. SILC-FM can get more spatial locality hits than CAMEO and PoM due to page-level operation and interleaving blocks respectively. The interleaved subblock placement improves performance by 55% on average over a static placement scheme without data migration. We also selectively lock hot blocks to prevent them from being involved in the hardware swapping operations. Additional features such as locking, associativity and bandwidth balancing improve performance by 11%, 8%, and 8% respectively, resulting in a total of 82% performance improvement over no migration static placement scheme. Compared to the best state-of-the-art scheme, SILC-FM gets performance improvement of 36% with 13% energy savings.
design automation conference | 2017
Reena Panda; Xinnian Zheng; Jiajun Wang; Andreas Gerstlauer; Lizy Kurian John
Recent research studies have shown that modern GPU performance is often limited by the memory system performance. Optimizing memory hierarchy performance requires GPU designers to draw design insights based on the cache & memory behavior of end-user applications. Unfortunately, it is often difficult to get access to end-user workloads due to the confidential or proprietary nature of the software/data. Furthermore, the efficiency of early design space exploration of cache & memory systems is often limited due to either the slow speed of detailed simulation techniques or limited scope of state-of-the-art cache analytical models. To enable efficient GPU memory system exploration, we present a novel methodology and framework that statistically models the GPU memory access stream locality. The proposed G-MAP (GPU Memory Access Proxy) framework models the regularity in codelocalized memory access patterns of GPGPU applications and the parallelism in GPUs execution model to create miniaturized memory proxies. We evaluate G-MAP using 18 GPGPU benchmarks and show that G-MAP proxies can replicate cache/memory performance of original applications with over 90% accuracy across over 5000 different L1/L2 cache, prefetcher and memory configurations.