Qingda Lu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Qingda Lu is active.

Explore More

Publication

Featured researches published by Qingda Lu.

high-performance computer architecture | 2008

Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems

Jiang Lin; Qingda Lu; Xiaoning Ding; Zhao Zhang; Xiaodong Zhang; P. Sadayappan

Cache partitioning and sharing is critical to the effective utilization of multicore processors. However, almost all existing studies have been evaluated by simulation that often has several limitations, such as excessive simulation time, absence of OS activities and proneness to simulation inaccuracy. To address these issues, we have taken an efficient software approach to supporting both static and dynamic cache partitioning in OS through memory address mapping. We have comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS). Our software approach makes it possible to run the SPEC CPU2006 benchmark suite to completion. Besides confirming important conclusions from previous work, we are able to gain several insights from whole-program executions, which are infeasible from simulation. For example, giving up some cache space in one program to help another one may improve the performance of both programs for certain workloads due to reduced contention for memory bandwidth. Our evaluation of previously proposed fairness metrics is also significantly different from a simulation-based study. The contributions of this study are threefold. (1) To the best of our knowledge, this is a highly comprehensive execution- and measurement-based study on multicore cache partitioning. This paper not only confirms important conclusions from simulation-based studies, but also provides new insights into dynamic behaviors and interaction effects. (2) Our approach provides a unique and efficient option for evaluating multicore cache partitioning. The implemented software layer can be used as a tool in multicore performance evaluation and hardware design. (3) The proposed schemes can be further refined for OS kernels to improve performance.

Proceedings of the IEEE | 2005

Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models

Gerald Baumgartner; Alexander A. Auer; David E. Bernholdt; Alina Bibireata; Venkatesh Choppella; Daniel Cociorva; Xiaoyang Gao; Robert J. Harrison; So Hirata; Sriram Krishnamoorthy; Sandhya Krishnan; Chi-Chung Lam; Qingda Lu; Marcel Nooijen; Russell M. Pitzer; J. Ramanujam; P. Sadayappan; Alexander Sibiryakov

This paper provides an overview of a program synthesis system for a class of quantum chemistry computations. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. The input to the system is a a high-level specification of the computation, from which the system can synthesize high-performance parallel code tailored to the characteristics of the target architecture. Several components of the synthesis system are described, focusing on performance optimization issues that they address.

Molecular Physics | 2006

Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡

Alexander A. Auer; Gerald Baumgartner; David E. Bernholdt; Alina Bibireata; Venkatesh Choppella; Daniel Cociorva; Xiaoyang Gao; Robert J. Harrison; Sriram Krishnamoorthy; Sandhya Krishnan; Chi-Chung Lam; Qingda Lu; Marcel Nooijen; Russell M. Pitzer; J. Ramanujam; P. Sadayappan; Alexander Sibiryakov

As both electronic structure methods and the computers on which they are run become increasingly complex, the task of producing robust, reliable, high-performance implementations of methods at a rapid pace becomes increasingly daunting. In this paper we present an overview of the Tensor Contraction Engine (TCE), a unique effort to address issues of both productivity and performance through automatic code generation. The TCE is designed to take equations for many-body methods in a convenient high-level form and acts like an optimizing compiler, producing an implementation tuned to the target computer system and even to the specific chemical problem of interest. We provide examples to illustrate the TCE approach, including the ability to target different parallel programming models, and the effects of particular optimizations.

international conference on parallel architectures and compilation techniques | 2009

Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

Qingda Lu; Christophe Alias; Uday Bondhugula; Thomas Henretty; Sriram Krishnamoorthy; J. Ramanujam; Atanas Rountev; P. Sadayappan; Yongjian Chen; Haibo Lin; Tin-Fook Ngai

With increasing numbers of cores, future CMPs (Chip Multi-Processors) are likely to have a tiled architecture with a portion of shared L2 cache on each tile and a bank-interleaved distribution of the address space. Although such an organization is effective for avoiding access hot-spots, it can cause a significant number of non-local L2 accesses for many commonly occurring regular data access patterns. In this paper we develop a compile-time framework for data locality optimization via data layout transformation. Using a polyhedral model, the programs localizability is determined by analysis of its index set and array reference functions, followed by non-canonical data layout transformation to reduce non-local accesses for localizable computations. Simulation-based results on a 16-core 2D tiled CMP demonstrate the effectiveness of the approach. The developed program transformation technique is also useful in several other data layout transformation contexts.

international conference on parallel architectures and compilation techniques | 2009

Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning

Qingda Lu; Jiang Lin; Xiaoning Ding; Zhao Zhang; Xiaodong Zhang; P. Sadayappan

Performance degradation of memory-intensive programs caused by the LRU policys inability to handle weak-locality data accesses in the last level cache is increasingly serious for two reasons. First, the last-level cache remains in the CPUs critical path, where only simple management mechanisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with the LRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large class of programs, the locality information can be accurately predicted if access patterns are recognized through small training runs at the data object level. To achieve this goal, we present a system-software framework referred to as Soft-OLP (Software-based Object-Level cache Partitioning). We first collect per-object reuse distance histograms and inter-object interference histograms via memory-trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented with a modified Linux kernel, and tested on a commodity multi-core processor. Experimental results show that in comparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L2 cache misses across inputs for a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.

very large data bases | 2009

MCC-DB: minimizing cache conflicts in multi-core processors for databases

Rubao Lee; Xiaoning Ding; Feng Chen; Qingda Lu; Xiaodong Zhang

In a typical commercial multi-core processor, the last level cache (LLC) is shared by two or more cores. Existing studies have shown that the shared LLC is beneficial to concurrent query processes with commonly shared data sets. However, the shared LLC can also be a performance bottleneck to concurrent queries, each of which has private data structures, such as a hash table for the widely used hash join operator, causing serious cache conflicts. We show that cache conflicts on multi-core processors can significantly degrade overall database performance. In this paper, we propose a hybrid system method called MCC-DB for accelerating executions of warehouse-style queries, which relies on the DBMS knowledge of data access patterns to minimize LLC conflicts in multi-core systems through an enhanced OS facility of cache partitioning. MCC-DB consists of three components: (1) a cacheaware query optimizer carefully selects query plans in order to balance the numbers of cache-sensitive and cache-insensitive plans; (2) a query execution scheduler makes decisions to co-run queries with an objective of minimizing LLC conflicts; and (3) an enhanced OS kernel facility partitions the shared LLC according to each querys cache capacity need and locality strength. We have implemented MCC-DB by patching the three components in PostgreSQL and Linux kernel. Our intensive measurements on an Intel multi-core system with warehouse-style queries show that MCC-DB can reduce query execution times by up to 33%.

ieee international conference on high performance computing data and analytics | 2009

Enabling software management for multicore caches with a lightweight hardware support

Jiang Lin; Qingda Lu; Xiaoning Ding; Zhao Zhang; Xiaodong Zhang; P. Sadayappan

The management of shared caches in multicore processors is a critical and challenging task. Many hardware and OS-based methods have been proposed. However, they may be hardly adopted in practice due to their non-trivial overheads, high complexities, and/or limited abilities to handle increasingly complicated scenarios of cache contention caused by many-cores. In order to turn cache partitioning methods into reality in the management of multicore processors, we propose to provide an affordable and lightweight hardware support to coordinate with OS-based cache management policies. The proposed methods are scalable to many-cores, and perform comparably with other proposed hardware solutions, but have much lower overheads, therefore can be easily adopted in commodity processors. Having conducted extensive experiments with 37 multi-programming workloads, we show the effectiveness and scalability of the proposed methods. For example on 8-core systems, one of our proposed policies improves performance over LRU-based hardware cache management by 14.5% on average.

international conference on parallel architectures and compilation techniques | 2006

Combining analytical and empirical approaches in tuning matrix transposition

Qingda Lu; Sriram Krishnamoorthy; P. Sadayappan

Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. In this paper, we develop an integrated optimization framework that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors. A judicious combination of analytical and empirical approaches is used to determine the most appropriate optimizations. The absence of problem information until execution time is handled by generating multiple versions of the code — the best version is chosen at runtime, with assistance from minimal-overhead inspectors. The approach highlights aspects of empirical optimization that are important for similar computations with little temporal reuse. Experimental results on PowerPC G5 and Intel Pentium 4 demonstrate the effectiveness of the developed framework.

international conference on parallel processing | 2004

Applying MPI derived datatypes to the NAS benchmarks: A case study

Qingda Lu; Jiesheng Wu; Dhabaleswar K. Panda; P. Sadayappan

MPI derived datatypes are a powerful method to define arbitrary collections of non-contiguous data in memory and to enable non-contiguous data communication in a single MPI function call. In this paper, we employ MPI datatypes in four NAS benchmarks (MG, LU, BT, and SP) to transfer non-contiguous data. Comprehensive performance evaluation was carried out on two clusters: an Itanium-2 Myrinet cluster and a Xeon InfiniBand cluster. Performance results show that using datatypes can achieve performance comparable to manual packing/unpacking in the original benchmarks, though the MPI implementations that were studied also perform internal packing and unpacking on noncontiguous datatype communication. In some cases, better performance can be achieved because of the reduced costs to transfer non-contiguous data. This is because some optimizations in the MPI packing/unpacking implementations can be easily overlooked in manual packing and unpacking by users. Our case study demonstrates that MPI datatypes simplify the implementation of non-contiguous communication and lead to application code with portable performance. We expect that with further improvement of datatype processing and datatype communication such as [10, 24], datatypes can outperform the conventional methods of noncontiguous data communication. Our modified NAS benchmarks can be used to evaluate datatype processing and datatype communication in MPI implementations.

Journal of Physical Chemistry A | 2009

Performance Optimization of Tensor Contraction Expressions for Many Body Methods in Quantum Chemistry

Albert Hartono; Qingda Lu; Thomas Henretty; Sriram Krishnamoorthy; huaijian zhang; Gerald Baumgartner; David E. Bernholdt; Marcel Nooijen; Russell M. Pitzer; J. Ramanujam; P. Sadayappan

Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. This paper addresses two complementary aspects of performance optimization of such tensor contraction expressions. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions. The identification of common subexpressions among a set of tensor contraction expressions can result in a reduction of the total number of operations required to evaluate the tensor contractions. The first part of the paper describes an effective algorithm for operation minimization with common subexpression identification and demonstrates its effectiveness on tensor contraction expressions for coupled cluster equations. The second part of the paper highlights the importance of data layout transformation in the optimization of tensor contraction computations on modern processors. A number of considerations, such as minimization of cache misses and utilization of multimedia vector instructions, are discussed. A library for efficient index permutation of multidimensional tensors is described, and experimental performance data is provided that demonstrates its effectiveness.

Explore More