Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Fengguang Song is active.

Publication


Featured researches published by Fengguang Song.


ieee international conference on high performance computing data and analytics | 2009

Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Fengguang Song; Asim YarKhan; Jack J. Dongarra

This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms on multicore systems (either shared-memory or distributed-memory). We use a task-based library to replace the existing linear algebra subroutines such as PBLAS to transparently provide the same interface and computational function as the ScaLAPACK library. Linear algebra programs are written with the task-based library and executed by a dynamic runtime system. We mainly focus our runtime system design on the metric of performance scalability. We propose a distributed algorithm to solve data dependences without process cooperation. We have implemented the runtime system and applied it to three linear algebra algorithms: Cholesky, LU, and QR factorizations. Our experiments on both shared-memory machines (16, 32 cores) and distributed-memory machines (1024 cores) demonstrate that our runtime system is able to achieve good scalability. Furthermore, we provide analytical analysis to show why the tiled algorithms are scalable and the expected execution time.


international conference on supercomputing | 2012

Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Fengguang Song; Stanimire Tomov; Jack J. Dongarra

We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multi-GPU systems to support dense matrix computations efficiently. The main idea is that we treat a heterogeneous system as a distributed-memory machine, and use a heterogeneous multi-level block cyclic distribution method to allocate data to the host and multiple GPUs to minimize communication. We design heterogeneous algorithms with hybrid tiles to accommodate the processor heterogeneity, and introduce an auto-tuning method to determine the hybrid tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our approach is designed for achieving four objectives: a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our experiments on a compute node (with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs), as well as on up to 100 compute nodes on the Keeneland system, demonstrate great scalability, good load balancing, and efficiency of our approach.


international conference on parallel processing | 2004

An algebra for cross-experiment performance analysis

Fengguang Song; Felix Wolf; Nikhil Bhatia; Jack J. Dongarra; Shirley Moore

Performance tuning of parallel applications usually involves multiple experiments to compare the effects of different optimization strategies. This article describes an algebra that can be used to compare, integrate, and summarize performance data from multiple sources. The algebra consists of a data model to represent the data in a platform-independent fashion plus arithmetic operations to merge, subtract, and average the data from different experiments. A distinctive feature of this approach is its closure property, which allows processing and viewing all instances of the data model in the same way - regardless of whether they represent original or derived data - in addition to an arbitrary and easy composition of operations.


acm symposium on parallel algorithms and architectures | 2012

A scalable framework for heterogeneous GPU-based clusters

Fengguang Song; Jack J. Dongarra

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.


international conference on parallel processing | 2007

L2 Cache Modeling for Scientific Applications on Chip Multi-Processors

Fengguang Song; Shirley Moore; Jack J. Dongarra

It is critical to provide high performance for scientific applications running on chip multi-processors (CMP). A CMP architecture often comprises a shared 12 cache and lower-level storages. The shared 12 cache can reduce the number of cache misses if the data are accessed in common by several threads, but it can also lead to performance degradation due to resource contention. Sometimes running threads on all cores can cause severe contention and increase the number of cache misses greatly. To investigate how the performance of a thread varies when running it concurrently with other threads on the remaining cores, we develop an analytical model to predict the number of misses on the shared L2 cache. In particular, we apply the model to thread-parallel numerical pro grams. We assume that all the threads compute homogeneous tasks and share a fully associative L2 cache. We use circular sequence profiling and stack processing techniques to analyze the L2 cache trace to predict the number of compulsory cache misses, capacity cache misses on shared data, and capacity cache misses on private data, respectively. Our method is able to predict the L2 cache performance for threads that have a global shared address space. For scientific applications, threads often have overlapping memory footprints. We use a cycle accurate simulator to validate the model with three scientific programs: dense matrix multiplication, blocked dense matrix multiplication, and sparse matrix-vector product. The average relative errors for the three experiments are 8.01%, 1.85%, and 2.41%, respectively.


high performance distributed computing | 2007

Feedback-directed thread scheduling with memory considerations

Fengguang Song; Shirley Moore; Jack J. Dongarra

This paper describes a novel approach to generate an optimized schedule to run threads on distributed shared memory (DSM) systems. The approach relies upon a binary instrumentation tool to automatically acquire the memory sharingrelationship between user-level threads by analyzing their memory trace. We introduce the concept of Affinity Graph to model the relationship. Expensive I/O for large trace files is completely eliminated by using an online graph creation scheme. We apply the technique of hierarchical graph partitioning and thread reordering to the affinity graph to determine an optimal thread schedule. We have performed experiments on an SGI Altix system. The experimental results show that our approach is able to reduce the totalexecution time by 10% to 38% for a variety of applications through the maximization of the data reuse within a single processor, minimization of the data sharing between processors, and a good load balance.


ieee international conference on high performance computing data and analytics | 2010

Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Fengguang Song; Hatem Ltaief; Bilel Hadri; Jack J. Dongarra

As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.


University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250) | 2011

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

Fengguang Song; Stanimire Tomov; Jack J. Dongarra

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures ∗ Fengguang Song Stanimire Tomov Jack Dongarra University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee Oak Ridge National Laboratory University of Manchester [email protected] [email protected] [email protected] ABSTRACT We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently. Our ap- proach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized commu- nication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allo- cate data to the host system and GPUs to minimize commu- nication. We have designed heterogeneous algorithms with two different tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scal- ability, strong scalability, load balance, and efficiency of our approach. INTRODUCTION As the performance of both multicore CPU and GPU con- tinues to scale at a Moore’s law rate, it is becoming perva- sive to use heterogeneous multicore and multi-GPU archi- tectures to attain the highest performance possible from a single compute node. Before making parallel programs run efficiently on a distributed-memory system, it is critical to achieve high performance on a single node first. However, the heterogeneity in the multi-core and multi-GPU architec- ture has introduced new challenges to algorithm design and system software. Over the last few years, our colleagues at the Univer- sity of Tennessee have developed the PLASMA library [2] to solve linear algebra problems on multicore architectures. In parallel with PLASMA, we have also developed another library called MAGMA [27] to solve linear algebra problems on GPUs. While PLASMA and MAGMA aim to provide the same routines as LAPACK [4], one is used for multicore CPUs, and the other for a single core with an attached GPU, respectively. Our goal is to utilize all cores and all GPUs efficiently on a single multicore and multi-GPU system to support matrix computations. ∗ This material is based upon work supported by the NSF grants CCF-0811642, OCI-0910735, by the DOE grant DE- FC02-06ER25761, and by Microsoft Research. GPU Device Memory Multicore Host System Host Memory PCIe Interface GPU Switch PCIe Interface GPU Switch GPU Device Memory GPU Device Memory GPU Device Memory Figure 1: An example of a heterogeneous multi-core and multi-GPU system. The host system is connected to four GPUs via two PCI Express connections. The host system and the GPUs have separate memory spaces. Figure 1 shows the architecture of a heterogeneous mul- ticore and multi-GPU system we are considering. The mul- ticore host system is connected to four GPUs via two PCI Express connections and each pair of GPUs share a GPU switch. To design new software on this type of heteroge- neous architectures, we must consider the following special features: (1) The host and the GPUs have different memory spaces and an explicit memory copy is required to transfer data between the host and a GPU; (2) The system is also dif- ferent from a distributed-memory machine since each GPU is actually controlled by a thread running on the host (more like pthreads on a shared-memory machine); (3) The pro- cessor heterogeneity between CPUs and GPUs; (4) GPUs are optimized for throughput and expect a larger input size than CPUs which are optimized for latency [24]; (5) As the performance gap between a GPU and its PCI-Express in- terconnection to the host becomes larger, network is even- tually the bottleneck for the entire system. In this work, we take into account all these factors and strive to meet the following objectives in order to obtain high performance: a high degree of parallelism, minimized synchronization, min- imized communication, and load balancing. We propose to design new heterogeneous algorithms and to use a simple but practical static data distribution to achieve the objec- tives simultaneously. This paper describes heterogeneous rectangular tile algo- rithms with hybrid tile sizes, heterogeneous 1-D block cyclic data distribution, a new runtime system, and an auto-tuning method to determine the hybrid tile sizes. The rectangu- lar tile algorithms build upon the previous tile algorithms, which divide a matrix into square tiles and exhibit a high de- gree of parallelism and minimized synchronizations [13, 14]


international conference on parallel processing | 2005

Automatic experimental analysis of communication patterns in virtual topologies

Nikhil Bhatia; Fengguang Song; Felix Wolf; Jack J. Dongarra; Bernd Mohr; Shirley Moore

Automatic pattern search in event traces is a powerful method to identify performance problems in parallel applications. We demonstrate that knowledge about the virtual topology, which defines logical adjacency relationships between processes, can be exploited to explain the occurrence of inefficiency patterns in terms of the parallelization strategy used in an application. We show correlations between higher-level events related to a parallel wavefront scheme and wait states identified by our pattern analysis. In addition, we visually expose relationships between pattern occurrences and the topological characteristics of the affected processes.


international workshop on openmp | 2005

Performance instrumentation and compiler optimizations for MPI/OpenMP applications

Oscar R. Hernandez; Fengguang Song; Barbara M. Chapman; Jack J. Dongarra; Bernd Mohr; Shirley Moore; Felix Wolf

This article describes how the integration of the OpenUH OpenMP compiler with the KOJAK performance analysis tool can assist developers of OpenMP and hybrid codes in optimizing their applications with as little user intervention as possible. In particular, we (i) describe how the compilers ability to automatically instrument user code down to the flow-graph level can improve the location of performance problems and (ii) outline how the performance feedback provided by KOJAK will direct the compilers optimization decisions in the future. To demonstrate our methodology, we present experimental results showing how reasons for the performance slow down of the ASPCG benchmark could be identified.

Collaboration


Dive into the Fengguang Song's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Shirley Moore

University of Texas at El Paso

View shared research outputs
Top Co-Authors

Avatar

Lan Lin

Ball State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Zizhong Chen

University of California

View shared research outputs
Top Co-Authors

Avatar

Felix Wolf

Technische Universität Darmstadt

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bernd Mohr

Forschungszentrum Jülich

View shared research outputs
Top Co-Authors

Avatar

Asim YarKhan

University of Tennessee

View shared research outputs
Researchain Logo
Decentralizing Knowledge