Eddy Z. Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eddy Z. Zhang is active.

Explore More

Publication

Featured researches published by Eddy Z. Zhang.

architectural support for programming languages and operating systems | 2011

On-the-fly elimination of dynamic irregularities for GPU computing

Eddy Z. Zhang; Yunlian Jiang; Ziyu Guo; Kai Tian; Xipeng Shen

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs. This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It reveals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their relations with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dynamic irregularities in GPU computing, producing speedups between 1.07 and 2.5 for a variety of applications.

international parallel and distributed processing symposium | 2009

A cross-input adaptive framework for GPU program optimizations

Yixun Liu; Eddy Z. Zhang; Xipeng Shen

Recent years have seen a trend in using graphic processing units (GPU) as accelerators for general-purpose computing. The inexpensive, single-chip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applications. However, the development of a high-quality GPU application is challenging, due to the large optimization space and complex unpredictable effects of optimizations on GPU program performance. Recently, several studies have attempted to use empirical search to help the optimization. Although those studies have shown promising results, one important factor—program inputs—in the optimization has remained unexplored. In this work, we initiate the exploration in this new dimension. By conducting a series of measurement, we find that the ability to adapt to program inputs is important for some applications to achieve their best performance on GPU. In light of the findings, we develop an input-adaptive optimization framework, namely G-ADAPT, to address the influence by constructing cross-input predictive models for automatically predicting the (near-)optimal configurations for an arbitrary input to a GPU program. The results demonstrate the promise of the framework in serving as a tool to alleviate the productivity bottleneck in GPU programming.

acm sigplan symposium on principles and practice of parallel programming | 2010

Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs

Eddy Z. Zhang; Yunlian Jiang; Xipeng Shen

Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood. In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

international conference on supercomputing | 2010

Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Eddy Z. Zhang; Yunlian Jiang; Ziyu Guo; Xipeng Shen

Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the presence of condition statements in a GPU application. On a conditional branch where threads diverge in which path to take, the threads taking different paths have to run serially. Such divergences often cause serious performance degradations, impairing the adoption of GPU for many applications that contain non-trivial branches or certain types of loops. This paper presents a systematic investigation in the employment of runtime thread-data remapping for solving that problem. It introduces an abstract form of GPU applications, based on which, it describes the use of reference redirection and data layout transformation for remapping data and threads to minimize thread divergences. It discusses the major challenges for practical deployment of the remapping techniques, most notably, the conflict between the large remapping overhead and the need for the remapping to happen on the fly because of the dependence of thread divergences on runtime values. It offers a solution to the challenge by proposing a CPU-GPU pipelining scheme and a label-assign-move (LAM) algorithm to virtually hide all the remapping overhead. At the end, it reports significant performance improvement produced by the remapping for a set of GPU applications, demonstrating the potential of the techniques for streamlining GPU applications on the fly.

compiler construction | 2010

Is reuse distance applicable to data locality analysis on chip multiprocessors

Yunlian Jiang; Eddy Z. Zhang; Kai Tian; Xipeng Shen

On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance prediction. However, previous studies have concentrated on sequential programs running on unicore processors. On CMP, accesses by different threads (or jobs) interact in the shared cache. How reuse distance applies to the new architecture remains an open question—particularly, how the interactions in shared cache affect the collection and application of reuse distance, and how reuse-distance–based locality analysis should adapt to such architecture changes. This paper presents our explorations towards answering those questions. It first introduces the concept of concurrent reuse distance, a direct extension of the traditional concept of reuse distance with data references by all co-running threads (or jobs) considered. It then discusses the properties of concurrent reuse distance, revealing the special challenges facing the collection and application of concurrent reuse distance on CMP platforms. Finally, it presents the solutions to those challenges for a class of multithreading applications. The solutions center on a probabilistic model that connects concurrent reuse distance with the data locality of each individual thread. Experiments demonstrate the effectiveness of the proposed techniques in facilitating the uses of concurrent reuse distance for CMP computing.

acm sigplan symposium on principles and practice of parallel programming | 2013

Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Bo Wu; Zhijia Zhao; Eddy Z. Zhang; Yunlian Jiang; Xipeng Shen

The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristic methods to determine the new data layouts to create. As a result, they either do not provide any performance guarantee or are effective to only some limited scenarios. This paper contributes a fundamental study to the problem. It systematically analyzes the inherent complexity of the problem in various settings, and for the first time, proves that the problem is NP-complete. It then points out the limitations of existing techniques and reveals that in practice, the essence for designing an appropriate data reorganization algorithm can be reduced to a tradeoff among space, time, and complexity. Based on that insight, it develops two new data reorganization algorithms to overcome the limitations of previous methods. Experiments show that an assembly composed of the new algorithms and a previous algorithm can circumvent the inherent complexity in finding optimal data layouts, making it feasible to minimize non-coalesced memory accesses for a variety of irregular applications and settings that are beyond the reach of existing techniques.

quantitative evaluation of systems | 2008

KPC-Toolbox: Simple Yet Effective Trace Fitting Using Markovian Arrival Processes

Giuliano Casale; Eddy Z. Zhang; Evgenia Smirni

We present the KPC-Toolbox, a collection of MATLAB scripts for fitting workload traces into Markovian arrival processes (MAPs) in an automatic way. We first present detailed sensitivity analysis that builds intuition on which trace descriptors are most important for queueing. This sensitivity analysis stresses the importance of matching higher-order correlations (i.e., joint moments) of the process inter-arrival times rather than higher order moments of the distribution and provides guidance on the relative importance of different descriptors on queueing. Given that the MAP parameterization space can be very large, we focus on first determining the order of the smallest MAP that can fit the trace well, using the Bayesian information criterion (BIC) for determining the best order-accuracy tradeoff. Having determined the order of the target MAP, the KPC-Toolbox automatically derives a MAP that captures accurately the most essential features of the trace. Extensive experimentation illustrates the effectiveness of the KPC-Toolbox in fitting traces that are well-documented in the literature as very challenging to fit, showing that the KPC-Toolbox provides a simple and powerful solution to fitting accurately trace data into MAPs.

Performance Evaluation | 2010

Trace data characterization and fitting for Markov modeling

Giuliano Casale; Eddy Z. Zhang; Evgenia Smirni

We propose a trace fitting algorithm for Markovian Arrival Processes (MAPs) that can capture statistics of any order of interarrival times between measured events. By studying real traffic and workload traces often used in performance evaluation studies, we show that matching higher order statistical properties, in addition to first and second order descriptors, results in increased queueing prediction accuracy with respect to algorithms that only match the mean, the coefficient of variation, and the autocorrelations of the trace. This result supports the approach of modeling traces by the interarrival time process instead of the counting process that is more frequently used in the literature. We proceed by first characterizing the general properties of MAPs using a spectral approach. Based on this result, we show how different MAPs can be combined together using Kronecker products to define a larger MAP with predefined properties of interarrival times. We then devise an algorithm that is based on this Kronecker composition and can accurately fit data traces. This MAP fitting algorithm uses nonlinear optimization that can be customized to fit an arbitrary number of moments and to meet the desired cost-accuracy tradeoff. Numerical results of the fitting algorithm on real data, such as the Bellcore Aug89 trace and a Seagate disk drive trace, indicate that the proposed fitting technique achieves increased prediction accuracy with respect to other state-of-the-art fitting methods.

conference on object-oriented programming systems, languages, and applications | 2010

An input-centric paradigm for program dynamic optimizations

Kai Tian; Yunlian Jiang; Eddy Z. Zhang; Xipeng Shen

Accurately predicting program behaviors (e.g., locality, dependency, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited program inputs, a deciding factor for program behaviors. Triggered by the strong and predictive correlations between program inputs and behaviors that recent studies have uncovered, this work proposes to include program inputs into the focus of program behavior analysis, cultivating a new paradigm named input-centric program behavior analysis. This new approach consists of three components, forming a three-layer pyramid. At the base is program input characterization, a component for resolving the complexity in program raw inputs and the extraction of important features. In the middle is input-behavior modeling, a component for recognizing and modeling the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) is able to predict the large-scope behaviors of a programs execution as soon as the execution starts. The top layer of the pyramid is input-centric adaptation, which capitalizes on the novel opportunities that the first two components create to facilitate proactive adaptation for program optimizations. By centering on program inputs, the new approach resolves a proactivity-adaptivity dilemma inherent in previous techniques. Its benefits are demonstrated through proactive dynamic optimizations and version selection, yielding significant performance improvement on a set of Java and C programs.

symposium on code generation and optimization | 2010

Exploiting statistical correlations for proactive prediction of program behaviors

Yunlian Jiang; Eddy Z. Zhang; Kai Tian; Feng Mao; Malcom Gethers; Xipeng Shen; Yaoqing Gao

This paper presents a finding and a technique on program behavior prediction. The finding is that surprisingly strong statistical correlations exist among the behaviors of different program components (e.g., loops) and among different types of program level behaviors (e.g., loop trip-counts versus data values). Furthermore, the correlations can be beneficially exploited: They help resolve the proactivity-adaptivity dilemma faced by existing program behavior predictions, making it possible to gain the strengths of both approaches--the large scope and earliness of offline-profiling--based predictions, and the cross-input adaptivity of runtime sampling-based predictions. The main technique contributed by this paper centers on a new concept, seminal behaviors. Enlightened by the existence of strong correlations among program behaviors, we propose a regression based framework to automatically identify a small set of behaviors that can lead to accurate prediction of other behaviors in a program. We call these seminal behaviors. By applying statistical learning techniques, the framework constructs predictive models that map from seminal behaviors to other behaviors, enabling proactive and cross-input adaptive prediction of program behaviors. The prediction helps a commercial compiler, the IBM XL C compiler, generate code that runs up to 45% faster (5%-13% on average), demonstrating the large potential of correlation-based techniques for program optimizations.

Explore More