Yuanrui Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuanrui Zhang is active.

Explore More

Publication

Featured researches published by Yuanrui Zhang.

programming language design and implementation | 2012

A compiler framework for extracting superword level parallelism

Jun Liu; Yuanrui Zhang; Ohyoung Jang; Wei Ding; Mahmut T. Kandemir

SIMD (single-instruction multiple-data) instruction set extensions are quite common today in both high performance and embedded microprocessors, and enable the exploitation of a specific type of data parallelism called SLP (Superword Level Parallelism). While prior research shows that significant performance savings are possible when SLP is exploited, placing SIMD instructions in an application code manually can be very difficult and error prone. In this paper, we propose a novel automated compiler framework for improving superword level parallelism exploitation. The key part of our framework consists of two stages: superword statement generation and data layout optimization. The first stage is our main contribution and has two phases, statement grouping and statement scheduling, of which the primary goals are to increase SIMD parallelism and, more importantly, capture more superword reuses among the superword statements through global data access and reuse pattern analysis. Further, as a complementary optimization, our data layout optimization organizes data in memory space such that the price of memory operations for SLP is minimized. The results from our compiler implementation and tests on two systems indicate performance improvements as high as 15.2% over a state-of-the-art SLP optimization algorithm.

international symposium on microarchitecture | 2009

Optimizing shared cache behavior of chip multiprocessors

Mahmut T. Kandemir; Sai Prashanth Muralidhara; Sri Hari Krishna Narayanan; Yuanrui Zhang; Ozcan Ozturk

One of the critical problems associated with emerging chip multiprocessors (CMPs) is the management of on-chip shared cache space. Unfortunately, single processor centric data locality optimization schemes may not work well in the CMP case as data accesses from multiple cores can create conflicts in the shared cache space. The main contribution of this paper is a compiler directed code restructuring scheme for enhancing locality of shared data in CMPs. The proposed scheme targets the last level shared cache that exist in many commercial CMPs and has two components, namely, allocation, which determines the set of loop iterations assigned to each core, and scheduling, which determines the order in which the iterations assigned to a core are executed. Our scheme restructures the application code such that the different cores operate on shared data blocks at the same time, to the extent allowed by data dependencies. This helps to reduce reuse distances for the shared data and improves on-chip cache performance. We evaluated our approach using the Splash-2 and Parsec applications through both simulations and experiments on two commercial multi-core machines. Our experimental evaluation indicates that the proposed data locality optimization scheme improves inter-core conflict misses in the shared cache by 67% on average when both allocation and scheduling are used. Also, the execution time improvements we achieve (29% on average) are very close to the optimal savings that could be achieved using a hypothetical scheme.

signal processing systems | 2011

Accurate Area, Time and Power Models for FPGA-Based Implementations

Lanping Deng; Kanwaldeep Sobti; Yuanrui Zhang; Chaitali Chakrabarti

This paper presents accurate area, time, power estimation models for implementations using FPGAs from the Xilinx Virtex-2Pro family (Deng et al. 2008). These models are designed to facilitate efficient design space exploration in an automated algorithm-architecture codesign framework. Detailed models for estimating the number of slices, block RAMs and 18×18-bit multipliers for fixed point and floating point IP cores have been developed. These models are also utilized to develop power models that consider the effect of logic power, signal power, clock power and I/O power. Timing models have been developed to predict the latency of the fixed point and floating point IP cores. In all cases, the model coefficients have been derived by using curve fitting or regression analysis. The modeling error is quite small for single IP cores; the error for the area estimate, for instance, is on the average 0.95%. The error for fairly large examples such as floating point implementation of 8-point FFTs is also quite small; it is 1.87% for estimation of number of slices and 3.48% for estimation of power consumption. The proposed models have also been integrated into a hardware-software partitioning tool to facilitate design space exploration under area and time constraints.

international conference on parallel architectures and compilation techniques | 2011

Optimizing Data Layouts for Parallel Computation on Multicores

Yuanrui Zhang; Wei Ding; Jun Liu; Mahmut T. Kandemir

The emergence of multicore platforms offers several opportunities for boosting application performance. These opportunities, which include parallelism and data locality benefits, require strong support from compilers as well as operating systems. Current compiler research targeting multicores mostly focuses on code restructuring and mapping. In this work, we explore automatic data layout transformation targeting multithreaded applications running on multicores. Our transformation considers both data access patterns exhibited by different threads of a multithreaded application and the on-chip cache topology of the target multicore architecture. It automatically determines a customized memory layout for each target array to minimize potential cache conflicts across threads. Our experiments show that, our optimization brings significant benefits over state-of-the-art data locality optimization strategies when tested using 30 benchmark programs on an Intel multicore machine. The results also indicate that this strategy is able to scale to larger core counts and it performs better with increased data set sizes.

symposium on code generation and optimization | 2011

On-chip cache hierarchy-aware tile scheduling for multicore machines

Jun Liu; Yuanrui Zhang; Wei Ding; Mahmut T. Kandemir

Iteration space tiling and scheduling is an important technique for optimizing loops that constitute a large fraction of execution times in computation kernels of both scientific codes and embedded applications. While tiling has been studied extensively in the context of both uniprocessor and multiprocessor platforms, prior research has paid less attention to tile scheduling, especially when targeting multicore machines with deep on-chip cache hierarchies. In this paper, we propose a cache hierarchy-aware tile scheduling algorithm for multicore machines, with the purpose of maximizing both horizontal and vertical data reuses in on-chip caches, and balancing the workloads across different cores. This scheduling algorithm is one of the key components in a source-to-source translation tool that we developed for automatic loop parallelization and multithreaded code generation from sequential codes. To the best of our knowledge, this is the first effort that develops a fully-automated tile scheduling strategy customized for on-chip cache topologies of multicore machines. The experimental results collected by executing twelve application programs on three commercial Intel machines (Nehalem, Dunnington, and Harpertown) reveal that our cache-aware tile scheduling brings about 27.9% reduction in cache misses, and on average, 13.5% improvement in execution times over an alternate method tested.

measurement and modeling of computer systems | 2011

Studying inter-core data reuse in multicores

Yuanrui Zhang; Mahmut T. Kandemir; Taylan Yemliha

Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant benefits. Most of the previous data locality optimization techniques have been proposed and evaluated in the context of single core architectures. While one can expect these optimizations to be useful for multicore machines as well, multicores present further opportunities due to shared on-chip caches most of them accommodate. In order to optimize data locality targeting multicore machines however, the first step is to understand data reuse characteristics of multithreaded applications and potential benefits shared caches can bring. Motivated by these observations, we make the following contributions in this paper. First, we give a definition for inter-core data reuse and quantify it on multicores using a set of ten multithreaded application programs. Second, we show that neither on-chip cache hierarchies of current multicore architectures nor state-of-the-art (single-core centric) code/data optimizations exploit available inter-core data reuse in multithreaded applications. Third, we demonstrate that exploiting all available intercore reuse could boost overall application performance by around 21.3% on average, indicating that there is significant scope for optimization. However, we also show that trying to optimize for inter-core reuse aggressively without considering the impact of doing so on intra-core reuse can actually perform worse than optimizing for intra-core reuse alone. Finally, we present a novel, compiler-based data locality optimization strategy for multicores that balances both inter-core and intra-core reuse optimizations carefully to maximize benefits that can be extracted from shared caches. Our experiments with this strategy reveal that it is very effective in optimizing data locality in multicores.

design, automation, and test in europe | 2010

A special-purpose compiler for look-up table and code generation for function evaluation

Yuanrui Zhang; Lanping Deng; Praveen Yedlapalli; Sai Prashanth Muralidhara; Hui Zhao; Mahmut T. Kandemir; Chaitali Chakrabarti; Nikos P. Pitsianis; Xiaobai Sun

Elementary functions are extensively used in computer graphics, signal and image processing, and communication systems. This paper presents a special-purpose compiler that automatically generates customized look-up tables and implementations for elementary functions under user given constraints. The generated implementations include a C/C++ code that can be used directly by applications running on multicores, as well as a MATLAB-like code that can be translated directly to a hardware module on FPGA platforms. The experimental results show that our solutions for function evaluation bring significant performance improvements to applications on multicores as well as significant resource savings to designs on FPGAs.

symposium on code generation and optimization | 2013

Locality-aware mapping and scheduling for multicores

Wei Ding; Mahmut T. Kandemir; Praveen Yedlapalli; Yuanrui Zhang; Jithendra Srinivas

This paper presents a cache hierarchy-aware code mapping and scheduling strategy for multicore architectures. Our mapping strategy determines a loop iteration-to-core mapping by taking into account application data access patterns and on-chip cache hierarchy. It employs a novel concept called “core vectors” to obtain a mapping matrix which exploits data reuses at different layers of the cache hierarchy based on their reuse distances, with the goal of maximizing data locality at each level, while minimizing data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core, with the goal of reducing data reuse distances across the cores for dependence-free loop nests. Our experimental evaluation shows that the proposed mapping scheme reduces miss rates at all levels of caches and application execution time significantly, and when supported by scheduling, the reduction in cache miss rates and execution time become much larger.

international symposium on microarchitecture | 2011

A data layout optimization framework for NUCA-based multicores

Yuanrui Zhang; Wei Ding; Mahmut T. Kandemir; Jun Liu; Ohyoung Jang

Future multicore architectures are likely to include a large number of cores connected using an on-chip network with Non-uniform Cache Access (NUCA). In such architectures, whether a data request is satisfied from a local cache or a remote cache can make an important difference. To exploit this NUCA property, prior research explored both architectural enhancements as well as compiler-based code optimization strategies. In this work, we take an alternate view, and explore data layout optimizations to improve locality of data accesses in a NUCA-based system. Our proposed approach includes three steps: array tiling, computation-to-core mapping, and layout customization. The first of these tries to identify the affinity between data and computation taking into account parallelization information, with the goal of minimizing remote accesses. The second step maps computations (and their associated data) to cores with the goal of minimizing average distance-to-data, and the last step further customizes the memory layout taking into account the data placement policy adopted by the underlying architecture. We evaluated the success of this three-step approach in enhancing on-chip cache behavior using all application programs from the SPECOMP suite on a full-system simulator. Our results show that the proposed approach improves on average data access latency and execution time by 24.7% and 18.4%, respectively, in the case of static NUCA, and 18.1% and 12.7%, respectively, in the case of dynamic NUCA.

european conference on parallel processing | 2010

Scalable parallelization strategies to accelerate NuFFT data translation on multicores

Yuanrui Zhang; Jun Liu; Emre Kultursay; Mahmut T. Kandemir; Nikos P. Pitsianis; Xiaobai Sun

The non-uniform FFT (NuFFT) has been widely used in many applications. In this paper, we propose two new scalable parallelization strategies to accelerate the data translation step of the NuFFT on multicore machines. Both schemes employ geometric tiling and binning to exploit data locality, and use recursive partitioning and scheduling with dynamic task allocation to achieve load balancing. The experimental results collected from a commercial multicore machine show that, with the help of our parallelization strategies, the data translation step is no longer the bottleneck in the NuFFT computation, even for large data set sizes, with any input sample distribution.

Explore More