Wanling Gao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wanling Gao is active.

Explore More

Publication

Featured researches published by Wanling Gao.

high-performance computer architecture | 2014

BigDataBench: A big data benchmark suite from internet services

Lei Wang; Jianfeng Zhan; Chunjie Luo; Yuqing Zhu; Qiang Yang; yongqiang he; Wanling Gao; Zhen Jia; Yingjie Shi; Shujie Zhang; Chen Zheng; Gang Lu; Kent Zhan; Xiaona Li; bizhu qiu

As architecture, systems, and data management communities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evaluating these systems rises. However, the complexity, diversity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data benchmarking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. Currently, we choose 19 big data benchmarks from dimensions of application scenarios, operations/ algorithms, data types, data sources, software stacks, and application types, and they are comprehensive for fairly measuring and evaluating big data systems and architecture. BigDataBench is publicly available from the project home page http://prof.ict.ac.cn/BigDataBench. Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity, which measures the ratio of the total number of instructions divided by the total byte number of memory accesses; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache (L1I) misses per 1000 instructions (in short, MPKI) of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.

arXiv: Databases | 2013

BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking

Zijian Ming; Chunjie Luo; Wanling Gao; Rui Han; Qiang Yang; Lei Wang; Jianfeng Zhan

Data generation is a key issue in big data benchmarking that aims to generate application-specific data sets to meet the 4 V requirements of big data. Specifically, big data generators need to generate scalable data (Volume) of different types (Variety) under controllable generation rates (Velocity) while keeping the important characteristics of raw data (Veracity). This gives rise to various new challenges about how we design generators efficiently and successfully. To date, most existing techniques can only generate limited types of data and support specific big data systems such as Hadoop. Hence we develop a tool, called Big Data Generator Suite (BDGS), to efficiently generate scalable big data while employing data models derived from real data to preserve data veracity. The effectiveness of BDGS is demonstrated by developing six data generators covering three representative data types (structured, semi-structured and unstructured) and three data sources (text, graph, and table data).

arXiv: Performance | 2012

The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems

Zhen Jia; Runlin Zhou; Chunge Zhu; Lei Wang; Wanling Gao; Yingjie Shi; Jianfeng Zhan; Lixin Zhang

Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications in short big data systems is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications--an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.

symposium on code generation and optimization | 2018

CVR: efficient vectorization of SpMV on x86 processors

Biwei Xie; Jianfeng Zhan; Xu Liu; Wanling Gao; Zhen Jia; Xiwen He; Lixin Zhang

Sparse Matrix-vector Multiplication (SpMV) is an important computation kernel widely used in HPC and data centers. The irregularity of SpMV is a well-known challenge that limits SpMV’s parallelism with vectorization operations. Existing work achieves limited locality and vectorization efficiency with large preprocessing overheads. To address this issue, we present the Compressed Vectorization-oriented sparse Row (CVR), a novel SpMV representation targeting efficient vectorization. The CVR simultaneously processes multiple rows within the input matrix to increase cache efficiency and separates them into multiple SIMD lanes so as to take the advantage of vector processing units in modern processors. Our method is insensitive to the sparsity and irregularity of SpMV, and thus able to deal with various scale-free and HPC matrices. We implement and evaluate CVR on an Intel Knights Landing processor and compare it with five state-of-the-art approaches through using 58 scale-free and HPC sparse matrices. Experimental results show that CVR can achieve a speedup up to 1.70 × (1.33× on average) and a speedup up to 1.57× (1.10× on average) over the best existing approaches for scale-free and HPC sparse matrices, respectively. Moreover, CVR typically incurs the lowest preprocessing overhead compared with state-of-the-art approaches.

IEEE Transactions on Parallel and Distributed Systems | 2017

Understanding Big Data Analytics Workloads on Modern Processors

Zhen Jia; Jianfeng Zhan; Lei Wang; Chunjie Luo; Wanling Gao; Yi Jin; Rui Han; Lixin Zhang

Big data analytics workloads are very significant ones in modern data centers, and it is more and more important to characterize their representative workloads and understand their behaviors so as to improve the performance of data center computer systems. In this paper, we embark on a comprehensive study to understand the impacts and performance implications of the big data analytics workloads on the systems equipped with modern superscalar out-of-order processors. After investigating three most important application domains in Internet services in terms of page views and daily visitors, we choose 11 representative data analytics workloads and characterize their micro-architectural behaviors by using hardware performance counters. Our study reveals that the big data analytics workloads share many inherent characteristics, which place them in a different class from the traditional workloads and the scale-out services. To further understand the characteristics of big data analytics workloads, we perform correlation analysis to identify the most key factors that affect cycles per instruction (CPI). Also, we reveal that the increasing complexity of the big data software stacks will put higher pressures on the modern processor pipelines.

international conference on parallel architectures and compilation techniques | 2018

Data motifs: a lens towards fully understanding big data and AI workloads

Wanling Gao; Jianfeng Zhan; Lei Wang; Chunjie Luo; Daoyi Zheng; Fei Tang; Biwei Xie; Chen Zheng; Xu Wen; Xiwen He; Hainan Ye; Rui Ren

The complexity and diversity of big data and AI workloads make understanding them difficult and challenging. This paper proposes a new approachto modelling and characterizing big data and AI workloads. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on different initial or intermediate data inputs. Each class of unit of computation captures the common requirements while being reasonably divorced from individual implementations, and hence we call it a data motif. For the first time, among a wide variety of big data and AI workloads, we identify eight data motifs that take up most of the run time of those workloads, including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic. We implement the eight data motifs on different software stacks as the micro benchmarks of an open-source big data and AI benchmark suite --- BigDataBench 4.0 (publicly available from http://prof.ict.ac.cn/BigDataBench), and perform comprehensive characterization of those data motifs from perspective of data sizes, types, sources, and patterns as a lens towards fully understanding big data and AI workloads. We believe the eight data motifs are promising abstractions and tools for not only big data and AI benchmarking, but also domain-specific hardware and software co-design.

international parallel and distributed processing symposium | 2017

BigDataBench-S: An Open-Source Scientific Big Data Benchmark Suite

Xinhui Tian; Shaopeng Dai; Zhihui Du; Wanling Gao; Rui Ren; Yaodong Cheng; Zhifei Zhang; Zhen Jia; Peijian Wang; Jianfeng Zhan

Data generated from modern scientific instrumentation have grown up to an unprecedented scale. Moreover, data formats and computational behaviors of scientific big data workloads are much more complex than those in Internet services. These two facts pose a serious challenge to scientific data management and analytics. Among many concerns, the first one is how to build a comprehensive and representative scientific big data benchmark suite. Previous benchmark efforts either focus on Internet areas (i.e. BigDataBench) or pay attention to a specific area (i.e. GeneBase). This paper presents our preliminary work on building a comprehensive scientific big data benchmark suite---BigDataBench-S. Also, we use BigDataBench-S to evaluate several general-purpose big data management systems specifically designed for Internet services applications. Our evaluation shows: these systems cannot achieve expected performance for many scientific workloads, especially for complex matrix computation, for the lack of appropriate mechanisms and policies on data storage, query optimization and support of distributed matrix computation.

IEEE Transactions on Big Data | 2017

Understanding Processors Design Decisions for Data Analytics in Homogeneous Data Centers

Zhen Jia; Wanling Gao; Yingjie Shi; Sally A. McKee; Jianfeng Zhan; Lei Wang; Lixin Zhang

Our global economy increasingly depends on our ability to gather, analyze, link, and compare very large data sets. Keeping up with such big data poses challenges in terms of both computational performance and energy efficiency, and motivates different approaches to explore data center systems and architectures. To better understand the processor design decisions in context of data analytics in data centers, we conduct comprehensive evaluations using representative data analaytics workloads on representative conventional multi-core and many-core processors. After a comprehensive analysis of performance, power, energy efficiency and performance-cost efficiency, we have the following observations: contrasted with the conventional wisdom that uses wimpy many-core processors to improve energy-efficiency, the brawny multi-core processors with SMT (simultaneous multithreading) and dynamic overclocking technologies outperform the counterparts in terms of not only execution time, but also energy-efficiency for most of data analytics workloads in our experiments.

arXiv: Performance | 2015