Data Motifs: A Lens Towards Fully Understanding Big Data and AI Workloads
Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Fei Tang, Biwei Xie, Chen Zheng, Xu Wen, Xiwen He, Hainan Ye, Rui Ren
DData Motifs: A Lens Towards Fully Understanding Big Data andAI Workloads
Wanling Gao
State Key Laboratory of ComputerArchitectureInstitute of Computing Technology,Chinese Academy of SciencesUniversity of Chinese Academy [email protected]
Jianfeng Zhan ∗ State Key Laboratory of ComputerArchitectureInstitute of Computing Technology,Chinese Academy of SciencesUniversity of Chinese Academy [email protected]
Lei Wang
State Key Laboratory of ComputerArchitectureInstitute of Computing Technology,Chinese Academy of [email protected]
Chunjie Luo
Institute of Computing Technology,Chinese Academy of [email protected]
Daoyi Zheng
Institute of Computing Technology,Chinese Academy of [email protected]
Fei Tang
Institute of Computing Technology,Chinese Academy of [email protected]
Biwei Xie
Institute of Computing Technology,Chinese Academy of [email protected]
Chen Zheng
Institute of Computing Technology,Chinese Academy of [email protected]
Xu Wen
University of Chinese Academy [email protected]
Xiwen He
Institute of Computing Technology,Chinese Academy of [email protected]
Hainan Ye
Beijing Academy of Frontier Sciencesand [email protected]
Rui Ren
Institute of Computing Technology,Chinese Academy of [email protected]
ABSTRACT
The complexity and diversity of big data and AI workloads makeunderstanding them difficult and challenging. This paper proposes anew approach to modelling and characterizing big data and AI work-loads. We consider each big data and AI workload as a pipeline ofone or more classes of units of computation performed on differentinitial or intermediate data inputs. Each class of unit of computa-tion captures the common requirements while being reasonablydivorced from individual implementations, and hence we call it a data motif . For the first time, among a wide variety of big data andAI workloads, we identify eight data motifs that take up most ofthe run time of those workloads, including
Matrix , Sampling , Logic , Transform , Set , Graph , Sort and
Statistic . We implement the eightdata motifs on different software stacks as the micro benchmarks ofan open-source big data and AI benchmark suite — BigDataBench4.0 (publicly available from http://prof.ict.ac.cn/BigDataBench), andperform comprehensive characterization of those data motifs from ∗ Jianfeng Zhan is the corresponding author.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
PACT ’18, November 1–4, 2018, Limassol, Cyprus © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5986-3/18/11...$15.00https://doi.org/10.1145/3243176.3243190 perspective of data sizes, types, sources, and patterns as a lenstowards fully understanding big data and AI workloads. We be-lieve the eight data motifs are promising abstractions and tools fornot only big data and AI benchmarking, but also domain-specifichardware and software co-design.
CCS CONCEPTS • Theory of computation → Models of computation ; •
Com-puting methodologies → Symbolic and algebraic manipulation ;• Computer systems organization → Architectures ; KEYWORDS
Data Motif; Big Data; AI; Workload Characterization
ACM Reference Format:
Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, FeiTang, Biwei Xie, Chen Zheng, Xu Wen, Xiwen He, Hainan Ye, and RuiRen. 2018. Data Motifs: A Lens Towards Fully Understanding Big Dataand AI Workloads. In
International conference on Parallel Architectures andCompilation Techniques (PACT ’18), November 1–4, 2018, Limassol, Cyprus.
ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3243176.3243190
The complexity and diversity of big data and AI workloads make un-derstanding them difficult and challenging. First, modern big dataand AI workloads expand and change very fast, and it is impossibleto create a new benchmark or proxy for every possible workload.Second, several fundamental changes, i.e., end of Dennard scaling, a r X i v : . [ c s . D C ] A ug nding of Moore’s Law, Amdahl’s Law and its implications for end-ing "Easy" multicore era, indicate only hardware-centric path leftis Domain-specific Architectures [24]. To achieve higher efficiency,we need tailor the architecture to characteristics of a domain ofapplications [24]. However, the first step is to understand Big Dataand AI workloads. Third, whatever early in the architecture designprocess or later in the system evaluation, it is time-consuming torun a comprehensive benchmark suite. The complex software stacksof the modern workloads aggravate this issue. The modern big dataor AI benchmark suites [18, 41] are too huge to run on simulatorsand hence challenge time-constrained simulation and even makeit impossible. Fourth, too complex workloads raise challenges inboth reproducibility and interpretability of performance data inbenchmarking systems.Identifying abstractions of time-consuming units of computationis an important step toward fully understanding complex workloads.Much previous work [6, 10–12, 37] has illustrated the importanceof abstracting workloads in corresponding domains. TPC-C [10] isa successful benchmark built on the basis of frequently-appearingoperations in the OLTP domain. HPCC [33] adopts a similar methodto design a benchmark suite for high performance computing. Na-tional Research Council proposes seven major tasks in massivedata analysis [14], while they are macroscopical definition of prob-lems from the perspective of mathematics. Unfortunately, to thebest of our knowledge, none of previous work has identified time-consuming classes of unit of computation in big data and AI work-loads.Also, identifying abstractions of time-consuming units of com-putation is an important step toward domain-specific hardwareand software co-design. Straightforwardly, we can tailor the archi-tecture to characteristics of an application, several applications, oreven a domain of applications [24]. The past witnesses the successof neural network processors for machine learning [9, 28], GPUs forgraphics, virtual reality [35], and programmable network switchesand interfaces [24]. Moreover, if we can identify abstractions oftime-consuming units of computation in Big Data and AI work-loads and design domain-specific hardware and software systemfor them, our target will be much general-purpose. Meanwhile,optimizing most time-consuming units of computation other thanmany algorithms case by case on different hardware or softwaresystems will be much efficient.In this paper, we propose a new approach to modelling andcharacterizing big data and AI workloads. We consider each bigdata and AI workload as a pipeline of one or more classes of unitof computation performed on different initial or intermediate datainputs, each of which captures the common requirements whilebeing reasonably divorced from individual implementations [6].We call this abstraction a data motif . Significantly different fromthe traditional kernels, a data motif’s behaviors are affected by thesizes, patterns, types, and sources of different data inputs; Moreover,it reflects not only computation patterns, memory access patterns, butalso disk and network I/O patterns .After thoroughly analyzing a majority of workloads in five typ-ical big data application domains (search engine, social network,e-commerce, multimedia and bioinformatics), we identify eight datamotifs that take up most of run time, including
Matrix , Sampling , Logic , Transform , Set , Graph , Sort and
Statistic . We found the combi-nations of one or more data motifs with different weights in termsof runtime can describe most of big data and AI workloads we inves-tigated [19]. Considering various data inputs—text, sequence, graph,matrix and image data—with different data types and distributions,we implement eight data motifs on different software stacks, in-cluding Hadoop [1], Spark [46], TensorFlow [5] and POSIX-thread(Pthread) [8]. For big data, the implemented data motifs includesort (
Sort ), wordcount (
Statistics ), grep (
Set ), MD5 hash (
Logic ), ma-trix multiplication (
Matrix ), random sampling (
Sampling ), graphtraversal (
Graph ) and FFT transformation (
Transform ), while for AI,we implement 2-dimensional convolution (
Transform ), max pooling(
Sampling ), average pooling (
Sampling ), ReLU activation (
Logic ),sigmoid activation (
Matrix ), tanh activation (
Matrix ), fully con-nected (
Matrix ), and element-wise multiplication (
Matrix ), whichare frequently-used computation in neural network modelling. Werelease the implemented data motifs as the micro benchmarks ofan open-source big data and AI benchmark suite — BigDataBench.In the rest of paper, we use the big data motifs to indicate the motifimplementations for big data, and use the AI motifs to indicate themotif implementations for AI.Just like relation algebra in database, the data motifs are promis-ing fundamental concepts and tools for benchmarking, designing,measuring, and optimizing big data and AI systems. Based on thedata motifs, we build the fourth version of BigDataBench [20],including micro benchmarks, each of which is a data motif, andcomponent benchmarks, each of which is a combination of sev-eral data motifs, and end-to-end application benchmarks, each ofwhich is a combination of component benchmarks. Also, we buildthe proxy benchmarks [19] for big data and AI workloads, whichhas a speedup up to 1000 times in terms of runtime and a micro-architectural data accuracy of more than 90%. In this paper, as thefirst step, we call attention to performing comprehensive charac-terization of those data motifs from perspective of data sizes, types,sources, and patterns as a lens towards fully understanding big dataand AI workloads. On a typical state-of-practice processor: IntelXeon E5-2620 V3, we comprehensively characterize all data motifimplementations and identify their bottlenecks.Our contributions are five-fold as follows: • We identify eight data motifs through profiling a wide varietyof big data and AI workloads. • We provide diverse data motif implementations on the soft-ware stacks of Hadoop, Spark, TensorFlow, Pthread. • From the system and micro-architecture perspectives, wecomprehensively characterize the behaviors of data motifsand identify their bottlenecks. We find that these data motifscover a wide variety of performance space, from the perspec-tives of system and micro-architecture behaviors. Moreover,the behavior of each motif is not only influenced by its algo-rithm, but also largely affected by the type, source, size, andpattern of input data. • From the system aspect, we find that some AI motifs likeconvolution, fully-connected are CPU-intensive, while theother AI motifs are not CPU-intensive, such as Relu, Sigmoidused as activation layer. Further, the AI motifs have little atMulti Transform DownSampleMatSub SIFT: Units of Computation
1) Builds Gaussian pyramid:13.16% ——Matrix Multiplication ——Transform ——DownSample
2) Builds DoG pyramid: 4.17% ——Matrix Subtraction ——Matrix Inversion
3) Finds keypoints: 26.01% ——Sort ——Matrix Inversion
4) Compute scale, orientation & descriptors: 53.11% ——Statistic
5) Sort: 0.53% ——Sort
Builds Gaussian pyramidBuilds DoG pyramid
MatInversionSort
Finds keypoints
Statistics
Computes scale, orientation & descriptors
Sort
Sort
MatInversion
Figure 1: The Computation Dependency Graph and RunTime Breakdown of SIFT Workload. pressure on disk I/O, since they load a batch (e.g. 128 images)from disk every iteration. • From the micro-architecture aspect, we find that these motifsshow various computation and memory access patterns, ex-ploiting different parallelism degrees of ILP and MLP. Withthe data size expanding, the percentage of frontend bounddecreases while the backend bound increases.The rest of the paper is organized as follows. Section 2 illustratesthe motivation of identifying data motifs. Section 3 introduces datamotif identification methodology. Section 4 performs system andmicro-architecture evaluations on the data motif implementations.In Section 5, we report the data impact on the data motifs’ behaviorsfrom perspectives of data size, data pattern, data type and datasource. Section 6 introduces the related work. Finally, we draw aconclusion in Section 7.
We take two examples to explain why we should call attention toperforming comprehensive characterization of those data motifs.
SIFT [32] is a typical workload for feature extraction, and widelyused to detect local features of input images.Fig. 1 shows the computation dependency graph and run timebreakdown of SIFT workload. In total, SIFT involves five data mo-tifs. Gaussian filters G ( x , y , ∂ ) with different space scale factors ∂ are used to generate a group of image scale spaces, through theconvolution with the input image. Image pyramid is to downsamplethese image scale spaces. DOG image means difference-of-Gaussianimage, which is produced by matrix subtraction of adjacent imagescale spaces in image pyramid. After that, every point in one DOGscale space would sort with eight adjacent points in the same scalespace and points in adjacent two scale spaces, to find the key pointsin the image. Through profiling, we find that computes descirptors,finds keypoints and builds gaussian pyramid are three main time-consuming parts of the SIFT workload. Furthermore, we analyzethose three parts and find they consist of several classes of unit ofcomputation, like Matrix, Sampling, Transform, Sort and Statistics,summing up to 83.23% of the total SIFT run time. Conv2d Max Pooling NormalizationConv2d Max Pooling NormalizationConv2d * 2Conv2d Max Pooling NormalizationFully Connected DropoutFully Connected DropoutFully Connected ( ) μ s ( ) ( ) ( ) μ s ( ) ( ) ( ) μ s ( ) μ s ( ) ( ) ( ) ( ) ( ) ( ) AlexNetUnits of Computation:
1) Convolution: 36.91% ----Conv2d
2) Sampling: 13.45% ----Max Pooling ----Dropout
3) Matrix Multiply: 48.87% ----Fully Connected
4) Basic Statics: 0.76% ----Normalization ( ) Figure 2: The Computation Dependency Graph and RunTime Breakdown of One Iteration of TensorFlow AlexNetWorkload.
AlexNet [30] is a representative and widely-used convolutional neu-ral network in deep learning. In total, it has eight layers, includingfive convolutional layers and three fully connected layers.We profile one iteration of the AlexNet workload (implementedwith TensorFlow) using TensorBoard toolkit. Fig. 2 presents itscomputation dependency graph and run time breakdown. For eachoperator, we report its run time and its percentage of the total runtime, such as 6.57 ms and 1.35% for the first convolution operator.We find that each iteration involves Transform (conv2d), Sampling(max pooling, dropout), Statistics (normalization), and Matrix (fullyconnected). Among them, matrix and transform computations oc-cupy a large proportion—48.87% and 36.91%, respectively.Through the above analysis, we have the following observa-tion. Though big data and AI workloads are very complex andfast-changing, we can consider them as a pipeline of one or morefundamental classes of unit of computation performed on differentinitial or intermediate data inputs. Those classes of unit of compu-tation, which we call data motifs, occupy most of the run time ofthe workloads, so we should pay more attention to them. In thenext section, we will investigate more extensive big data and AIworkloads, and elaborate the design of data motifs.
Data motifs are frequently-appearing classes of unit of computationhandling different data inputs. In this section, we illustrate how toidentify data motifs from big data and AI workloads, and illustrateour data motif implementations.
Fig. 3 overviews the methodology of motif identification. We firstsingle out a broad spectrum of big data and AI workloads throughinvestigating five typical application domains (search engine, socialnetwork, e-commerce, multimedia, and bioinformatics) and repre-sentative algorithms in four processing techniques (machine learn-ing, data mining, computer vision and natural language processing).Then we conduct algorithmic analysis and profiling analysis on able 1: The Importance of Eight Data motifs in Big Data and AI workloads. Catergory Application Domain Workload Unit of Computation
Deep Learning Image RecognitionSpeech Recognition Convolutional neural network(CNN) Matrix, Sampling, TransformDeep belief network(DBN) Matrix, SamplingGraph Mining Search EngineCommunity Detection PageRank Matrix, Graph, SortBFS, Connected component(CC) GraphDimension Reduction Image ProcessingText Processing Principal components analysis(PCA) MatrixLatent dirichlet allocation(LDA) Statistics, SamplingRecommendation Association Rules MiningElectronic Commerce Aporiori Statistics, SetFP-Growth Graph, Set, StatisticsCollaborative filtering(CF) Graph, MatrixClassification Image RecognitionSpeech RecognitionText Recognition Support vector machine(SVM) MatrixK-nearest neighbors(KNN) Matrix, Sort, StatisticsNaive bayes StatisticRandom forest Graph, StatisticsDecision tree(C4.5/CART/ID3) Graph, StatisticsClustering Data Mining K-means Matrix, SortFeature Preprocess Image ProcessingSignal ProcessingText Processing Image segmentation(GrabCut) Matrix, GraphScale-invariant feature transform(SIFT) Matrix, Transform, Sampling, Sort, StatisticsImage Transform Matrix, TransformTerm Frequency-inverse document frequency(TF-IDF) StatisticsSequence Tagging BioinformaticsLanguage Processing Hidden Markov Model(HMM) MatrixConditional random fields(CRF) Matrix, SamplingIndexing Search Engine Inverted index, Forward index Statistics, Logic, Set, SortEncoding/Decoding Multimedia ProcessingSecurityCryptographyDigital Signature MPEG-2 Matrix, TransformEncryption Matrix, LogicSimHash, MinHash Set, LogicLocality-sensitive hashing(LSH) Set, LogicData Warehouse Business intelligence Project, Filter, OrderBy, Union Set, Sort
Algorithmic Analysis Data Motifs • Frequently-appearing Units of Computation• Data Inputs (type, source, size, pattern)
Big Data and AI Workloads
Pipeline of units of computationData input and intermediate data
Profiling Analysis
Run time breakdownComputation graph
Figure 3: Identifying Data Motifs. these workloads. We profile the workload to analyze the computa-tion dependency graph and run time breakdown, to find and cor-relate the hotspot functions to the code segments. Combing withalgorithmic analysis, we decompose the workload into a pipelineof units of computation and focus on the input/intermediate data as well. Then we summarize the frequently-appearing and time-consuming units as data motifs. We repeat this procedure on fortyworkloads with a broad spectrum to guarantee the representative-ness of our data motifs.According to the units of computation pipeline and run timebreakdown, we finalize eight big data and AI motifs, which areessential computations that take up most of run time. Table 1 showsthe importance of eight data motifs in a majority of big data and AIworkloads. Note that previous work [23] has identified four basicunits of computation in online service, including get, put, post,delete. We don’t include those four in our motif set.
In this subsection, we summarize eight data motifs that frequentlyappear in big data and AI workloads.
Matrix
In big data and AI workloads, many problems involve ma-trix computations, such as vector-vector, matrix-vector and matrix-matrix operations. ampling Sampling plays an essential role in big data and AIprocessing, which selects a subset samples according to certain sta-tistical population. It can be used to obtain an approximate solutionwhen one problem cannot be solved by deterministic method.
Logic
We name computations performing bit manipulation aslogic computations, such as hash, data compression and encryption.
Transform
The transform computations here mean the conver-sion from the original domain (such as time) to another domain(such as frequency). Common transform computations include dis-crete fourier transform (DFT), discrete cosine transform (DCT) andwavelet transform.
Set
In mathematics, Set means a collection of distinct objects.Likewise, the concept of Set is widely used in computer science.Set is also the foundation of relational algebra [34]. In addition,similarity analysis of two data sets involves set computations, suchas Jaccard similarity. Furthermore, fuzzy set and rough set playvery important roles in computer science.
Graph
A lot of applications involve graphs, with nodes repre-senting entities and edges representing dependencies. Graph com-putation is notorious for having irregular memory access patterns.
Sort
Sort is widely used in many areas. Jim Gray thought sort isthe core of modern databases [6], which shows its fundamentality.
Statistics
Statistic computations are used to obtain the summaryinformation through statistical computations, such as counting andprobability statistics.
Data motifs are the fundamental components of big data and AIworkloads, which is of great significance for evaluation, consideringthe complexity and diversity of big data and AI workloads. We pro-vide the data motif implementations for big data and AI separately,according to their computation specialties. For the big data motif im-plementations, we provide Hadoop [1], Spark [46], and Pthreads [8]implementations. These data motifs include sort, wordcount, grep,MD5 hash, matrix multiplication, random sampling, graph traver-sal and FFT transformation. For the AI motifs, we provide Tensor-Flow [5] and Pthread implementations, including 2-dimensionalconvolution, max pooling, average pooling, relu activation, sig-moid activation, tanh activation, fully connected (matmul), andelement-wise multiply. We consider the impact of data input fromthe perspectives of type, source, size, and pattern. Among them, data type includes structure, un-structured, and semi-structureddata.
Data source indicates the data storage format, including text,sequence, graph, matrix, and image data.
Data pattern includes thedata distribution, data sparsity, et al. As for data size , we providebig data generators for text, sequence, graph and matrix data tofulfill different size requirements.
In this section, we evaluate data motifs with various software stacksfrom the perspectives of both system and architecture behaviors.
We deploy a three-node cluster, with one master node and two slavenodes. They are connected using 1Gb Ethernet network. Each nodeis equipped with two Intel Xeon E5-2620 V3 (Haswell) processors,
Table 2: Configuration Details of Xeon E5-2620 V3
Hardware ConfigurationsCPU Type Intel CPU CoreIntel ®Xeon E5-2620 V3 12 [email protected] DCache L1 ICache L2 Cache L3 Cache12 ×
32 KB 12 ×
32 KB 12 ×
256 KB 15MBMemory 64GB,DDR4Disk SATA@7200RPMEthernet 1GbHyper-Threading Disabled and each processor has six physical out-of-order cores. The memoryof each node is 64 GB. The operating system, software stacks andgcc versions are as follows: CentOS 7.2 (with kernel 4.1.13); JDK1.8.0_65; Hadoop 2.7.1; Spark 1.5.2; TensorFlow 1.0; GCC 4.8.5. Thedata motifs implemented with Pthread are compiled using "-O2" op-tion for optimization. The hardware and software details are listedin Table 2. Since Pthread is a multi-thread programming model, weevaluate both the TensorFlow and Pthread implementations of AImotifs on one node for apple-to-apple comparison.
We evaluate eight big data motifs implemented with Hadoop, Spark,and eight AI data motifs implemented with TensorFlow and Pthread.Note that we use the optimal configurations for each softwarestack, according to the cluster scale and memory size. The dataconfiguration and selected metrics are listed as follows.
Data Configuration
To evaluate the impacts of data input com-prehensively, we evaluate the data motifs with three data sizes:
Small , Medium , and
Large . We choose the
Large data size accordingto the memory capacity of the cluster so as to fully utilize the mem-ory resources, and the other two are chosen for comparison. Forthe graph motif,
Small , Medium , Large is 2 , 2 and 2 -vertex,respectively. For the matrix motif, we use 100, 1K and 10K two-dimensional matrix data with the same distribution and sparsity. Forthe transform motif, we use 16384, 32768 and 65536 two-dimensionmatrix data. For the other big data motifs, we use 1, 10 and 100 GBwikipedia text data, respectively. For the AI motifs, we use threeconfigurations in terms of input tensor sizes and channels. Theyare (224*224,64), (112*112,128) and (56*56,256) . Among them, thefirst value indicates the dimension of input tensor, the second valueindicates the channels, and all of them use 128 as batch size. Wechoose these three configurations because they are widely used inneural network models [39]. Note that the dimension for all inputtensors is 224 for Large configuration, 112 for
Medium configurationand 56 for
Small configuration. For the Pthread-version AI motifs,we use 1K, 10K, 100K images from ImageNet [15]. In the followingsubsections, we characterize the system and micro-architecturalbehaviors of data motifs with the
Large data size. In Section 5, wewill analyze the impact of data input on characteristics with all datasizes.
System and Micro-architecture Metrics
We characterize thesystem and micro-architectural behaviors [40] of the data motifs,which are significant for design and optimization [36]. For systemevaluation, we report the metrics of CPU utilization, I/O Wait, disk igure 4: CPU Utilization and I/O Wait of Data Motifs.Figure 5: I/O Behaviors of Data Motifs. I/O bandwidth, and network I/O bandwidth. The system metricsare collected through the proc file system.For micro-architectural evaluation, we use the Top-Down anal-ysis method [44], which categorizes the pipeline slots into fourcategories, including retiring, bad speculation, frontend bound andbackend bound. Among them, retiring represents the useful work,which means the issued micro operations (uops) eventually getretired. Bad speculation represents the pipeline is blocked due toincorrect speculations. Frontend bound represents the stalls due tofrontend, which undersupplies uops to the backend. Backend boundrepresents the stalls due to backend, which is a lack of requiredresources for new uops [4]. We use Perf [3], a Linux profiling tool,to collect the hardware events referring to the Intel DeveloperśManual [22] and pmu-tools [4].
Fig. 4 presents the CPU utilization and I/O Wait of all data motifs.We find that Hadoop motifs have higher CPU utilization than Sparkmotifs, and suffer from less I/O Wait than Spark motifs do. Partic-ularly, Hadoop motifs take 80 percent CPU time. The I/O Waitsof AI data motifs are extremely lower than that of big data motifs. For deep neural networks, even the total input data is large, theinput layer loads a batch from disk every iteration, so data loadingsize from disk by the input layer occupies a very small proportioncomparing to intermediate data, and thus introduces little disk I/Orequests. Pthread motifs have less CPU utilization and I/O Waitin general, because Pthread motifs have less memory allocationand relocation operations than counterparts using other stacks.Moreover, the data loading time overlaps the processing time sincecomputation is simple, except that Pthread Matmul has almost 100%CPU utilization because of its high computation complexity andCPU-intensive characteristics. TensorFlow motifs, such as AvgPool,Conv, Matmul, Maxpool, and Multiply, have taken most of CPUtime, because these five motifs are CPU-intensive. Nevertheless, wealso find that the other AI motifs are not that CPU-intensive, suchas Relu, Sigmoid, and Tanh.Fig 5 presents the network bandwidth and disk I/O bandwidth.For AI motifs, most of them (e.g. matmul, relu, pooling, activation)are executed in the hidden layers, and the intermediate states ofhidden layers are stored in the memory. That is to say, the hiddenlayers consume the most resources of computation and memorystorage, while the disk I/O for input layer is relatively minor. Ourevaluation confirms this observation. Meanwhile, as mentioned inSection 4.1, we evaluate both the TensorFlow and Pthread imple-mentations of AI motifs on one node for apple-to-apple comparison.So we do not report the I/O behaviors of AI motifs. We find that forall big data motifs, Spark stack has much larger network I/O pres-sure than that of Hadoop stack, because Spark stack has more datashuffles, so it needs transferring data from one node to another onefrequently. Five of the eight Spark implementations have smallerdisk I/O pressure than that of Hadoop, because Spark targets in-memory computing. Except Spark Matmul, Spark MD5 and SparkWordCount have larger disk I/O pressure than that of Hadoop coun-terparts. Their disk I/O read sector numbers are nearly equal, whilethe write sector numbers are much larger.
To better understand the data motifs, we analyze their performanceand micro-architectural characteristics. igure 6: Execution Performance of Data Motifs.Execution Performance The execution performance indicatesthe overall running efficiency of the workloads [29]. We use the in-struction level parallelism (ILP) and memory level parallelism (MLP)to reflect the execution performance. Among them, ILP measuresthe number of instructions that can be executed simultaneously.Here we use the retired instructions per cycle (IPC) to measure ILP.MLP indicates the parallelism degree that memory accesses can begenerated and executed [21]. MLP is computed through dividing
L1D_PEND_MISS.PENDING by L1D_PEND_MISS.PENDING_CYCLES [4].Fig. 6 presents the ILP and MLP of all data motifs. We find that thesemotifs cover a wide range of ILP and MLP behaviors, reflectingdistinct computation and memory access patterns. For example,TensorFlow Multiply does element-wise multiplications and hashigh MLP (5.27) but extremely low ILP (0.15). This is because thatits computation is simple and has little data dependencies, so itgenerates many concurrent data loads, thus incurs a large amountof data cache misses. Also, max pooling and average pooling havehigh MLP. The MLP of average pooling is lower than max pool-ing, because average computation involves many divide operations,and thus suffers from more stalls due to the delay of divider unit.The software stack changes workload’s computation and memoryaccess patterns, which is also found in previous work [25]. For ex-ample, both Hadoop FFT and Spark FFT are based on cooley-tukeyalgorithm [13], while they have different parallelism degrees. SparkFFT is more memory-intensive and has higher MLP.
The Uppermost Level Breakdown
Fig. 7 shows the upper-most level breakdown of all data motifs we evaluated. We findthat these motifs have different pipeline bottlenecks. For Hadoopmotifs, they suffer from notable stalls due to frontend bound andbad speculation. Moreover, Hadoop motifs reflect nearly consistentbottlenecks, indicating the Hadoop stack impacts workload behav-iors more than other stacks like Spark and TensorFlow. For Sparkmotifs, which mainly compute in memory, they suffer from a higherpercentage of backend bound than that of Hadoop counterparts.Spark Grep, Sample and Sort suffer from more frontend bound andtheir percentages of backend bound are smaller than the others. TheAI data motifs face different bottlenecks both on TensorFlow andPthreads. Conv and Matmul have the highest IPC (about 2.2) and
Figure 7: The Uppermost Level Breakdown of Data Motifs.Figure 8: The Frontend Breakdown of Data Motifs.Figure 9: The Frontend Latency Breakdown of Data Motifs. retiring percentages (about 50% on TensorFlow). Max pooling, aver-age pooling, and multiply have extremely low retiring percentages,which has been illustrated in above. However, activation operationlike ReLU, sigmoid and tanh suffer from more frontend bound thanbackend bound. For AI data motifs implemented with Pthread, theirmain bottleneck is backend bound. They suffer from little frontendand bad speculation stalls.
Frontend Bound
Frontend bound can be split into frontend la-tency bound and frontend bandwidth bound. Among them, latencybound means the frontend delivers no uops to the backend, whilebandwidth bound means delivering insufficient uops comparing tothe theoretical value. Fig. 8 presents the frontend breakdown of thedata motifs. We find that the main reason that incurs the frontend igure 10: The Backend Bound Breakdown of Data Motifs. stalls is latency bound for almost all motifs that suffer from severefrontend bound.We further investigate the reasons for the frontend latencybound and frontend bandwidth bound, respectively. Generally,the frontend latency bound are incurred by six reasons, includ-ing icache miss, itlb miss, branch resteers, DSB (Decoded StreamBuffer) switches, LCP (Length Changing Prefix), and MS (microcodesequencer) switches. Among them, icache miss and itlb miss areinstruction cache miss and instruction tlb miss. Branch resteersmeans the delays to obtain the correct instructions, such as thedelays due to branch misprediction. LCP measures the stalls whendecoding the instructions with a length changing prefix. Generally,uops comes from three places, including the decoded uops cache(DSB), legacy decode pipeline (MITE) and microcode sequencer(MS). DSB switches record the stalls caused by switching from theDSB to MITE. MS switches measure the penalty of switching to MSunit. As for latency bandwidth bound, there are mainly two reasons:the inefficiency of MITE pipeline and the inefficient utilization ofDSB cache. Additionally, LSD represents the stalls due to waitingthe uops from the loop stream detector [2]. Fig. 9 lists the latencyand bandwidth bound breakdown of all data motifs. For almost alldata motifs, branch resteers is a main reason for the high percentageof frontend bound, except Spark Matmul and Relu, Sigmoid, Tanhon TensorFlow. For these three activation functions, nearly 60%frontend bound is due to instruction cache miss. On average, bigdata motifs implemented with Hadoop and Spark suffer from moreicache misses than AI data motifs. Moreover, MS switch is anothersignificant factor that incurs frontend latency bound. Because bigdata and AI systems use many CISC instructions that cannot bedecoded by default decoder, so they must be decoded by MS unit,and results in performance penalties. Backend Bound
Fig 10 presents the backend bound breakdownof data motifs, which are split into backend memory bound andbackend core bound. Backend memory bound is mainly caused bythe data movement delays among different memory hierarchies.Backend core bound is mainly caused by the lack of hardwareresources (e.g. divider unit) or port under-utilization because ofinstruction dependencies and execution unit overloading. We findthat more than half of these data motifs suffer from more backendmemory bound than core bound. However, for each software stack,there is at least one data motif that suffers from equal percentages ofcore bound or even more percentages of core bound than memorybound, such as Hadoop WordCount, Spark MD5, TensorFlow Conv
Figure 11: The Backend Core Bound Breakdown of Data Mo-tifs. and Pthread AvgPool. Fig. 11 shows the core bound breakdown.We find that TensorFlow AvgPool and Hadoop WordCount sufferfrom significantly long latency of divider unit. While for SparkMD5 and TensorFlow Conv, which has the highest percentage ofbackend core bound, mainly suffer from the stalls due to port under-utilization. As for backend memory bound, we find that DRAMmemory bound is much severe than level 1, 2, and 3 cache boundfor almost all big data and AI motifs, indicating that the memorywall [42] still exists and needs to be optimized.
In this section, we evaluate the impact of data input on system andmicro-architecture behaviors from the perspectives of size, source,type, and pattern. For type and pattern evaluation, we use Sort andFFT as an example, respectively.
Based on all sixty metrics spanning system and micro-architecturewe evaluated in Section 4, we conduct a coarse-grained similarityanalysis using PCA (Principal Component Analysis) [27] and hier-archical clustering [26] methods on three data size configurations.Fig. 12 presents the linkage distance of all data motifs, which in-dicates the similarity of system and micro-architecture behaviors.Note that the smaller the linkage distance, the more similar thebehaviors. We find that data motifs with small data size are morelikely to be clustered together. A small data size will not fully utilizethe system and hardware resources, hence that they tend to reflectsimilar behaviors. However, for the motif that is computation inten-sive and has high computation complexity, even with the large dataset, it will be clustered together with small data set. For example,FFTs with three data size configurations are clustered together forboth Hadoop and Spark version. AI Motifs with TensorFlow imple-mentations are also greatly affected by the input data size. However,they reflect distinct behaviors with big data motifs implementedwith Hadoop and Spark, with the least linkage distance of 6.71.
Impact of Data Size on I/O Behaviors
We evaluate the impactof data size on I/O behaviors using the fully distributed Hadoop andSpark motif implementations. Using the I/O bandwidth of
Small data size as baseline, we normalize the I/O bandwidth of
Medium and
Large data size, as illustrated in Fig. 13. The bold black horizon-tal line in Fig. 13 shows the equal line with the small input. That is igure 12: Linkage Distance of Data Motifs.Figure 13: Impact of Data Size on I/O Behaviors. to say, the value higher than the line means larger I/O bandwidththan the value of the small input. Here we do not report the per-formance data of the AI motifs because the disk I/O behavior islittle in neural network modelling, which we have illustrated inSubsection 4.3. We find that almost for all data motifs, their I/Obehaviors are sensitive to the data size. When the data size large enough, the whole data can not be stored in memory, then the datahave to be swapped in and swapped out frequently, and hence putgreat pressure on disk I/O access. Modern big data and AI systemsadopt a distributed manner, with the data storing on an distributedfile system, such as HDFS [38], the data shuffling or data unbalancewill generate a large amount of network I/O. Impact of Data Size on Pipeline Efficiency
We further mea-sure the impact of data size on pipeline efficiency. As shown inFig. 14, we find that with the data size increases, the percentage offrontend bound decrease, while the percentage of backend boundincrease. For example, Spark Matmul with large input size decreasenearly 20% of frontend bound and increase more than 30% of back-end bound. As the data size increase, the high-speed cache and evenmemory are unable to hold all of them, and further incur many datacache misses, resulting in large penalties due to memory hierarchy.
Data pattern and data distribution impact the workload perfor-mance significantly [43, 45]. To evaluate the impact of data patternon the motifs, we use two different patterns of dense matrix andsparse matrix, to run FFT motif as an example. The matrix sparsityindicates the ratio of zero value among all matrix elements. With igure 14: Impact of Data Size on Pipeline Efficiency. (a) System Behavior with Different Patterns.(b) Micro-architecture Behavior with Different Patterns. Figure 15: Impact of Data Pattern on Data Motifs. different sparsity, the data access patterns vary, and thus reflectdifferent behaviors.We use two 16384 × Data types and sources are of great significance for read and writeefficiency [17], considering their storage format and targeted sce-narios, such as the supports for splitable files and compression level. To evaluate the impact of the data type and source on system andmicro-architecture behaviors, we use two different data types forSort motif, with the same data size of 10 GB. Two types are un-structured wikipedia text data and semi-structured sequence data.Wikipedia text file is laid out in lines and each line records an articlecontent. Sequence files are flat files that consist of key and valuepairs, stored in binary format. Fig. 16 lists the impact of data typeon data motifs from the system (Fig. 16(a)) and micro-architectureaspects (Fig. 16(b)). We find that the difference between using texttype and sequence type ranges from 1.12 times to 7.29 times fromthe system aspects. Using text data type, the CPU utilization is lowerthan using sequence data, which indicates that using sequence datahas better performance. Moreover, both Hadoop Sort and SparkSort suffer from more major page faults and further impact theexecution performance, because of page loads from disk. Note thatwe use the major page fault number per second in Fig. 16 and thetotal number during the running process is about 100 to 200. Even a) System Behavior with Different Types.(b) Micro-architecture Behavior with Different Types. Figure 16: Impact of Data Type and Source on Data Motifs. with the same amount of data size, their network I/O and disk I/Obandwidth still have a great difference. We find that the sequenceformat have larger requirements for I/O bandwidth than the textformat. From the micro-architecture aspect (Fig. 16(b)), Sort withdifferent data types reflect different percentages of pipeline bot-tlenecks. With the text format, backend bound bottleneck is moresevere, especially backend memory bound, which indicates thatthey waste more cycles to wait for the data from cache or memory.
Our big data and AI motifs are inspired by previous successfulabstractions in other application scenarios. The set concept in re-lational algebra [11] abstracted five primitive and fundamentaloperators, setting off a wave of relational database research. Theset abstraction is the basis of relational algebra and theoreticalfoundation of database. Phil Colella [12] identified seven motifs ofnumerical methods which he thought would be important for thenext decade. Based on that, a multidisciplinary group of Berkeleyresearchers proposed 13 motifs which were highly abstractions ofparallel computing, capturing the computation and communicationpatterns of a great mass of applications [6]. National Research Coun-cil proposed seven major tasks in massive data analysis [14], whichthey called giants. These seven giants are macroscopical definitionof problems in massive data analysis from the perspective of math-ematics, while our eight classes of motifs are main time-consumingunits of computation in the Big Data and AI workloads.Application kernels [7, 16] also aim at scaling down the run timeof the real applications, while preserving the main characteristics ofthe workload. Consisting of the major function of the application,Kernel tries to cover the bottleneck of the real application. Butkernel is still hard to understand the complex and diversity big dataand AI workloads [7, 31]. Other than that, kernel mainly focuseson the CPU and memory behaviors, and pays little attention to theI/O, which is also important for many real applications, especiallyin an era of data explosion.
In this paper, we answer what are abstractions of time-consumingunits of computation in big data and AI workloads. We identifyeight data motifs among a wide variety of big data and AI work-loads, including Matrix, Sampling, Logic, Transform, Set, Graph,Sort and Statistic computations. We found the combinations of oneor more data motifs with different weights in terms of runtime candescribe most of big data and AI workloads we investigated [19].We implement the data motifs for big data and AI separately, in-cluding the big data motif implementations using Hadoop, Spark,Pthreads, and the AI data motif implementations using TensorFlow,Pthreads, considering the impact of data type, data source, data size,and data pattern. We release them as the micro benchmarks of anopen-source Big Data and AI benchmark suite—BigDataBench, pub-licly available from http://prof.ict.ac.cn/BigDataBench. From thesystem and micro-architecture perspectives, we comprehensivelycharacterize the behaviors of data motifs and identify their bottle-necks. Further, we measure the impact of data type, data source,data pattern and data size on their behaviors. We find that thesedata motifs cover a wide variety of performance space, from theperspectives of system and micro-architecture behaviors. Moreover,the behavior of each data motif is not only influenced by its algo-rithm, but also largely affected by the type, source, size, and patternof input data. We believe our work is an important step towardnot only Big Data and AI benchmarking, but also domain-specifichardware and software co-design.
This work is supported by the National Key Research and Develop-ment Plan of China (Grant No. 2016YFB1000600 and 2016YFB1000601).The authors are very grateful to anonymous reviewers for theirinsightful feedback and Dr. Zhen Jia for his valuable suggestions.
REFERENCES [1] 2018. Hadoop. http://hadoop.apache.org/. (2018).[2] 2018. LSD. https://software.intel.com/en-us/vtune-amplifier-help-front-end-bandwidth-lsd. (2018).
3] 2018. Perf tool. https://perf.wiki.kernel.org/index.php/Main_Page. (2018).[4] 2018. PMU Tools. https://github.com/andikleen/pmu-tools. (2018).[5] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, JeffreyDean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.2016. TensorFlow: A System for Large-Scale Machine Learning.. In
OSDI , Vol. 16.265–283.[6] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, JohnShalf, Samuel Webb Williams, and Yelick Katherine. 2006.
The landscape of parallelcomputing research: A view from Berkeley . Technical Report. Technical ReportUCB/EECS-2006-183, EECS Department, University of California, Berkeley.[7] David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter,Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob SSchreiber, H D Simon, V Venkatakrishnan, and S K Weeratunga. 1991. The NASparallel benchmarks.
The International Journal of Supercomputing Applications
National Laboratory. Disponívelem:< https://computing. llnl. gov/tutorials/pthreads/> Acesso em
ACM Sigplan Notices
49, 4 (2014), 269–284.[10] Yanpei Chen, Francois Raab, and Randy Katz. 2014. From tpc-c to big databenchmarks: A functional workload model. In
Specifying Big Data Benchmarks .Springer, 28–43.[11] Edgar F Codd. 1970. A relational model of data for large shared data banks.
Commun. ACM
13, 6 (1970), 377–387.[12] Phillip Colella. 2004. Defining software requirements for scientific computing.(2004).[13] James W Cooley and John W Tukey. 1965. An algorithm for the machine cal-culation of complex Fourier series.
Mathematics of computation
19, 90 (1965),297–301.[14] NR Council. 2013. Frontiers in Massive Data Analysis. The National AcademiesPress Washington, DC.[15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima-genet: A large-scale hierarchical image database. In
Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conference on . IEEE, 248–255.[16] Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK bench-mark: past, present and future.
Concurrency and Computation: practice andexperience
15, 9 (2003), 803–820.[17] Lieven Eeckhout, Hans Vandierendonck, and Koen De Bosschere. 2003. Quan-tifying the impact of input data sets on program behavior and its applications.
Journal of Instruction-Level Parallelism
5, 1 (2003), 1–33.[18] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, MohammadAlisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, AnastasiaAilamaki, and Babak Falsafi. 2012. Clearing the Clouds: A Study of EmergingWorkloads on Modern Hardware. In
ACM International Conference on Architec-tural Support for Programming Languages and Operating Systems (ASPLOS) .[19] Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Zhen Jia, Daoyi Zheng, ChenZheng, Xiwen He, Hainan Ye, Haibin Wang, and Rui Ren. 2018. Data Motif-basedProxy Benchmarks for Big Data and AI Workloads.
Workload Characterization(IISWC), 2018 IEEE International Symposium on (2018).[20] Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Xu Wen, RuiRen, Chen Zheng, Hainan Ye, Jiahui Dai, Zheng Cao, et al. 2018. BigDataBench:A Scalable and Unified Big Data and AI Benchmark Suite.
Under review of IEEETransaction on Parallel and Distributed Systems (2018).[21] Andrew Glew. 1998. MLP yes! ILP no.
ASPLOS Wild and Crazy Idea Session’98 (1998).[22] Part Guide. 2011. Intel® 64 and IA-32 Architectures Software Developerś Manual.
Volume 3B: System programming Guide, Part
Internet of Things (IOT), 2010 . IEEE, 1–8.[24] John Hennessy and David Patterson. 2018. A New Golden Age for ComputerArchitecture: Domain-specific Hardware/Software Co-Design, Enhanced Security,Open Instruction Sets, and Agile Chip Development. (2018).[25] Zhen Jia, Jianfeng Zhan, Lei Wang, Rui Han, Sally A McKee, Qiang Yang, ChunjieLuo, and Jingwei Li. 2014. Characterizing and subsetting big data workloads. In
IEEE International Symposium on Workload Characterization (IISWC) .[26] Stephen C Johnson. 1967. Hierarchical clustering schemes.
Psychometrika
32, 3(1967), 241–254.[27] Ian T Jolliffe. 1986. Principal component analysis and factor analysis. In
Principalcomponent analysis . Springer, 115–128.[28] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017.In-datacenter performance analysis of a tensor processing unit. In
ComputerArchitecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on .IEEE, 1–12.[29] Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automati-cally exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs. In
Parallel Architecture and Compilation Techniques (PACT), 2016 Inter-national Conference on . IEEE, 339–350.[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In
Advances in neural informationprocessing systems . 1097–1105.[31] David J Lilja. 2005.
Measuring computer performance: a practitioner’s guide . Cam-bridge university press.[32] David G Lowe. 2004. Distinctive image features from scale-invariant keypoints.
International journal of computer vision
60, 2 (2004), 91–110.[33] Piotr R Luszczek, David H Bailey, Jack J Dongarra, Jeremy Kepner, Robert FLucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC Challenge(HPCC) benchmark suite. In
Proceedings of the 2006 ACM/IEEE conference onSupercomputing . Citeseer, 213.[34] David Maier. 1983.
The theory of relational databases . Vol. 11. Computer sciencepress Rockville.[35] John D Owens, Mike Houston, David Luebke, Simon Green, John E Stone, andJames C Phillips. 2008. GPU computing.
Proc. IEEE
96, 5 (2008), 879–899.[36] Heather Quinn, William H Robinson, Paolo Rech, Miguel Aguirre, Arno Barnard,Marco Desogus, Luis Entrena, Mario Garcia-Valderas, Steven M Guertin, DavidKaeli, et al. 2015. Using benchmarks for radiation testing of microprocessors andFPGAs.
IEEE Transactions on Nuclear Science
62, 6 (2015), 2547–2554.[37] Mehul Shah, Parthasarathy Ranganathan, Jichuan Chang, Niraj Tolia, DavidRoberts, and Trevor Mudge. 2010. Data dwarfs: Motivating a coverage set forfuture large data center workloads. In
Proc. Workshop Architectural Concerns inLarge Datacenters .[38] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010.The hadoop distributed file system. In
Mass storage systems and technologies(MSST), 2010 IEEE 26th symposium on . Ieee, 1–10.[39] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networksfor large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).[40] Sam Van den Steen, Stijn Eyerman, Sander De Pestel, Moncef Mechri, Trevor ECarlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. 2016. An-alytical processor performance and power modeling using micro-architectureindependent characteristics.
IEEE Trans. Comput.
65, 12 (2016), 3537–3551.[41] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He,Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, KentZhan, Xiaona Li, and Bizhu Qiu. 2014. Bigdatabench: A big data benchmark suitefrom internet services. In
IEEE International Symposium On High PerformanceComputer Architecture (HPCA) .[42] Wm A Wulf and Sally A McKee. 1995. Hitting the memory wall: implications ofthe obvious.
ACM SIGARCH computer architecture news
23, 1 (1995), 20–24.[43] Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and LixinZhang. 2018. CVR: Efficient Vectorization of SpMV on X86 Processors. In .[44] Ahmad Yasin. 2014. A top-down method for performance analysis and countersarchitecture. In
Performance Analysis of Systems and Software (ISPASS), 2014 IEEEInternational Symposium on . IEEE, 35–44.[45] Buse Yilmaz, Bariş Aktemur, MaríA J Garzarán, Sam Kamin, and Furkan Kiraç.2016. Autotuning runtime specialization for sparse matrix-vector multiplication.
ACM Transactions on Architecture and Code Optimization (TACO)
13, 1 (2016), 5.[46] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and IonStoica. 2010. Spark: cluster computing with working sets. In
Proceedings of the2nd USENIX conference on Hot topics in cloud computing . 10–10.
A ARTIFACT APPENDIXA.1 Abstract
The artifact contains our big data and AI motif implementations onHadoop, Spark, Pthreads, and TensorFlow stacks. It can support thecharacterization results in Chapter four and Chapter five of our PACT2018 paper
Data Motifs: A Lens Towards Fully UnderstandingBig Data and AI Workloads . To validate the results, deploy theexperiment environment and profile the benchmarks.
A.2 Artifact check-list (meta-information) • Program: Data motif implementations • Compilation: GCC 4.8.5; Python 2.7.5; Java 1.8.0_65 • Data set: generated by BigDataBench • Run-time environment: CentOS 7.2, Linux Kernel 4.1.13 withPerf tool Hardware: Processor supporting Top-Down analysis, aboveSandy Bridge series, and the performance events correspond-ing to the processor • Run-time state: Disable Hyper-Threading • Execution: root user or users that can execute sudo withoutpassword • Output: the system and micro-architecture profiling results • Experiment: Deploy the data motifs and corresponding soft-ware stacks; run benchmarks; profile using perf; output theresults • Workflow frameworks used? No • Publicly available?: Yes
A.3 Description
A.3.1 How delivered.
The data motifs are the micro benchmarks of Big-DataBench 4.0—an open source big data and AI benchmark suite. Downloadlink:http://prof.ict.ac.cn/bdb_uploads/bdb_4/pact2018.tar.gzAll the related files are under the "pact2018" directory, please refer toREADME for detailed description. Note that to obtain accurate performancedata, the user should make sure there is no other motif running before runa motif. The running scripts we provide suit for our cluster environment,like the node ip/hostname and port number, if you download and use it inyour cluster environment, you need to modify the scripts to suit for yourenvironment.
A.3.2 Hardware dependencies.
The data motifs can be run on all processorsthat can deploy Hadoop, Spark, TensorFlow and Pthread stacks. However,for Top-Down analysis, due to the performance counter limitations, wesuggest the Intel Xeon processors, above Sandy Bridge series. Also, userneed to find the performance counters corresponding to specific processor.We have provided profiling scripts for Xeon E5-2620 V3 (Haswell) processor.
A.3.3 Software dependencies.
JDK 1.8.0_65; Hadoop 2.7.1; Spark 1.5.2; Ten-sorFlow 1.0; GCC 4.8.5.
A.3.4 Data sets.
We provide data generators for text, sequence, graph, andmatrix data. Users can find the data generation method in the README fileor BigDataBench user manual. The generation parameter used in our paperfor the graph motif is 22 (Small), 24 (Medium), 26 (Large), respectively.
A.4 Installation
User need to install Hadoop, Spark, GCC and TensorFlow. The install detailscan be found in the User Manual of BigDataBench. We provide "Makefile" forpthread motifs. For all data motifs, we provide running scripts in our package.
A.5 Experiment workflow
Before profiling system and micro-architecture metrics of one motif, usersshould make sure there is no other motif/workload running.
A.5.1 Data generation.
We provide text, graph, matrix, and sequence datagenerators under data-generator directory. To generate large, medium, smalldata used in our paper, we provide a script "data-generator.sh". Make surehadoop is running, because the script upload the generated data to HDFS.The script running command:
Graph data generation:
Matrix data generation:
Text data generation:
Sequence data generation:
Transfer the wiki text data to sequence data, so user should generate textdata first and put it on HDFS, for example, "wiki-10G" data are on HDFS.
A.5.2 Run the workloads.
We provide running scripts for all workloads.During the running process, the profiling scripts are started to sample thesystem and architecture metrics.
For Hadoop motifs:
1) Under pact2018 directory2) Start Hadoop:
For Spark motifs:
1) Under pact2018 directory2) Start Spark:
For TensorFlow motifs:
1) Under pact2018 directory2) Choose one TensorFlow motif:
For Pthread motifs:
1) Under pact2018 directory2) Choose one Pthread motif:
A.5.3 Process the metric data and plot the figures.
We provide processingscripts and figure plotting scripts to generate the figures used in the paper.Note that the sampling results are saved under "result" directory when testfinished.1) Compute the performance data and save them in an excel file. arameter "result_new.xls" is the excel file generated by the first step.After running the command, several png files will be generated. In addition,"pact-AE.txt" is generated for linkage distance analysis.3) Linkage distance computing A.6 Evaluation and expected result
To evaluate the system and micro-architecture performance of data motifs,users need to run those motifs and profile them. These data motifs should reflectsimilar characteristics like figures in Chapter 4 and Chapter 5. Our profilingscripts sample the performance data every 1 second during the whole motifruntime, and the performance data possibly vary within a slightly variationfor each run.
A.7 Experiment customization
Users can run these data motifs for different benchmarking purpose, e.g.software stack comparison, different aspects of system and architecture char-acterizations. Also, the data motifs can be deployed on different processorsand cluster scales.
A.8 Notes