[PDF] AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

Abstract

Domain-specific software and hardware co-design is encouraging as it is much easier to achieve efficiency for fewer tasks. Agile domain-specific benchmarking speeds up the process as it provides not only relevant design inputs but also relevant metrics, and tools. Unfortunately, modern workloads like Big data, AI, and Internet services dwarf the traditional one in terms of code size, deployment scale, and execution path, and hence raise serious benchmarking challenges. This paper proposes an agile domain-specific benchmarking methodology. Together with seventeen industry partners, we identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We propose the permutations of essential AI and non-AI component benchmarks as end-to-end benchmarks. An end-to-end benchmark is a distillation of the essential attributes of an industry-scale application. We design and implement a highly extensible, configurable, and flexible benchmark framework, on the basis of which, we propose the guideline for building end-to-end benchmarks, and present the first end-to-end Internet service AI benchmark. The preliminary evaluation shows the value of our benchmark suite---AIBench against MLPerf and TailBench for hardware and software designers, micro-architectural researchers, and code developers. The specifications, source code, testbed, and results are publicly available from the web site \url{this http URL}.

Full PDF

AAIB

ENCH :A N A GILE D OMAIN - SPECIFIC B ENCHMARKING M ETHODOLOGY AND AN

AI B

ENCHMARK S UITE A BSTRACT AND S ECTION

NTRODUCTION ) WERE CONTRIBUTED BY J IANFENG Z HAN . S

ECTION WAS CONTRIBUTED BY J IANFENG Z HAN , L EI W ANG , W

ANLING G AO , AND F EI T ANG . S

ECTION WAS CONTRIBUTED BY J IANFENG Z HAN . S

ECTION

WAS CONTRIBUTED BY C HUNJIE L UO , F EI T ANG , Z

IHAN J IANG , W

ANLING G AO , J IANFENG Z HAN , AND SEVENTEEN INDUSTRY PARTNERS .S ECTION

AND COMPONENT BENCHMARKS WERE CONTRIBUTED BY W ANLING G AO , C HUNJIE L UO , X INGWANG X IONG , F EI T ANG , Z

IHAN J IANG , T

IANSHU H AO , F ANDA F AN , X U W EN , F AN Z HANG , Y

UNYOU H UANG , J

IANAN C HEN , AND M ENGJIA D U . S ECTION

AND MICROBENCHMARKS WERE CONTRIBUTED BY W ANLING G AO AND D AOYI Z HENG . S

ECTION

WASCONTRIBUTED BY W ANLING G AO , F EI T ANG , L EI W ANG , AND J IANFENG Z HAN . S

ECTION WASCONTRIBUTED BY F EI T ANG , W

ANLING G AO , L EI W ANG , AND J IANFENG Z HAN . S

ECTION WASCONTRIBUTED BY J IANFENG Z HAN , W

ANLING G AO , F EI T ANG , L EI W ANG , AND C HUANXIN L AN .S ECTION AND S ECTION WERE CONTRIBUTED BY J IANFENG Z HAN , W

ANLING G AO , AND L EI W ANG . R UI R EN AND C HEN Z HENG PROVIDE T ESTBED SUPPORT . T ECHNICAL R EPORT N O . B ENCH C OUNCIL -AIB

ENCH -2020 F EBRUARY

17, 2020 a r X i v : . [ c s . PF ] F e b IBench: An Agile Domain-speciﬁc BenchmarkingMethodology and an AI Benchmark Suite

Wanling Gao , Fei Tang , Jianfeng Zhan ∗ , Chuanxin Lan , Chunjie Luo , LeiWang , Jiahui Dai , Zheng Cao , Xiongwang Xiong , Zihan Jiang , Tianshu Hao ,Fanda Fan , Xu Wen , Fan Zhang , Yunyou Huang , Jianan Chen , Mengjia Du , RuiRen , Chen Zheng , Daoyi Zheng , Haoning Tang , Kunlin Zhan , Biao Wang , DefeiKong , Minghe Yu , Chongkang Tan , Huan Li , Xinhui Tian , Yatao Li , Gang Lu ,Junchao Shao , Zhenyu Wang , Xiaoyu Wang , and Hainan Ye State Key Laboratory of Computer Architecture, Institute of Computing Technology, ChineseAcademy of Sciences , { gaowanling, tangfei, wanglei 2011, zhanjianfeng,lanchuanxin } @ict.ac.cn BenchCouncil (International Open Benchmarking Council) Beijing Academy of Frontier Sciences and Technology, { daijiahui,yehainan } @mail.bafst.com University of Chinese Academy of Sciences Xinxiu (SciCom) Alibaba, [email protected] Baidu, [email protected] Tencent, [email protected] NetEase, [email protected] ByteDance, [email protected] Zhihu, [email protected] Lenovo, [email protected] Paypal, [email protected] Moqi, [email protected] Microsoft Research Asia, [email protected] Huawei, [email protected] JD.com, [email protected] CloudTa, [email protected] Intellifusion, [email protected] 17, 2020

Abstract

Domain-speciﬁc software and hardware co-design is encouraging as it is much easier to achieveefﬁciency for fewer tasks. Agile domain-speciﬁc benchmarking speeds up the process as it providesnot only relevant design inputs but also relevant metrics, and tools. Unfortunately, modern workloadslike Big data, AI, and Internet services dwarf the traditional one in terms of code size, deploymentscale, and execution path, and hence raise serious benchmarking challenges.This paper proposes an agile domain-speciﬁc benchmarking methodology. Together with seventeenindustry partners, we identify ten important end-to-end application scenarios, among which sixteen ∗ Jianfeng Zhan is the corresponding author. epresentative AI tasks are distilled as the AI component benchmarks. We propose the permutations ofessential AI and non-AI component benchmarks as end-to-end benchmarks. An end-to-end benchmarkis a distillation of the essential attributes of an industry-scale application. We design and implement ahighly extensible, conﬁgurable, and ﬂexible benchmark framework, on the basis of which, we proposethe guideline for building end-to-end benchmarks, and present the ﬁrst end-to-end Internet service AIbenchmark.The preliminary evaluation shows the value of our benchmark suite—AIBench against MLPerf andTailBench for hardware and software designers, micro-architectural researchers, and code developers.The speciﬁcations, source code, testbed, and results are publicly available from the web site . Introduction

As it is much easier to achieve more efﬁcient algorithms, systems, and architectures for fewer tasks,domain-speciﬁc software and hardware co-design is widely explored. For example, each of Internetservice giants like Facebook, Google, Alibaba focuses on a speciﬁc application domain, i.e., searchengine, social networks, E-commerce, respectively, and they are active practitioners. The ongoing AIaccelerator boom is another witness to this trend. As the AI advancement has brought breakthroughs inprocessing images, video, speech, and audio [42], Internet service providers pervasively perform softwareand hardware AI co-design to augment their services [49, 32, 10, 39, 55]. This trend is also witnessedby big data advancement, and there are hundreds of single-purpose solutions in the forms of NoSQL,NewSQL or hardware accelerators.Agile domain-speciﬁc benchmarking speeds up software and hardware co-design. Unfortunately,modern workloads dwarf the traditional one in terms of code size, deployment scale, and execution path,and hence raise serious benchmarking challenges. For example, the traditional desktop workloads, e.g.,data compression [9], image manipulation [9], are about one hundred thousand lines of code, and run on asingle node. The Web server workloads [5] are hundreds of thousands of lines of code, and run on a smallscale cluster, i.e., dozens of nodes. However, for modern workloads, their runtime environment stacks(e.g., Spark [8], TensorFlow [10]) alone are more than millions of lines of code, and these workloads oftenrun on a large-scale cluster, i.e., tens of thousands of nodes [16]. Moreover, modern Internet services adopta microservice-based architecture, which is often distributed across different datacenters, and consistsof diversity of AI and non-AI modules with very long and complex execution paths. Worst of all, thereal-world data sets, workloads or even AI models are hidden within the giant Internet service providers’datacenters [32, 14], which further exaggerates the benchmarking challenges.On one hand, the hardware and software designers should consider the overall system’s effects. Usingmicro (interchangeable with kernel in this paper) or component benchmarks alone can lead to incorrectconclusions. For example, in Section 6.2.1, we found that in terms of mere execution path, end-to-end taillatency deteriorates even hundreds times comparing to a single AI component tail latency, which can notbe predicted by a state-of-the-art statistical model [24] as discussed in Section 6.2.1. Hereby, end-to-endindicates the overall critical path. It may refer to the end-to-end (tail) latency of an online service, oreven cover ofﬂine AI training when updating an AI model for online services in a real time manner, asdiscussed in Section 6.2.2.On the other hand, it is usually difﬁcult to justify porting a full-scale end-to-end application to anew computer system or architecture simply to obtain a benchmark number [29, 15]. For hardwaredesigners, an end-to-end application is too huge to run on the simulators. In addition, evaluating afull-scale end-to-end application raises difﬁculties in reproducibility and interpretability of performancedata [28], and may lead to an error-prone conclusion. After gaining full knowledge of overall criticalinformation, micro and component benchmarks are still a necessary part of the evaluation.Put in other words, we believe a domain-speciﬁc benchmark suite should have three integrated parts.End-to-end benchmarks let software and hardware designer learn about the overall system information.Each end-to-end benchmark is a distillation of the essential attributes of an industry-scale application,and hence reduces the side effect of the latter’s huge code size, extreme deployment scale, and complexexecution paths. Measuring the achieved performance and quality targets for representative AI tasks,the component benchmarks provides diverse computation and memory access patterns for the micro-architectural researchers. The micro benchmarks are provided, and the code developers can drill down tohotspot functions for performance optimization.This paper proposes an agile domain-speciﬁc benchmarking methodology as shown in Fig. 1. Withoutlosing its generality, we apply it in characterizing the AI and Internet services application domains. First,in cooperation with seventeen industry partners, we investigate their domain-speciﬁc benchmarkingrequirements, and extract ten important end-to-end application scenarios. Instead of using real-worldapplications, we propose the permutations of essential AI and non-AI tasks as end-to-end benchmarks.Second, we identify sixteen representative AI tasks as the AI component benchmarks with bothperformance and quality targets. After proﬁling sixteen AI component benchmarks, we identify and3 e p r e s e n t a ti ve C o m po n e n t B e n c h m a r k s M i c r o B e n c h m a r k s E n d - t o - e n d B e n c h m a r k s Frequently-appearing units of computationTasksPermutation of Component BenchmarksReference Models Time-to-qualityEnergy-to-qualityThroughput D o m a i n - s p ec i f i c B e n c h m a r k i n g R e q u i r e m e n t s Metrics R e u s i n g F r a m e w o r k Loosely-coupled Modules(AI & Non-AI, Tools, Data Input)ProfilingDomain-specific Evaluation Metrics

AIBench

The First End-to-end Benchmark16 Component Benchmarks14 Micro Benchmarks10 End-to-end Application Scenarios

Figure 1: The Agile Domain-speciﬁc Benchmarking Methodology.implement fourteen frequent-appearing units of computation as the micro benchmarks.Third, we present a highly extensible, conﬁgurable, and ﬂexible benchmark framework, allowingresearchers to create end-to-end applications by using different components commonly found in majorapplication domains. On the basis of the framework, we propose guidelines on how to build end-to-endbenchmarks, and design and implement the ﬁrst end-to-end Internet service AI benchmark—E-commercesearch intelligence.The evaluation on a hybrid cluster consisting of 16-node CPUs and 4-node GPUs show the value ofAIBench against MLPerf and TailBench. We gain many insights for hardware and software designers,micro-architectural researchers, and code developers. Several important observations are as follows: (1) Inserving the same request, different AI components incur signiﬁcantly different latency; an end-to-end taillatency deteriorates dozens times or even hundreds times with respect to a single AI component, which cannot be predicted by a state-of-the-art statistical model [24]. (2) Internet service architects must performa tradeoff among service quality, model complexity, and model accuracy. (3) AI models are updated ina real time manner in many end-to-end application scenarios. Ofﬂine training should be included intoend-to-end benchmarking. (4) As they demonstrate distinct computation and memory patterns, diverse AItasks should be included into the AI component benchmarks. (5) Drilling down to hotspot functions ishelpful for code optimization.The rest of this paper is organized as follows. Section 2 explains the motivation. Section 3 summarizesthe methodology. Section 4 describes how to characterize the AI and Internet service application domains.Section 5 illustrates how to build an end-to-end benchmark. Section 6 performs evaluation. Section 7summarizes the related work. Section 8 draws a conclusion.

Modern Internet services process millions of user queries daily, thus the tail latency is of paramountimportance in terms of user experience [24]. However, a microservice-based architecture containsvarious AI and Non-AI modules and consequently forms long and complex execution paths. Existing AI4enchmarking efforts mostly provide a few micro or component benchmarks, and thus fail to model thecritical paths and the permutation of primary components of an industry-scale application.

The end-to-end tail latency deteriorates even 100X comparing to a single component tail latency.

The end-to-end tail latency indicates the overall performance of the entire execution path, while thecomponent tail latency only reports the performance of a single module. Our experiments in Section 6.2.1show that the end-to-end tail latency deteriorates dozens times or even hundreds times comparing to asingle component tail latency. For an AI component—recommendation, the difference is 13X, while forimage classiﬁcation, the difference reaches up to 296X.Debugging the performance of a single component benchmark alone does not touch the full executionpath and fail to provide bottleneck information among the primary modules within a critical path. Consid-ering a 90th percentile latency, We found that among the four AI related components, the recommendationcomponent occupies 72% of the execution time, while the image classiﬁcation component only occupies1.1%. This indicates that benchmarking a single AI component alone without the overall critical path doesnot make sense.

Someone may argue after proﬁling many components’ tail latency performance, a statistical model canpredict the end-to-end tail latency. Our answer is NO! In Section 6.2.1, We use a state-of-the-art queuingtheory [24] to evaluate the end-to-end application’s latency and tail latency. Through the experimentalevaluations, we ﬁnd that the gap is 3.4 times between the actual average latency and the theoretical one,while the gap is 8.1 times between the actual 99th percentile latency and the theoretical one. Furthermore,the state-of-art queuing model [24] for tail latency takes the system as a whole, and is not suited for theend-to-end application that needs characterize the permutations of several or dozens of components.

As witnessed by our many industry partners, when an AI model is used for online service, it has to beupdated in a real time manner. For example, one E-commence giant demands that the AI models have tobe updated every one hour, and the updated model will bring in the award about 3% click-through rateand millions of proﬁts. In Section 6.2.2, the evaluation shows ofﬂine training should be included intoend-to-end benchmarking for performing tradeoffs among model update interval, training overhead, andaccuracy improvement.

As modern AI and Internet service workloads are not only diverse, but also fast changing and expand-ing, the traditional benchmark methodology that creates a new benchmark or proxy for every possibleworkload is prohibitively costly and even impossible [29]. Hence an agile domain-speciﬁc benchmarkingmethodology is extremely essential. Fig. 1 summarizes our methodology.Step One. We investigate domain-speciﬁc benchmarking requirements with the industry partners.The input of this step is the candidate list of industry-scale applications. Just copying the real-worldapplications is impossible for two reasons. First, they treat the real-world workloads, data sets, or modelsare conﬁdential issues. Second, the massive code size, extreme deployment scale, and complex executionpath make it infeasible. So the purpose of this step is to understand their essential components and thepermutation of different components.Step Two. On the basis of the output from Step One, This step distills representative AI and non-AItasks. Different from traditional task, each AI task like image classiﬁcation has both performance andquality targets [45]. Generally, an AI component speciﬁcation deﬁnes a task in a high level language [64],only algorithmically in a paper-and-pencil approach [15]. We implement each task as a componentbenchmark. The benchmark also provides a reference AI model, evaluation metrics, and state-of-the-artquality target [45]. 5tep Three. According to the output of Step Two, we proﬁle the full component benchmarks and drilldown to frequently-appearing and time-consuming units of computation. We implement those units ofcomputation as micro benchmarks. Micro benchmarks are easily portable to new architecture and system,and are beneﬁcial to ﬁne-grained proﬁling and tuning.Step Four. According to the outputs of Steps One and Two, we design and implement a reusingbenchmark framework, including AI and non-AI component library, the data input, online inference,ofﬂine training, and deployment tool modules.Step Five. On the basis of the benchmark framework, we build end-to-end benchmarks. Each end-to-end benchmark models the permutation of several or tens of essential AI or non-AI components, reﬂectingcomplex interactions among different modules and depicting overall system’s performance. In addition,we propose domain-speciﬁc evaluation metrics.

We ﬁrst give a summary of the seventeen Industry Partners’ benchmarking requirements, and then identifythe representative AI tasks (component benchmarks and micro benchmarks). Finally, we propose thereusing benchmark framework.

Collaborating with seventeen industry partners whose domains include search engine, e-commerce, socialnetwork, news feed, video and etc, we extract the essential end-to-end application scenarios from theirproducts or services.The real-world applications are complex, and we only distill the permutations of primary AI andnon-AI tasks. Table 1 summarizes the list of end-to-end application scenarios.For example, the ﬁrst scenario in Table 1—E-commerce search intelligence is extracted from anE-commerce giant. A user will be classiﬁed into different groups to provide personalized services. Theresults are ranked according to the relations between the queries and the products. And the ranking isadjusted by learning from the history query and hitting logs. The recommended products are also returnedwith the search results to the users. We extract this industry-scale application into several AI tasks likeclassiﬁcation, learning to rank, recommendation, and non-AI tasks like query parsing, database operation,and indexing. Section 5.1 will describe how to implement this benchmark on the basis of the reusingframework described in Section 4.In general, end-to-end benchmarks concern overall system’s effects, including quality-ensured re-sponse latency, tail latency, and latency-bounded throughput. A quality-ensured performance example isthat a quality (e.g., accuracy) deviation from the target is within 2%. Different application scenarios havedomain-speciﬁc evaluation metrics. For example, several scenarios require that the AI model is updated ina real time manner.

To cover a wide spectrum of AI Tasks, we thoroughly analyze the end-to-end application scenarios shownin Table 1. In total, we identify sixteen representative AI tasks. For each AI task, we implement it onTensorFow [10] and PyTorch [7] as the AI component benchmarks. Table 2 summarizes the sixteencomponent benchmarks in AIBench.

Classiﬁcation.

This task is to extract different thematic classes within the input data like an imageor text ﬁle. It is a typical task in Internet services or other application domains, and is widely used inmultiple scenarios, like category prediction and spam detection.

Image Generation.

This task aims to provide an unsupervised learning problem to mimic thedistribution of data and generate images. The typical scenario includes image resolution enhancement,which is used to generate high-resolution image. 6able 1: Domain-speciﬁc Benchmarking Requirements

End to EndApplicationScenario Involved AI Task Involved Non-AITask Data Metrics ModelUpdateFrequency

E-commercesearch intelli-gence Classiﬁcation; Learning torank; Recommendation Query parsing,Database operation,Indexing User Data, Prod-uct data, Querydata Precision, Re-call, Latency HighLanguageand dialoguetranslation Text-to-Text translation;Speech recognition Query parsing Text, Speech Accuracy, La-tency LowContent-basedimage retrieval Object detection; Clas-siﬁcation; Spatial trans-former; Image-to-Text Query parsing, In-dexing, Sort Image Precision, Re-call, Latency HighWeb searching Text summarization;Learning to rank; Recom-mendation Query parsing,Indexing, Crawler,Sort, Hash Product data,Query data Precision, Re-call, Latency HighFacial authenti-cation and pay-ment Face embedding; 3D facerecognition; Encryption Face image Accuracy , La-tency LowNews feed Recommendation Database operation,Sort, Basic statistics,Filter Text Precision, Re-call HighPhoto transla-tion Classiﬁcation; Spatialtransformer; Text-to-Texttranslation Query parsing Image, Text Accuracy,BLEU, La-tency LowLive streaming Image generation; Image-to-Image Video codec, Videocapture Image Latency LowVideo services Image compression; Videoprediction Video codec Video Accuracy, La-tency LowOnline gaming 3D object reconstruction;Image generation; Image-to-Image Rendering Image Latency Low ext-to-Text Translation. This task needs to translate a text from one language to another, whichis the most important ﬁeld of computational linguistics. It can be used to translate a search query andtranslate dialogue.

Image-to-Text.

This task is to generate the description of an image automatically. It can be used togenerate image caption or recognize optical character.

Image-to-Image.

This task is to convert an image from one representation to another one. It can beused to synthesize the images with different facial ages and simulate virtual makeup.

Speech Recognition.

This task is to recognize and translate a spoken language into text. This task isbeneﬁcial for voice search and voice dialogue translation.

Face Embedding.

This task is to transform a facial image into a vector in an embedding space. Thetypical scenarios are facial similarity analysis and face recognition.

3D Face Recognition.

This task is to recognize the 3D facial information from multiple images fromdifferent angles. This task mainly focuses on three-dimensional images, and is beneﬁcial to the facialsimilarity and facial authentication scenario.

Object Detection.

This task is to detect the objects within an image. The typical scenarios includevertical search and video object detection.

Recommendation.

This task is to provide recommendations. This task is widely used for advertiserecommendation, community recommendation, or product recommendation.

Video Prediction.

This task is to predict the future video frames through predicting previous framestransformation. The typical scenarios are video compression and video encoding, for efﬁcient videostorage and transmission.

Image Compression.

This task is to compress the images and reduce the redundancy [57]. Thetask is important for Internet services in terms of reducing data storage overhead and improving datatransmission efﬁciency.

3D Object Reconstruction.

This task is to predict and reconstruct 3D objects [62]. The typicalscenarios are maps search, light ﬁeld rendering, virtual reality, and online gaming.

Text Summarization.

This task is to generate a text summary, which is important for search resultspreview, headline generation, and keyword discovery.

Spatial Transformer.

This task is to perform spatial transformations [36]. A typical scenario is spaceinvariance image retrieval, so that an image can be retrieved even if it is extremely stretched.

Learning to Rank.

This task is to learn the attributes of a searched content and rank the scores forthe results, which is the key for a search engine service.The AI tasks concern both performance and quality targets. The primary metrics include the samplesprocessed per second, the wall clock time to train a model achieving a target quality (Time-to-quality) [20],the wall clock time to train the speciﬁed epochs, quality-ensured throughput, and the energy consumptionto train a model achieving a target quality (Energy-to-quality) [20].

After proﬁling the sixteen component benchmarks, we identify fourteen frequently-appearing units ofcomputation. They are Covolution, Fully connected, Relu, Sigmoid, Tanh, MaxPooling, AvgPooling,CosineNorm, BatchNorm, Dropout, Element-wise multipy, Softmax, Data arrangement, and Memcpy.We implement them as a set of micro benchmarks using TensorFlow [10] and Pthreads.

As shown in Fig. 2, the framework provides loosely coupled modules that can be easily conﬁgured.Currently, the AIBench framework includes data input, ofﬂine training, online inference, non-AI library,and deployment tool modules. On the basis of the AIBench framework, we can easily compose anend-to-end benchmark.The data input module is responsible for feeding data into the other modules. It collects representativereal-world data sets, which are from not only the authoritative public websites but also our industry partners8able 2: Component Benchmarks in AIBench.

No. Component Benchmark Algorithm Data Set

DC-AI-C1 Image classiﬁcation ResNet50 [33] ImageNet [25], Cifar [41]DC-AI-C2 Image generation WassersteinGAN [13] LSUN [63]DC-AI-C3 Text-to-Text translation Transformer [58] WMT English-German [1]DC-AI-C4 Image-to-Text Neural Image Caption Model [60] Microsoft COCO [44]DC-AI-C5 Image-to-Image CycleGAN [66] Cityscapes [21]DC-AI-C6 Speech recognition DeepSpeech2 [12] Librispeech [51]DC-AI-C7 Face embedding Facenet [54] LFW [35], VGGFace2 [17]DC-AI-C8 3D Face Recognition 3D face models [59] 77,715 samples from 253 face IDsDC-AI-C9 Object detection Faster R-CNN [52] Microsoft COCO [44]DC-AI-C10 Recommendation Neural collaborative ﬁltering [34] MovieLens [31]DC-AI-C11 Video prediction Motion-Focused predictive models [27] Robot pushing data set [27]DC-AI-C12 Image compression Recurrent neural network [57] ImageNet [25]DC-AI-C13 3D object reconstruction Convolutional encoder-decoder network [62] ShapeNet Data set [18]DC-AI-C14 Text summarization Sequence-to-sequence model [48] Gigaword data set [53]DC-AI-C15 Spatial transformer Spatial transformer networks [36] MNIST [43]DC-AI-C16 Learning to rank Ranking distillation [56] Gowalla [19] after anonymization. The data schema is designed to maintain the real-world data characteristics, so as toalleviate the conﬁdential issue. Based on the data schema, a series of data generators are further providedto support an large-scale data generation, like user or product information. To cover a wide spectrumof data characteristics, we take diverse data types, e.g., structured, semi-structured, un-structured, anddifferent data sources, e.g., table, graph, text, image, audio, video, into account. Our framework integratesvarious open-source data storage systems, and supports large-scale data generation and deployment [47].The ofﬂine training and online inference modules are provided to build an end-to-end benchmark. First,the ofﬂine training module chooses one or more component benchmarks, through specifying the requiredbenchmark ID, input data, and execution parameters like batch size. Then the ofﬂine training moduletrains a model and provides the trained model to the online inference module. The online inference moduleloads the trained model onto the serving system, i.e., TensorFlow serving. The non-AI library moduleprovides the non-AI computations and database access, including query parsing, database operations,indexing, sort, crawler, hash, encryption, basic statistics, ﬁlter, video codec, video capture, and rendering.For a complex end-to-end application, the online inference, the non-AI library, and the ofﬂine trainingmodules together constitute an overall critical path.To be easily deployed on a large-scale cluster, the framework provides deployment tools that containtwo automated deployment templates using Ansible and Kubernetes. The Ansible templates supportscalable deployment on physical or virtual machines, while the kubernetes templates are used to deployon a container cluster. A conﬁguration ﬁle needs to be speciﬁed for installation and deployment, includingmodule parameters like a chosen benchmark ID, input data, and the cluster parameters like nodes, memory,and network information. Through the deployment tools, a user doesn’t need to know how to install andrun each individual module.

In this section, we illustrate how to build end-to-end benchmarks, and later discuss the guideline.9 nd-to-end BenchmarksData Input: Schema and generator

Structured Semi-structured Un-structuredTextTable Image Audio Video

Online Inference

AI-as-a-Service

Offline Training

AI for trainingGraph (Benchmark ID, Input data, Execution parameter)

Automated Deployment Tool

Deploy TemplateConfigurationsAnsibleKubernetesModule-relatedCluster-related N on - A I L i b r a r y Units of ComputationTasks Profiling

Component Benchmarks Micro Benchmarks

Figure 2: Reusing Framework.

On the basis of the reusing framework, we implement the ﬁrst end-to-end AI application benchmark—anE-commerce search intelligence (in short, E-commerce). This benchmark models the complete use-caseof a realistic E-commerce search intelligence, covering both text searching and image searching scenarios.The E-commerce benchmark consists of four subsystems: online server, ofﬂine analyzer, querygenerator, and data storage, as shown in Fig. 3. Among them, online server receives the query requestsand performs personalized searching and recommendation, integrating AI inference.Ofﬂine analyzer chooses the appropriate AI component benchmarks and performs a training stage togenerate a learning model. Also, ofﬂine analyzer is responsible to build data indexes to accelerate dataaccess.Query generator is to simulate concurrent users and send query requests to online server based on aspeciﬁc conﬁguration. Note that a query item provides either text or image to reﬂect different search habitsof users. The conﬁguration designates the parameters like concurrency, query arriving rate, distribution,user thinking time, and the ratio of text items and image items. The conﬁgurations simulate differentquery characteristics and satisfy multiple generation strategies. We implement our query generator basedon JMeter [37].The data storage module stores all kinds of data. The user database saves all the attributes of userinformation. The product database holds all the attributes of the product information. The logs record thecomplete query histories. The text data contain the product description text or the user comments. Theimage and video data depict the appearance and usage of product vividly. The audio data store the voicesearch data and voice chat data. Overall, the data storage covers various data types including structured,unstructured, and semi-structured data, and diverse data sources, including table, text, image, audio andvideo.To support scalable deployment on the clusters with different scales, each module is scalable and canbe deployed on multiple nodes. Also, a series of data generators are provided to generate E-commercedata with different scales, through setting several parameters, e.g., the number of products and productattribute ﬁelds, the number of users and user attribute ﬁelds.10

IBench FrameworkE-commerce Search Intelligence Implementation

Data Storage

Online Server

Recommender

Category classification (Query item, UserID) (Category, Weight)(Product ID, Score) Ranker

ReLUWeight Product Attribute

Sigmoid

L2R (Ranking Score)

Searcher

Cluster 2 medium popularity

Cluster 1 high popularity

Cluster 3 low popularity Search Planer

16 AI Component Benchmarks for representative AI Tasks Offline Analyzer

ClassificationSpeech recongition 3D face recognition Learning to rankImage compressionSpatial transformerText summarization RecommendationImage generation Text-to-Text translationFace embedding3D object reconstructionObject detection Video prediction Image-to-TextImage-to-ImageAI Units of Computation Q u e r y G e n e r a t o r ( T e x t & I m ag e ) User InfoProduct Info

Product Index Product Attribute IndexUser Index

Job Scheduler

Batch Processing Streaming-like

IndexerAI Offline TrainerAI-as-a-Service AI for training (Product ID, Weight)(Query item, Category)

Image classification Speech recognitionSpatial transformerImage generationLearning to rankRecommendationObject detection Image-to-ImageImage-to-Text Face embeddingImage classifier Text classifierPersonalized recommendation

Figure 3: AIBench Implementation.

Online server provides personalized searching and recommendations. Online server consists of fourmodules, including search planer, recommender, searcher, and ranker.

Search planer is the entrance of online server. It is responsible for receiving the query requests fromquery generator, and sending the request to the other modules and receiving the return results. We use theSpring Boot framework [61] to implement search planer.

Recommender is to analyze the query item and provide personalized recommendation, accordingto the user information obtained from the user database. It ﬁrst conducts query spelling correction andquery rewriting, and then it predicts the belonged category of the query item based on two classiﬁcationmodels—FastText [38] and ResNet50 [33]. FastText is for text classiﬁcation when a query item is textdata, and ResNet50 [33] is for image classiﬁcation when a query item is an image. Using a deep neuralnetwork proposed by Alibaba [49], query planer then conducts an inference process and uses the ofﬂinetrained model to provide personalized recommendation. It returns two vectors: one is the probabilityvector of the predicted categories, and the other is the user preference score vector of product attributes,such as the user preference for brand, color and etc. We use TensorFlow serving [50] to provide textclassiﬁcation, image classiﬁcation, and online recommendation services.To guarantee scalability and service efﬁciency, searcher follows an industry-scale architecture.

Searcher is deployed on several different clusters, and three clusters are the default conﬁguration.The clusters hold the inverted indexes of product information in memory to guarantee high concurrencyand low latency. According to the click-through rate and purchase rate, the products belong to threecategories according to the popularity—high, medium, and low, and the proportion of data volume is 15%,50%, and 50%, respectively. Note that the high popularity category is a subset of the medium popularitycategory. The indexes of products with different popularity are stored into the different clusters. Given asearching request, the searcher searches these three clusters one by one until reaching a speciﬁc amount.Generally, the cluster that holds low popularity products is rarely searched in a realistic scenario. So foreach category, searcher adopts different deployment strategies. The cluster for high popularity containsmore nodes and more backups to guarantee the searching efﬁciency. While the cluster for low popularity11eploys the least number of nodes and backups. We use Elasticsearch [30] to set up and manage theSearcher deploying on the three clusters.

Ranker uses the weight returned by recommender as an initial weight, and ranks the scores of productsthrough a personalized L2R neural network [49]. Ranker uses TensorFlow serving [50] to implementproduct ranking.

Ofﬂine analyzer is responsible for training models and building indexes to improve the online servingperformance. It consists of three modules—AI ofﬂine trainer, job scheduler, and indexer.AI ofﬂine trainer is to train models using the data stored in the database. Ofﬂine trainer digeststhe features of the product data, e.g., text, image, audio, video. To power the efﬁciency of onlineserver, Ofﬂine trainer chooses ten AI algorithms (component benchmarks) from the AIBench framework.The ten component benchmarks include classiﬁcation for category prediction, recommendation forpersonalized recommendation, learning to ranking for result scoring and ranking, image-to-text for imagecaption, image-to-image and image generation for image resolution enhancement, face embedding forface detection within an image, spatial transformer for image rotating and resizing, object detection fordetecting video data, and speech recognition for audio data recognition.Job scheduler provides two kinds of training mechanisms: batch processing and streaming-likeprocessing. In a realistic scenario, some models need to be updated frequently. For example, when userssearch an item and click one product showed in the ﬁrst page, the application will immediately train anew model based on the product that the users just clicked, and make new recommendations shown inthe second page. Our benchmark implementations consider this situation, and adopt a streaming-likeapproach to updating the models every several seconds. For batch processing, trainer will update themodels every several hours.Indexer is to build indexes for product information. Indexer provides three kinds of indexes: theinverted indexes with a few ﬁelds of products for searching, the forward indexes with a few ﬁelds forranking, and the forward indexes with a majority of ﬁelds for summary generation.

We are implementing other end-to-end benchmarks listed in Table 1. There are some guidelines.(1) Determine the essential AI and non-AI component benchmarks.(2) For each component benchmark, ﬁnd the valid input data and the data input module.(3) Determine the valid permutation of AI and non-AI components.(4) Specify the module-related conﬁgurations, i.e., benchmark ID, input data, execution parameters,Non-AI library, and cluster-related conﬁgurations, i.e., node, memory, and network information.(5) Specify the deployment strategy and write the scripts for the automated deployment tool.(6) Train the AI models of the selected AI component benchmarks using the ofﬂine training module,and transfer the trained models to the online inference module.

This section summarizes our evaluation using AIBench end-to-end, component and micro benchmarks.In Section 6.2, we explain why end-to-end benchmarking is necessary for both online server and ofﬂinetrainer, and gain several insights, which can not be found using MLPerf [4] and TailBench [40]. InSection 6.3, we characterize diverse and distinct computation and memory patterns of sixteen AI tasks,emphasizing the necessity of including diverse AI tasks for benchmarking, which is also ignored byMLPerf [4]. In Section 6.4, we drill down to the hotspot functions, and analyze their execution stalls.12igure 4: Latency of Online Server.

We perform experiments on a 16-node CPU and 4-node GPU cluster. All the nodes are connected with a 1Gb Ethernet network. Each CPU node is equipped with two Xeon E5645 processors and 32 GB memory.Each processor contains six physical out-of-order cores. Hyper-Threading is disabled. The OS version ofeach node is Linux CentOS 6.9 with the Linux kernel version 3.11.10. The software versions are JDK1.8.0, Python 3.6.8, and GCC 5.4, respectively. We perform ofﬂine training on four Nvidia Titan XPGPUs. Every Titan XP owns 3840 Nvidia Cuda cores and 12 GB memory.

We use the network time protocol (NTP) [46] for synchronizing cluster-wide clock. We use a proﬁlingtool—Perf [23] to collect the CPU micro-architectural data through the hardware performance monitoringcounters (PMCs). For GPU proﬁling, we use the Nvidia proﬁling toolkit—nvprof [6] to track the runningperformance of GPU. To proﬁle accuracy-ensured performance, we ﬁrst adjust the parameters, e.g., batchsize, to achieve the state-of-the-art quality target of that model on a given dataset, and then sample1,000 epochs using the same parameter settings. For the GAN based model, whose accuracy is hard tomeasure, we set their parameters according to the referenced paper and reproduce the results. We runeach benchmark three times and report the average numbers.

This subsection demonstrates why end-to-to benchmarking is necessary for both online services andofﬂine trainer in Section 6.2.1 and Section 6.2.2, respectively.

We deploy online server on the 16-node CPU cluster. Online server contains one query generator node(Jmeter 5.1.1), one search planer node (SpringBoot 2.1.3), two recommender nodes (TensorFlow Serving1.14.0), nine searcher nodes (Elasticsearch 6.5.2), one ranker node (TensorFlow Serving 1.14.0), and twonodes for data storage (Neo4j 3.5.8 for the user database, Elasticsearch 6.5.2 for the product database).The product database contains a hundred thousand products with 32-attribute ﬁelds. Query generatorsimulates 1000 users with 30-second warm up time. The users send query requests continuously everythink time interval, which follow a Poisson distribution. Note that the proportions of text queries and13mage queries are 90% and 10%, respectively. In total, we collect the performance numbers until 20,000query requests have ﬁnished. We train each AI task to achieve the quality target of the referenced paper.The latency is an important metric to evaluate the service quality. Fig. 4(a) shows the end-to-endlatency of online server. We ﬁnd that the average, 90th percentile, and 99th percentile latency, of the entireexecution path of the current implementation is 215.5, 843, and 1419 milliseconds, respectively.We further perform the latency breakdown of each module to identify the critical paths, includingthe recommender, searcher, search planer, and ranker modules, as shown in Fig. 4(b). The latency ofsearch planer is negligible, so we do not report it in Fig. 4(b). We ﬁnd that recommender occupies thelargest proportion of latency: 48 milliseconds, 60 milliseconds, and 317 milliseconds for the average,90th percentile, 99th percentile latency, respectively. In comparison, the latency of searcher and rankeris both within 5 milliseconds, respectively. Although recommender and ranker both contain AI relatedcomponents, they incur signiﬁcantly different latencies.Furthermore, Fig. 4(c) drills down the latency breakdown of the recommender module to a componentlevel, which includes query parsing, user DB access, image classiﬁer, text classiﬁer and recommendation.We ﬁnd that user DB access (non-AI component) and recommendation (AI component) are the top twokey components that impact the latency. Especially, the average latency of the recommendation componenttakes up 60% of the average latency of the recommender module, and occupies 13% of the total end-to-endlatency of the online server subsystem. The 99th percentile latency of the recommendation componentis 289 milliseconds, while the number for the recommender module and the whole subsystem are 317milliseconds and 1419 milliseconds, respectively. The reason for that end-to-end tail latency deterioratesdozens times or even hundreds times with respect to a single component are 1) a single component maybe not in the critical path; 2) even an AI component like recommendation is in the critical path, thereexists cascading interaction effects with the other AI and non-AI components.We also analyze the execution time ratio of the AI components vs. the non-AI components in onlineserver. If we exclude the data preprocessing and communication latency, the time spent on the AIcomponents and the non-AI components is 38 and 17 milliseconds for the average latency, which indicatesthat the AI components are essential critical path of an industry-scale end-to-end benchmark like theE-commerce benchmark. Can a Statistical Model Predict the End-to-end Tail latency?

As an end-to-end benchmark ismuch complex in using a hardware or software evaluation, an intuition is that can we use a statisticalmodel to predict the end-to-end tail latency? The answer is NO!The state-of-the-art work [24] uses the M/M/1 and M/M/K queuing models to calculate the p’thpercentile latency. We repeat their work, and choose the M/M/1 model to predict the latency as we onlydeploy one instance of online server. In the M/M/1 model, the p’th percentile latency (

T p ) and the averagelatency (

T m ) can be calculated using the following formula:

T p = − ln ( − P ) µ − λ , T m = µ − λ . µ is theservice rate, which follows the exponential distribution. λ is the arrival rate, which follows the Poissondistribution.We get the number of µ —20 requests per second through the experiments. Then we set λ as 1.0requests per second (10 simulated users), 9.1 requests per second (100 simulated users), and 16.7 requestsper second (200 simulated users), respectively. For different settings, the theoretical number of the averagelatency is 53ms, 91ms, and 303ms, while the actual number is 123ms, 459ms, and 852ms, respectively.The average gap is 3.4 times. The theoretical number of the 99th percentile latency is 242ms, 422ms, and1394ms, while the actual number is 953ms, 5008ms, and 11980ms, respectively. The average gap is 8.1times.The main reason for this huge gap is as follows. It is complex and uncertain to execute an end-to-endbenchmark, and the service rate doesn’t follow the exponential distribution. So, the M/M/1 model is faraway from the realistic situation. However, the more generalized model (such as G/G/1 model) is difﬁcultto be used to calculate the tail latency. Furthermore, if we try to characterize the permutations of executingdozens of components in an end-to-end benchmark, we need a more sophisticated analytical model such With respect to the real numbers in our industry partner, the number is quite high. They have taken many measures todecrease the overall latency.

14s a queuing network model, which is much infeasible to perform a calculation of tail latency.

Tradeoff among Service Quality, Model Accuracy, and Model Complexity.

The online inferencemodule needs to load the trained model and conducts a forward computation to obtain the result. Usually,increasing the depth of a neural network model may improve the model accuracy, but it will lead to alarger model size and longer inference time. For comparison, we replace ResNet50 with ResNet152 inimage classiﬁer. The model accuracy improvement is 1.5%, while the end-to-end 99th percentile latencydeteriorates by 9.7X. Hence, Internet service architects must perform a tradeoff between the servicequality, model complexity, and model accuracy.

Updating AI models in a real time manner is a signiﬁcant domain-speciﬁc concern in many scenarios. Weevaluate the real-time model update efﬁciency using ofﬂine training. We deploy ofﬂine trainer on fourTitan XP GPUs.We adopt incremental learning method to update the models for online inference, and explore therelationship between the model update interval, training time overhead, and accuracy improvement. Ourexperiments show that comparing to the original training time and accuracy, 35% additional training timebrings in 1.9% accuracy improvement for image classiﬁer, and 10% additional training time brings in0.3% accuracy improvement for ranker.Thus, ofﬂine training is an integrated part of end-to-end benchmarking. It not only facilitates measuringthe model update efﬁciency, but also provides a guidance on how to choose an optimal update interval tobalance the tradeoff between training overhead and accuracy improvement.

We characterize distinct computation and memory patterns of the diverse AI tasks, emphasizing thenecessity of including diverse AI tasks for benchmarking.We characterize the sixteen component benchmarks of AIBench. The AIBench component benchmarksare deployed on the Titan XP GPUs, and we focus on a single GPU performance. The CUDA and Nvidiadriver versions are 10.0 and 410.78, respectively.We evaluate the PyTorch implementations with the version of 1.1.0. The data set for each benchmarkis as follows: ImageNet (137 GB) for image classiﬁcation and Image compression; LSUN (42.8 GB) forimage generation; VGGFace2 (36 GB) for face embedding; Microsoft COCO (13 GB) for Image-to-Textand object detection; MNIST (9.5 MB) for spatial transformer; Cityscapes (267 MB) for Image-to-Image;MovieLens (190 MB) for recommendation; Librispeech (59.3 GB) for speech recognition; Gowalla(107 MB) for learning to rank; WMT English-German (1.2 MB) for Text-to-Text translation; Robotpushing data set (137 GB) for Video prediction; ShapeNet Data set (6.8 GB) for 3D object reconstruction;Gigaword data set (277 MB) for Text summarization; 3D face data (37 GB) for 3D Face Recognition,respectively.GPU architecture contains multiple streaming multiprocessors (SM), each of which has a certainnumber of CUDA cores, memory registers, memory caches, warp schedulers and etc. To characterizethe AIBench component benchmarks from a perspectives of computation and memory access patterns,We choose ﬁve micro-architectural metrics, including achieved occupancy, ipc efﬁciency, gld efﬁciency,gst efﬁciency, and dram utilization. Achieved occupancy represents the ratio of the average active warpsper active cycle to the maximum number of warps supported on a multiprocessor [6]. Ipc efﬁciencyindicates the ratio of the executed instructions per cycle to the theoretical number [6]. Gld efﬁciencymeans the ratio of the requested global memory load throughput to the required global memory loadthroughput [6]. Gst efﬁciency means the ratio of the requested global memory store throughput to therequired global memory store throughput [6]. Dram utilization means the utilization level of the devicememory relative to the peak utilization [6]. 15igure 5: Computation and Memory Patterns of AIBench Components (1: achieved occupancy; 2:ipc efﬁciency; 3: gld efﬁciency; 4: gst efﬁciency; 5: dram utilization).Fig. 5 presents the computation and memory characteristics of the sixteen AI benchmarks. Weﬁnd that they have distinct computation and memory patterns not only under different scenarios, e.g.,processing text, image, audio, video, but also under different tasks of the same scenario, e.g., imageclassiﬁcation and image generation. Thus, diverse AI tasks reﬂecting different computation and memoryaccess patterns should be included into the AI benchmarks. Achieving a state-of-the-art quality targetfor each AI task will incur heavy training overhead, however, it does not justify including only a fewbenchmarks [64].

Following the experiments in 6.3, We drill down to the hotspot functions and analyze their runtimebreakdown and execution stalls for code optimization.The overall execution performance of these component benchmarks are varying in terms of IPC,which measures the executed instructions per cycle. From Fig. 5, we ﬁnd that the IPC efﬁciency rangesfrom 0.25 (Learning to rank) to 0.77 (Text to Text translation). Some benchmarks like learning to rankhave extremely low IPC comparing to the other benchmarks. To discover the factors that impact theperformance greatly, we ﬁrst conduct runtime breakdown analysis and decompose the benchmarks into thehotspot kernels or functions, then we ﬁnd the GPU execution efﬁciency in terms of different percentage ofstalls.

We use nvprof to trace the runtime breakdown and ﬁnd the hotspot functions that occupy more than 80%of runtime in total. Since each run involves dozens of function calls, we single out the functions thatoccupy large proportions of runtime and classify them into several categories of kernels according totheir computation logic. Through statistics, we ﬁnd that the most time-consuming functions among allcomponent benchmarks have much in common, and they are classiﬁed into eight categories of kernels,which are a subset of the AIBench micro benchmarks: data arrangement, convolution, general matrix16igure 6: Runtime Breakdown of AIBench Components.multiply (gemm), batch normalization, element-wise operation, relu activation , pooling, and memorycopy, spanning from computation kernels to memory access kernels. Note that each kernel contains abunch of functions that solve the similar issue. For example, a gemm kernel includes single or doubleprecision ﬂoating general matrix multiply. Fig. 6 shows the runtime breakdown of sixteen componentbenchmarks, using the average number of all involved functions within each micro benchmark. Note thatthe remaining 20% functions are not considered in this ﬁgure. Further, for each micro benchmark, wesummarize typical functions that occupy a large proportion of runtime among the component benchmarks,as shown in Table 3. We ﬁnd that learning to rank spends too much time on data arrangement operationsfrom Fig. 6, and the corresponding function call is maxwell scudnn 128x32 stridedB splitK interior nnwith an IPC of 0.98. This is the reason why leaning to rank has the lowest IPC of 0.99. We believe thatthe eight micro benchmarks and these corresponding functions are the optimization points not only forCUDA library optimizations but also for micro-architectural optimizations. Focusing on the above eight most time-consuming micro benchmarks, we evaluate the following stallsof these kernels. Instruction fetch stall (Inst fetch) indicates the percentage of stalls because the nextassembly instruction has not yet been fetched; Execution dependency stall (Exe depend) is the percentageof stalls because an input required by the instruction is not yet available; Memory dependency stall(Mem depend) is the percentage of stalls because a memory operation cannot be performed due to therequired resources not being available or fully utilized; Texture stall (Texture) is the percentage of stallsbecause of the under-utilization of the texture sub-system; Synchronization stall (Sync) is the percentage ofstalls due to a syncthreads call; Constant memory dependency stall (Const mem depend) is the percentageof stalls because of immediate constant cache miss; Pipe busy stall (Pipi busy) is percentage of stallsbecause a compute operation cannot be performed because the compute pipeline is busy; Memory throttlestall (Mem throttle) is the percentage of stalls due to large pending memory operations [6].The breakdown of eight stalls of the hotspot functions is shown in Fig. 7. The top two GPU executionstalls are memory dependency stalls, and execution dependency stalls. For example, for Element-Wisebenchmark, the memory dependency stalls occupy a very large proportion of 70%, thus resulting ina low IPC number of about 0.86 on average. The memory dependency stalls may occurs due to highcache misses, and thus the load/store resources are not available. Possible optimization strategies includeoptimizing date alignment, data locality, and data access patterns. The execution dependency stallsmay occur due to low instruction-level parallelism, and exploiting ILP may alleviate partial execution Relu activation is an element-wise operation, here we use a separate category of Relu considering its large proportion anddiverse CUDA functions.

Micro Benchmark Function Name

Data Arragement maxwell scudnn 128x128 stridedB splitK interior nnmaxwell scudnn 128x32 stridedB splitK interior nnmaxwell scudnn 128x128 stridedB interior nnConvolution maxwell scudnn winograd 128x128 ldg1 ldg4 tile148n ntwgrad alg0 enginefft2d r2c 32x32GEMM maxwell sgemm 128x64 ntmaxwell sgemm 128x64 nnsgemm 32x32x32 NN vecBatchNorm cudnn::detail::bn fw tr 1C11 kernel NCHWcudnn::detail::bn bw 1C11 kernel newbatch norm backward kernelat::native::batch norm backward kernelRelu maxwell scudnn 128x128 relu small nnmaxwell scudnn 128x128 relu interior nnmaxwell scudnn 128x32 relu interior nnElement-wise element-wise add kernelelement-wise threshold kernelelement-wise mul kernelPooling MaxPoolBackwardAvePoolForwardMemcpy CUDA memcpy HtoDCUDA memcpy DtoD dependency stalls to a certain degree.

State-of-the-art and state-of-the-practise AI or Internet service benchmarks only provide a few microor component benchmarks, as shown in Table 4, and none of them distill representative and essentialAI or non-AI components, and especially the permutations of different AI and non-AI components incharacterizing industry-scale AI and Internet service applications.MLPerf [3] is an ML benchmark suite targeting six AI tasks, including image classiﬁcation, objectdetection, speech recognition, translation, recommendation, and reinforcement learning. It provides bothlight-weight and heavy-weight implementations. Totally, it includes seven benchmarks for training andﬁve benchmarks for inference. The MLPerf training benchmark [45] proposes a series of benchmarkingrules to eliminate the side effect of the stochastic nature of AI.Figure 7: Stall Breakdown of the Hotspot Functions.18able 4: AI Benchmark Comparison.

AIBench MLPerf Fathom DeepBench DNNMark DAWNBench TBDBenchmark Framework (Extensible)Modular-design " × × × " × × End-to-End Application BenchmarkOnline module " × × × × × × Ofﬂine module " × × × × × × Component BenchmarkImageclassiﬁcation Train " " " × × " "

Infer " " " × × " × Imagegeneration Train " × × × × × " Infer " × × × × × × Text-to-Text Train " " " × × × " Infer " " " × × × ×

Image-to-Text Train " × × × × × × Infer " × × × × × × Image-to-Image Train " × × × × × × Infer " × × × × × × Speech recog-nition Train " " " × × × " Infer " " " × × × " Faceembedding Train " × × × × × × Infer " × × × × × ×

3D FaceRecognition Train " × × × × × × Infer " × × × × × × Objectdetection Train " " × × × × " Infer " " × × × × ×

Recommenda-tion Train " " × × × × " Infer " × × × × × × Videoprediction Train " × × × × × × Infer " × × × × × × Imagecompression Train " × " × × × × Infer " × " × × × ×

3D object re-construction Train " × × × × × × Infer " × × × × × × Text sum-marization Train " × × × × × × Infer " × × × × × × Spatialtransformer Train " × × × × × × Infer " × × × × × × Learning torank Train " × × × × × × Infer " × × × × × × Games Train × " " × × × " Infer × × " × × × × Memorynetwork Train × × " × × × × Infer × × " × × × × Questionanswering Train × × × × × " × Infer × × × × × " × Micro BenchmarkConvolution " × × " " × × Fully connected " × × " " × × Element-wise op " × × × × × × Pooling " × × × " × × Normalization " × × × " × × Dropout " × × × " × × Softmax " × × × " × × Memory access " × × × × × × AllReduce × × × " × × × Real-world Data sets and Software StackText data 3 1 2 N/A N/A 1 1Image data 8 2 2 N/A N/A 2 43D data 2 0 0 N/A N/A 0 0Audio data 1 0 1 N/A N/A 0 2Video data 1 0 1 N/A N/A 0 0Software Stack 3 2 1 1 1 2 4

This paper proposes an agile domain-speciﬁc benchmarking methodology that speeds up software andhardware co-design. Together with seventeen industry partners, we identify ten end-to-end applicationscenarios, distill sixteen representative AI tasks and fourteen time-consuming units of computations. Wepropose the permutations of the essential AI and non-AI tasks as the end-to-end benchmark to characterizeindustry-scale applications. We design and implement a reusable framework to facilitate agile end-to-endbenchmark building. We build the ﬁrst end-to-end benchmark to model E-commerce search intelligence.Our evaluation shows that the end-to-end benchmark integrating both online service and ofﬂine trainingprovides overall system performance for hardware and software designers. The component benchmarksreﬂect diverse computation and memory access patterns, essential for micro-architectural researchers. Themicro benchmarks represent hotspot functions, beneﬁcial to code optimization.

References arXiv preprint arXiv:1603.04467 , 2016.[11] R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks, “Fathom: reference workloads for moderndeep learning methods,” in

Workload Characterization (IISWC) . IEEE, 2016, pp. 1–10.[12] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro,Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding,N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. V.Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang,A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun,S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang,Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, “Deepspeech 2: End-to-end speech recognition in english and mandarin,” in

International conference onmachine learning , 2016, pp. 173–182.[13] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875 , 2017.[14] G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, “Memory hierarchy for web search,” in . IEEE,2018, pp. 643–656.[15] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O.Frederickson, T. A. Lasinski, R. S. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga,“The nas parallel benchmarks,”

The International Journal of Supercomputing Applications , vol. 5,no. 3, pp. 63–73, 1991.[16] L. A. Barroso and U. H¨olzle, “The datacenter as a computer: An introduction to the design ofwarehouse-scale machines,”

Synthesis Lectures on Computer Architecture , vol. 4, no. 1, pp. 1–108,2009.[17] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognisingfaces across pose and age,” in . IEEE, 2018, pp. 67–74.[18] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva,S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012 , 2015.[19] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: user movement in location-basedsocial networks,” in

Proceedings of the 17th ACM SIGKDD international conference on Knowledgediscovery and data mining . ACM, 2011, pp. 1082–1090.[20] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. R´e, andM. Zaharia, “Dawnbench: An end-to-end deep learning benchmark and competition,”

Training , vol.100, no. 101, p. 102, 2017. 2121] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, andB. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in

Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp. 3213–3223.[22] A. Dakkak, C. Li, J. Xiong, and W.-m. Hwu, “Frustrated with replicating claims of a shared model?a solution,” arXiv preprint arXiv:1811.09737 , 2019.[23] A. C. De Melo, “The new linux perf tools,” in

Slides from Linux Kongress , vol. 18, 2010.[24] C. Delimitrou and C. Kozyrakis, “Amdahl’s law for tail latency,”

Communications of the ACM ,vol. 61, no. 8, pp. 65–72, 2018.[25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet: A large-scale hierarchicalimage database,” in

Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conferenceon . IEEE, 2009, pp. 248–255.[26] S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmark suite for gpus,” in

Proceedingsof the General Purpose GPUs . ACM, 2017, pp. 63–72.[27] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through videoprediction,” in

Advances in neural information processing systems , 2016, pp. 64–72.[28] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, F. Tang, B. Xie, C. Zheng, X. Wen, X. He, H. Ye,and R. Ren, “Data motifs: A lens towards fully understanding big data and ai workloads,”

ParallelArchitectures and Compilation Techniques (PACT), 2018 27th International Conference on , 2018.[29] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, X. Wen, R. Ren, C. Zheng, X. He, H. Ye, H. Tang,Z. Cao, S. Zhang, and J. Dai, “Bigdatabench: A scalable and uniﬁed big data and ai benchmarksuite,” arXiv preprint arXiv:1802.08254 , 2018.[30] C. Gormley and Z. Tong,

Elasticsearch: the deﬁnitive guide: a distributed real-time search andanalytics engine . ” O’Reilly Media, Inc.”, 2015.[31] F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,”

ACM Transactionson Interactive Intelligent Systems (TiiS) , vol. 5, no. 4, p. 19, 2016.[32] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia,A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang, “Appliedmachine learning at facebook: A datacenter infrastructure perspective,” in . IEEE, 2018, pp. 620–629.[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

Proceedingsof the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778.[34] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collaborative ﬁltering,” in

Proceedings of the 26th international conference on world wide web . International World WideWeb Conferences Steering Committee, 2017, pp. 173–182.[35] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A databaseforstudying face recognition in unconstrained environments,” in

Workshop on faces in’Real-Life’Images: detection, alignment, and recognition , 2008.[36] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in

Advances in neural information processing systems , 2015, pp. 2017–2025.[37] A. JMeter, “Apache jmeter,”

Online.(2016). http://jmeter. apache. org/-Visited , pp. 04–25, 2017.2238] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J´egou, and T. Mikolov, “Fasttext.zip: Compressingtext classiﬁcation models,” arXiv preprint arXiv:1612.03651 , 2016.[39] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden,A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean,B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. Ho, D. Hogberg, J. Hu,R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch,N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. Mackean,A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie,M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn,G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma,E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performanceanalysis of a tensor processing unit,” in

Proceedings of the 44th Annual International Symposium onComputer Architecture . ACM, 2017, pp. 1–12.[40] H. Kasture and D. Sanchez, “Tailbench: a benchmark suite and evaluation methodology for latency-critical applications,” in .IEEE, 2016, pp. 1–10.[41] A. Krizhevsky, V. Nair, and G. Hinton, “The cifar-10 dataset,” , vol. 55, 2014.[42] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521, no. 7553, pp. 436–444,2015.[43] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,”

AT&T Labs [Online].Available: http://yann. lecun. com/exdb/mnist , vol. 2, p. 18, 2010.[44] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick,“Microsoft coco: Common objects in context,” in

European conference on computer vision . Springer,2014, pp. 740–755.[45] P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei,P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang,B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, G. Ma, D. Narayanan, T. Oguntebi, G. Pekhimenko,L. Pentecost, V. J. Reddi, T. Robie, T. St. John, C.-J. Wu, L. Xu, C. Young, and M. Zaharia, “Mlperftraining benchmark,” arXiv preprint arXiv:1910.01500 , 2019.[46] D. L. Mills, “Network time protocol (ntp),”

Network , 1985.[47] Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and J. Zhan, “Bdgs: A scalable big datagenerator suite in big data benchmarking,” arXiv preprint arXiv:1401.5465 , 2014.[48] R. Nallapati, B. Zhou, C. Gulcehre, and B. Xiang, “Abstractive text summarization using sequence-to-sequence rnns and beyond,” arXiv preprint arXiv:1602.06023 , 2016.[49] Y. Ni, D. Ou, S. Liu, X. Li, W. Ou, A. Zeng, and L. Si, “Perceive your users in depth: Learninguniversal user representations from multiple e-commerce tasks,” in

Proceedings of the 24th ACMSIGKDD International Conference on Knowledge Discovery & Data Mining . ACM, 2018, pp.596–605.[50] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Rajashekhar, S. Ramesh, and J. Soyke,“Tensorﬂow-serving: Flexible, high-performance ml serving,” arXiv preprint arXiv:1712.06139 ,2017. 2351] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on publicdomain audio books,” in . IEEE, 2015, pp. 5206–5210.[52] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with regionproposal networks,” in

Advances in neural information processing systems , 2015, pp. 91–99.[53] A. M. Rush, S. Harvard, S. Chopra, and J. Weston, “A neural attention model for sentence sum-marization,” in

ACLWeb. Proceedings of the 2015 Conference on Empirical Methods in NaturalLanguage Processing , 2017.[54] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed embedding for face recognitionand clustering,” in

Proceedings of the IEEE conference on computer vision and pattern recognition ,2015, pp. 815–823.[55] B. Smith and G. Linden, “Two decades of recommender systems at amazon. com,”

Ieee internetcomputing , vol. 21, no. 3, pp. 12–18, 2017.[56] J. Tang and K. Wang, “Ranking distillation: Learning compact ranking models with high performancefor recommender system,” in

ACM SIGKDD International Conference on Knowledge Discovery &Data Mining , 2018.[57] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolu-tion image compression with recurrent neural networks,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2017, pp. 5306–5314.[58] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,“Attention is all you need,” in

Advances in neural information processing systems , 2017, pp. 5998–6008.[59] R.-L. Vieriu, S. Tulyakov, S. Semeniuta, E. Sangineto, and N. Sebe, “Facial expression recognitionunder a wide range of head poses,” in , vol. 1. IEEE, 2015, pp. 1–7.[60] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015mscoco image captioning challenge,”

IEEE transactions on pattern analysis and machine intelli-gence , vol. 39, no. 4, pp. 652–663, 2017.[61] P. Webb, D. Syer, J. Long, S. Nicoll, R. Winch, A. Wilkinson, M. Overdijk, C. Dupuis, andS. Deleuze, “Spring boot reference guide,”

Part IV. Spring Boot features , vol. 24, 2013.[62] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer nets: Learning single-view3d object reconstruction without 3d supervision,” in

Advances in Neural Information ProcessingSystems , 2016, pp. 1696–1704.[63] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scaleimage dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365 , 2015.[64] J. Zhan, L. Wang, W. Gao, and R. Ren, “Benchcouncils view on benchmarking ai and other emergingworkloads,”

Technical Report , 2019.[65] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhimenko, “Tbd:Benchmarking and analyzing deep neural network training,” arXiv preprint arXiv:1803.06905 , 2018.[66] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in