BenchCouncil's View on Benchmarking AI and Other Emerging Workloads
BB ENCH C OUNCIL ’ S V IEW ON B ENCHMARKING AI AND O THER E MERGING W ORKLOADS E DITED B Y J IANFENG Z HAN (BenchCouncil Steering Committee Chair) L EI W ANG (BenchCouncil Big Data and CPU Tracks Executive Committee Co-chair) W ANLING G AO (BenchCouncil Datacenter AI Track Executive Committee Co-chair) R UI R EN T ECHNICAL R EPORT N O . B ENCH C OUNCIL -BCV
IEW -2019 N OV
12, 2019 a r X i v : . [ c s . PF ] D ec enchCouncil’s View on Benchmarking AI and Other EmergingWorkloads Jianfeng Zhan ∗ , Lei Wang , Wanling Gao , and Rui Ren BenchCouncil (International Open Benchmark Council)Nov 12, 2019
Abstract
This paper outlines BenchCouncil’s view on the challenges, rules, and vision of benchmarkingmodern workloads like Big Data, AI or machine learning, and Internet Services. We conclude thechallenges of benchmarking modern workloads as FIDSS (Fragmented, Isolated, Dynamic, Service-based, and Stochastic), and propose the PRDAERS benchmarking rules that the benchmarks shouldbe specified in a paper-and-pencil manner, relevant, diverse, containing different levels of abstractions,specifying the evaluation metrics and methodology, repeatable, and scaleable. We believe proposingsimple but elegant abstractions that help achieve both efficiency and general-purpose is the final targetof benchmarking in future, which may be not pressing. In the light of this vision, we shortly discussBenchCouncil’s related projects.
Our society is increasing relying upon information infrastructure, which consists of massive IoT, edgedevices, extreme-scale datacenters, and high-performance computing systems. Those systems collaboratewith each other to handle big data and leverage AI technique, and finally provide Internet services forhuge end users with guaranteed quality of services. From a perspective of workload characterization, theemerging workloads like Big Data, AI, and Internet Services raise serious FIDSS (Fragmented, Isolated,Dynamic, Service-based, and Stochastic) challenges, which are significantly different from the traditionalworkloads characterized by SPECCPU (desktop workloads) [3], TPC-C [4], TPC-Web (Traditional webservices) [5], and HPL (high performance computing) [1] benchmarks.The first challenge is fragmented. There are huge fragmented application scenarios, a marked departurefrom the past. However, there is a lack of simple but elegant abstractions that help achieve both efficiencyand general-purpose. For example, for database, the relation algebra demonstrates its generalized ability,and any complex query can be written using five primitives like select, project, product, union, anddifference [13]. However, in a new era of big data, hundreds or even thousands ad-hoc solutions areproposed to handle different application scenarios, most of which are termed with NoSQL or NewSQL.For AI, the same observation holds true. There are tens or even hundreds of organizations who aredeveloping AI training or inference chips to tackle their challenges in different application scenarios,respectively [30]. Though domain-specific software and hardware co-design is promising [23], the lackof simple but unified abstractions has two side effects. On one hand, it is challenging to amortize thecost of building an ad-hoc solution. On the other hand, single-purpose is a structure obstacle to resourcesharing. Proposing simple but elegant abstractions that help achieve both efficiency and general-purposeis our final target of workload modeling, benchmarking, or characterization in future, which may be notpressing. ∗ Jianfeng Zhan is the corresponding author.
After revisiting the previous successful benchmarks, we propose the PRDAERS benchmarking rules asfollows.First, the common requirements should be specified only algorithmically in a paper-and pencilapproach. This rule is firstly proposed by the NAS parallel benchmarks [8], and well-practiced inthe database community. Interestingly, this rule is often overlooked by the architecture and systemcommunities. Following this rule, the benchmark specification should be proposed firstly and reasonablydivorced from individual implementations. In general, the benchmark specification should define aproblem domain in a high-level language.Second, the benchmark should be relevant [20]. On one hand, the benchmark should be domain-specific [27] and distinguish between different contexts like IoT, edge, Datacenter and HPC. Under eachcontext, each benchmark should provide application scenarios abstracted from details, sematic-preservingdata sets, or even quality targets (for AI tasks) that represent real-world deployments [30]. On the otherhand, the benchmark should be simplified. It is not a copy of a real-world application. Instead, it isa distillation of the essential attributes of a workload [27]. Generally, a real-world application is notportable across different systems and architectures.Third, the diversity and representativeness of a widely accepted benchmark suite are of paramountimportance. On one hand, this is a tradition witnessed by the past. For example, SPECCPU 2017contains 43 benchmarks. The other examples include PARSEC3.0 (30), TPC-DS (99). On the other hand,modern workloads manifest much higher complexity. For example, Google’s datacenter workloads showsignificant diversity in workload behavior with no single “silver-bullet” application to optimize for [26].Modern AI models vary wildly, and a small accuracy change (e.g., a few percent) can drastically changethe computational requirements (e.g., 5-10x) [30, 11]. For modern deep learning workloads, someonemay argue running an entire training session is costly, so few benchmarks should be included for reducingthe cost. However, we believe the cost of execution time cannot justify including only a few benchmarks.Actually, the cost of execution time for other benchmarks (like HPC, SPECCPU on simulators) is also2rohibitively costly. So, for workload characterization, diverse workloads should be included to exhibitthe range of behavior of the target applications, or else it will oversimplify the typical environment [27].On the other hand, for performance ranking (benchmarketing), it may be reasonable for us to choose a fewrepresentative benchmarks for reducing the cost just like that the HPC Top500 ranking only reports HPL,HPCG, and Graph500 (three benchmarks out of 20+ representative HPC benchmarks like HPCC, NPB).Fourth, the benchmarks should contain different levels of abstractions, and usually a combinationof micro, component and end-to-end application benchmarks is preferred to. From an architecturalperspective, porting a full-scale application to a new architecture at an earlier stage is difficult and evenimpossible [9], while using micro or component benchmarks alone are insufficient to discover the timebreakdown of different modules and locate the bottleneck within a realistic application scenario at a laterstage [9]. Hence, a realistic benchmark suite should have the ability to run not only collectively as a wholeend-to-end application to discover the time breakdown of different modules but also individually as amicro or component benchmark for fine tuning hot spot functions or kernels [16] .Fifth, it should specify the evaluation metrics and methodology. The performance number should besimple, linear, orthogonal, and monotonic [27]. Meanwhile, it should be domain relevant. For example,the time-to-quality (i.e., state-of-the-art accuracy) metric is relevant to the AI domain, because someoptimizations immediately improve throughput while adversely affect the quality of the final model, whichcan only be observed by running an entire training session [29].Sixth, the benchmark should be repeatable, reliable, and reproducible [7]. Since many modern deeplearning workloads are intrinsically approximate and stochastic, allowing multiple different but equallyvalid solutions [29], it will raise serious challenges for AI benchmarking.Finally, but not least, the benchmark should be scaleable [20], and the benchmark users can scale upthe problem size, so the benchmark is applicable for running on both small and large systems. However,this is not trivial for modern deep learning workloads, as accommodating system scale even requireschanging hyperparameters, which can affect the amount of computation to reach a particular qualitytarget [29].
Domain-specific software-hardware co-design is promising as a single-purpose solution will achieve highenergy-efficiency than that of a general-purpose one. However, we believe the final target in future is topropose simple but elegant abstractions that help achieve both efficiency and general-purpose for big data,AI, and Internet services.To meet with the fragmented application scenarios, currently, BenchCouncil sets up (datacen-ter) Big Data, datacenter AI, HPC AI, AIoT, and Edge AI benchmarking tracks, and released Big-DataBench [19] ( ) and AIBench [16,15]( ) benchmark suites for datacenter bigdata and AI, Edge AIbench [22]( ) foredge AI, AIoT Bench [28]( )for IoT AI,and HPC AI500 [25] ( ) for HPC AI. Thosebenchmarks are in fast evolution. We release the source code, pre-trained model, and container-baseddeployment on the BenchCouncil web site ( http://benchcouncil.org/testbed/index.html ).On the other side, we propose an innovative approach to modeling and characterizing the emergingworkloads. We consider each big data, AI and Internet service workload as a pipeline of one or moreclasses of unit of computation performed on different initial or intermediate data inputs, which we call adata motif [18]. After thoroughly analyzing a majority of workloads in five typical big data applicationdomains (search engine, social network, e-commerce, multimedia and bio-informatics), we identifyeight data motifs that take up most of run time, including Matrix, Sampling, Logic, Transform, Set,Graph, Sort and Statistic [18]. We found the combinations of one or more data motifs with differentweights in terms of runtime can describe most of big data and AI workloads we investigated [18, 17]. Inconclusion, the data motifs are promising as simple but elegant abstractions that achieve both efficiency3nd general-purpose.On the basis of the data motif methodology, we are proposing a new benchmark suite, namedBENCHCPU [6], to characterize emerging workloads, including Big Data, AI, and Internet Services.The goal of BENCHCPU is to abstract ISA (Instruction Set Architecture) independent workload Char-acterizations from emerging workloads. BENCHCPU will be portable across edge, IoT, and datacenterprocessor architectures. Furthermore, we are working on an open-source chip project, named EChip.On the basis of data dwarf approaches, the goal of EChip is to design an open source general-purposeISA for emerging workloads, which is a marked departure from single-purpose accelerators. The ISA ofEChip is composed of the general-purpose instruction set and the domain-specific instruction set. Thegeneral-purpose instruction set is modular-based basic instruction set, which always remains unchanged(Minimum implementation subset) and is compatibility with Linux ecosystems. The domain-specificinstruction set is an extension of domain-specific customization, proposed for emerging workloads likebig data, AI, internet service.
Benchmarking principle defines the rules and guidelines on what the important criteria need to beconsidered for a good benchmark [7, 31]. The benchmarking methodology specifies systematic strategiesand processes on how to construct a benchmark. Jim Gray summarizes four key criteria to define a goodbenchmark—relevant, portable, scalable, and simple [21]. Also, other benchmark consortiums proposetheir own principles and methodologies.The TPC Benchmarks are a series of domain-specific benchmarks targeting measuring transactionprocessing (TP) and database (DB) performance. They believe that a domain-specific benchmark shouldsatisfy three criteria [27, 24]: (1) No single metric can measure the application performance of all domains,(2) The more general the benchmark, the less useful it is for anything in particular, (3) A benchmark isa distillation of the essential attributes of a workload. They adopt a benchmark methodology with theconcept of ”functions of abstraction” and ”functional workload model”, which abstract the compute unitsthat frequently appeared with repetitions or similarities.SPEC benchmarks are a set of benchmarks for the newest generation of computing systems [2], guidedby six implicit design principles: application-oriented; portable; repeatable and reliable of benchmarkingresults; consistent and fair across different users or different systems; diverse workloads that can runindependently, and reporting unit of measurement, e.g., throughput.PARSEC (Princeton Application Repository for Shared-Memory Computers) is a benchmark suitefor chip multiprocessors. The PARSEC benchmarks are constructed following five requirements [12].(1) The benchmarks should be multi-threaded. (2) Emerging workloads should be considered. (3) Thebenchmarks should cover diverse applications and a variety of platforms, and accommodate differentusage models. (4) The benchmarks should use state-of-the-art algorithms and data structures. (5) Thebenchmarks should be designed to support research.
This paper outlines BenchCouncil’s view on the challenges, rules, and vision of benchmarking modernworkloads. We conclude the challenges of benchmarking modern workloads as Fragmented, Isolated,Dynamic, Service-based, and Stochastic. After revisiting previous successful benchmarks, we proposethe PRDAERS benchmarking rules that the benchmarks should be specified in a paper-and-pencil man-ner, relevant, diverse, containing different levels of abstractions, specifying the evaluation metrics andmethodology, repeatable, and scaleable. We believe proposing simple but elegant abstractions that helpachieve both efficiency and general-purpose is the final target of benchmarking in future, which maybe not pressing. In the light of this vision, we shortly discuss BenchCouncil’s related projects, includ-ing BigDataBench, AIBench, HPC AI500, Edge AIBench, AIoT Bench, and BENCHCPU, and EChipprojects. 4 eferences
Linked Data Benchmark Council (LDBC).Project No 317548, European Community’s Seventh Framework Programme FP7 , 2012-2014.[8] D. H. Bailey, “Nas parallel benchmarks,”
Encyclopedia of Parallel Computing , pp. 1254–1259,2011.[9] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O.Frederickson, T. A. Lasinski, R. S. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga,“The nas parallel benchmarks,”
The International Journal of Supercomputing Applications , vol. 5,no. 3, pp. 63–73, 1991.[10] L. A. Barroso and U. H?lzle, “The datacenter as a computer: An introduction to the design ofwarehouse-scale machines,”
Synthesis Lectures on Computer Architecture , vol. 8, no. 3, 2009.[11] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark analysis of representative deepneural network architectures,”
IEEE Access , vol. 6, pp. 64 270–64 277, 2018.[12] C. Bienia and K. Li,
Benchmarking modern multiprocessors . Princeton University USA, 2011.[13] E. F. Codd, “A relational model of data for large shared data banks,”
Communications of the ACM ,vol. 13, no. 6, pp. 377–387, 1970.[14] J. Dean and L. A. Barroso, “The tail at scale,”
Communications of the ACM , vol. 56, no. 2, pp. 74–80,2013.[15] W. Gao, C. Luo, L. Wang, X. Xiong, J. Chen, T. Hao, Z. Jiang, F. Fan, M. Du, Y. Huang, F. Zhang,X. Wen, C. Zheng, X. He, and J. Dai, “AIBench: Towards scalable and comprehensive datacenter AIbenchmarking,” in
International Symposium on Benchmarking, Measuring and Optimization(Bench) ,2018.[16] W. Gao, F. Tang, L. Wang, J. Zhan, C. Lan, C. Luo, Y. Huang, C. Zheng, J. Dai, Z. Cao, H. Tang,K. Zhan, B. Wang, D. Kong, T. Wu, M. Yu, C. Tan, H. Li, X. Tian, Y. Li, G. Lu, J. Shao, Z. Wang,X. Wang, and H. Ye, “AIBench: An Industry Standard Internet Service AI Benchmark Suite,” https://arxiv.org/abs/1908.08998 , 2019.[17] W. Gao, J. Zhan, L. Wang, C. Luo, Z. Jia, D. Zheng, C. Zheng, H. Ye, H. Wang, and R. Ren, “Datamotif-based proxy benchmarks for big data and AI workloads,”
IISWC 2018 , 2018.[18] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, F. Tang, B. Xie, C. Zheng, X. Wen, X. He, H. Ye, andR. Ren, “Data motifs: A lens towards fully understanding big data and ai workloads,” on ParallelArchitectures and Compilation Techniques (PACT), the 27th International Conference , 2018.519] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, X. Wen, R. Ren, C. Zheng, X. He, H. Ye, H. Tang,Z. Cao, S. Zhang, and J. Dai, “BigDataBench: A scalable and unified big data and ai benchmarksuite,” arXiv preprint arXiv:1802.08254 , 2018.[20] J. Gray,
Benchmark handbook: for database and transaction processing systems . Morgan KaufmannPublishers Inc., 1992.[21] J. Gray, “Database and transaction processing performance handbook.” 1993.[22] T. Hao, Y. Huang, X. Wen, W. Gao, F. Zhang, C. Zheng, L. Wang, H. Ye, K. Hwang, Z. Ren, andJ. Zhan, “Edge AIBench: Towards comprehensive end-to-end edge computing benchmarking,” ,2018.[23] J. Hennessy and D. Patterson, “A new golden age for computer architecture: Domain-specifichardware/software co-design, enhanced security, open instruction sets, and agile chip development.”2018.[24] K. Huppler, “The art of building a good benchmark,” in
Technology Conference on PerformanceEvaluation and Benchmarking . Springer, 2009, pp. 18–30.[25] Z. Jiang, W. Gao, L. Wang, X. Xiong, Y. Zhang, X. Wen, C. Luo, H. Ye, X. Lu, Y. Zhang, S. Feng,K. Li, W. Xu, and J. Zhan, “HPC AI500: A benchmark suite for hpc ai systems,” , 2018.[26] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks,“Profiling a warehouse-scale computer,” in
Computer Architecture (ISCA), 2015 ACM/IEEE 42ndAnnual International Symposium on . IEEE, 2015, pp. 158–169.[27] C. Levine, “Tpc benchmarks,” in
SIGMOD International Conference on Managementof Data -Industrial Session , 1997.[28] C. Luo, F. Zhang, C. Huang, X. Xiong, J. Chen, L. Wang, W. Gao, H. Ye, T. Wu, R. Zhou, and J. Zhan,“AIoT Bench: Towards comprehensive benchmarking mobile and embedded device intelligence,”
BenchCouncil International Symposium on Benchmarking, Measuring and Optimizing (Bench18) ,2018.[29] P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei,P. Bailis, V. Bittorf et al. , “MLPerf training benchmark,” arXiv preprint arXiv:1910.01500 , 2019.[30] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, Carole-JeanWu, B. Anderson,M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos,J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar,D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko,A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, EphremWu, L. Xu,K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou, “MLPerf inference benchmark,” https://arxiv.org/abs/1911.02549 , 2019.[31] L. Zhao, W. Gao, and Y. Jin, “Revisiting benchmarking principles and methodologies for big databenchmarking,” in