[PDF] HPC AI500: The Methodology, Tools, Roofline Performance Models, and Metrics for Benchmarking HPC AI Systems

Abstract

The recent years witness a trend of applying large-scale distributed deep learning in both business and scientific computing areas, whose goal is to speed up the training time to achieve a state-of-the-art quality. The HPC community feels a great interest in building the HPC AI systems that are dedicated to running those workloads. The HPC AI benchmarks accelerate the process. Unfortunately, benchmarking HPC AI systems at scale raises serious challenges. None of previous HPC AI benchmarks achieve the goal of being equivalent, relevant, representative, affordable, and repeatable. This paper presents a comprehensive methodology, tools, Roofline performance models, and innovative metrics for benchmarking, optimizing, and ranking HPC AI systems, which we call HPC AI500 V2.0. We abstract the HPC AI system into nine independent layers, and present explicit benchmarking rules and procedures to assure equivalence of each layer, repeatability, and replicability. On the basis of AIBench -- by far the most comprehensive AI benchmarks suite, we present and build two HPC AI benchmarks from both business and scientific computing: Image Classification, and Extreme Weather Analytics, achieving both representativeness and affordability. To rank the performance and energy-efficiency of HPC AI systems, we propose Valid FLOPS, and Valid FLOPS per watt, which impose a penalty on failing to achieve the target quality. We propose using convolution and GEMM -- the two most intensively-used kernel functions to measure the upper bound performance of the HPC AI systems, and present HPC AI roofline models for guiding performance optimizations. The evaluations show our methodology, benchmarks, performance models, and metrics can measure, optimize, and rank the HPC AI systems in a scalable, simple, and affordable way. HPC AI500 V2.0 are publicly available from this http URL.

Full PDF

HHPC AI500:T HE M ETHODOLOGY , T

OOLS , R

OOFLINE P ERFORMANCE M ODELS , AND M ETRICS FOR B ENCHMARKING

HPC AIS

YSTEMS

AUTHORS’ CONTRIBUTIONSECTION 1 IS CONTRIBUTED BY JIANFENG ZHAN AND ZIHAN JIANG. SECTION 2 ISCONTRIBUTED BY JIANFENG ZHAN, ZIHAN JIANG, AND FEI TANG. SECTION 3 ISCONTRIBUTED BY JIANFENG ZHAN. SECTION 4 IS CONTRIBUTED BY XINGWANGXIONG, ZIHAN JIANG, LEI WANG, WANLING GAO, AND JIANFENG ZHAN. SECTION 5IS CONTRIBUTED BY ZIHAN JIANG, LEI WANG, CHUNJIE LUO, WANLING GAO,JIANFENG ZHAN, AND HONGXIAO LI. SECTION 6 IS CONTRIBUTED BY LEI WANG,ZIHAN JIANG, WANLING GAO, AND JIANFENG ZHAN. SECTION 7 IS CONTRIBUTEDBY ZIHAN JIANG, XINGWANG XIONG, LEI WANG, WANLING GAO, CHUNXIN LAN,AND JIANFENG ZHAN. SECTION 8 IS CONTRIBUTED BY ZIHAN JIANG, LEI WANG,WANLING GAO, AND JIANFENG ZHAN. SECTION 9 ISCONTRIBUTED BY JIANFENGZHAN. T ECHNICAL R EPORT N O . B ENCH C OUNCIL -HPCAI500-2020-1 J UNE

30, 2020 a r X i v : . [ c s . PF ] J u l PC AI500: The Methodology, Tools, Rooﬂine PerformanceModels, and Metrics for Benchmarking HPC AI Systems*

Zihan Jiang , Lei Wang , Xingwang Xiong , Wanling Gao , Chunjie Luo , Fei Tang ,Chuanxin Lan , Hongxiao Li , and Jianfeng Zhan *1,2,31 State Key Laboratory of Computer Architecture, Institute of Computing Technology, ChineseAcademy of Sciences , { jiangzihan, wanglei 2011, xingwangxiong, gaowanling, luochunjie,lanchuanxin, tangfei, lihongxiao, zhanjianfeng } @ict.ac.cn BenchCouncil (International Open Benchmarking Council) University of Chinese Academy of SciencesJune 30, 2020

The recent years witness a trend of applying large-scale distributed deep learning algorithms in bothbusiness and scientiﬁc computing areas, whose goal is to speed up the training time to achieve a state-of-the-art quality. The HPC community feels a great interest in building the HPC AI systems that arededicated to running those workloads. The HPC AI benchmarks accelerate the process. Unfortunately,benchmarking HPC AI systems at scale raises serious challenges. None of previous HPC AI benchmarksachieve the goal of being equivalent, relevant, representative, affordable, and repeatable.This paper presents a comprehensive methodology, tools, Rooﬂine performance models, and innovativemetrics for benchmarking, optimizing, and ranking HPC AI systems, which we call HPC AI500 V2.0. Weabstract the HPC AI system into nine independent layers, and present explicit benchmarking rules andprocedures to assure equivalence of each layer, repeatability, and replicability. On the basis of AIBench–byfar the most comprehensive AI benchmarks suite, we present and build two HPC AI benchmarks fromboth business and scientiﬁc computing: Image Classiﬁcation, and Extreme Weather Analytics, achievingboth representativeness and affordability. To ranking the performance and energy-efﬁciency of HPC AIsystems, we propose Valid FLOPS, and Valid FLOPS per watt, which impose a penalty on failing toachieve the target quality. We propose using convolution and GEMM— the two most intensively-usedkernel functions of AIBench to measure the upper bound performance of the HPC AI systems, and presentHPC AI rooﬂine models for guiding performance optimizations. The evaluations show our methodology,benchmarks, performance models, and metrics can measure, optimize, and rank the HPC AI systems in ascalable, simple, and affordable way. The speciﬁcation, source code, datasets, and benchmarking data arepublicly available from . The huge success of AlexNet [1] in the ImageNet [2] competition marks the booming success of deeplearning (DL) in a wide range of commercial application areas. Many commercial ﬁelds, like image recog-nition, and natural language processing achieve unprecedented accuracy, even outperforming common * Jianfeng Zhan is the corresponding author. igure 1: ImageNet/ResNet-50 training is one well-known showcase for optimizing HPC AI systems. It reports theperformance in terms of a ternary tuple (achieved quality, PFLOPS, time-to-quality–minutes). The past witnessesthe systems performance varying wildly from (74.6%, 1.6, 28) to (75.1%, 36.8, 1.2). Table 1 summarizes theutilized optimization approaches. As no equivalent benchmarking rule is stated, we can not objectively derive theperformance edge of one system against the others. human being’s capability. Though it is much challenging to obtain high quality labeled scientiﬁc data sets,there is an increasing trend in applying DL in scientiﬁc computing areas [3–6].With massive training data available, the recent years witness a trend of applying distributed DLalgorithms at scale in both commercial and scientiﬁc computing areas. Motivated by these emergingHPC AI workloads, the HPC community feels a great interest in building HPC AI systems to reducetime-to-quality–the training time to achieve a convergent quality. For example, the Summit system [7] isbuilt to tackle huge AI challenges. The benchmark accelerates the process [8, 9], as it provides not onlydesign inputs, but also evaluation and optimization metric and methodology [10, 11]. However, there areseveral challenges in benchmarking HPC AI systems.First, it is nontrivial to prove the equivalence of two AI benchmark implementations on differentsystems or even the same system with different scales. Equivalence quantiﬁes how equivalent twobenchmarks implementations on different systems or the same system with different scales. Thereare complex interactions among hardware and software systems, which is further aggravated by theAI algorithm complexity. Even for the same AI algorithms, there are huge parameters signiﬁcantlyimpacting learning dynamics [9]. ImageNet/ResNet-50 (Image Classiﬁcation) training is one well-knownshowcase for optimizing HPC AI systems. Table 1 summarizes the state-of-the-art and state-of-the-practice optimization approaches in ImageNet/ResNet-50 training. Unfortunately, without equivalentbenchmarking rules explicitly stated, we can not objectively derive the performance edge of one systemagainst the others from Fig. 1.The second challenge inherits from the the conﬂict of two classical benchmarking methodologies withthe emphasis of different requirements. On one hand, as no single benchmark or metric can measure theperformance of computer systems on all applications [12], being relevant, representative, and diverse is2f paramount importance [10]. On the other hand, TOP500 [13] establishes the de facto super computerbenchmark standard in terms of three deﬁning characteristics: scalable, simple, and affordable.

Learning rate policies and batchsize settingProgramming ModelCommunication libs (e.g. Horovod)OSHardware (e.g. CPU, Network) Other hyper-parameters settings Layer 1Layer 2Layer 3Layer 4Layer 5AI Accelerators and libs (e.g. GPU, CUDA, (cid:3)(cid:49)(cid:38)(cid:38)(cid:47) )AI Framework (e.g. TensorFlow) Layer 6Layer 7

Free Level Hardware LevelSystem Level

Workload (Algorithm) Layer 8Problem Domain (Datasets, Target quality, Epochs) Layer 9

Figure 2:

The equivalent perspective of HPC AI500 V2.0 Methodology. We abstract the HPC AI system into nineindependent layers: put each layer under test while keeping other layers intact. We provide three high levels ofbenchmarking: hardware, system, and free: put the related layers together under test while keeping other layersintact with only allowed changes stated in the benchmarking rules.

In the AI domain, there are massive AI tasks and models with different performance metrics. Forexample, by far the most comprehensive and representative AI benchmark suite–AIBench [10, 11, 14, 15]contains seventeen AI tasks. It is not affordable to implement so many massive benchmarks and furtherperform benchmarking at scale. So what are the criteria for deciding the benchmarks that can fairly andobjectively measure the HPC AI systems.Third, the benchmark mandates being repeatable, while the nature of AI is stochastic, allowingmultiple different but equally valid solutions [9]. The uncertainty of HPC AI is manifested by run-to-runvariation in terms of epochs-to-quality and the effect of scaling training on time-to-quality [9, 16]. For theﬁrst time, Tang et al. [10] quantify the variations of seventeen AI benchmarks of AIBench. They foundthat the run-to-run variations vary from 0% to 38.46% in terms of the ratio of the standard deviation to themean of the training epochs to achieve a convergent quality.None of previous HPC AI benchmarks achieve the goal of being equivalent, relevant, representative,affordable, and repeatable. They either are not representative or even irrelevant to HPC AI workloads interms of kernel functions [17, 18], or overlook the differences of HPC AI workloads between scientiﬁcand business computing [9], or fail to specify fair and equivalent benchmarking rules across different HPCAI systems [9]. Moreover, they fail to propose simple and AI domain-speciﬁc metric to score and rankHPC AI systems.The micro benchmark like HPL-AI [18], which only contains LU decomposition, is affordable toperform a fair comparison of competing systems by isolating hardware and software from statisticaloptimizations [9]. However, we found it is irrelevant to most of AI workloads in Section 3.2. Moreover,the traditional micro or kernel benchmarking methodology, widely used in the HPC communities, canlead to misleading conclusion, as the mixed precision optimizations indeed improve the FLOPS of amicro benchmark like convolution, while signiﬁcantly impact time-to-quality of an AI task like imageclassiﬁcation as discussed in Section 3.2.This paper presents HPC AI500 V2.0–a comprhensieve HPC AI benchmarking methodology, tools,performance models, and metrics. As shown in Fig. 2, we abstract the HPC AI system into nine3ndependent layers. To perform fair benchmarking across different systems or the same system withdifferent scales, we present explicit benchmarking rules to assure equivalence of each layer, repeatability,and replicability of those two benchmarks. We put each layer under test while keeping the other layersintact. Also, We propose three high levels of benchmarking: hardware, system, and free (Fig. 2): put therelated layers under test while keeping the other layers intact unless otherwise stated.On the basis of AIBench , we present two benchmarks: Image Classiﬁcation with state-of-the-artquality on the ImageNet dataset (business computing), and Extreme Weather Analytics (EWA) withstate-of-the-art quality on the EWA dataset (scientiﬁc computing) to measure HPC AI systems. Thesetwo benchmarks represent two clusters of AI benchmarks–thirteen AI benchmarks from AIBench fromperspectives of computing areas (business vs. scientiﬁc computing), diversity of model complexity (from0.03 million to 68.39 million in terms of model parameters ), computational cost (from 0.09 MFLOPs to157.80 GFLOPs in terms of a single forward computation), and convergence rate (from 6 epochs to 304epochs). Moreover, our decision also takes into account their repeatablility, and whether these benchmarkshave widely-accepted metrics or not.To rank HPC AI systems, we propose two metrics, named Valid FLOPS, and Valid FLOPS per watt,to emphasise the vital importance of achieving the state-of-the-art quality, and an auxiliary metrics–time-to-quality.We propose using convolution and GEMM (GEneral Matrix to Matrix Multiplication)–two mostintensively-used kernel functions of AIBench to measure the upper bound performance of the HPC AIsystems, and present corresponding single-node and distributed-system HPC AI rooﬂine models forguiding performance optimizations.The evaluations show our benchmarks can fairly measure the HPC AI systems in a scalable, simple,and affordable way. Our Rooﬂine models are helpful to system optimizations. Our metrics can be used torank HPC AI systems in a simple and visual manner.

The challenges of HPC AI benchmarking inherit from the complexity of benchmarking scalablehardware and software systems, which are further exaggerated by the uncertainty of AI algorithms.

For the same AI algorithms, there are huge parameters signiﬁcantly impacting learning dynamics [9].Even for the same system with different scales, the interactions among system size, minibatch size, andlearning dynamics have a signiﬁcant impact on time-to-quality and computation overhead in terms ofFLOPS [9, 19, 26]. So for the same AI task, it is non-trivial to prove the equivalence of two benchmarkimplementations on different systems or even the same system with different scales.ImageNet/ResNet-50 training is one widely-used showcase for optimizaing HPC AI systems. Fig. 1shows the systems performance varies wildly: the performance gap in terms of FLOPS is 50x. Accordingly,Table 1 summarizes the state-of-the-art and state-of-the-practice work on ImageNet training at scale. Inaddition to the system-level optimizations (e.g. more efﬁcient communication typologies), some algorithm-level optimizations involve changing model architectures (e.g. optimizations on batch normalization) orlearning rate policies, i.e., LARS [26]. As there are prohibitively complex interactions among hardwaresystems, software systems, and algorithms, previous work fails to clearly state the equivalent rules of eachhardware or software layer for benchmarking HPC AI systems.

The second challenge inherits from the the conﬂict of two classical benchmarking methodologies with theemphasis of different requirements. 4 able 1:

The summary of the utilized optimization approaches in ImageNet/ResNet-50 training. The optimizationapproaches of each system are inconsistent or inequivalent. Please note that only the optimizations items in italics are allowed to change in the HPC AI500 benchmarking rules (deﬁned in Section 6).

System-level Algorithm-level parallelMode Communication Precision DataStaging LearningRate Policy

DataArgumentation ModelArchitecture Others

Facebook [19] Dataparallelism Recursive halvingand doubling andring all-reduce. N/A N/A Linear scalingand warmup [20]. N/A N/A Momentumcorrection;Data shufﬂingbased on theworkers.

Intel [21] Dataparallelism Intel MLSL [22] N/A N/A Linear scalingand warmup;ﬁnal collapse. N/A N/A Collapsedensembles;Dynamicallychangeweight decay.

IBM [23] Dataparallelism Topology aware N/A N/A Linearscaling,warmup [20] N/A N/A Momentumcorrection;Data shufﬂingbased on theworkers.

Berkeley [24] Dataparallelism Intel MLSL [25] N/A N/A Linear scalingandwarmup [20];LARS [26]. N/A N/A N/A

PreferredNet-works [27] Dataparallelism Ring all-reduce;Communicationcompression. N/A N/A Linearscaling,RMSpropwarmup, andslow-start; N/A Batch normal-ization:withoutmovingaverages. N/A

Sony [28] Dataparallelism 2D-Torusall-reduce;Communicationcompression;Communicationtensor fusion. Mixedprecisiontraining:FP16 &FP32 N/A Linear scalingandwarmup [20];LARS [26] Adding,scaling,rotations ,etc Batch normal-ization:withoutmovingaverages. N/A

Tencent [29] Dataparallelism Hierarchicalall-reduce;Communicationcompression;Communicationtensor fusion. Mixedprecisiontraining:FP16 &FP32 Efﬁcientinputpipeline Linear scalingandwarmup [20];LARS [26] N/A Batch normal-ization:eliminatingweight decay. N/A

Google [30] Dataparallelism 2D-Meshall-reduce; Mixedprecisiontraining:BFLOAT16 [31]& FP32. Efﬁcientinputpipeline Linear scalingandwarmup [20];LARS [26] Fused JPEGdecode andcropping Distributedbatchnormalization N/A

Fujitsu [32] Dataparallelism Communicationtensor fusion;Optimalscheduling bygrouping layers;Calculate thenorms of layers inparallel. Mixedprecisiontraining:FP16 &FP32. N/A Linear scalingandwarmup [20];LARS [26] N/A N/A Label smooth-ing [33]

On one hand, the SPECCPU [34], PARSEC [35], and TPC benchmarks, like TPC-DS [36] witness theparamount importance [10] of being representative and diverse, as no single benchmark or metric canmeasure the performance of computer systems on all applications [12].On the other hand, TOP500 [13] deﬁnes three distinctive characteristics of the de facto super computer5enchmark standard: affordable, simple, and scalable. Affordable has two implications: ﬁrst, thebenchmark is easy to port to a new system or architecture; second, the benchmarking cost is affordable formeasuring a systems at scale. Simple indicates the number of the metric is not only linear, orthogonal,and monotony [13], but also easily interpretable and understandable. Scalable means the benchmark canbe used to measured different scales of system, and the problem size can be scaled up and down.In the AI domain, there are massive AI tasks and models with different performance metrics. Forexample, AIBench [10] contains seventeen representative AI tasks, including Image Classiﬁcation, ObjectDetection, Learning to Rank, Image Generation, Text-to-Text Translation, Image-to-Text, Image-to-ImageTranslation, Speech Recognition, Face Embedding, 3D Face Recognition, Recommendation, VideoPrediction, Image Compression, 3D Object Reconstruction, Text Summarization, Spatial Transformer,and Neural Architecture Search. For HPC AI benchmarking, it is not affordable to implement so manymassive benchmarks and further perform benchmarking at scale.The traditional micro or kernel benchmarking methodology, which is widely in the HPC communities,can lead to misleading conclusion, as the mixed precision optimizations indeed improve the FLOPS of amicro benchmark like convolution, while signiﬁcantly impact time-to-quality of an AI task like ImageClassiﬁcation. Fig. 4 shows that the mixed precision implementation increases the FLOPS of both microand component benchmarks, while incurring accuracy drop as the system scale increases.Last but not least, the relevancy [37] of a benchmark indicates that it must measure the peak per-formance and price/performance of systems when performing typical operations within that problemdomain. The micro benchmark like HPL-AI [18], which only contains LU decomposition, is affordableto perform a fair comparison of competing systems by isolating hardware and software from statisticaloptimizations [9]. However, we found it is irrelevant to most of AI workloads in AIBench. As shown inFig. 3, the dominated kernel functions are convolution and matrix multiplication.

Figure 3:

The kernel function breakdown of the 17 representative AI workloads from AIBench [10], indicating theLU factorization is irrelevant.

Figure 4:

With respect to the FP32 implementation, the mixed precision one speeds up 2x the FLOPS of two microbenchmarks: Conv and GEMM and a component benchmark: ResNet-50 (LEFT), while incurring deterioratedaccuracy drop of ResNet-50 when the system scale increases (Right): 0.12% at 1 node while about 1% at 8 nodes. .3 Repeatability Repeatability [38, 39] refers to the variation in repeat measurements of different runs of the samebenchmark implementation, by the same team, on the same system under the identical conﬁgurations.Table 2 shows run-to-run variations of 17 benchmarks from AIBench varying from 0% to 38.46%. Asshown in Fig. 5, the variation of 3d Face Recognition is high as 38.46%. There are diverse reasons for theuncertainty of different benchmarks. For NAS (network architecture searching), it constructs the networkarchitecture by randomly sampling building blocks (e.g. convolution) from a predeﬁned search space.In addition, the complex design itself, which involves many hyper-parameters, makes AutoML hard toevaluate [40].

Table 2:

The run-to-run variations of seventeen AI benchmarks of AIBench [10]

No. Component Benchmark Variation Repeat Times

DC-AI-C1 Image Classiﬁcation 1.12% 5DC-AI-C2 Image Generation Not available N/ADC-AI-C3 Text-to-Text Translation 9.38% 6DC-AI-C4 Image-to-Text 23.53% 5DC-AI-C5 Image-to-Image Not available N/ADC-AI-C6 Speech Recognition 12.08% 4DC-AI-C7 Face Embedding 5.73% 8DC-AI-C8 3D Face Recognition 38.46% 4DC-AI-C9 Object Detection 0 10DC-AI-C10 Recommendation 9.95% 5DC-AI-C11 Video Prediction 11.83% 4DC-AI-C12 Image Compression 22.49% 4DC-AI-C13 3D Object Reconstruction 16.07% 4DC-AI-C14 Text Summarization 24.72% 5DC-AI-C15 Spatial Transformer 7.29% 4DC-AI-C16 Learning to Rank 1.90% 4DC-AI-C17 Neural Architecture Search 6.15% 6

Figure 5:

The worst unrepeatable benchmark from AIBench is 3D Face Recognition. Its run-to-run variation ishigh as 38.46%. The variation is deﬁned as the ratio of the standard deviation to the mean of the training epochs tothe achieved quality [10].

Without the equivalent benchmarking rules being explicitly stated, ImageNet/ResNet-50 training isnot qualiﬁed for ranking the performance and energy efﬁciency of HPC AI systems.7

Benchmarking Methodology

This section presents our methodology to achieve the goal of being equivalent, relevant, representative,affordable, and repeatable.

To perform fair benchmarking across different systems or the same system with different scale, we proposetwo approaches to assure the equivalence.First, as shown in Fig. 2, we abstract the system under test into nine independent layers, and put eachlayer under test while keeping the other layers intact unless otherwise stated.Layer 1 is the hardware, including CPUs and networks. Layers 2, and 3 are the related systemsoftware, including the operating system (Layer 2), and the communication libraries (Layer 3). Layer 4 isthe AI accelerators, i.e., GPU, and libraries, i.e., CUDA and cuDNN. Layer 5 is the AI framework, suchas TensorFlow [41] and PyTorch [42]. Layer 6 refers to programming model, including parallel mode(data parallelism or model parallelism), and synchronous or asynchronous training. Layer 7 refers to theworkloads used in HPC AI500 V2.0 benchmark. Layer 8 refers to hyper-parameters policies or settings.Layer 9 refers to problem domain, including datasets, target quality, and epochs.Second, for the sake of simpleness, we propose three high levels of benchmarking and put severalrelated layers together under test.(1) The hardware level. This high level is for benchmarking HPC AI hardware systems and theirrelated system software (Layers 1, 2, 3, 4). In this context, the other layers should be kept intact unlessotherwise stated in the benchmarking rules. The benchmark users should compile the source code ofthe benchmark implementation, provided by the benchmark committee, on their hardware directly withallowed changes. Luo et al. [43] show that the same model on different frameworks has different accuracy.So in addition to the same data set, and AI model, we mandate that the benchmark implementationsalso use the same AI framework. The benchmark users can change hardware, OS, compiler settings,communication libraries. For the other layers, the benchmark users can only change parallel modes inLayer 6 or tune learning rate policies and batchsize settings in Layer 8. It is the benchmark committee’duty to assure the equivalence of Layers 6, 7, 8, 9 across different benchmark implementations upon theusers’ requests.(2) The system level. Because of the portability cost, some benchmark users may opt for one speciﬁcAI framework without the support of the other, so specifying a ﬁxed framework has a limited purpose. Soin the system level, we put the hardware system in addition to the AI framework under the test (Layers1, 2, 3, 4, and 5), which we call the system level. We mandate that the benchmark implementations usethe same data set, and AI model. In addition to the changes allowed in the hardware level, the users areallowed to re-implement the algorithms on different or even customized AI framework (Layer 5). Theother layers should be kept intact unless otherwise stated in the benchmarking rules.The benchmark committee or an independent group need double-check the equivalence of Layers 6, 7,8, 9 between the two benchmark implementations.(3) The free level. In this high level, the speciﬁcation of an AI task is stated in a paper-and-pencilmanner separating from its speciﬁc implementation. That is to say, the same data set, target quality, andtraining epochs are deﬁned in Layer 9 while the other layers are open for optimizations. The emphasis isadvancing the state-of-the-art of software and hardware co-design, so the benchmark users can changeany layers from Layer 1 to Layer 8 while keeping Layer 9 intact. Meanwhile, the benchmark users areencouraged to disclose the details.

We investigate and compare the state-of-the-art and state-of-the-practice of AI benchmark suites, includingMLPerf [9], AIBench [10], Deep500 [44], HPC AI500 V1.0 [45]. We present the detailed analytics in8ection 9. Fortunately, we found the methodology of AIBench and its subset combines the merits of twomethodologies discussed in Section 3.On one hand, AIBench [10] is by far the most representative and comprehensive AI benchmark suite.It contains seventeen representative AI tasks. These workloads are diverse in terms of model complexity,computational cost, and convergent rate, computation and memory access patterns, hotspot functions, andother micro-architecture characteristics.On the other hand, for affordability, AIBench carefully selected a minimum subset from the seventeenAI tasks from perspectives of model complexity, computational cost, convergent rate, run-to-run variation,and having Widely accepted evaluation metrics or not. As shown in Fig. 6, the AIBench subset includesthree AI tasks–Image Classiﬁcation, Object Detection, and Learning to Rank.

300 200 100 0 100 200 3006004002000200400600

Face Embedding Image ClassificationImage GenerationImage-to-Image Image-to-TextObject Detection RecommendationSpatial Transformer Speech-to-TextLearning-to-Rank 3D Face Recognition 3D Object ReconstructionImage Compression Text SummarizationText-to-TextVideo Prediction Reinforcement

Figure 6:

The three subset of AIBench with respect to the full benchmarks [10]. The clustering is based on thepatterns of computation and memory access of seventeen AIBench component benchmarks, which described by ﬁvemetrics listed in Table 3. For visualization, ﬁve dimensional data are downscaled into two-dimension ones by thet-SNE clustering approach [46].

Table 3:

The metrics used by the t-SNE clustering approach [10].

Metrics Meaningachieved occupancy

The ratio of the average active warps per active cycle to the maximum number of warps providedby a multiprocessor ipc efﬁciency

The ratio of the executed instructions per cycle to the theoretical number. gld efﬁciency

The ratio of the requested global memory load throughput to the required global memory loadthroughput gst efﬁciency

The ratio of the requested global memory store throughput to the required global memory storethroughput dram utilization

The utilization level of the device memory relative to the peak utilization

Tang et al. [10] systematically quantify the run-to-run variation of seventeen AI tasks of AIBenchin terms of the standard deviation to the mean of the training epochs to achieve a convergent quality.The variation of image classiﬁcation, object detection, and learning ranking is 1.12%, 0%, and 1.90%,respectively, and they are the most repeatable benchmarks, which is the other reason for including theminto the subset.So we choose the AIBench subset as the HPC AI500 V2.0 candidate benchmarks for implementingscalable HPC AI benchmark tools. 9 .3 Repeatability and Replicability

In line with the experimental sciences discussed in [47], we propose the benchmarking procedures forassuring repeatability and replicability [48]. We adopt the deﬁnition similar to that of the Association forComputing Machinery [39]. Different from reproducibility, which requires changes, repeatability andreplicability avoid changes [47].Repeatability (same team): The benchmarking is performed on the same HPC AI system, usingthe same benchmark implementation under the same conﬁgurations, following the same benchmarkingprocedures, on multiple trials [47].The team should submit the raw data of all trials, including the average numbers in addition to itsvariations. The variation is measured in terms of the ratio of the standard deviation to the mean of thenumbers of all trials.To mitigate the inﬂuence of stochastic of the AI algorithm, each benchmark should mandate theleast valid runs of benchmarking. The number of all trials should be more than the least valid runs ofbenchmarking.Replicability (Different team) [39]: The replicability refers to that the other team veriﬁes the bench-marking results on the same HPC AI system, using the same benchmark implementation under the sameconﬁgurations, following the same benchmarking procedures, on multiple trials.For replicability, The benchmark committee or an independent group need verify the numbers onthe same system, and report the raw data of all trials, including the average numbers in addition to itsvariation.

In this section, we ﬁrstly illustrate how to choose the workloads according to our benchmarking methodol-ogy (Section 4). Then we present the datasets, AI models, and reference implementations of HPC AI500.Finally, we introduce the metrics.

With respect to other AI benchmarks, there are two unique differences of HPC AI benchmarking. First,the challenges of HPC AI benchmarking inherit from the complexity of benchmarking scalable hardwareand software systems at scale, i.e., tens of thousands of nodes, signiﬁcantly different from that of IoT [43]or datacenter [11]. On this point, we need consider the cost of benchmarking at scale. Second, HPCAI domains cover both commercial and high performance scientiﬁc computing. Currently, businessapplications are pervasive. Because of the difﬁculty of recruiting qualiﬁed scientists to label scientiﬁcdata, the applications in scientiﬁc computing lag behind but are promising. In general, the scientiﬁc dataare often more complex than that of the MINST or ImageNet data: the shape of scientiﬁc data can be 2Dimages or higher-dimension structures with hundreds of channels, while the popular commercial imagedata like ImageNet often consist of only RGB [45]. So we should include the scientiﬁc data in the HPCAI benchmarks.According to our benchmarking methodology discussed in Section 4, we choose the AIBench subsetas the HPC AI500 candidate benchmarks for implementing scalable HPC AI benchmark tools.

As the broad HPC AI applications cover both scientiﬁc [5–7, 49, 50] and commercial ﬁeld [27–30], wechoose the most representative workloads and data sets from these two ﬁelds.

EWA is one of the pioneering work that uses deep learning algorithm to replace the rules predeﬁnedby human expert and achieve excellent results [5]. Most important of all, the goal of EWA is to identifyvarious extreme weather patterns (e.g. tropical depression), which is essentially object detection –one of10he three benchmarks of the AIBench subset. In 2018, a deep learning based EWA implementation [7]won the Gordon Bell Prize, which is the ﬁrst AI application to win this award.

Image Classiﬁcation is widely used in many applications of commercial ﬁelds , which is a funda-mental task in AI research. With the developing of large-scale deep earning, Image Classiﬁcation hasbecome a well-known showcase optimizing HPC AI systems [27–30], as summarized in Table 1. ImageClassiﬁcation is also one of the three benchmarks of the AIBench Subset.We exclude Learn to Ranking because it has the lowest computation complexity in terms of FLOPS,which is only 0.08 MFLOPs in terms of a single forward computation. According to [10], ImageClassiﬁcation and Object Detection is more complex than that by one or two orders of magnitude,respectively.

As the stochastic nature of AI, we need to ensure the repeatability by choosing relatively stable workloadsin various AI tasks. According to the randomness analytics of AIBench [10], the two most repeatable AIbenchmarks are Object Detection and Image Classiﬁcation, whose variation is 0% and 1.12%, respectively.So they satisﬁes the property of a good benchmark–being repeatable.

For comprehensive evaluation, the workloads we choose have distinct characteristics in terms of scalingcharacteristics. We use scaling ratio to depict the difﬁculty when scaling a workload from a single nodeto multiple nodes. As shown in Table 5, the scaling ratio of EWA and Image Classiﬁcation is 16.85 and117.76, respectively, reﬂecting very different scaling characteristics.

When ranking HPC system, we consider not only its performance, but also the achieved quality. DifferentAI tasks have different levels of stringent quality requirement. Our benchmark decision also consider thisfactor. In our two benchmarks, EWA has much more stringent quality requirement than that of ImageClassiﬁcation.

The EWA dataset [50] is made up of 26-year climate data. The data of every year is availableas one HDF5 ﬁle. Each HDF5 ﬁle contains two data sets: images and boxes. The images data set has1460 example images (4 per day, 365 days per year) with 16 channels. Each channel is 768 * 1152corresponding to one measurement per 25 square km on earth. The box dataset records the coordinatesof the four kinds of extreme weather events in the corresponding images: tropical depression, tropicalcyclone, extratropical cyclone and the atmospheric river.

Model.

Faster-RCNN targets real-time Object Detection [51]. As one of the latest models of an RCNNfamily [52,53], it deprecates the selective search that has been used in the previous RCNN version. Instead,Faster-RCNN proposes a dedicated convolutional neural network, named region proposal network (RPN),to achieve nearly cost-free region proposals. With such design, Object Detection is much faster. As aresult, Faster-RCNN wins the 1st-place entries in ILSVRC’15 (ImageNet Large Scale Visual RecognitionCompetition).

Quality

The target quality is

MAP @ [ IoU = . ] = .

35, which is our best training result. MAP meansthe average precision, which is a dedicated metric for object detection. The IoU means the intersectionover union, used to measure how much our predicted boundary overlaps with the ground truth.11 .2.2 Image ClassiﬁcationDataset.

ImageNet [2] is large visual database designed for use in visual object recognition research.More than 14 million images have been hand-annotated according to the WordNet hierarchy. Both theoriginal images and bounding boxes are provided. The data size is more than 100 GB.

Model.

ResNet is a milestone in Image Classiﬁcation [54], marking the ability of AI to identifyimages beyond humans in a particular domain. The spirit of ResNet is its success in reducing the negativeimpact of the degradation problem. The degradation problem means in the very deep neural network,the gradient will gradually disappear in the process of back-propagation, leading to poor performance.Therefore, with ResNet, it is possible to build a deeper convolution neural network and archive the higheraccuracy. Researchers successfully build a ResNet with 152 layers. This ultra-deep model won all theawards in ILSVRC’15.

Quality

The target quality is

Top Accuracy = . Table 4:

The Summary of Image Data Sets of HPC AI500 V2.0 Benchmarks

Dataset Channels Resolution Size

The extreme weather dataset [50] 16 768*1052 558 GBImageNet dataset [2] 3 256*256 137 GB

Table 5:

The scaling ratio of HPC AI500 v2.0 workloads

Workloads Comm (Parameters/Step) Comp (GFLOPs/Step) Comp/Comm(GFLOPs/Parameters)

EWA 41 million 691 16.85Image Classiﬁcation 25 million 2944 117.76

The reference implementation of HPC AI500 V2.0 benchmark is summarized as shown in Table 6.At present, we provide the implementations using TensorFlow [41], which is a popular deep learningframework in the HPC community [55]. For communication, we adopt Horovod [56] instead of the defaultGRPC protocol in TensorFlow, which is not extendable for large-scale cluster [57] due to the limitation ofthe master-slave architecture and socket-based communication. Horovod is a library originally designedfor scalable distributed deep learning using TensorFlow. It implements all reduce operations usingring-based algorithms [58] and other high efﬁcient communication algorithms that are widely used in thetraditional HPC community.

We propose two metrics, called Valid FLOPS (in short VFLOPS) and Valid FLOPS per watt (in shortVFLOPS per watt), to quantify the valid performance and energy efﬁciency that consider both the systemthroughput and model quality. The goal of these two metrics is to impose an penalty on failing to achievea target quality. VFLOPS and VFLOPS per watt is calculated according to the formulas as follows.

V FLOPS = FLOPS ∗ penalty coe f f icient (1)12he penalty coefﬁcient is used to penalize or award the FLOPS if the achieved quality is lower orgreater than the target quality. Its deﬁnition is described as follows: penalty coe f f icient = ( achieved quality / target quality ) n (2)Here, achieved quality represents the actual model quality achieved in the evaluation. target quality is the state-of-the-art model quality that has been predeﬁned in our benchmarks 6. The value of n is apositive integer, which is used to deﬁne the sensitivity to the model quality. The higher the number of n,the more loss of quality drop. As EWA has much more stringent quality requirement than that of ImageClassiﬁcation. We set n as 10 for EWA and 5 for Image Classiﬁcation by default.We propose VFLOPS per watt to evaluate energy efﬁciency. Table 6:

HPC AI500 V2.0 benchmark suite.

ProblemDomains Models Datasets Target Quality AI Frameworks Comm Lib AI Acc Lib Epochs

EWA FasterRCNN [51] EWA [50] mAP@[IoU=0.5]=0.35 TensorFlow Horovod CUDA,cuDNN,NCCL 50ImageClassiﬁcation ResNet50v1.5 [54] ImageNet [2] TOP 1Accuracy=0.763 TensorFlow Horovod CUDA,cuDNN,NCCL 90 Comm Lib refers to the communication libraries. AI acc lib refers to AI accelerators libraries.

For the fairness and equivalence of benchmarking different HPC AI systems, a series of clear andunambiguous benchmarking rules are mandatory.Our fundamental benchmarking rule is that we put each independent layer (Shown in Fig. 2) undertest while keeping the other layers intact.Furthermore, for the hardware-level and system-level benchmarking presented in Section 4, we give adetailed description from perspectives of each layer. Finally, we introduce the benchmarking procedures.

Based on our nine-layer model (Fig. 2), we specify the rules of each layer from top to bottom. • The dataset and target quality must be in accordance with the speciﬁcation of HPC AI500 V2.0benchmark that we have discussed in Section 5.• The training epoch number should be the same like the reference implementation to guarantee theequivalent computational cost, namely 90 epochs for ImageNet and 50 epochs for EWA. Note thatan epoch is an iteration over the entire data set, while a step refers to one update of the modelparameters. The number of epochs is based on our experimental observation, and it should beupdated in the future as well as the target qualities.

The rules of hyper-parameters setting layer include three parts, namely batchsize setting, learning ratepolicies, and other hyper-parameters settings. 13 atchsize Setting

The batchsize of a training step is allowed to change, to fully utilize the computingcapability of the system.

Learning Rate Policies

Previous work shows the increase of batchsize leads to a fall of the modelquality [59]. In this context, many learning rate policies are proposed [19, 20, 26, 60]. With state-of-the-artlearning rate policies, we can increase the training batch size to fully utilize the hardware’s resourceswhile preserving the model quality at the same time. As each learning rate policy has its limitation interms of the maximum supporting batchsize, our rule allows benchmark users to propose new learning ratepolicies to fully utilize the hardware’s resources. Meanwhile, we provide a default learning rate policy.

The default learning rate policy:

The default learning rate policy of HPC AI500 is a linear scalingrule and a warm-up rule. The description is as follows:• A linear scaling rule: multiply the base learning rate η by k when the batch size is multipliedby k . The goal of the linear scaling rule is to make SGD updates similar in both distributed andsingle-worker training [19].• A warm-up rule: gradually increase the learning rate from a small to a large number until it equalsto η × k . After warmup, the learning rate starts the original learning rate schedule (e.g. cosinedecay). The Warm up rule is proposed since using linear scaling rule alone breaks down when theweight of the neural network is changing rapidly in the early training stage [19].Fig. 7a shows the learning rate changing curve after applying linear scaling and warm-up rules. Wealso perform a series of experiments to show the effect of this policy on model quality. As shown inFig. 7b, linear scaling and warm-up rules can improves the top1 accuracy from 61.48% to 76.34%in Image Classiﬁcation when the batchsize is 8192. (a) The learning rate curve. (b)

The effect on accuracy.

Figure 7:

The learning rate curve and its effect on accuracy with linear scaling and warm-up rules. The benchmarkis Image Classiﬁcation and the system scale is 64 GPUs. The experiment conﬁguration is consistent with that ofTable 9. The batchsize is 8192.

Other learning rate policies:

Except for the linear scaling and warmup scheme, using state-of-the-art learning rate policies (e.g. LARS [26] and LAMB [60]) are allowed. For new proposed ones,benchmarking users should open source their methods.

Other Hyper-parameters Setting

Except for batchsize and learning rate policy, other hyper-parameterssuch as weight decay, momentum must be as the same as the reference implementation.14 .1.3 Workload Layer

The AI algorithms in the workload must be the same as the reference implementation. • Data parallelism and model parallelism are both allowed as long as the mathematical equivalence ispreserved.• Synchronous Stochastic Gradient Descent (SGD) must be used in training, since asynchronousSGD may a) introduce the randomness, b) destroy the mathematical equivalence, c) decrease theaccuracy.

The AI framework must be the same as the reference implementation.

In synchronous communication, the workers in the cluster must wait until all the workers have ﬁnished, toproceed to next iteration. We allow different communication policies in an synchronous mode.

Table 7:

Some common communication typologies of allreduce.

Topologies Applications

Butterﬂy OpenMPI [61]Double binary tree NCCL [62]Ring Baidu DeepSpeech [63], Horovod [56]Hierarchical ring Horovod • Based on AllReduce. Table 7 shows the common topology used in AllReduce implementations.The benchmark users are allowed to utilize these existing ones or propose new typologies accordingto the conﬁguration of the systems. For example, the researchers from Lawrence Berkeley NationalLaboratory archived Exascale FLOPS by customizing a communication topology of AllReduce onSUMMIT [7].• Based on MapReduce. The communication topology is determined by the implementation ofMapReduce. The distributed training of Spark MLlib, SystemML, and REEF are all based onMapReduce. Users are allowed to implement customized MapReduce on their systems.• Based on parameter server. It is mandatory that only the synchronous mode is used for the parameterserver, while it also supports asynchronous training. • Benchmark users can choose the AI accelerator library to achieve the best performance out of thesystem.• The single-precision ﬂoating point (FP32), half precision (FP16, BFLOAT16 [31]), and quantization(INT8, INT4) are allowed. 15 .1.8 OS Layer • Benchmark users can adjust OS conﬁgurations (such as CPU-Afﬁnity setting) to achieve the bestperformance out of the system.• Benchmark users can choose ‘-O2’ compiler optimization option when compiling the benchmarksand the run time environment software.

Benchmark users can adjust hardware conﬁgurations (such as hyper-threading setting, memory-prefetchingsetting) to achieve the best performance out of the system.

As discussed in Section 4.1, in the system level, we put the hardware system in addition to the frameworkunder test. Therefore, in addition to the rules deﬁned in the hardware level, benchmark users are allowedto reimplement the benchmark using a different or even customized AI framework at the AI frameworklayer.

Benchmark users need to download the source code of the benchmarks from the Benchcouncil Web site. • Timing rules: timing starts when the workload reads the ﬁrst batch training data and ends when thetarget epochs is reached.• Runs: according to the variation of EWA and Image Classiﬁcation from Table 2, the least numberof runs is 5 and 10, respectively, to reduce run-to-run variation. For reporting, we drops the runswith the highest and lowest variations, than calculate the arithmetic mean of the remaining results.• Benchmarking scores:1) time-to-quality is the training time to its achieved quality;2) FLOPS refers to the single-precision ﬂoating point operations (or equivalent operations) persecond. The equivalent operations of the single-precision ﬂoating point operations include but notlimited with FP16, BFLOAT16, INT8, and INT4;3) VFLOPS and VFLOPS per Watt refers to the deﬁnitions in Section 5.4.1.

The reporting results should include the following parts:• The description of system under test, including but not limited to:1) detail descriptions of parameters of CPUs and AI accelerators in a single-node;2) detail descriptions of parameters of intra-node connection in a single-node;3) detail descriptions of parameters of OS in a single-node;4) detail descriptions of parameters of run time environment software in a single-node;5) detail descriptions of parameters of inter-node connection in the system;6) detail descriptions of parameters of run time environment software in the system.16 Benchmark conﬁgurations, including but not limited to:1) all hyper-parameter setting;2) detail descriptions of communication.• Benchmarking scores, including time-to-quality, FLOPS, FLOPS per Watt, VFLOPS and VFLOPSper watt in all runs. These metrics should be submitted with the output log of the benchmark.• The source code, relevant document, and running script should be uploaded to Benchhub, which isthe ofﬁcial code repository managed by BenchCouncil.The BenchCouncil community is responsible for checking the replicability of the reported results andreviewing the code.

A lot of previous work [27–30] focuses on accelerating Image Classiﬁcation/ResNet-50 training. Theseefforts reduce the training time from hours to minutes. In this section, we take Image Classiﬁcation as anexample to explain why equivalent benchmarking rules matter for fair ranking HPC AI systems.Batch normalization is a common effective method to improve the model generalization [64]. Thetrainable parameters of batch normalization γ and β are used to restore the representation ability ofthe network. Jia et al. [29] propose eliminating the weight decay on γ and β of batch normalizationlayer, which is a signiﬁcant algorithm innovation in their work. We re-implement this algorithm-leveloptimization in accordance with [29]. Further, we use the VFLOPS as the metric to quantify theperformance gap.The benchmarking results are shown in Table 8. The accuracy gain and corresponding VFLOPS ratioare reported against the one without removing the weight decay. We ﬁnd that as the system scale becomeslarger, this optimization has a greater impact on the achieved quality. The accuracy gain is 0.45% onthe scales of 16 and 32 GPUs, and then jumps to 1.38% on the scale of 64 GPUs, which is a notableimprovement. We calculate the VFLOPS ratio according to the formula discussed in Sec 5.4 for eachsystem scale. On the system scale of 64 GPUs, the VFLOPS ratio is high as 1.10, which is essentially thegain contributed solely by the algorithm innovation.Consider the following case: we perform a comparison between two HPC AI systems using the samebenchmark. One benchmark user leverages this algorithm innovation, while the other does not. If wedo not exclude this case in the benchmarking rules, the benchmarking results will be unfair. That is thereason why we mandate that the other hyper-parameter settings in Layer 8 must keep intact as shown inFig. 2.Someone may question why we allow changing learning rate policies in Layer 8 in our rules as shownin Fig. 2. Just as discussed in Section 6, this is because to fully utilize the hardware resources, the usershave to change the learning rate policies. Table 8:

The impact of removing the weight decay on batch normalization (BN) layer with different system scales.The benchmark is Image Classiﬁcation and the accuracy is measured by Top-1 accuracy.

System Scale Batchsize Accuracy Gain VFLOPS Ratio

16 GPUs 2048 +0.45% 1.0332 GPUs 4096 +0.45% 1.0364 GPUs 8192 +1.38% 1.10 The VFLOPS ratio refers to the ratio of the VFLOPS after the optimization against the one without optimization. The HPC AI Rooﬂine Performance Model

Given a speciﬁc HPC AI system, the theoretical peak performance number can be calculated accordingto hardware conﬁgurations. However, the theoretical peak one is hard to achieve. Hence, we need aperformance model to help achieve the upper bound performance of an HPC AI system.The previous Rooﬂine model [65] is a upper bound performance model based on FLOPS and operationintensity (OI)–the total number of ﬂoating point instructions divided by the total byte number of memoryaccesses. With the aid of a Rooﬂine model, we can decide a workload is memory-bound or compute-bound.Moreover, potential optimization strategies can be recommended according to the different ceilings of theRooﬂine model. To date, there is no such a performance model available for HPC-AI systems. In thissection, we ﬁrst analyze the distinctive characteristics of an HPC-AI system, and then propose an HPC-AIRooﬂine Model.

An HPC AI system is a distributed system consisting of multiple nodes, each of which is heterogeneousand equipped with multiple CPUs and AI accelerators, as shown in Fig 8. The CPUs of each node areresponsible for scheduling tasks and communicating with other nodes. The AI accelerators are responsiblefor AI calculations. Each AI accelerator loads or stores data from its memory units through memorychannels. And all AI accelerators of each node are connected with a speciﬁc high-speed network (e.g.NVLink for GPUs). The distributed nodes are interconnected by a general high-speed network (e.g.high speed Ethernet). Hence, the communications include both inter-node and intra-node ones. Ouranalytics in Section 8.4 reveals the communication efﬁciency is one of the dominant factors that impactits performance.

Figure 8:

The Architecture of an HPC AI system.

When proposing HPC-AI Rooﬂine models, we consider the distinctive characteristics of HPC AIsystems and the huge impact of communication efﬁciency on the performance of HPC AI systems.Signiﬁcantly different from the original Rooﬂine model [65], which emphasizes the impact of computation(FLOPS) and memory access (OI) on the overall performance, our HPC-AI Rooﬂine model emphasizesthe impact of communication and computation. We propose an innovative metric, named communicationoperation intensity (in short, COI), to replace OI. COI is deﬁned as the total number of ﬂoating pointinstructions divided by the total byte number of communication.Considering the different communication modes of inter-node communication (general high speednetwork) and intra-node communication (speciﬁc high speed network), our HPC-AI Rooﬂine model is acombination of a single-node model with a distributed model.18e use FLOPS as the metric to depict the upper bound performance. Unlike the original Rooﬂinemodel [65] using the double-precision ﬂoating point operations per second, we use the single-precisionﬂoating point operations or equivalent operations, such as mixed-precision ﬂoating point operations persecond. This is because double-precision ﬂoating point operations are rarely required for deep learningworkloads, while single-precision or mixed-precision ﬂoating point operations are prevalent.Intentionally, we do not choose VFLOP as the performance metric. This is because the purpose ofthe Rooﬂine model is to decide the performance bound of the workload and guide its system-level andhardware-level optimizations. Instead, VFLOP is a composite metric reﬂecting both performance andaccuracy to rank the HPC AI systems.

The single-node HPC-AI model is formulated as follows.

FLOPS

Attained = min ( FLOPS

Peak , ComBand

Peak ∗ COI ) (3) ComBand

Peak is the theoretical peak communication bandwidth of a single-node HPC AI system, whichis the bandwidth of interconnections among AI accelerators.

FLOPS

Peak is the theoretical peak FLOPSof a single-node HPC AI system, which is the aggregate theoretical peak FLOPS of all AI accelerators.The communication operation intensity–

COI –is obtained by

COI = FLOPs / CT where CT is short forthe communication trafﬁc–the total number of communication bytes among AI accelerators. To moreaccurately reﬂect the performance bottleneck of a given workload, different ceilings are added to helplocate the bottlenecks and provide potential optimization recommendations.We use CONV (convolution) and GEMM (GEneral Matrix to Matrix Multiplication) to measure theupper bound performance of the system. On one hand, they are two most frequently-appearing kernelfunctions of the seventeen benchmarks of AIBench; On the other hand, their computing patterns, i.e., theirmultiplying and adding calculations can be fused, allow them to make more efﬁcient use of accelerators. FLOPS

Attained is the performance that a workload can attain, and the attained performance bound of agiven workload under ceilings is formulated as follows.

FLOPS

Attained = Min ( FLOPS

Ceiling , ComBand

Ceiling ∗ COI ) (4) For the distributed model, we propose using COI (communication operation intensity) and FLOPS todepict the upper bound performance. The model is formulated as follows.

FLOPS

Attained = Min ( FLOPS

Peak , ComBand

Peak ∗ COI ) (5)The ComBand

Peak is the theoretical peak communication bandwidth of the distributed system, i.e., thetheoretical bandwidth of the high speed Ethernet.

FLOPS

Peak is the theoretical peak FLOPS of thedistributed system, which is the aggregate theoretical FLOPS of all AI accelerators in the distributedsystem. The communication operation intensity–

COI is obtained by

COI = FLOPs / CT , where thecommunication trafﬁc– CT is the total byte number of communications among all AI accelerators in thedistributed system. To more accurately reﬂect the performance bottleneck of a given workload, we addseveral ceilings, and the attained performance bound of a given workload is formulated as follows. FLOPS

Attained = Min ( FLOPS

Ceiling , ComBand

Ceiling ∗ COI ) (6)19 a) The Single-Node version. (b)

The Distributed version.

Figure 9:

The HPC-AI Rooﬂine Model.

We perform a case study of our HPC AI Rooﬂine models on an experimental system. The system consistsof eight nodes, each of which is equipped with one Intel(R) Xeon(R) Platinum 8268 CPU and eightNVIDIA Tesla V100 GPUs. Each GPU in the same node has 32 GB HBM memory, connected by NVIDIANVLinka high-speed GPU interconnection that has theoretical peak 300GB/s bi-directional bandwidth.The nodes are connected with an Ethernet networking with a bandwidth of 10 Gb/s. Each node has 1.5TB of system memory and 8 TB of NVMe SSD disk.

As shown in Fig. 9a, the y-axis is the performance in terms of ﬂoating-point operations per second, whilethe x-axis is the communication operation intensity–the ﬂoating-point operations divided by the totalbyte number of communication. In Fig. 9a, the peak computation rate forms the ‘ﬂat’ part, while thecommunication bandwidth turns into the ‘slanted’ part. So, if the communication operation intensity islower, the workload is communication-bound, under the slanted part of the rooﬂine. With the sufﬁcientcommunication operation intensity, the workload is compute-bound.We add four computation ceilings: mixed-precision GEMM (the performance of the mixed-precisionﬂoating point implementation of GEMM), single-precision GEMM, mixed-precision CONV, and single-precision CONV. Single-precision setting is commonly-used in the AI domain, while mixed-precision isone of the optimization features on some advanced AI accelerators.The best-case performance of eight GPUs is that the communication and computation totally overlap,and the memory bandwidth becomes the bottlenecks. We add one communication ceiling – memorybandwidth. In Fig. 9a, the theoretical peak number of mixed-precision FLOPS, the mixed-precisionGEMM ceiling, the mixed-precision CONV ceiling, the single-precision GEMM ceiling, the single-precision CONV ceiling is 1040 TFLOPS, 636 TFLOPS, 176 TFLOPS, 115 TFLOPS, 112 TFLOPS,respectively. Note that the gap between the theoretical perk number with the actual one is because thatthe performance of CONV and GEMM is affected by the dimension and sparsity of input data, NCHWformat and output channels. Additionally, the convolution kernel also impacts the performance of CONVgreatly. The different input size of CONV and GEMM leads to different performance numbers. TheNVLink ceiling is the theoretical peak bandwidth of the communications among GPUs–300 GB/S, andthe memory bandwidth ceiling is the theoretical peak bandwidth of the memory–1134 GB/S.

Our system consists of eight nodes. All the GPUs in the same node are connected by NVIDIA NVLink,and the nodes are connected with an Ethernet networking. In Fig. 9b, the peak computation rate forms the20 able 9:

Hardware conﬁguration details.

System Conﬁgurations Single-Node ConﬁgurationsNum of Nodes 8 CPU Type Intel(R) Xeon(R) Platinum8268 CPUGPUs per Node 8 Memory 1.5TB, DDR4Total num of GPUs 64 Disk 8TB, NVxMe SSDPeak Theoreticalperformance (FP32) 960 TFLOPS GPU Type Nvidia Tesla V100Peak Theoreticalperformance (Mixed) 7680 TFLOPS GPU Memory 32GB, HBMInterconnection Ethernet, 10Gb/s Intraconnection NVLink ‘ﬂat’ part, while the communication bandwidth (Ethernet networking bandwidth) turns into the ‘slanted’part. The theoretical Peak FLOPS of the system is 8320 TFLOPS, and the communication ceiling is 1.2GB/S.We add four computation ceilings: mixed-precision GEMM, single-precision GEMM, mixed-precisionCONV, and single-precision CONV. Their numbers are 5091, 920, 2376, and 976 TFLOPS, respectively.The best-case performance of the HPC-AI system is that the communications are within the nodes. Sowe add one communication ceilings–NVLink bandwidth. The NVLink bandwidth ceiling is 300 GB/S.

In this section, we introduce the experimental conﬁgurations in Section 8.1, present how to measureFLOPs in Section 8.2. Then, we perform an in-depth performance analysis of a single node in Section 8.3and multiple nodes in Section 8.4, respectively. Finally, we demonstrate how to use our rooﬂine model toguide the optimizations of the HPC AI systems in Section 8.5.

Our experiments are conducted on an HPC AI system, consisting of eight nodes, each of which is equippedwith one Intel(R) Xeon(R) Platinum 8268 CPU and eight NVIDIA Tesla V100 GPUs. Each GPU in thesame node has 32GB HBM memory, connected by NVIDIA NVLink–a high-speed GPU interconnectionwhose theoretical peak bi-directional bandwidth is 300GB/s. The nodes are connected with an Ethernetnetworking with a bandwidth of 10 Gb/s. Each node has 1.5 TB system memory and 8 TB NVMe SSDdisk.The details of the architecture of each NVIDIA Tesla V100 GPU–NVIDIA Volta architecture areas follows. The NVIDIA Volta architecture is equipped with 640 Tensor Cores to accelerate GEMMand convolution operations. Each Tensor Core performs 64 ﬂoating-point fused-multiply-add (FMA)operations per clock, delivering up to 125 TFLOPs of theoretical peak performance. When performingmixed precision training with a Tensor Core, we uses FP16 for calculation and FP32 for accumulation [18].We use TensorFlow v1.14, compiled with CUDA v10.1 and cuDnn v7.6.2 backend. We use Horovodv0.16.4 for synchronous distributed training, compiled with OpenMPI v3.1.4 and NCCL v2.4.8. NCCL isshort for the NVIDIA Collective Communications Library, which is a closed-source library of multi-GPUcollective communication primitives that are topology-aware.

The source-code level measurement of FLOPs is difﬁcult for a complex AI model implemented with acomplex AI framework. The mainstream frameworks like TensorFlow and PyTorch adopt computationalgraphs and map them to speciﬁc computing engines, e.g., GPU and cuDNN. This process invokesnumerous kernels, and each of which contributes to a portion of FLOPs. Hence, we need to ﬁgure out the21mplementation of each invoked kernel to obtain the FLOPs of an entire AI model. Unfortunately, thesource code is not publicly available as the NVIDIA libraries, like CUDA and cuDnn are not open source.We use NVProf [66]–a performance analysis tool for NVIDIA GPUs–to measure the FLOPs in ourexperiments. NVProf can be used to collect the proﬁling data from hardware performance counters. But ithas a huge overhead, slowing down the the execution time more than hundreds of times. Thus, proﬁlingthe whole training session of a deep learning model is prohibitively costly. The previous work [67, 68] hasfound that each iteration of model training has the same computation logic and the iteration number haslittle impact on micro-architectural behaviors. So we sample a partial training set and calculate the FLOPsfor efﬁciency. As the image size of the EWA and ImageNet datasets is 13.14k, and 1280k, respectively,so we sample 500 images and 12800 images from the EWA and ImageNet datasets, respectively. Thethroughput is calculate according to the following equation:

T hroughput = N × R × C . Here N is thenumber of images processed by each training process per second, R is the total number of ranks (thenumber of training processes), and C is the FLOPs per image. Table 10:

The FLOPs per image.

Dataset Image Sample Size Total FLOPs FLOPs Per Image

EWA 500 345.66 TFLOPs 691 GFLOPsImage Classiﬁcation 12800 2877.06 TFLOPs 23 GFLOPs

Table 11:

The performance summary of a single node

Workloads Models Precision GFLOP(Per Image) Throughput(Images/s) Attainable Performance (TFLOPS) Achieved Performance Ratio (%) Image Classiﬁcation ResNet-50 V1.5 FP32Mixed

23 26245734 58126 48105EWA FasterRCNN FP32 691 46 31 26 The attainable performance refers to the performance obtained in the testing. The achieved performance ratio refers to the ratio of the attainable performance against the theoretical peak performance (FP32). Mixed refers to FP32 & FP16 mixed precision.

Categories Convolution GEMM

BatchNormalization

ElementWise Pooling Memcpy NCCLAllreduce

Data Arrangement

Overall GPUUtilization (%)

Image ClassiﬁcationFP32Image ClassiﬁcationMixedEWAFP32

IPCTime (%)Time (%)Time (%)DramUtilizationIPCIPCDramUtilizationDramUtilization

Figure 10:

The details of single-node performance analytics of Image Classiﬁcation and EWA. We classify thekernels invoked on the GPU into eight categories and use three metrics to depict their characteristics: the proportionof time, instruction per cycle (IPC) and dram utilization. The GPU utilization during the overall training session isalso recorded. An asterisk (*) is used to indicate the number is negligible, less than 0.001%. .3 Single-node Evaluation In this subsection, we ﬁrst report the execution efﬁciency on a single node, and then perform communica-tion and computation analytics to recover the factors that impact the performance signiﬁcantly. We usethe HPC AI500 V2.0 benchmarks.

Based on the methodology described in Section 8.2, we report the performance efﬁciency of two bench-marks on a single node: Image Classiﬁcation and EWA. We evaluate both the FP32 precision and mixedprecision implementations, which uses the Tensor Core to accelerate the training session. As the memoryfootprint required by the mixed precision implementation is nearly a half of that of FP32 precision,we double the batch size in each training step for mixed precision without breaking the benchmarkingrule deﬁned in Section 6. Table 11 shows the performance efﬁciency of the above two benchmarks.The achieved performance ratio is the ratio of the attainable performance against the theoretical peakperformance of the FP32 precision implementation. In our experiments, the theoretical peak number is120 TFLOPS, which is the theoretical peak performance of the single-precision (FP32) implementation(15 TFLOPS) multiplied by 8– the number of NVIDIA Tesla V100 SXM2 GPUs. From Table 11, we ﬁndthat the performance efﬁciency of EWA is extremely low with respect to that of Image Classiﬁcation. Wefurther characterize their computation and communication characteristics to uncover the factors.

NEGOTIATE_ALLREDUCE

ALLREDUCE S (cid:87)(cid:72)(cid:83)(cid:20) Step2

Wait_for_dataWait_for_other_data Queuing Memcpy_in

Step3 Step4 Step5 Step6

Memcpy_outNccl_allreduce

TimelineNegotiation Processing

Figure 11:

The timeline of Horovod communication.

We ﬁrst perform communication analytics using a timeline analysis tool [69] to record all activitiesof the Horovod communication, since its synchronous distributed manner may signiﬁcantly affect theperformance. As shown in Fig. 11, the communication timeline of Horovod is divided into two phases: negotiation and processing . In the negotiation phase, all training processes send a signal to the ﬁrstprocess to ensure their status are ready for the subsequent tensor reduction. In the processing phase, thetensor reduction is performed. Speciﬁcally, the processing phase is further divided into six steps. Steps 1(

Wait for data ) and 2 (

Wait for other data ) are waiting for the data produced by GPU computing, whichis the input to all reduce operations. Step 3 (

Queuing ) happens only when the previous all reduce has notﬁnished. Steps 4 (

Memcpy in ) copies data into the fusion buffer. Step 5 (

NCCL Allreduce ) is the core partthat executes all reduce operation across all the training processes. Steps 6 (

Memcpy out ) removes thedata out of the fusion buffer.We proﬁle the average wall clock time of all steps and compare EWA against Image classiﬁcation. Weﬁnd the long negotiation phase is one main factor that leads to inefﬁcient communication of EWA. Asshown in Table 12, the average negotiation allreduce of EWA accounts for 28.5% of the total duration ofHorovod communication, 2.5 times than that of Image classiﬁcation. The root cause is the side effect ofthe centralized schedule strategy of the Horovod negotiation. As mentioned before, the ﬁrst process during23 able 12:

The time breakdown of the Horovod communication.

Phases Steps EWA Image Classiﬁcation

Negotiation Negotiation Allreduce 54.837 ms 22.836 msProcessing Wait for data 1.746 ms 85.418 msProcessing Wait for other data 2.961 ms 27.036 msProcessing Queuing 65.863 ms 0.043 msProcessing Memcpy in 0.108 ms 1.256 msProcessing NCCL Allreduce 66.228 ms 4.153 msProcessing Memcpy out 0.197 ms 0.993 ms the negotiation acts as a centralized scheduler to avoid deadlock by reordering all the all reduce operationsacross processes. It receives the message from all processes and sends back the correct tensor list thatshould be reduced. EWA needs to execute all reduce operation more than one hundred times and has about41 millions of gradients in total to be reduced during each training step, and thus spends too much timeon the ﬁrst process. Another factor is the sub-optimal overlap between computation and communication.According to Table 12, we ﬁnd the total duration of wait for data and wait for other data in EWA andImage Classiﬁcation is 4.6, 112.4 ms, respectively; in the duration of

NCCL Allreduce it is 66.2 ms and4.15 ms in EWA and Image Classiﬁcation, respectively. These numbers indicate EWA has a worse overlapbetween computation and communication than that of Image classiﬁcation. Besides, queuing is up toabout 65.8 ms, showing the

NCCL Allreduce operation has to wait for a longer duration. Accordingly,the duration of queuing and wait for data of Image Classiﬁcation is 0.043 and 85.4 ms, respectively,indicating Image Classiﬁcation has better overlap between communication and computation than that ofEWA.In addition to the communication analytics, we also conduct computation analytics through a thoroughproﬁling of GPU activities using NVProf [66]. Fig. 10 shows the results. There are thousands invocationsof CUDA kernel during each training step. For simplicity, we classify all the kernel functions intoeight categories. Each category represents a kind of operation, namely convolution, GEMM, batchnormalization, element wise, pooling, memcpy, NCCL Allreduce, and transformation. For EWA, we ﬁndNCCL Allreduce (35.97%) and memcpy operations occupy 50.62% in total, leading to poor performance.For ImageNet Classiﬁcation, the most time-consuming kernel is convolution, namely 35.02% and 22.18%in the FP32 and mixed precision implementations, respectively.We also notice the overhead of data arrangement occupies 15.61% in the mixed precision imple-mentation, while less than 0.0001% in the FP32 implementation. The huge overhead in the mixedprecision implementation is incurred by converting different data layouts between the TensorFlow andCUDA kernels. The data layout of the TensorFlow kernels is represented in a quadruple tuple (batch size,channels, height of data sample, width of data sample), abbreviated as NCHW. While, the data layout ofthe CUDA kernels is represented in a quadruple tuple (batch size, height of data sample, width of datasample, channels), abbreviated as NHWC. That inconsistency incurs a huge overhead. It explains whythe speedup of mixed precision version of Image Classiﬁcation is only 2.16x. It is much smaller than theresults published by Nvidia [70], which claims that the mixed precision training can bring up to 8x speedup on the Tesla V100 GPU.

We perform several scaling experiments on the distributed system, described in Section 8.1. Both EWAand Image Classiﬁcation experiments are scaled out from 8 GPUs to 64 GPUs. We take the 8-GPUexperiments (single node) as a baseline. Our communication topology is the double binary tree [62],which is implemented by NCCL 2.4. We report the performance numbers of these experiments and24 a) Image Classiﬁcation (FP32) (b)

Image Classiﬁcation (Mixed) (c)

EWA (FP32) (d)

Image Classiﬁcation(FP32+Compression) (e)

Image Classiﬁcation(Mixed+Compression) (f)

EWA (FP32+Compression)

Figure 12:

The scaling experiments of EWA and Image Classiﬁcation. perform further analysis using the HPC AI rooﬂine models proposed in Section 8.5. The scaling resultsare shown in Fig. 12.

For the FP32 precision implementation of Image Classiﬁcation, the parallel efﬁciency is 0.91, 0.85 and0.71 on 16, 32 and 64 GPUs, respectively. For the mixed implementation, the parallel efﬁciency is slightlower: 0.89, 0.82 and 0.67, respectively. There is a notable loss of parallel efﬁciency when the systemscale is 64 GPUs.We also notice that the communication compression does not bring any performance improvementwhen the system scale is 32 GPUs or less. However, when the scale is 64 GPUs, it contributes a lot.For the FP32 version, the performance improves from 345 to 414 TFLOPS. For the mixed version, theperformance improves from 718 to 939 TFLOPS. According to our HPC AI Rooﬂine model shown inFig. 9b, we ﬁnd that there is a performance bound shift when the system scale changes from 32 to 64 GPUs.Speciﬁcally, when the system scale is less or equal to 32 GPUs, Image Classiﬁcation’s communicationceiling is dominated by NVLink’s bandwidth, and it is computation-bound. Hence, communicationcompression cannot improve the performance. However, When the system increases to 64 GPUs, thecommunication ceiling is dominated by Ethernet’s bandwidth, so it turns into communication-bound. Thisis why communication compression works. The highest performance of Image Classiﬁcation that weachieve is 939 TFLOPS through both mixed precision optimization and communication compression, asshown in Fig. 12e.

For the FP32 precision implementation of EWA, the parallel efﬁciency is 0.50, 0.37, and 0.36 at thesystem scale of 16, 32, and 64 GPUs, respectively. According to the Rooﬂine model shown in Fig. 9b,the bottleneck is always communication bandwidth. Therefore, communication compression achievesgood results. When communication compression is used, the performance gain persists when the scaleincrease from 8, to 16, 32, and 64 GPUs, and the speedup is 1.2, 1.4 ,1.6 and 1.5, respectively. The highestperformance of EWA achieved through communication compression is 109 TFLOPS.25 igure 13:

The distinctive communication bandwidth consumption of the FP32 implementations of EWAand Image Classiﬁcation.

For EWA and Image Classiﬁcation, we found their different parallel efﬁciencies are due to distinctcommunication bandwidth consumption. As shown in Fig. 13, we measure the communication bandwidthconsumption of the FP32 precision implementations of EWA and Image Classiﬁcation. EWA consumesmuch higher communication bandwidth than that of Image Classiﬁcation. In the contrast, the performanceof Image Classiﬁcation largely depend on the computation efﬁciency especially when the scale is less thanand equal to 32 GPUs. In conclusion, 10 Gb/s Ethernet can not satisfy the communication requirement ofEWA, and hence results in poor parallel efﬁciency.

The metric of VFLOPS emphasizes both the performance and quality. Fig. 14 shows the rankings ofdifferent scale HPC AI systems with mixed-precision or FP32 implementations. The highest performanceis 642 TVFLOPS, achieving through the mixed optimization at the scale of 64 GPUs. Meanwhile, anotherauxiliary metric–time-to-quality is also reported. Generally, our metric is simple and visual.

Figure 14:

The VFLOPS Rankings of HPC AI Systems Using Image Classiﬁcation.26 .5 The Case Study of Using HPC-AI Rooﬂine Models

This section presents a case study on how to use our proposed HPC AI Rooﬂine models to identify thebottleneck and guide optimizations.

We use the proposed rooﬂine models to the 16-GPU HPC AI system. The theoretical peak number iscalculated according to the hardware conﬁgurations shown in Table 9. We use the rooﬂine model toidentify potential bottlenecks of EWA and Image Classiﬁcation.From Fig. 15, we have the following observations. EWA is bounded by the communication bandwidthas it falls in the slanted part of the roof, while Image Classiﬁcation is bounded by the computation as itfalls in the ﬂat part. Communication Operation Intensity P e r f o r m a n ce ( TF O PS ) Peak DL C o m m u n i c a t i o n B a n d w i d t h N V L i n k B a n d w i d t h CONV with Single PointCONV with Mixed PrecisionGEMM with Single PointGEMM with Mixed Precision

Figure 15:

The rooﬂine model at the system scale of 16 GPUs. The blue point represents EWA, and thered point represents Image Classiﬁcation.

We adopt two optimization strategies: communication compression and mixed precision optimization.

Communication compression.

In order to optimize the communication, we perform communicationcompression, which encodes and compresses the tensor precision into FP16 for communication andthen decodes into FP32 for computation. This optimization halves the amount of communication foreach training step, which is equivalent to doubling the communication bandwidth. As the amount ofcomputation remains the same, the COI of EWA and Image Classiﬁcation also doubles. As shown inFig. 15, our results show that the performance of EWA increases from 25.99 to 36.97 TFLOPS aftercommunication compression. On the other hand, the performance of Image Classiﬁcation is not improvedbecause it is computation bound. Its COI indeed increases.

Mixed precision training.

In order to improve the performance of Image Classiﬁcation, we adopt themixed precision optimization, which makes use of Tensor Core to perform arithmetic calculation in anFP16 format, achieving higher amount of computation operations per second. As shown in Fig. 15, therightest red point represents using the mixed precision training. It brings about 2.16x speedup. Moreover,the COI is also improved. This is because that the mixed precision training requires lower memory27ootprint, so we double the batchsize, and the larger batchsize leads to higher COI (higher amount ofcomputation per step). In the near future, we will try mixed precision optimization for EWA, too.

We summarize the related work in a chronological order (according to the publication dates of thereferred papers or publicly available technique reports) from the perspectives of HPC benchmarking, AIbenchmarking, and HPC AI benchmarking.

HPL (1994) [71] is the famous HPC benchmark for the Top500 [13] ranking. HPL is short for HighPerformance Linpack, which is designed to solve dense linear equations. For the TOP500 ranking, usersare allowed to optimize MPI [61] and the BLAS [72] library to achieve the best performance. Sincesolving the Linpack problem is very regular, the HPC system can achieve very high performance. So, theperformance of HPL can be described as the upper bound performance of the target HPC system. HPL isopen source and publicly available from .NPB (1994) [73] is the NAS Parallel Benchmark suite, whose workloads are derived from thecomputational ﬂuid dynamics (CFD) applications. CFD is a typical traditional HPC application. Based onthe pencil-and-paper speciﬁcation, NPB 1.0 consists of ﬁve kernels and three pseudo-applications, andthe lastest NPB 3.4.1 includes 12 workloads. NPB is open source and publicly available from .HPCC (2005) [74] is an HPC Challenge benchmark suite, which includes seven different workloads.HPCC covers the spectrum of spatial locality and temporal locality of the HPC workloads. So, the HPCCbenchmarks are designed for measuring a range of memory access patterns of the HPC system. HPCC isopen source and publicly available from https://icl.utk.edu/hpcc/ .Graph500 (2010) [75] is designed for the data-intensive supercomputer applications. The workloadsof Graph500 are the search and shortest-path programs of the weighted undirected graph. The Graph500workloads exhibit very low spatial and temporal locality. Its metric is not the FLOPS but the TEPS(traversed edges per second). Graph500 is open source and publicly available from https://graph500.org/ .HPCG (2013) [76] is another benchmark for the Top500 ranking. HPCG means High PerformanceConjugate Gradients (HPCG). Computational and data access patterns of HPCG are more close to thereal HPC applications. As a kernel workload extracted from the traditional HPC workloads, the HPCGbenchmark is intended as a complement to the High Performance LINPACK (HPL) benchmark, and theFLOPS of HPCG is far lower than that of HPL on the same platform. HPCG is open source and publiclyavailable from https://github.com/hpcg-benchmark/hpcg . BenchNN (2012) [77] uses neural networks algorithms to re-implement the well-known PARSEC bench-mark [35]. Their main propose is to illustrate the potential application scope of neural networks algorithms.The models adopted in BenchNN are simple shallow neural networks such as multi-layer perceptron, andthus they cannot reﬂect the state of the art. BenchNN is not open source so far.DeepBench (2016) [78] is a micro benchmark suite that aims to benchmark basic operations in deepneural networks such as convolution and dense matrices multiply. The methodology of DeepBench isto reﬂect the characteristics of these operations by using different input sizes. Since only operator levelis concerned, DeepBench cannot provide full-model level evaluation. DeepBench is open source andpublicly available from https://github.com/baidu-research/DeepBench .Both Fathom (2016) [79] and TBD (2018) [80] consists of representative AI workloads, covering abroad range of application domains. Their evaluation only focus on throughput while ignoring model28uality. Fathom is open source and publicly available from https://github.com/rdadolf/fathom .TBD is open source and publicly available from https://github.com/tbd-ai/tbd-suite .DawnBench (2017) [81] aims to end-to-end deep learning benchmarking as it ﬁrstly proposes time-to-accuracy as the main metric, which requires to train a model to the state-of-the-art accuracy. It hastwo workloads including image classiﬁcation and question answering. The limitation of DawnBenchis ignoring the equivalent benchmarking rules. DawnBench is open source and publicly available from https://github.com/stanford-futuredata/dawn-bench-entries .The BenchCouncil AI benchmark suites (2018) present a series of AI benchmarking work, in-cluding AIBench [10, 11, 14, 15] for datacenter AI benchmarking, AIoTBench [82] for mobile andembedded device intelligence benchmarking, Edge AIBench [83] for edge computing benchmarking,and the previous version of HPC AI500 [45]. The BenchCouncil AI benchmarks are by far the mostcomprehensive AI benchmark suites covering datacenter, IoT, edge, and HPC. For example, AIBenchadopts a scenario-distilling benchmarking methodology for the ﬁrst time, which considers scenariobenchmarks, component and micro benchmarks as three indispensable parts of a benchmark suite. Thismethodology bridges a huge gap from real-world application deployments to simulator-based architec-ture research, and balances the subtly different requirements of earlier-stage benchmarking (portabilityand affordability for new architectures) and later-stage benchmarking (representativeness and compre-hensiveness) [11]. The BenchCouncil AI benchmark suites are open source and publicly availablefrom .BenchIP (2018) [84] focuses on benchmarking intelligent processors. It contains two sets of bench-marks: micro-benchmarks and macro-benchmarks. Micro-benchmarks consists of single-layer networksthat are used to system optimizations. Macro-benchmarks consists of various neural networks that areused to offer realistic benchmarking. BenchIP also ignores the equivalent benchmarking rules. In addition,it only focuses on throughput. BenchIP is not open source so far.MLPerf (2019) [9] includes seven benchmarks for training and ﬁve benchmarks for inference. TheMLPerf training benchmark proposes a series of benchmarking rules to eliminate the side effect of thestochastic nature of AI. Nevertheless, The MLPerf rules can not be used to assure the equivalency, repeata-bility, and replicability of HPC AI benchmarking. It lacks the speciﬁc parallelism and communicationrules. MLPerf is open source and publicly available from https://github.com/mlperf . HPC AI500 (V 1.0) (2018) [45] is the ﬁrst HPC AI benchmarks based on the real-world scientiﬁcdataset, covering three representative HPC AI applications, namely high energy physics, cosmology, andextreme weather analytics. The HPC AI500 (V 1.0) is open source and publicly available from .The HPL-AI benchmark (2019) [17] is designed for 32-bit and even lower ﬂoating-point precision AIcomputing. Using the solver formulation of the decades-old HPL framework of benchmarking, HPL-AIstrives to unite traditional HPC and state-of-art AI. HPL-AI algorithm is a combination of low-precision(state-of-art AI precision) LU factorization and iterative reﬁnement performed afterwards to bring thesolution back to 64-bit accuracy (traditional HPC precision). However, the LU factorization operationis irrelevant to most of AI workloads. As a micro-benchmark, HPL-AI benchmark is more suitable forevaluating the upper bound performance of the HPC AI system. The HPL-AI benchmark is open sourceand publicly available from https://icl.bitbucket.io/hpl-ai/ .Deep500 (2019) [85] is a reproducible customized benchmarking infrastructure for high-performancedeep learning. It has four levels of abstraction to provide a full-stack evaluation. However, its referenceimplementation uses commercial open source data sets and simple deep learning models, hence cannotreﬂect real-world HPC AI workloads. Moreover, it fails to propose rules to assure the equivalency,repeatability, and replicability of HPC AI benchmarking. Deep500 is open source and publicly availablefrom https://github.com/deep500/deep500 .AAH (2020) [86] uses AutoML [87] to benchmark HPC AI systems. AutoML is highly compute-intensive and extensible, which ﬁts the requirement of benchmarking HPC systems. However, as a29omplicated AI workload, AutoML involves many hyper-parameters, which usually makes it hard toevaluate [40]. Moreover, the variance of its essential workload–Neural Architecture Search is also high as6.15%, according to the evaluation in [10].

Some speciﬁc AI workloads also play an important role in evaluating the HPC AI system. Ima-geNet/Resnet50 is a well-known showcase for optimizing HPC AI systems, motivating a series ofresearches on learning rate scheduling algorithms and efﬁcient communication strategies [19, 23, 24, 26,27, 29, 32].The researchers of Facebook (2017) [19] formally propose linear scaling rule and warmup schema forthe ﬁrst time and summarized several pitfalls in large scale deep learning. They ﬁnish the training in 60minutes with a top1 accuracy of 76.3%.The researchers of Berkeley (2017) [26] ﬁrstly propose LARS (Layer-wise Adaptive Rate Scaling),which is a novel learning rate policy. By utilizing this policy, they successfully scale the batchsize ofResNet50 to 32K and reduce the training time to 20 minutes.Preferred Networks (2017) [27], IBM (2017) [23], Tencent (2018) [29], Sony (2018) [28], Google(2018) [30], and Fujitsu (2019) [32] all focus on high efﬁcient communication strategies (scale to largerHPC systems) and other system-level optimizations (e.g. mixed precision training). Their learning ratepolicy or other algorithm-level optimizations follow the work from Facebook [19] and Berkeley [26].These work have reduced the training times from hours to minutes. So far, the fastest training time is 74seconds, which is from Fujitsu (2019) [32].

10 Conclusion

This paper proposes a comprehensive HPC AI benchmarking methodology that achieves the goal of beingequivalent, relevant, representative, affordable, and repeatable. Following this methodology, we presentopen-source benchmarks, and Rooﬂine performance model to benchmarking and optimizing the systems.We propose two innovative metrics: Valid FLOPS, and valid FLOPS per watt to rank the performance andenergy-efﬁciency of HPC AI systems.The evaluations show our methodology, benchmarks, performance models, and metrics can measure,optimize, and rank the HPC AI systems in a scalable, simple, and affordable way. The speciﬁcation,source code, and benchmarking data are publicly available from .

11 Acknowledgments

We thank the PengCheng Laboratory for hardware support. We also thank Shaomeng Cao, Xuhui Shao,Yongheng Liu, Changsong Liu, and Jingfei Qiu for technical support in using those systems.30 eferences [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutionalneural networks,” in

Advances in neural information processing systems , pp. 1097–1105, 2012.[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchicalimage database,” in , pp. 248–255,Ieee, 2009.[3] .[4] S. Ravanbakhsh, J. B. Oliva, S. Fromenteau, L. Price, S. Ho, J. G. Schneider, and B. P´oczos,“Estimating cosmological parameters from the dark matter distribution.,” in

ICML , pp. 2407–2416,2016.[5] Y. Liu, E. Racah, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel, M. Wehner, W. Collins, et al. ,“Application of deep convolutional neural networks for detecting extreme weather in climate datasets,” arXiv preprint arXiv:1605.01156 , 2016.[6] A. Mathuriya, D. Bard, P. Mendygral, L. Meadows, J. Arnemann, L. Shao, S. He, T. K¨arn¨a, D. Moise,S. J. Pennycook, et al. , “Cosmoﬂow: Using deep learning to learn the universe at scale,” in

SC18:International Conference for High Performance Computing, Networking, Storage and Analysis ,pp. 819–829, IEEE, 2018.[7] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips, A. Mahesh, M. Matheson,J. Deslippe, M. Fatica, et al. , “Exascale deep learning for climate analytics,” in

SC18: InternationalConference for High Performance Computing, Networking, Storage and Analysis , pp. 649–660,IEEE, 2018.[8] J. L. Hennessy and D. A. Patterson,

Computer architecture: a quantitative approach . Elsevier, 2011.[9] P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei,P. Bailis, V. Bittorf, et al. , “Mlperf training benchmark,” arXiv preprint arXiv:1910.01500 , 2019.[10] F. Tang, W. Gao, J. Zhan, C. Lan, X. Wen, L. Wang, C. Luo, J. Dai, Z. Cao, X. Xiong, et al. , “Aibench:An industry standard ai benchmark suite from internet services,” arXiv preprint arXiv:2004.14690 ,2020.[11] W. Gao, F. Tang, J. Zhan, X. Wen, L. Wang, Z. Cao, C. Lan, C. Luo, and Z. Jiang, “Aibench:Scenario-distilling ai benchmarking,” arXiv preprint arXiv:2005.03459 , 2020.[12] J. Gray, “Database and transaction processing performance handbook.,” 1993.[13] J. J. Dongarra, H. W. Meuer, E. Strohmaier, et al. , “Top500 supercomputer sites,”

Supercomputer ,vol. 13, pp. 89–111, 1997.[14] W. Gao, C. Luo, L. Wang, X. Xiong, J. Chen, T. Hao, Z. Jiang, F. Fan, M. Du, Y. Huang, et al. ,“Aibench: towards scalable and comprehensive datacenter ai benchmarking,” in

International Sympo-sium on Benchmarking, Measuring and Optimization , pp. 3–9, Springer, 2018.[15] W. Gao, F. Tang, L. Wang, J. Zhan, C. Lan, C. Luo, Y. Huang, C. Zheng, J. Dai, Z. Cao, et al. ,“Aibench: an industry standard internet service ai benchmark suite,” arXiv preprint arXiv:1908.08998 ,2019.[16] J. Zhan, L. Wang, W. Gao, and R. Ren, “Benchcouncil’s view on benchmarking ai and other emergingworkloads,” arXiv preprint arXiv:1912.00572 , 2019.[17] https://icl.bitbucket.io/hpl-ai/ . 3118] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston,O. Kuchaiev, G. Venkatesh, et al. , “Mixed precision training,” arXiv preprint arXiv:1710.03740 ,2017.[19] P. Goyal, P. Doll´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia,and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprintarXiv:1706.02677 , 2017.[20] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprintarXiv:1404.5997 , 2014.[21] V. Codreanu, D. Podareanu, and V. Saletore, “Scale out for large minibatch sgd: Residual net-work training on imagenet-1k with improved accuracy and reduced time to train,” arXiv preprintarXiv:1711.04291 , 2017.[22] S. Sridharan, K. Vaidyanathan, D. Kalamkar, D. Das, M. E. Smorkalov, M. Shiryaev, D. Mudigere,N. Mellempudi, S. Avancha, B. Kaul, et al. , “On scale-out deep learning training for cloud and hpc,” arXiv preprint arXiv:1801.08030 , 2018.[23] M. Cho, U. Finkler, S. Kumar, D. Kung, V. Saxena, and D. Sreedhar, “Powerai ddl,” arXiv preprintarXiv:1708.02188 , 2017.[24] Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “Imagenet training in minutes,” in

Proceedings of the 47th International Conference on Parallel Processing , ICPP 2018, (New York,NY, USA), Association for Computing Machinery, 2018.[25] .[26] Y. You, Z. Zhang, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Imagenet training in 24 minutes,” arXivpreprint arXiv:1709.05011 , 2017.[27] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch sgd: Training resnet-50 on imagenetin 15 minutes,” arXiv preprint arXiv:1711.04325 , 2017.[28] Y. Tanaka and Y. Kageyama, “Imagenet/resnet-50 training in 224 seconds,”[29] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, et al. , “Highlyscalable deep learning training system with mixed-precision: Training imagenet in four minutes,” arXiv preprint arXiv:1807.11205 , 2018.[30] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, “Image classiﬁcation at supercomputer scale,” arXiv preprint arXiv:1811.06992 , 2018.[31] https://cloud.google.com/tpu/docs/bfloat16 .[32] M. Yamazaki, A. Kasagi, A. Tabuchi, T. Honda, M. Miwa, N. Fukumoto, T. Tabaru, A. Ike, andK. Nakashima, “Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds,” arXivpreprint arXiv:1903.12650 , 2019.[33] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecturefor computer vision,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , pp. 2818–2826, 2016.[34] . 3235] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: Characterizationand architectural implications,” in

Proceedings of the 17th international conference on Parallelarchitectures and compilation techniques , pp. 72–81, 2008.[36] .[37] J. Gray, “The benchmark handbook for database and transasction systems,”

Mergan Kaufmann, SanMateo , 1993.[38] J. Bartlett and C. Frost, “Reliability, repeatability and reproducibility: analysis of measurementerrors in continuous variables,”

Ultrasound in Obstetrics and Gynecology: The Ofﬁcial Journal ofthe International Society of Ultrasound in Obstetrics and Gynecology , vol. 31, no. 4, pp. 466–475,2008.[39] .[40] A. Yang, P. M. Esperanc¸a, and F. M. Carlucci, “Nas evaluation is frustratingly hard,” arXiv preprintarXiv:1912.12522 , 2019.[41] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,M. Isard, et al. , “Tensorﬂow: A system for large-scale machine learning,” in { USENIX } symposium on operating systems design and implementation ( { OSDI } , pp. 265–283, 2016.[42] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,L. Antiga, et al. , “Pytorch: An imperative style, high-performance deep learning library,” in Advancesin neural information processing systems , pp. 8026–8037, 2019.[43] C. Luo, X. He, J. Zhan, L. Wang, W. Gao, and J. Dai, “Comparison and benchmarking of ai modelsand frameworks on mobile devices,” arXiv preprint arXiv:2005.05085 , 2020.[44] W. Bhimji, S. A. Farrell, T. Kurth, M. Paganini, E. Racah, et al. , “Deep neural networks for physicsanalysis on low-level whole-detector data at the lhc,” in

Journal of Physics: Conference Series ,vol. 1085, p. 042034, IOP Publishing, 2018.[45] Z. Jiang, W. Gao, L. Wang, X. Xiong, Y. Zhang, X. Wen, C. Luo, H. Ye, X. Lu, Y. Zhang, et al. ,“Hpc ai500: a benchmark suite for hpc ai systems,” in

International Symposium on Benchmarking,Measuring and Optimization , pp. 10–22, Springer, 2018.[46] https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html .[47] C. Drummond, “Replicability is not reproducibility: nor is it good science,” 2009.[48] H. E. Plesser, “Reproducibility vs. replicability: a brief history of a confused terminology,”

Frontiersin neuroinformatics , vol. 11, p. 76, 2018.[49] T. Kurth, J. Zhang, N. Satish, E. Racah, I. Mitliagkas, M. M. A. Patwary, T. Malas, N. Sundaram,W. Bhimji, M. Smorkalov, et al. , “Deep learning at 15pf: supervised and semi-supervised classiﬁ-cation for scientiﬁc data,” in

Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , pp. 1–11, 2017.[50] E. Racah, C. Beckham, T. Maharaj, S. E. Kahou, M. Prabhat, and C. Pal, “Extremeweather: Alarge-scale climate dataset for semi-supervised detection, localization, and understanding of extremeweather events,” in

Advances in Neural Information Processing Systems , pp. 3402–3413, 2017.[51] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with regionproposal networks,” in

Advances in neural information processing systems , pp. 91–99, 2015.3352] R. Girshick, “Fast r-cnn,” in

Proceedings of the IEEE international conference on computer vision ,pp. 1440–1448, 2015.[53] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate objectdetection and semantic segmentation,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , pp. 580–587, 2014.[54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

Proceedingsof the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016.[55] .[56] A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep learning in tensorﬂow,” arXiv preprint arXiv:1802.05799 , 2018.[57] A. Mathuriya, T. Kurth, V. Rane, M. Mustafa, L. Shao, D. Bard, V. W. Lee, et al. , “Scaling grpctensorﬂow on 512 nodes of cori supercomputer,” arXiv preprint arXiv:1712.09388 , 2017.[58] http://research.baidu.com/bringing-hpc-techniques-deep-learning .[59] S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, “Don’t decay the learning rate, increase thebatch size,” arXiv preprint arXiv:1711.00489 , 2017.[60] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J.Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” arXiv preprintarXiv:1904.00962 , 2019.[61] .[62] https://developer.nvidia.com/nccl .[63] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh,S. Sengupta, A. Coates, et al. , “Deep speech: Scaling up end-to-end speech recognition,” arXivpreprint arXiv:1412.5567 , 2014.[64] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducinginternal covariate shift,” arXiv preprint arXiv:1502.03167 , 2015.[65] S. Williams, A. Waterman, and D. Patterson, “Rooﬂine: an insightful visual performance model formulticore architectures,”

Communications of the ACM , vol. 52, no. 4, pp. 65–76, 2009.[66] https://docs.nvidia.com/cuda/profiler-users-guide/index.html .[67] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, X. Wen, R. Ren, C. Zheng, X. He, H. Ye, et al. , “Big-databench: A scalable and uniﬁed big data and ai benchmark suite,” arXiv preprint arXiv:1802.08254 ,2018.[68] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhimenko, “Tbd:Benchmarking and analyzing deep neural network training,” arXiv preprint arXiv:1803.06905 , 2018.[69] http://https://horovod.readthedocs.io/en/latest/timeline.html .[70] .[71] J. J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,”

Concurrency and Computation: practice and experience , vol. 15, no. 9, pp. 803–820, 2003.[72] . 3473] .[74] L. Humphrey, B. Guilfoos, H. B. Smith, A. Warnock, J. Unpingco, B. H. Elton, and A. Chalker,“Evaluating parallel extensions to high level languages using the hpc challenge benchmarks,” pp. 410–415, 2009.[75] K. Ueno and T. Suzumura, “Highly scalable graph search for the graph500 benchmark,” pp. 149–160,2012.[76] J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugate-gradient benchmark: Anew metric for ranking high-performance computing systems,”

The International Journal of HighPerformance Computing Applications , vol. 30, no. 1, pp. 3–10, 2016.[77] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag,and O. Temam, “Benchnn: On the broad potential application scope of hardware neural networkaccelerators,” in ,pp. 36–45, IEEE, 2012.[78] https://github.com/baidu-research/DeepBench/ .[79] R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks, “Fathom: Reference workloads for moderndeep learning methods,” in , pp. 1–10, IEEE, 2016.[80] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Jayarajan, A. Phanishayee, B. Schroeder, and G. Pekhi-menko, “Benchmarking and analyzing deep neural network training,” in , pp. 88–100, IEEE, 2018.[81] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. R´e,and M. Zaharia, “Dawnbench: An end-to-end deep learning benchmark and competition,”

Training ,vol. 100, no. 101, p. 102, 2017.[82] C. Luo, F. Zhang, C. Huang, X. Xiong, J. Chen, L. Wang, W. Gao, H. Ye, T. Wu, R. Zhou, et al. ,“Aiot bench: towards comprehensive benchmarking mobile and embedded device intelligence,” in

International Symposium on Benchmarking, Measuring and Optimization , pp. 31–35, Springer,2018.[83] T. Hao, Y. Huang, X. Wen, W. Gao, F. Zhang, C. Zheng, L. Wang, H. Ye, K. Hwang, Z. Ren, et al. ,“Edge aibench: towards comprehensive end-to-end edge computing benchmarking,” in

InternationalSymposium on Benchmarking, Measuring and Optimization , pp. 23–30, Springer, 2018.[84] J.-H. Tao, Z.-D. Du, Q. Guo, H.-Y. Lan, L. Zhang, S.-Y. Zhou, L.-J. Xu, C. Liu, H.-F. Liu, S. Tang, et al. , “B ench ip: Benchmarking intelligence processors,”

Journal of Computer Science andTechnology , vol. 33, no. 1, pp. 1–23, 2018.[85] T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoeﬂer, “A modular benchmarkinginfrastructure for high-performance and reproducible deep learning,” in , pp. 66–77, IEEE, 2019.[86] Z. Ren, Y. Liu, T. Shi, L. Xie, Y. Zhou, H. Chen, H. Fu, Y. Ouyang, J. Zhai, Y. Zhang, Y. Zhang,and W. Chen, “Aah: Automated machine learning as an ai-hpc benchmark,”

Technical Report ofPengcheng Lab and Tsinghua University , 2020.[87] H. Jin, Q. Song, and X. Hu, “Auto-keras: An efﬁcient neural architecture search system,” in