HPC AI500: The Methodology, Tools, Roofline Performance Models, and Metrics for Benchmarking HPC AI Systems
Zihan Jiang, Lei Wang, Xingwang Xiong, Wanling Gao, Chunjie Luo, Fei Tang, Chuanxin Lan, Hongxiao Li, Jianfeng Zhan
HHPC AI500:T HE M ETHODOLOGY , T
OOLS , R
OOFLINE P ERFORMANCE M ODELS , AND M ETRICS FOR B ENCHMARKING
HPC AIS
YSTEMS
AUTHORS’ CONTRIBUTIONSECTION 1 IS CONTRIBUTED BY JIANFENG ZHAN AND ZIHAN JIANG. SECTION 2 ISCONTRIBUTED BY JIANFENG ZHAN, ZIHAN JIANG, AND FEI TANG. SECTION 3 ISCONTRIBUTED BY JIANFENG ZHAN. SECTION 4 IS CONTRIBUTED BY XINGWANGXIONG, ZIHAN JIANG, LEI WANG, WANLING GAO, AND JIANFENG ZHAN. SECTION 5IS CONTRIBUTED BY ZIHAN JIANG, LEI WANG, CHUNJIE LUO, WANLING GAO,JIANFENG ZHAN, AND HONGXIAO LI. SECTION 6 IS CONTRIBUTED BY LEI WANG,ZIHAN JIANG, WANLING GAO, AND JIANFENG ZHAN. SECTION 7 IS CONTRIBUTEDBY ZIHAN JIANG, XINGWANG XIONG, LEI WANG, WANLING GAO, CHUNXIN LAN,AND JIANFENG ZHAN. SECTION 8 IS CONTRIBUTED BY ZIHAN JIANG, LEI WANG,WANLING GAO, AND JIANFENG ZHAN. SECTION 9 ISCONTRIBUTED BY JIANFENGZHAN. T ECHNICAL R EPORT N O . B ENCH C OUNCIL -HPCAI500-2020-1 J UNE
30, 2020 a r X i v : . [ c s . PF ] J u l PC AI500: The Methodology, Tools, Roofline PerformanceModels, and Metrics for Benchmarking HPC AI Systems*
Zihan Jiang , Lei Wang , Xingwang Xiong , Wanling Gao , Chunjie Luo , Fei Tang ,Chuanxin Lan , Hongxiao Li , and Jianfeng Zhan *1,2,31 State Key Laboratory of Computer Architecture, Institute of Computing Technology, ChineseAcademy of Sciences , { jiangzihan, wanglei 2011, xingwangxiong, gaowanling, luochunjie,lanchuanxin, tangfei, lihongxiao, zhanjianfeng } @ict.ac.cn BenchCouncil (International Open Benchmarking Council) University of Chinese Academy of SciencesJune 30, 2020
The recent years witness a trend of applying large-scale distributed deep learning algorithms in bothbusiness and scientific computing areas, whose goal is to speed up the training time to achieve a state-of-the-art quality. The HPC community feels a great interest in building the HPC AI systems that arededicated to running those workloads. The HPC AI benchmarks accelerate the process. Unfortunately,benchmarking HPC AI systems at scale raises serious challenges. None of previous HPC AI benchmarksachieve the goal of being equivalent, relevant, representative, affordable, and repeatable.This paper presents a comprehensive methodology, tools, Roofline performance models, and innovativemetrics for benchmarking, optimizing, and ranking HPC AI systems, which we call HPC AI500 V2.0. Weabstract the HPC AI system into nine independent layers, and present explicit benchmarking rules andprocedures to assure equivalence of each layer, repeatability, and replicability. On the basis of AIBench–byfar the most comprehensive AI benchmarks suite, we present and build two HPC AI benchmarks fromboth business and scientific computing: Image Classification, and Extreme Weather Analytics, achievingboth representativeness and affordability. To ranking the performance and energy-efficiency of HPC AIsystems, we propose Valid FLOPS, and Valid FLOPS per watt, which impose a penalty on failing toachieve the target quality. We propose using convolution and GEMM— the two most intensively-usedkernel functions of AIBench to measure the upper bound performance of the HPC AI systems, and presentHPC AI roofline models for guiding performance optimizations. The evaluations show our methodology,benchmarks, performance models, and metrics can measure, optimize, and rank the HPC AI systems in ascalable, simple, and affordable way. The specification, source code, datasets, and benchmarking data arepublicly available from . The huge success of AlexNet [1] in the ImageNet [2] competition marks the booming success of deeplearning (DL) in a wide range of commercial application areas. Many commercial fields, like image recog-nition, and natural language processing achieve unprecedented accuracy, even outperforming common * Jianfeng Zhan is the corresponding author. igure 1: ImageNet/ResNet-50 training is one well-known showcase for optimizing HPC AI systems. It reports theperformance in terms of a ternary tuple (achieved quality, PFLOPS, time-to-quality–minutes). The past witnessesthe systems performance varying wildly from (74.6%, 1.6, 28) to (75.1%, 36.8, 1.2). Table 1 summarizes theutilized optimization approaches. As no equivalent benchmarking rule is stated, we can not objectively derive theperformance edge of one system against the others. human being’s capability. Though it is much challenging to obtain high quality labeled scientific data sets,there is an increasing trend in applying DL in scientific computing areas [3–6].With massive training data available, the recent years witness a trend of applying distributed DLalgorithms at scale in both commercial and scientific computing areas. Motivated by these emergingHPC AI workloads, the HPC community feels a great interest in building HPC AI systems to reducetime-to-quality–the training time to achieve a convergent quality. For example, the Summit system [7] isbuilt to tackle huge AI challenges. The benchmark accelerates the process [8, 9], as it provides not onlydesign inputs, but also evaluation and optimization metric and methodology [10, 11]. However, there areseveral challenges in benchmarking HPC AI systems.First, it is nontrivial to prove the equivalence of two AI benchmark implementations on differentsystems or even the same system with different scales. Equivalence quantifies how equivalent twobenchmarks implementations on different systems or the same system with different scales. Thereare complex interactions among hardware and software systems, which is further aggravated by theAI algorithm complexity. Even for the same AI algorithms, there are huge parameters significantlyimpacting learning dynamics [9]. ImageNet/ResNet-50 (Image Classification) training is one well-knownshowcase for optimizing HPC AI systems. Table 1 summarizes the state-of-the-art and state-of-the-practice optimization approaches in ImageNet/ResNet-50 training. Unfortunately, without equivalentbenchmarking rules explicitly stated, we can not objectively derive the performance edge of one systemagainst the others from Fig. 1.The second challenge inherits from the the conflict of two classical benchmarking methodologies withthe emphasis of different requirements. On one hand, as no single benchmark or metric can measure theperformance of computer systems on all applications [12], being relevant, representative, and diverse is2f paramount importance [10]. On the other hand, TOP500 [13] establishes the de facto super computerbenchmark standard in terms of three defining characteristics: scalable, simple, and affordable.
Learning rate policies and batchsize settingProgramming ModelCommunication libs (e.g. Horovod)OSHardware (e.g. CPU, Network) Other hyper-parameters settings Layer 1Layer 2Layer 3Layer 4Layer 5AI Accelerators and libs (e.g. GPU, CUDA, (cid:3)(cid:49)(cid:38)(cid:38)(cid:47) )AI Framework (e.g. TensorFlow) Layer 6Layer 7
Free Level Hardware LevelSystem Level
Workload (Algorithm) Layer 8Problem Domain (Datasets, Target quality, Epochs) Layer 9
Figure 2:
The equivalent perspective of HPC AI500 V2.0 Methodology. We abstract the HPC AI system into nineindependent layers: put each layer under test while keeping other layers intact. We provide three high levels ofbenchmarking: hardware, system, and free: put the related layers together under test while keeping other layersintact with only allowed changes stated in the benchmarking rules.
In the AI domain, there are massive AI tasks and models with different performance metrics. Forexample, by far the most comprehensive and representative AI benchmark suite–AIBench [10, 11, 14, 15]contains seventeen AI tasks. It is not affordable to implement so many massive benchmarks and furtherperform benchmarking at scale. So what are the criteria for deciding the benchmarks that can fairly andobjectively measure the HPC AI systems.Third, the benchmark mandates being repeatable, while the nature of AI is stochastic, allowingmultiple different but equally valid solutions [9]. The uncertainty of HPC AI is manifested by run-to-runvariation in terms of epochs-to-quality and the effect of scaling training on time-to-quality [9, 16]. For thefirst time, Tang et al. [10] quantify the variations of seventeen AI benchmarks of AIBench. They foundthat the run-to-run variations vary from 0% to 38.46% in terms of the ratio of the standard deviation to themean of the training epochs to achieve a convergent quality.None of previous HPC AI benchmarks achieve the goal of being equivalent, relevant, representative,affordable, and repeatable. They either are not representative or even irrelevant to HPC AI workloads interms of kernel functions [17, 18], or overlook the differences of HPC AI workloads between scientificand business computing [9], or fail to specify fair and equivalent benchmarking rules across different HPCAI systems [9]. Moreover, they fail to propose simple and AI domain-specific metric to score and rankHPC AI systems.The micro benchmark like HPL-AI [18], which only contains LU decomposition, is affordable toperform a fair comparison of competing systems by isolating hardware and software from statisticaloptimizations [9]. However, we found it is irrelevant to most of AI workloads in Section 3.2. Moreover,the traditional micro or kernel benchmarking methodology, widely used in the HPC communities, canlead to misleading conclusion, as the mixed precision optimizations indeed improve the FLOPS of amicro benchmark like convolution, while significantly impact time-to-quality of an AI task like imageclassification as discussed in Section 3.2.This paper presents HPC AI500 V2.0–a comprhensieve HPC AI benchmarking methodology, tools,performance models, and metrics. As shown in Fig. 2, we abstract the HPC AI system into nine3ndependent layers. To perform fair benchmarking across different systems or the same system withdifferent scales, we present explicit benchmarking rules to assure equivalence of each layer, repeatability,and replicability of those two benchmarks. We put each layer under test while keeping the other layersintact. Also, We propose three high levels of benchmarking: hardware, system, and free (Fig. 2): put therelated layers under test while keeping the other layers intact unless otherwise stated.On the basis of AIBench , we present two benchmarks: Image Classification with state-of-the-artquality on the ImageNet dataset (business computing), and Extreme Weather Analytics (EWA) withstate-of-the-art quality on the EWA dataset (scientific computing) to measure HPC AI systems. Thesetwo benchmarks represent two clusters of AI benchmarks–thirteen AI benchmarks from AIBench fromperspectives of computing areas (business vs. scientific computing), diversity of model complexity (from0.03 million to 68.39 million in terms of model parameters ), computational cost (from 0.09 MFLOPs to157.80 GFLOPs in terms of a single forward computation), and convergence rate (from 6 epochs to 304epochs). Moreover, our decision also takes into account their repeatablility, and whether these benchmarkshave widely-accepted metrics or not.To rank HPC AI systems, we propose two metrics, named Valid FLOPS, and Valid FLOPS per watt,to emphasise the vital importance of achieving the state-of-the-art quality, and an auxiliary metrics–time-to-quality.We propose using convolution and GEMM (GEneral Matrix to Matrix Multiplication)–two mostintensively-used kernel functions of AIBench to measure the upper bound performance of the HPC AIsystems, and present corresponding single-node and distributed-system HPC AI roofline models forguiding performance optimizations.The evaluations show our benchmarks can fairly measure the HPC AI systems in a scalable, simple,and affordable way. Our Roofline models are helpful to system optimizations. Our metrics can be used torank HPC AI systems in a simple and visual manner.
The challenges of HPC AI benchmarking inherit from the complexity of benchmarking scalablehardware and software systems, which are further exaggerated by the uncertainty of AI algorithms.
For the same AI algorithms, there are huge parameters significantly impacting learning dynamics [9].Even for the same system with different scales, the interactions among system size, minibatch size, andlearning dynamics have a significant impact on time-to-quality and computation overhead in terms ofFLOPS [9, 19, 26]. So for the same AI task, it is non-trivial to prove the equivalence of two benchmarkimplementations on different systems or even the same system with different scales.ImageNet/ResNet-50 training is one widely-used showcase for optimizaing HPC AI systems. Fig. 1shows the systems performance varies wildly: the performance gap in terms of FLOPS is 50x. Accordingly,Table 1 summarizes the state-of-the-art and state-of-the-practice work on ImageNet training at scale. Inaddition to the system-level optimizations (e.g. more efficient communication typologies), some algorithm-level optimizations involve changing model architectures (e.g. optimizations on batch normalization) orlearning rate policies, i.e., LARS [26]. As there are prohibitively complex interactions among hardwaresystems, software systems, and algorithms, previous work fails to clearly state the equivalent rules of eachhardware or software layer for benchmarking HPC AI systems.
The second challenge inherits from the the conflict of two classical benchmarking methodologies with theemphasis of different requirements. 4 able 1:
The summary of the utilized optimization approaches in ImageNet/ResNet-50 training. The optimizationapproaches of each system are inconsistent or inequivalent. Please note that only the optimizations items in italics are allowed to change in the HPC AI500 benchmarking rules (defined in Section 6).
System-level Algorithm-level parallelMode Communication Precision DataStaging LearningRate Policy
DataArgumentation ModelArchitecture Others
Facebook [19] Dataparallelism Recursive halvingand doubling andring all-reduce. N/A N/A Linear scalingand warmup [20]. N/A N/A Momentumcorrection;Data shufflingbased on theworkers.
Intel [21] Dataparallelism Intel MLSL [22] N/A N/A Linear scalingand warmup;final collapse. N/A N/A Collapsedensembles;Dynamicallychangeweight decay.
IBM [23] Dataparallelism Topology aware N/A N/A Linearscaling,warmup [20] N/A N/A Momentumcorrection;Data shufflingbased on theworkers.
Berkeley [24] Dataparallelism Intel MLSL [25] N/A N/A Linear scalingandwarmup [20];LARS [26]. N/A N/A N/A
PreferredNet-works [27] Dataparallelism Ring all-reduce;Communicationcompression. N/A N/A Linearscaling,RMSpropwarmup, andslow-start; N/A Batch normal-ization:withoutmovingaverages. N/A
Sony [28] Dataparallelism 2D-Torusall-reduce;Communicationcompression;Communicationtensor fusion. Mixedprecisiontraining:FP16 &FP32 N/A Linear scalingandwarmup [20];LARS [26] Adding,scaling,rotations ,etc Batch normal-ization:withoutmovingaverages. N/A
Tencent [29] Dataparallelism Hierarchicalall-reduce;Communicationcompression;Communicationtensor fusion. Mixedprecisiontraining:FP16 &FP32 Efficientinputpipeline Linear scalingandwarmup [20];LARS [26] N/A Batch normal-ization:eliminatingweight decay. N/A
Google [30] Dataparallelism 2D-Meshall-reduce; Mixedprecisiontraining:BFLOAT16 [31]& FP32. Efficientinputpipeline Linear scalingandwarmup [20];LARS [26] Fused JPEGdecode andcropping Distributedbatchnormalization N/A
Fujitsu [32] Dataparallelism Communicationtensor fusion;Optimalscheduling bygrouping layers;Calculate thenorms of layers inparallel. Mixedprecisiontraining:FP16 &FP32. N/A Linear scalingandwarmup [20];LARS [26] N/A N/A Label smooth-ing [33]
On one hand, the SPECCPU [34], PARSEC [35], and TPC benchmarks, like TPC-DS [36] witness theparamount importance [10] of being representative and diverse, as no single benchmark or metric canmeasure the performance of computer systems on all applications [12].On the other hand, TOP500 [13] defines three distinctive characteristics of the de facto super computer5enchmark standard: affordable, simple, and scalable. Affordable has two implications: first, thebenchmark is easy to port to a new system or architecture; second, the benchmarking cost is affordable formeasuring a systems at scale. Simple indicates the number of the metric is not only linear, orthogonal,and monotony [13], but also easily interpretable and understandable. Scalable means the benchmark canbe used to measured different scales of system, and the problem size can be scaled up and down.In the AI domain, there are massive AI tasks and models with different performance metrics. Forexample, AIBench [10] contains seventeen representative AI tasks, including Image Classification, ObjectDetection, Learning to Rank, Image Generation, Text-to-Text Translation, Image-to-Text, Image-to-ImageTranslation, Speech Recognition, Face Embedding, 3D Face Recognition, Recommendation, VideoPrediction, Image Compression, 3D Object Reconstruction, Text Summarization, Spatial Transformer,and Neural Architecture Search. For HPC AI benchmarking, it is not affordable to implement so manymassive benchmarks and further perform benchmarking at scale.The traditional micro or kernel benchmarking methodology, which is widely in the HPC communities,can lead to misleading conclusion, as the mixed precision optimizations indeed improve the FLOPS of amicro benchmark like convolution, while significantly impact time-to-quality of an AI task like ImageClassification. Fig. 4 shows that the mixed precision implementation increases the FLOPS of both microand component benchmarks, while incurring accuracy drop as the system scale increases.Last but not least, the relevancy [37] of a benchmark indicates that it must measure the peak per-formance and price/performance of systems when performing typical operations within that problemdomain. The micro benchmark like HPL-AI [18], which only contains LU decomposition, is affordableto perform a fair comparison of competing systems by isolating hardware and software from statisticaloptimizations [9]. However, we found it is irrelevant to most of AI workloads in AIBench. As shown inFig. 3, the dominated kernel functions are convolution and matrix multiplication.
Figure 3:
The kernel function breakdown of the 17 representative AI workloads from AIBench [10], indicating theLU factorization is irrelevant.
Figure 4:
With respect to the FP32 implementation, the mixed precision one speeds up 2x the FLOPS of two microbenchmarks: Conv and GEMM and a component benchmark: ResNet-50 (LEFT), while incurring deterioratedaccuracy drop of ResNet-50 when the system scale increases (Right): 0.12% at 1 node while about 1% at 8 nodes. .3 Repeatability Repeatability [38, 39] refers to the variation in repeat measurements of different runs of the samebenchmark implementation, by the same team, on the same system under the identical configurations.Table 2 shows run-to-run variations of 17 benchmarks from AIBench varying from 0% to 38.46%. Asshown in Fig. 5, the variation of 3d Face Recognition is high as 38.46%. There are diverse reasons for theuncertainty of different benchmarks. For NAS (network architecture searching), it constructs the networkarchitecture by randomly sampling building blocks (e.g. convolution) from a predefined search space.In addition, the complex design itself, which involves many hyper-parameters, makes AutoML hard toevaluate [40].
Table 2:
The run-to-run variations of seventeen AI benchmarks of AIBench [10]
No. Component Benchmark Variation Repeat Times
DC-AI-C1 Image Classification 1.12% 5DC-AI-C2 Image Generation Not available N/ADC-AI-C3 Text-to-Text Translation 9.38% 6DC-AI-C4 Image-to-Text 23.53% 5DC-AI-C5 Image-to-Image Not available N/ADC-AI-C6 Speech Recognition 12.08% 4DC-AI-C7 Face Embedding 5.73% 8DC-AI-C8 3D Face Recognition 38.46% 4DC-AI-C9 Object Detection 0 10DC-AI-C10 Recommendation 9.95% 5DC-AI-C11 Video Prediction 11.83% 4DC-AI-C12 Image Compression 22.49% 4DC-AI-C13 3D Object Reconstruction 16.07% 4DC-AI-C14 Text Summarization 24.72% 5DC-AI-C15 Spatial Transformer 7.29% 4DC-AI-C16 Learning to Rank 1.90% 4DC-AI-C17 Neural Architecture Search 6.15% 6
Figure 5:
The worst unrepeatable benchmark from AIBench is 3D Face Recognition. Its run-to-run variation ishigh as 38.46%. The variation is defined as the ratio of the standard deviation to the mean of the training epochs tothe achieved quality [10].
Without the equivalent benchmarking rules being explicitly stated, ImageNet/ResNet-50 training isnot qualified for ranking the performance and energy efficiency of HPC AI systems.7
Benchmarking Methodology
This section presents our methodology to achieve the goal of being equivalent, relevant, representative,affordable, and repeatable.
To perform fair benchmarking across different systems or the same system with different scale, we proposetwo approaches to assure the equivalence.First, as shown in Fig. 2, we abstract the system under test into nine independent layers, and put eachlayer under test while keeping the other layers intact unless otherwise stated.Layer 1 is the hardware, including CPUs and networks. Layers 2, and 3 are the related systemsoftware, including the operating system (Layer 2), and the communication libraries (Layer 3). Layer 4 isthe AI accelerators, i.e., GPU, and libraries, i.e., CUDA and cuDNN. Layer 5 is the AI framework, suchas TensorFlow [41] and PyTorch [42]. Layer 6 refers to programming model, including parallel mode(data parallelism or model parallelism), and synchronous or asynchronous training. Layer 7 refers to theworkloads used in HPC AI500 V2.0 benchmark. Layer 8 refers to hyper-parameters policies or settings.Layer 9 refers to problem domain, including datasets, target quality, and epochs.Second, for the sake of simpleness, we propose three high levels of benchmarking and put severalrelated layers together under test.(1) The hardware level. This high level is for benchmarking HPC AI hardware systems and theirrelated system software (Layers 1, 2, 3, 4). In this context, the other layers should be kept intact unlessotherwise stated in the benchmarking rules. The benchmark users should compile the source code ofthe benchmark implementation, provided by the benchmark committee, on their hardware directly withallowed changes. Luo et al. [43] show that the same model on different frameworks has different accuracy.So in addition to the same data set, and AI model, we mandate that the benchmark implementationsalso use the same AI framework. The benchmark users can change hardware, OS, compiler settings,communication libraries. For the other layers, the benchmark users can only change parallel modes inLayer 6 or tune learning rate policies and batchsize settings in Layer 8. It is the benchmark committee’duty to assure the equivalence of Layers 6, 7, 8, 9 across different benchmark implementations upon theusers’ requests.(2) The system level. Because of the portability cost, some benchmark users may opt for one specificAI framework without the support of the other, so specifying a fixed framework has a limited purpose. Soin the system level, we put the hardware system in addition to the AI framework under the test (Layers1, 2, 3, 4, and 5), which we call the system level. We mandate that the benchmark implementations usethe same data set, and AI model. In addition to the changes allowed in the hardware level, the users areallowed to re-implement the algorithms on different or even customized AI framework (Layer 5). Theother layers should be kept intact unless otherwise stated in the benchmarking rules.The benchmark committee or an independent group need double-check the equivalence of Layers 6, 7,8, 9 between the two benchmark implementations.(3) The free level. In this high level, the specification of an AI task is stated in a paper-and-pencilmanner separating from its specific implementation. That is to say, the same data set, target quality, andtraining epochs are defined in Layer 9 while the other layers are open for optimizations. The emphasis isadvancing the state-of-the-art of software and hardware co-design, so the benchmark users can changeany layers from Layer 1 to Layer 8 while keeping Layer 9 intact. Meanwhile, the benchmark users areencouraged to disclose the details.
We investigate and compare the state-of-the-art and state-of-the-practice of AI benchmark suites, includingMLPerf [9], AIBench [10], Deep500 [44], HPC AI500 V1.0 [45]. We present the detailed analytics in8ection 9. Fortunately, we found the methodology of AIBench and its subset combines the merits of twomethodologies discussed in Section 3.On one hand, AIBench [10] is by far the most representative and comprehensive AI benchmark suite.It contains seventeen representative AI tasks. These workloads are diverse in terms of model complexity,computational cost, and convergent rate, computation and memory access patterns, hotspot functions, andother micro-architecture characteristics.On the other hand, for affordability, AIBench carefully selected a minimum subset from the seventeenAI tasks from perspectives of model complexity, computational cost, convergent rate, run-to-run variation,and having Widely accepted evaluation metrics or not. As shown in Fig. 6, the AIBench subset includesthree AI tasks–Image Classification, Object Detection, and Learning to Rank.
300 200 100 0 100 200 3006004002000200400600
Face Embedding Image ClassificationImage GenerationImage-to-Image Image-to-TextObject Detection RecommendationSpatial Transformer Speech-to-TextLearning-to-Rank 3D Face Recognition 3D Object ReconstructionImage Compression Text SummarizationText-to-TextVideo Prediction Reinforcement
Figure 6:
The three subset of AIBench with respect to the full benchmarks [10]. The clustering is based on thepatterns of computation and memory access of seventeen AIBench component benchmarks, which described by fivemetrics listed in Table 3. For visualization, five dimensional data are downscaled into two-dimension ones by thet-SNE clustering approach [46].
Table 3:
The metrics used by the t-SNE clustering approach [10].
Metrics Meaningachieved occupancy
The ratio of the average active warps per active cycle to the maximum number of warps providedby a multiprocessor ipc efficiency
The ratio of the executed instructions per cycle to the theoretical number. gld efficiency
The ratio of the requested global memory load throughput to the required global memory loadthroughput gst efficiency
The ratio of the requested global memory store throughput to the required global memory storethroughput dram utilization
The utilization level of the device memory relative to the peak utilization
Tang et al. [10] systematically quantify the run-to-run variation of seventeen AI tasks of AIBenchin terms of the standard deviation to the mean of the training epochs to achieve a convergent quality.The variation of image classification, object detection, and learning ranking is 1.12%, 0%, and 1.90%,respectively, and they are the most repeatable benchmarks, which is the other reason for including theminto the subset.So we choose the AIBench subset as the HPC AI500 V2.0 candidate benchmarks for implementingscalable HPC AI benchmark tools. 9 .3 Repeatability and Replicability
In line with the experimental sciences discussed in [47], we propose the benchmarking procedures forassuring repeatability and replicability [48]. We adopt the definition similar to that of the Association forComputing Machinery [39]. Different from reproducibility, which requires changes, repeatability andreplicability avoid changes [47].Repeatability (same team): The benchmarking is performed on the same HPC AI system, usingthe same benchmark implementation under the same configurations, following the same benchmarkingprocedures, on multiple trials [47].The team should submit the raw data of all trials, including the average numbers in addition to itsvariations. The variation is measured in terms of the ratio of the standard deviation to the mean of thenumbers of all trials.To mitigate the influence of stochastic of the AI algorithm, each benchmark should mandate theleast valid runs of benchmarking. The number of all trials should be more than the least valid runs ofbenchmarking.Replicability (Different team) [39]: The replicability refers to that the other team verifies the bench-marking results on the same HPC AI system, using the same benchmark implementation under the sameconfigurations, following the same benchmarking procedures, on multiple trials.For replicability, The benchmark committee or an independent group need verify the numbers onthe same system, and report the raw data of all trials, including the average numbers in addition to itsvariation.
In this section, we firstly illustrate how to choose the workloads according to our benchmarking methodol-ogy (Section 4). Then we present the datasets, AI models, and reference implementations of HPC AI500.Finally, we introduce the metrics.
With respect to other AI benchmarks, there are two unique differences of HPC AI benchmarking. First,the challenges of HPC AI benchmarking inherit from the complexity of benchmarking scalable hardwareand software systems at scale, i.e., tens of thousands of nodes, significantly different from that of IoT [43]or datacenter [11]. On this point, we need consider the cost of benchmarking at scale. Second, HPCAI domains cover both commercial and high performance scientific computing. Currently, businessapplications are pervasive. Because of the difficulty of recruiting qualified scientists to label scientificdata, the applications in scientific computing lag behind but are promising. In general, the scientific dataare often more complex than that of the MINST or ImageNet data: the shape of scientific data can be 2Dimages or higher-dimension structures with hundreds of channels, while the popular commercial imagedata like ImageNet often consist of only RGB [45]. So we should include the scientific data in the HPCAI benchmarks.According to our benchmarking methodology discussed in Section 4, we choose the AIBench subsetas the HPC AI500 candidate benchmarks for implementing scalable HPC AI benchmark tools.
As the broad HPC AI applications cover both scientific [5–7, 49, 50] and commercial field [27–30], wechoose the most representative workloads and data sets from these two fields.
EWA is one of the pioneering work that uses deep learning algorithm to replace the rules predefinedby human expert and achieve excellent results [5]. Most important of all, the goal of EWA is to identifyvarious extreme weather patterns (e.g. tropical depression), which is essentially object detection –one of10he three benchmarks of the AIBench subset. In 2018, a deep learning based EWA implementation [7]won the Gordon Bell Prize, which is the first AI application to win this award.
Image Classification is widely used in many applications of commercial fields , which is a funda-mental task in AI research. With the developing of large-scale deep earning, Image Classification hasbecome a well-known showcase optimizing HPC AI systems [27–30], as summarized in Table 1. ImageClassification is also one of the three benchmarks of the AIBench Subset.We exclude Learn to Ranking because it has the lowest computation complexity in terms of FLOPS,which is only 0.08 MFLOPs in terms of a single forward computation. According to [10], ImageClassification and Object Detection is more complex than that by one or two orders of magnitude,respectively.
As the stochastic nature of AI, we need to ensure the repeatability by choosing relatively stable workloadsin various AI tasks. According to the randomness analytics of AIBench [10], the two most repeatable AIbenchmarks are Object Detection and Image Classification, whose variation is 0% and 1.12%, respectively.So they satisfies the property of a good benchmark–being repeatable.
For comprehensive evaluation, the workloads we choose have distinct characteristics in terms of scalingcharacteristics. We use scaling ratio to depict the difficulty when scaling a workload from a single nodeto multiple nodes. As shown in Table 5, the scaling ratio of EWA and Image Classification is 16.85 and117.76, respectively, reflecting very different scaling characteristics.
When ranking HPC system, we consider not only its performance, but also the achieved quality. DifferentAI tasks have different levels of stringent quality requirement. Our benchmark decision also consider thisfactor. In our two benchmarks, EWA has much more stringent quality requirement than that of ImageClassification.
The EWA dataset [50] is made up of 26-year climate data. The data of every year is availableas one HDF5 file. Each HDF5 file contains two data sets: images and boxes. The images data set has1460 example images (4 per day, 365 days per year) with 16 channels. Each channel is 768 * 1152corresponding to one measurement per 25 square km on earth. The box dataset records the coordinatesof the four kinds of extreme weather events in the corresponding images: tropical depression, tropicalcyclone, extratropical cyclone and the atmospheric river.
Model.
Faster-RCNN targets real-time Object Detection [51]. As one of the latest models of an RCNNfamily [52,53], it deprecates the selective search that has been used in the previous RCNN version. Instead,Faster-RCNN proposes a dedicated convolutional neural network, named region proposal network (RPN),to achieve nearly cost-free region proposals. With such design, Object Detection is much faster. As aresult, Faster-RCNN wins the 1st-place entries in ILSVRC’15 (ImageNet Large Scale Visual RecognitionCompetition).
Quality
The target quality is
MAP @ [ IoU = . ] = .
35, which is our best training result. MAP meansthe average precision, which is a dedicated metric for object detection. The IoU means the intersectionover union, used to measure how much our predicted boundary overlaps with the ground truth.11 .2.2 Image ClassificationDataset.
ImageNet [2] is large visual database designed for use in visual object recognition research.More than 14 million images have been hand-annotated according to the WordNet hierarchy. Both theoriginal images and bounding boxes are provided. The data size is more than 100 GB.
Model.
ResNet is a milestone in Image Classification [54], marking the ability of AI to identifyimages beyond humans in a particular domain. The spirit of ResNet is its success in reducing the negativeimpact of the degradation problem. The degradation problem means in the very deep neural network,the gradient will gradually disappear in the process of back-propagation, leading to poor performance.Therefore, with ResNet, it is possible to build a deeper convolution neural network and archive the higheraccuracy. Researchers successfully build a ResNet with 152 layers. This ultra-deep model won all theawards in ILSVRC’15.
Quality
The target quality is
Top Accuracy = . Table 4:
The Summary of Image Data Sets of HPC AI500 V2.0 Benchmarks
Dataset Channels Resolution Size
The extreme weather dataset [50] 16 768*1052 558 GBImageNet dataset [2] 3 256*256 137 GB
Table 5:
The scaling ratio of HPC AI500 v2.0 workloads
Workloads Comm (Parameters/Step) Comp (GFLOPs/Step) Comp/Comm(GFLOPs/Parameters)
EWA 41 million 691 16.85Image Classification 25 million 2944 117.76
The reference implementation of HPC AI500 V2.0 benchmark is summarized as shown in Table 6.At present, we provide the implementations using TensorFlow [41], which is a popular deep learningframework in the HPC community [55]. For communication, we adopt Horovod [56] instead of the defaultGRPC protocol in TensorFlow, which is not extendable for large-scale cluster [57] due to the limitation ofthe master-slave architecture and socket-based communication. Horovod is a library originally designedfor scalable distributed deep learning using TensorFlow. It implements all reduce operations usingring-based algorithms [58] and other high efficient communication algorithms that are widely used in thetraditional HPC community.
We propose two metrics, called Valid FLOPS (in short VFLOPS) and Valid FLOPS per watt (in shortVFLOPS per watt), to quantify the valid performance and energy efficiency that consider both the systemthroughput and model quality. The goal of these two metrics is to impose an penalty on failing to achievea target quality. VFLOPS and VFLOPS per watt is calculated according to the formulas as follows.
V FLOPS = FLOPS ∗ penalty coe f f icient (1)12he penalty coefficient is used to penalize or award the FLOPS if the achieved quality is lower orgreater than the target quality. Its definition is described as follows: penalty coe f f icient = ( achieved quality / target quality ) n (2)Here, achieved quality represents the actual model quality achieved in the evaluation. target quality is the state-of-the-art model quality that has been predefined in our benchmarks 6. The value of n is apositive integer, which is used to define the sensitivity to the model quality. The higher the number of n,the more loss of quality drop. As EWA has much more stringent quality requirement than that of ImageClassification. We set n as 10 for EWA and 5 for Image Classification by default.We propose VFLOPS per watt to evaluate energy efficiency. Table 6:
HPC AI500 V2.0 benchmark suite.
ProblemDomains Models Datasets Target Quality AI Frameworks Comm Lib AI Acc Lib Epochs
EWA FasterRCNN [51] EWA [50] mAP@[IoU=0.5]=0.35 TensorFlow Horovod CUDA,cuDNN,NCCL 50ImageClassification ResNet50v1.5 [54] ImageNet [2] TOP 1Accuracy=0.763 TensorFlow Horovod CUDA,cuDNN,NCCL 90 Comm Lib refers to the communication libraries. AI acc lib refers to AI accelerators libraries.
For the fairness and equivalence of benchmarking different HPC AI systems, a series of clear andunambiguous benchmarking rules are mandatory.Our fundamental benchmarking rule is that we put each independent layer (Shown in Fig. 2) undertest while keeping the other layers intact.Furthermore, for the hardware-level and system-level benchmarking presented in Section 4, we give adetailed description from perspectives of each layer. Finally, we introduce the benchmarking procedures.
Based on our nine-layer model (Fig. 2), we specify the rules of each layer from top to bottom. • The dataset and target quality must be in accordance with the specification of HPC AI500 V2.0benchmark that we have discussed in Section 5.• The training epoch number should be the same like the reference implementation to guarantee theequivalent computational cost, namely 90 epochs for ImageNet and 50 epochs for EWA. Note thatan epoch is an iteration over the entire data set, while a step refers to one update of the modelparameters. The number of epochs is based on our experimental observation, and it should beupdated in the future as well as the target qualities.
The rules of hyper-parameters setting layer include three parts, namely batchsize setting, learning ratepolicies, and other hyper-parameters settings. 13 atchsize Setting
The batchsize of a training step is allowed to change, to fully utilize the computingcapability of the system.
Learning Rate Policies
Previous work shows the increase of batchsize leads to a fall of the modelquality [59]. In this context, many learning rate policies are proposed [19, 20, 26, 60]. With state-of-the-artlearning rate policies, we can increase the training batch size to fully utilize the hardware’s resourceswhile preserving the model quality at the same time. As each learning rate policy has its limitation interms of the maximum supporting batchsize, our rule allows benchmark users to propose new learning ratepolicies to fully utilize the hardware’s resources. Meanwhile, we provide a default learning rate policy.
The default learning rate policy:
The default learning rate policy of HPC AI500 is a linear scalingrule and a warm-up rule. The description is as follows:• A linear scaling rule: multiply the base learning rate η by k when the batch size is multipliedby k . The goal of the linear scaling rule is to make SGD updates similar in both distributed andsingle-worker training [19].• A warm-up rule: gradually increase the learning rate from a small to a large number until it equalsto η × k . After warmup, the learning rate starts the original learning rate schedule (e.g. cosinedecay). The Warm up rule is proposed since using linear scaling rule alone breaks down when theweight of the neural network is changing rapidly in the early training stage [19].Fig. 7a shows the learning rate changing curve after applying linear scaling and warm-up rules. Wealso perform a series of experiments to show the effect of this policy on model quality. As shown inFig. 7b, linear scaling and warm-up rules can improves the top1 accuracy from 61.48% to 76.34%in Image Classification when the batchsize is 8192. (a) The learning rate curve. (b)
The effect on accuracy.
Figure 7:
The learning rate curve and its effect on accuracy with linear scaling and warm-up rules. The benchmarkis Image Classification and the system scale is 64 GPUs. The experiment configuration is consistent with that ofTable 9. The batchsize is 8192.
Other learning rate policies:
Except for the linear scaling and warmup scheme, using state-of-the-art learning rate policies (e.g. LARS [26] and LAMB [60]) are allowed. For new proposed ones,benchmarking users should open source their methods.
Other Hyper-parameters Setting
Except for batchsize and learning rate policy, other hyper-parameterssuch as weight decay, momentum must be as the same as the reference implementation.14 .1.3 Workload Layer
The AI algorithms in the workload must be the same as the reference implementation. • Data parallelism and model parallelism are both allowed as long as the mathematical equivalence ispreserved.• Synchronous Stochastic Gradient Descent (SGD) must be used in training, since asynchronousSGD may a) introduce the randomness, b) destroy the mathematical equivalence, c) decrease theaccuracy.
The AI framework must be the same as the reference implementation.
In synchronous communication, the workers in the cluster must wait until all the workers have finished, toproceed to next iteration. We allow different communication policies in an synchronous mode.
Table 7:
Some common communication typologies of allreduce.
Topologies Applications
Butterfly OpenMPI [61]Double binary tree NCCL [62]Ring Baidu DeepSpeech [63], Horovod [56]Hierarchical ring Horovod • Based on AllReduce. Table 7 shows the common topology used in AllReduce implementations.The benchmark users are allowed to utilize these existing ones or propose new typologies accordingto the configuration of the systems. For example, the researchers from Lawrence Berkeley NationalLaboratory archived Exascale FLOPS by customizing a communication topology of AllReduce onSUMMIT [7].• Based on MapReduce. The communication topology is determined by the implementation ofMapReduce. The distributed training of Spark MLlib, SystemML, and REEF are all based onMapReduce. Users are allowed to implement customized MapReduce on their systems.• Based on parameter server. It is mandatory that only the synchronous mode is used for the parameterserver, while it also supports asynchronous training. • Benchmark users can choose the AI accelerator library to achieve the best performance out of thesystem.• The single-precision floating point (FP32), half precision (FP16, BFLOAT16 [31]), and quantization(INT8, INT4) are allowed. 15 .1.8 OS Layer • Benchmark users can adjust OS configurations (such as CPU-Affinity setting) to achieve the bestperformance out of the system.• Benchmark users can choose ‘-O2’ compiler optimization option when compiling the benchmarksand the run time environment software.
Benchmark users can adjust hardware configurations (such as hyper-threading setting, memory-prefetchingsetting) to achieve the best performance out of the system.
As discussed in Section 4.1, in the system level, we put the hardware system in addition to the frameworkunder test. Therefore, in addition to the rules defined in the hardware level, benchmark users are allowedto reimplement the benchmark using a different or even customized AI framework at the AI frameworklayer.
Benchmark users need to download the source code of the benchmarks from the Benchcouncil Web site. • Timing rules: timing starts when the workload reads the first batch training data and ends when thetarget epochs is reached.• Runs: according to the variation of EWA and Image Classification from Table 2, the least numberof runs is 5 and 10, respectively, to reduce run-to-run variation. For reporting, we drops the runswith the highest and lowest variations, than calculate the arithmetic mean of the remaining results.• Benchmarking scores:1) time-to-quality is the training time to its achieved quality;2) FLOPS refers to the single-precision floating point operations (or equivalent operations) persecond. The equivalent operations of the single-precision floating point operations include but notlimited with FP16, BFLOAT16, INT8, and INT4;3) VFLOPS and VFLOPS per Watt refers to the definitions in Section 5.4.1.
The reporting results should include the following parts:• The description of system under test, including but not limited to:1) detail descriptions of parameters of CPUs and AI accelerators in a single-node;2) detail descriptions of parameters of intra-node connection in a single-node;3) detail descriptions of parameters of OS in a single-node;4) detail descriptions of parameters of run time environment software in a single-node;5) detail descriptions of parameters of inter-node connection in the system;6) detail descriptions of parameters of run time environment software in the system.16 Benchmark configurations, including but not limited to:1) all hyper-parameter setting;2) detail descriptions of communication.• Benchmarking scores, including time-to-quality, FLOPS, FLOPS per Watt, VFLOPS and VFLOPSper watt in all runs. These metrics should be submitted with the output log of the benchmark.• The source code, relevant document, and running script should be uploaded to Benchhub, which isthe official code repository managed by BenchCouncil.The BenchCouncil community is responsible for checking the replicability of the reported results andreviewing the code.
A lot of previous work [27–30] focuses on accelerating Image Classification/ResNet-50 training. Theseefforts reduce the training time from hours to minutes. In this section, we take Image Classification as anexample to explain why equivalent benchmarking rules matter for fair ranking HPC AI systems.Batch normalization is a common effective method to improve the model generalization [64]. Thetrainable parameters of batch normalization γ and β are used to restore the representation ability ofthe network. Jia et al. [29] propose eliminating the weight decay on γ and β of batch normalizationlayer, which is a significant algorithm innovation in their work. We re-implement this algorithm-leveloptimization in accordance with [29]. Further, we use the VFLOPS as the metric to quantify theperformance gap.The benchmarking results are shown in Table 8. The accuracy gain and corresponding VFLOPS ratioare reported against the one without removing the weight decay. We find that as the system scale becomeslarger, this optimization has a greater impact on the achieved quality. The accuracy gain is 0.45% onthe scales of 16 and 32 GPUs, and then jumps to 1.38% on the scale of 64 GPUs, which is a notableimprovement. We calculate the VFLOPS ratio according to the formula discussed in Sec 5.4 for eachsystem scale. On the system scale of 64 GPUs, the VFLOPS ratio is high as 1.10, which is essentially thegain contributed solely by the algorithm innovation.Consider the following case: we perform a comparison between two HPC AI systems using the samebenchmark. One benchmark user leverages this algorithm innovation, while the other does not. If wedo not exclude this case in the benchmarking rules, the benchmarking results will be unfair. That is thereason why we mandate that the other hyper-parameter settings in Layer 8 must keep intact as shown inFig. 2.Someone may question why we allow changing learning rate policies in Layer 8 in our rules as shownin Fig. 2. Just as discussed in Section 6, this is because to fully utilize the hardware resources, the usershave to change the learning rate policies. Table 8:
The impact of removing the weight decay on batch normalization (BN) layer with different system scales.The benchmark is Image Classification and the accuracy is measured by Top-1 accuracy.
System Scale Batchsize Accuracy Gain VFLOPS Ratio
16 GPUs 2048 +0.45% 1.0332 GPUs 4096 +0.45% 1.0364 GPUs 8192 +1.38% 1.10 The VFLOPS ratio refers to the ratio of the VFLOPS after the optimization against the one without optimization. The HPC AI Roofline Performance Model
Given a specific HPC AI system, the theoretical peak performance number can be calculated accordingto hardware configurations. However, the theoretical peak one is hard to achieve. Hence, we need aperformance model to help achieve the upper bound performance of an HPC AI system.The previous Roofline model [65] is a upper bound performance model based on FLOPS and operationintensity (OI)–the total number of floating point instructions divided by the total byte number of memoryaccesses. With the aid of a Roofline model, we can decide a workload is memory-bound or compute-bound.Moreover, potential optimization strategies can be recommended according to the different ceilings of theRoofline model. To date, there is no such a performance model available for HPC-AI systems. In thissection, we first analyze the distinctive characteristics of an HPC-AI system, and then propose an HPC-AIRoofline Model.
An HPC AI system is a distributed system consisting of multiple nodes, each of which is heterogeneousand equipped with multiple CPUs and AI accelerators, as shown in Fig 8. The CPUs of each node areresponsible for scheduling tasks and communicating with other nodes. The AI accelerators are responsiblefor AI calculations. Each AI accelerator loads or stores data from its memory units through memorychannels. And all AI accelerators of each node are connected with a specific high-speed network (e.g.NVLink for GPUs). The distributed nodes are interconnected by a general high-speed network (e.g.high speed Ethernet). Hence, the communications include both inter-node and intra-node ones. Ouranalytics in Section 8.4 reveals the communication efficiency is one of the dominant factors that impactits performance.
Figure 8:
The Architecture of an HPC AI system.
When proposing HPC-AI Roofline models, we consider the distinctive characteristics of HPC AIsystems and the huge impact of communication efficiency on the performance of HPC AI systems.Significantly different from the original Roofline model [65], which emphasizes the impact of computation(FLOPS) and memory access (OI) on the overall performance, our HPC-AI Roofline model emphasizesthe impact of communication and computation. We propose an innovative metric, named communicationoperation intensity (in short, COI), to replace OI. COI is defined as the total number of floating pointinstructions divided by the total byte number of communication.Considering the different communication modes of inter-node communication (general high speednetwork) and intra-node communication (specific high speed network), our HPC-AI Roofline model is acombination of a single-node model with a distributed model.18e use FLOPS as the metric to depict the upper bound performance. Unlike the original Rooflinemodel [65] using the double-precision floating point operations per second, we use the single-precisionfloating point operations or equivalent operations, such as mixed-precision floating point operations persecond. This is because double-precision floating point operations are rarely required for deep learningworkloads, while single-precision or mixed-precision floating point operations are prevalent.Intentionally, we do not choose VFLOP as the performance metric. This is because the purpose ofthe Roofline model is to decide the performance bound of the workload and guide its system-level andhardware-level optimizations. Instead, VFLOP is a composite metric reflecting both performance andaccuracy to rank the HPC AI systems.
The single-node HPC-AI model is formulated as follows.
FLOPS
Attained = min ( FLOPS
Peak , ComBand
Peak ∗ COI ) (3) ComBand
Peak is the theoretical peak communication bandwidth of a single-node HPC AI system, whichis the bandwidth of interconnections among AI accelerators.
FLOPS
Peak is the theoretical peak FLOPSof a single-node HPC AI system, which is the aggregate theoretical peak FLOPS of all AI accelerators.The communication operation intensity–
COI –is obtained by
COI = FLOPs / CT where CT is short forthe communication traffic–the total number of communication bytes among AI accelerators. To moreaccurately reflect the performance bottleneck of a given workload, different ceilings are added to helplocate the bottlenecks and provide potential optimization recommendations.We use CONV (convolution) and GEMM (GEneral Matrix to Matrix Multiplication) to measure theupper bound performance of the system. On one hand, they are two most frequently-appearing kernelfunctions of the seventeen benchmarks of AIBench; On the other hand, their computing patterns, i.e., theirmultiplying and adding calculations can be fused, allow them to make more efficient use of accelerators. FLOPS
Attained is the performance that a workload can attain, and the attained performance bound of agiven workload under ceilings is formulated as follows.
FLOPS
Attained = Min ( FLOPS
Ceiling , ComBand
Ceiling ∗ COI ) (4) For the distributed model, we propose using COI (communication operation intensity) and FLOPS todepict the upper bound performance. The model is formulated as follows.
FLOPS
Attained = Min ( FLOPS
Peak , ComBand
Peak ∗ COI ) (5)The ComBand
Peak is the theoretical peak communication bandwidth of the distributed system, i.e., thetheoretical bandwidth of the high speed Ethernet.
FLOPS
Peak is the theoretical peak FLOPS of thedistributed system, which is the aggregate theoretical FLOPS of all AI accelerators in the distributedsystem. The communication operation intensity–
COI is obtained by
COI = FLOPs / CT , where thecommunication traffic– CT is the total byte number of communications among all AI accelerators in thedistributed system. To more accurately reflect the performance bottleneck of a given workload, we addseveral ceilings, and the attained performance bound of a given workload is formulated as follows. FLOPS
Attained = Min ( FLOPS
Ceiling , ComBand
Ceiling ∗ COI ) (6)19 a) The Single-Node version. (b)
The Distributed version.
Figure 9:
The HPC-AI Roofline Model.
We perform a case study of our HPC AI Roofline models on an experimental system. The system consistsof eight nodes, each of which is equipped with one Intel(R) Xeon(R) Platinum 8268 CPU and eightNVIDIA Tesla V100 GPUs. Each GPU in the same node has 32 GB HBM memory, connected by NVIDIANVLinka high-speed GPU interconnection that has theoretical peak 300GB/s bi-directional bandwidth.The nodes are connected with an Ethernet networking with a bandwidth of 10 Gb/s. Each node has 1.5TB of system memory and 8 TB of NVMe SSD disk.
As shown in Fig. 9a, the y-axis is the performance in terms of floating-point operations per second, whilethe x-axis is the communication operation intensity–the floating-point operations divided by the totalbyte number of communication. In Fig. 9a, the peak computation rate forms the ‘flat’ part, while thecommunication bandwidth turns into the ‘slanted’ part. So, if the communication operation intensity islower, the workload is communication-bound, under the slanted part of the roofline. With the sufficientcommunication operation intensity, the workload is compute-bound.We add four computation ceilings: mixed-precision GEMM (the performance of the mixed-precisionfloating point implementation of GEMM), single-precision GEMM, mixed-precision CONV, and single-precision CONV. Single-precision setting is commonly-used in the AI domain, while mixed-precision isone of the optimization features on some advanced AI accelerators.The best-case performance of eight GPUs is that the communication and computation totally overlap,and the memory bandwidth becomes the bottlenecks. We add one communication ceiling – memorybandwidth. In Fig. 9a, the theoretical peak number of mixed-precision FLOPS, the mixed-precisionGEMM ceiling, the mixed-precision CONV ceiling, the single-precision GEMM ceiling, the single-precision CONV ceiling is 1040 TFLOPS, 636 TFLOPS, 176 TFLOPS, 115 TFLOPS, 112 TFLOPS,respectively. Note that the gap between the theoretical perk number with the actual one is because thatthe performance of CONV and GEMM is affected by the dimension and sparsity of input data, NCHWformat and output channels. Additionally, the convolution kernel also impacts the performance of CONVgreatly. The different input size of CONV and GEMM leads to different performance numbers. TheNVLink ceiling is the theoretical peak bandwidth of the communications among GPUs–300 GB/S, andthe memory bandwidth ceiling is the theoretical peak bandwidth of the memory–1134 GB/S.
Our system consists of eight nodes. All the GPUs in the same node are connected by NVIDIA NVLink,and the nodes are connected with an Ethernet networking. In Fig. 9b, the peak computation rate forms the20 able 9:
Hardware configuration details.
System Configurations Single-Node ConfigurationsNum of Nodes 8 CPU Type Intel(R) Xeon(R) Platinum8268 CPUGPUs per Node 8 Memory 1.5TB, DDR4Total num of GPUs 64 Disk 8TB, NVxMe SSDPeak Theoreticalperformance (FP32) 960 TFLOPS GPU Type Nvidia Tesla V100Peak Theoreticalperformance (Mixed) 7680 TFLOPS GPU Memory 32GB, HBMInterconnection Ethernet, 10Gb/s Intraconnection NVLink ‘flat’ part, while the communication bandwidth (Ethernet networking bandwidth) turns into the ‘slanted’part. The theoretical Peak FLOPS of the system is 8320 TFLOPS, and the communication ceiling is 1.2GB/S.We add four computation ceilings: mixed-precision GEMM, single-precision GEMM, mixed-precisionCONV, and single-precision CONV. Their numbers are 5091, 920, 2376, and 976 TFLOPS, respectively.The best-case performance of the HPC-AI system is that the communications are within the nodes. Sowe add one communication ceilings–NVLink bandwidth. The NVLink bandwidth ceiling is 300 GB/S.
In this section, we introduce the experimental configurations in Section 8.1, present how to measureFLOPs in Section 8.2. Then, we perform an in-depth performance analysis of a single node in Section 8.3and multiple nodes in Section 8.4, respectively. Finally, we demonstrate how to use our roofline model toguide the optimizations of the HPC AI systems in Section 8.5.
Our experiments are conducted on an HPC AI system, consisting of eight nodes, each of which is equippedwith one Intel(R) Xeon(R) Platinum 8268 CPU and eight NVIDIA Tesla V100 GPUs. Each GPU in thesame node has 32GB HBM memory, connected by NVIDIA NVLink–a high-speed GPU interconnectionwhose theoretical peak bi-directional bandwidth is 300GB/s. The nodes are connected with an Ethernetnetworking with a bandwidth of 10 Gb/s. Each node has 1.5 TB system memory and 8 TB NVMe SSDdisk.The details of the architecture of each NVIDIA Tesla V100 GPU–NVIDIA Volta architecture areas follows. The NVIDIA Volta architecture is equipped with 640 Tensor Cores to accelerate GEMMand convolution operations. Each Tensor Core performs 64 floating-point fused-multiply-add (FMA)operations per clock, delivering up to 125 TFLOPs of theoretical peak performance. When performingmixed precision training with a Tensor Core, we uses FP16 for calculation and FP32 for accumulation [18].We use TensorFlow v1.14, compiled with CUDA v10.1 and cuDnn v7.6.2 backend. We use Horovodv0.16.4 for synchronous distributed training, compiled with OpenMPI v3.1.4 and NCCL v2.4.8. NCCL isshort for the NVIDIA Collective Communications Library, which is a closed-source library of multi-GPUcollective communication primitives that are topology-aware.
The source-code level measurement of FLOPs is difficult for a complex AI model implemented with acomplex AI framework. The mainstream frameworks like TensorFlow and PyTorch adopt computationalgraphs and map them to specific computing engines, e.g., GPU and cuDNN. This process invokesnumerous kernels, and each of which contributes to a portion of FLOPs. Hence, we need to figure out the21mplementation of each invoked kernel to obtain the FLOPs of an entire AI model. Unfortunately, thesource code is not publicly available as the NVIDIA libraries, like CUDA and cuDnn are not open source.We use NVProf [66]–a performance analysis tool for NVIDIA GPUs–to measure the FLOPs in ourexperiments. NVProf can be used to collect the profiling data from hardware performance counters. But ithas a huge overhead, slowing down the the execution time more than hundreds of times. Thus, profilingthe whole training session of a deep learning model is prohibitively costly. The previous work [67, 68] hasfound that each iteration of model training has the same computation logic and the iteration number haslittle impact on micro-architectural behaviors. So we sample a partial training set and calculate the FLOPsfor efficiency. As the image size of the EWA and ImageNet datasets is 13.14k, and 1280k, respectively,so we sample 500 images and 12800 images from the EWA and ImageNet datasets, respectively. Thethroughput is calculate according to the following equation:
T hroughput = N × R × C . Here N is thenumber of images processed by each training process per second, R is the total number of ranks (thenumber of training processes), and C is the FLOPs per image. Table 10:
The FLOPs per image.
Dataset Image Sample Size Total FLOPs FLOPs Per Image
EWA 500 345.66 TFLOPs 691 GFLOPsImage Classification 12800 2877.06 TFLOPs 23 GFLOPs
Table 11:
The performance summary of a single node
Workloads Models Precision GFLOP(Per Image) Throughput(Images/s) Attainable Performance (TFLOPS) Achieved Performance Ratio (%) Image Classification ResNet-50 V1.5 FP32Mixed
23 26245734 58126 48105EWA FasterRCNN FP32 691 46 31 26 The attainable performance refers to the performance obtained in the testing. The achieved performance ratio refers to the ratio of the attainable performance against the theoretical peak performance (FP32). Mixed refers to FP32 & FP16 mixed precision.
Categories Convolution GEMM
BatchNormalization
ElementWise Pooling Memcpy NCCLAllreduce
Data Arrangement
Overall GPUUtilization (%)
Image ClassificationFP32Image ClassificationMixedEWAFP32
IPCTime (%)Time (%)Time (%)DramUtilizationIPCIPCDramUtilizationDramUtilization
Figure 10:
The details of single-node performance analytics of Image Classification and EWA. We classify thekernels invoked on the GPU into eight categories and use three metrics to depict their characteristics: the proportionof time, instruction per cycle (IPC) and dram utilization. The GPU utilization during the overall training session isalso recorded. An asterisk (*) is used to indicate the number is negligible, less than 0.001%. .3 Single-node Evaluation In this subsection, we first report the execution efficiency on a single node, and then perform communica-tion and computation analytics to recover the factors that impact the performance significantly. We usethe HPC AI500 V2.0 benchmarks.
Based on the methodology described in Section 8.2, we report the performance efficiency of two bench-marks on a single node: Image Classification and EWA. We evaluate both the FP32 precision and mixedprecision implementations, which uses the Tensor Core to accelerate the training session. As the memoryfootprint required by the mixed precision implementation is nearly a half of that of FP32 precision,we double the batch size in each training step for mixed precision without breaking the benchmarkingrule defined in Section 6. Table 11 shows the performance efficiency of the above two benchmarks.The achieved performance ratio is the ratio of the attainable performance against the theoretical peakperformance of the FP32 precision implementation. In our experiments, the theoretical peak number is120 TFLOPS, which is the theoretical peak performance of the single-precision (FP32) implementation(15 TFLOPS) multiplied by 8– the number of NVIDIA Tesla V100 SXM2 GPUs. From Table 11, we findthat the performance efficiency of EWA is extremely low with respect to that of Image Classification. Wefurther characterize their computation and communication characteristics to uncover the factors.
NEGOTIATE_ALLREDUCE
ALLREDUCE S (cid:87)(cid:72)(cid:83)(cid:20) Step2
Wait_for_dataWait_for_other_data Queuing Memcpy_in
Step3 Step4 Step5 Step6
Memcpy_outNccl_allreduce
TimelineNegotiation Processing
Figure 11:
The timeline of Horovod communication.
We first perform communication analytics using a timeline analysis tool [69] to record all activitiesof the Horovod communication, since its synchronous distributed manner may significantly affect theperformance. As shown in Fig. 11, the communication timeline of Horovod is divided into two phases: negotiation and processing . In the negotiation phase, all training processes send a signal to the firstprocess to ensure their status are ready for the subsequent tensor reduction. In the processing phase, thetensor reduction is performed. Specifically, the processing phase is further divided into six steps. Steps 1(
Wait for data ) and 2 (
Wait for other data ) are waiting for the data produced by GPU computing, whichis the input to all reduce operations. Step 3 (
Queuing ) happens only when the previous all reduce has notfinished. Steps 4 (
Memcpy in ) copies data into the fusion buffer. Step 5 (
NCCL Allreduce ) is the core partthat executes all reduce operation across all the training processes. Steps 6 (
Memcpy out ) removes thedata out of the fusion buffer.We profile the average wall clock time of all steps and compare EWA against Image classification. Wefind the long negotiation phase is one main factor that leads to inefficient communication of EWA. Asshown in Table 12, the average negotiation allreduce of EWA accounts for 28.5% of the total duration ofHorovod communication, 2.5 times than that of Image classification. The root cause is the side effect ofthe centralized schedule strategy of the Horovod negotiation. As mentioned before, the first process during23 able 12:
The time breakdown of the Horovod communication.
Phases Steps EWA Image Classification
Negotiation Negotiation Allreduce 54.837 ms 22.836 msProcessing Wait for data 1.746 ms 85.418 msProcessing Wait for other data 2.961 ms 27.036 msProcessing Queuing 65.863 ms 0.043 msProcessing Memcpy in 0.108 ms 1.256 msProcessing NCCL Allreduce 66.228 ms 4.153 msProcessing Memcpy out 0.197 ms 0.993 ms the negotiation acts as a centralized scheduler to avoid deadlock by reordering all the all reduce operationsacross processes. It receives the message from all processes and sends back the correct tensor list thatshould be reduced. EWA needs to execute all reduce operation more than one hundred times and has about41 millions of gradients in total to be reduced during each training step, and thus spends too much timeon the first process. Another factor is the sub-optimal overlap between computation and communication.According to Table 12, we find the total duration of wait for data and wait for other data in EWA andImage Classification is 4.6, 112.4 ms, respectively; in the duration of
NCCL Allreduce it is 66.2 ms and4.15 ms in EWA and Image Classification, respectively. These numbers indicate EWA has a worse overlapbetween computation and communication than that of Image classification. Besides, queuing is up toabout 65.8 ms, showing the
NCCL Allreduce operation has to wait for a longer duration. Accordingly,the duration of queuing and wait for data of Image Classification is 0.043 and 85.4 ms, respectively,indicating Image Classification has better overlap between communication and computation than that ofEWA.In addition to the communication analytics, we also conduct computation analytics through a thoroughprofiling of GPU activities using NVProf [66]. Fig. 10 shows the results. There are thousands invocationsof CUDA kernel during each training step. For simplicity, we classify all the kernel functions intoeight categories. Each category represents a kind of operation, namely convolution, GEMM, batchnormalization, element wise, pooling, memcpy, NCCL Allreduce, and transformation. For EWA, we findNCCL Allreduce (35.97%) and memcpy operations occupy 50.62% in total, leading to poor performance.For ImageNet Classification, the most time-consuming kernel is convolution, namely 35.02% and 22.18%in the FP32 and mixed precision implementations, respectively.We also notice the overhead of data arrangement occupies 15.61% in the mixed precision imple-mentation, while less than 0.0001% in the FP32 implementation. The huge overhead in the mixedprecision implementation is incurred by converting different data layouts between the TensorFlow andCUDA kernels. The data layout of the TensorFlow kernels is represented in a quadruple tuple (batch size,channels, height of data sample, width of data sample), abbreviated as NCHW. While, the data layout ofthe CUDA kernels is represented in a quadruple tuple (batch size, height of data sample, width of datasample, channels), abbreviated as NHWC. That inconsistency incurs a huge overhead. It explains whythe speedup of mixed precision version of Image Classification is only 2.16x. It is much smaller than theresults published by Nvidia [70], which claims that the mixed precision training can bring up to 8x speedup on the Tesla V100 GPU.
We perform several scaling experiments on the distributed system, described in Section 8.1. Both EWAand Image Classification experiments are scaled out from 8 GPUs to 64 GPUs. We take the 8-GPUexperiments (single node) as a baseline. Our communication topology is the double binary tree [62],which is implemented by NCCL 2.4. We report the performance numbers of these experiments and24 a) Image Classification (FP32) (b)
Image Classification (Mixed) (c)
EWA (FP32) (d)
Image Classification(FP32+Compression) (e)
Image Classification(Mixed+Compression) (f)
EWA (FP32+Compression)
Figure 12:
The scaling experiments of EWA and Image Classification. perform further analysis using the HPC AI roofline models proposed in Section 8.5. The scaling resultsare shown in Fig. 12.
For the FP32 precision implementation of Image Classification, the parallel efficiency is 0.91, 0.85 and0.71 on 16, 32 and 64 GPUs, respectively. For the mixed implementation, the parallel efficiency is slightlower: 0.89, 0.82 and 0.67, respectively. There is a notable loss of parallel efficiency when the systemscale is 64 GPUs.We also notice that the communication compression does not bring any performance improvementwhen the system scale is 32 GPUs or less. However, when the scale is 64 GPUs, it contributes a lot.For the FP32 version, the performance improves from 345 to 414 TFLOPS. For the mixed version, theperformance improves from 718 to 939 TFLOPS. According to our HPC AI Roofline model shown inFig. 9b, we find that there is a performance bound shift when the system scale changes from 32 to 64 GPUs.Specifically, when the system scale is less or equal to 32 GPUs, Image Classification’s communicationceiling is dominated by NVLink’s bandwidth, and it is computation-bound. Hence, communicationcompression cannot improve the performance. However, When the system increases to 64 GPUs, thecommunication ceiling is dominated by Ethernet’s bandwidth, so it turns into communication-bound. Thisis why communication compression works. The highest performance of Image Classification that weachieve is 939 TFLOPS through both mixed precision optimization and communication compression, asshown in Fig. 12e.
For the FP32 precision implementation of EWA, the parallel efficiency is 0.50, 0.37, and 0.36 at thesystem scale of 16, 32, and 64 GPUs, respectively. According to the Roofline model shown in Fig. 9b,the bottleneck is always communication bandwidth. Therefore, communication compression achievesgood results. When communication compression is used, the performance gain persists when the scaleincrease from 8, to 16, 32, and 64 GPUs, and the speedup is 1.2, 1.4 ,1.6 and 1.5, respectively. The highestperformance of EWA achieved through communication compression is 109 TFLOPS.25 igure 13:
The distinctive communication bandwidth consumption of the FP32 implementations of EWAand Image Classification.
For EWA and Image Classification, we found their different parallel efficiencies are due to distinctcommunication bandwidth consumption. As shown in Fig. 13, we measure the communication bandwidthconsumption of the FP32 precision implementations of EWA and Image Classification. EWA consumesmuch higher communication bandwidth than that of Image Classification. In the contrast, the performanceof Image Classification largely depend on the computation efficiency especially when the scale is less thanand equal to 32 GPUs. In conclusion, 10 Gb/s Ethernet can not satisfy the communication requirement ofEWA, and hence results in poor parallel efficiency.
The metric of VFLOPS emphasizes both the performance and quality. Fig. 14 shows the rankings ofdifferent scale HPC AI systems with mixed-precision or FP32 implementations. The highest performanceis 642 TVFLOPS, achieving through the mixed optimization at the scale of 64 GPUs. Meanwhile, anotherauxiliary metric–time-to-quality is also reported. Generally, our metric is simple and visual.
Figure 14:
The VFLOPS Rankings of HPC AI Systems Using Image Classification.26 .5 The Case Study of Using HPC-AI Roofline Models
This section presents a case study on how to use our proposed HPC AI Roofline models to identify thebottleneck and guide optimizations.
We use the proposed roofline models to the 16-GPU HPC AI system. The theoretical peak number iscalculated according to the hardware configurations shown in Table 9. We use the roofline model toidentify potential bottlenecks of EWA and Image Classification.From Fig. 15, we have the following observations. EWA is bounded by the communication bandwidthas it falls in the slanted part of the roof, while Image Classification is bounded by the computation as itfalls in the flat part. Communication Operation Intensity P e r f o r m a n ce ( TF O PS ) Peak DL C o m m u n i c a t i o n B a n d w i d t h N V L i n k B a n d w i d t h CONV with Single PointCONV with Mixed PrecisionGEMM with Single PointGEMM with Mixed Precision
Figure 15:
The roofline model at the system scale of 16 GPUs. The blue point represents EWA, and thered point represents Image Classification.
We adopt two optimization strategies: communication compression and mixed precision optimization.
Communication compression.
In order to optimize the communication, we perform communicationcompression, which encodes and compresses the tensor precision into FP16 for communication andthen decodes into FP32 for computation. This optimization halves the amount of communication foreach training step, which is equivalent to doubling the communication bandwidth. As the amount ofcomputation remains the same, the COI of EWA and Image Classification also doubles. As shown inFig. 15, our results show that the performance of EWA increases from 25.99 to 36.97 TFLOPS aftercommunication compression. On the other hand, the performance of Image Classification is not improvedbecause it is computation bound. Its COI indeed increases.
Mixed precision training.
In order to improve the performance of Image Classification, we adopt themixed precision optimization, which makes use of Tensor Core to perform arithmetic calculation in anFP16 format, achieving higher amount of computation operations per second. As shown in Fig. 15, therightest red point represents using the mixed precision training. It brings about 2.16x speedup. Moreover,the COI is also improved. This is because that the mixed precision training requires lower memory27ootprint, so we double the batchsize, and the larger batchsize leads to higher COI (higher amount ofcomputation per step). In the near future, we will try mixed precision optimization for EWA, too.
We summarize the related work in a chronological order (according to the publication dates of thereferred papers or publicly available technique reports) from the perspectives of HPC benchmarking, AIbenchmarking, and HPC AI benchmarking.
HPL (1994) [71] is the famous HPC benchmark for the Top500 [13] ranking. HPL is short for HighPerformance Linpack, which is designed to solve dense linear equations. For the TOP500 ranking, usersare allowed to optimize MPI [61] and the BLAS [72] library to achieve the best performance. Sincesolving the Linpack problem is very regular, the HPC system can achieve very high performance. So, theperformance of HPL can be described as the upper bound performance of the target HPC system. HPL isopen source and publicly available from .NPB (1994) [73] is the NAS Parallel Benchmark suite, whose workloads are derived from thecomputational fluid dynamics (CFD) applications. CFD is a typical traditional HPC application. Based onthe pencil-and-paper specification, NPB 1.0 consists of five kernels and three pseudo-applications, andthe lastest NPB 3.4.1 includes 12 workloads. NPB is open source and publicly available from .HPCC (2005) [74] is an HPC Challenge benchmark suite, which includes seven different workloads.HPCC covers the spectrum of spatial locality and temporal locality of the HPC workloads. So, the HPCCbenchmarks are designed for measuring a range of memory access patterns of the HPC system. HPCC isopen source and publicly available from https://icl.utk.edu/hpcc/ .Graph500 (2010) [75] is designed for the data-intensive supercomputer applications. The workloadsof Graph500 are the search and shortest-path programs of the weighted undirected graph. The Graph500workloads exhibit very low spatial and temporal locality. Its metric is not the FLOPS but the TEPS(traversed edges per second). Graph500 is open source and publicly available from https://graph500.org/ .HPCG (2013) [76] is another benchmark for the Top500 ranking. HPCG means High PerformanceConjugate Gradients (HPCG). Computational and data access patterns of HPCG are more close to thereal HPC applications. As a kernel workload extracted from the traditional HPC workloads, the HPCGbenchmark is intended as a complement to the High Performance LINPACK (HPL) benchmark, and theFLOPS of HPCG is far lower than that of HPL on the same platform. HPCG is open source and publiclyavailable from https://github.com/hpcg-benchmark/hpcg . BenchNN (2012) [77] uses neural networks algorithms to re-implement the well-known PARSEC bench-mark [35]. Their main propose is to illustrate the potential application scope of neural networks algorithms.The models adopted in BenchNN are simple shallow neural networks such as multi-layer perceptron, andthus they cannot reflect the state of the art. BenchNN is not open source so far.DeepBench (2016) [78] is a micro benchmark suite that aims to benchmark basic operations in deepneural networks such as convolution and dense matrices multiply. The methodology of DeepBench isto reflect the characteristics of these operations by using different input sizes. Since only operator levelis concerned, DeepBench cannot provide full-model level evaluation. DeepBench is open source andpublicly available from https://github.com/baidu-research/DeepBench .Both Fathom (2016) [79] and TBD (2018) [80] consists of representative AI workloads, covering abroad range of application domains. Their evaluation only focus on throughput while ignoring model28uality. Fathom is open source and publicly available from https://github.com/rdadolf/fathom .TBD is open source and publicly available from https://github.com/tbd-ai/tbd-suite .DawnBench (2017) [81] aims to end-to-end deep learning benchmarking as it firstly proposes time-to-accuracy as the main metric, which requires to train a model to the state-of-the-art accuracy. It hastwo workloads including image classification and question answering. The limitation of DawnBenchis ignoring the equivalent benchmarking rules. DawnBench is open source and publicly available from https://github.com/stanford-futuredata/dawn-bench-entries .The BenchCouncil AI benchmark suites (2018) present a series of AI benchmarking work, in-cluding AIBench [10, 11, 14, 15] for datacenter AI benchmarking, AIoTBench [82] for mobile andembedded device intelligence benchmarking, Edge AIBench [83] for edge computing benchmarking,and the previous version of HPC AI500 [45]. The BenchCouncil AI benchmarks are by far the mostcomprehensive AI benchmark suites covering datacenter, IoT, edge, and HPC. For example, AIBenchadopts a scenario-distilling benchmarking methodology for the first time, which considers scenariobenchmarks, component and micro benchmarks as three indispensable parts of a benchmark suite. Thismethodology bridges a huge gap from real-world application deployments to simulator-based architec-ture research, and balances the subtly different requirements of earlier-stage benchmarking (portabilityand affordability for new architectures) and later-stage benchmarking (representativeness and compre-hensiveness) [11]. The BenchCouncil AI benchmark suites are open source and publicly availablefrom .BenchIP (2018) [84] focuses on benchmarking intelligent processors. It contains two sets of bench-marks: micro-benchmarks and macro-benchmarks. Micro-benchmarks consists of single-layer networksthat are used to system optimizations. Macro-benchmarks consists of various neural networks that areused to offer realistic benchmarking. BenchIP also ignores the equivalent benchmarking rules. In addition,it only focuses on throughput. BenchIP is not open source so far.MLPerf (2019) [9] includes seven benchmarks for training and five benchmarks for inference. TheMLPerf training benchmark proposes a series of benchmarking rules to eliminate the side effect of thestochastic nature of AI. Nevertheless, The MLPerf rules can not be used to assure the equivalency, repeata-bility, and replicability of HPC AI benchmarking. It lacks the specific parallelism and communicationrules. MLPerf is open source and publicly available from https://github.com/mlperf . HPC AI500 (V 1.0) (2018) [45] is the first HPC AI benchmarks based on the real-world scientificdataset, covering three representative HPC AI applications, namely high energy physics, cosmology, andextreme weather analytics. The HPC AI500 (V 1.0) is open source and publicly available from .The HPL-AI benchmark (2019) [17] is designed for 32-bit and even lower floating-point precision AIcomputing. Using the solver formulation of the decades-old HPL framework of benchmarking, HPL-AIstrives to unite traditional HPC and state-of-art AI. HPL-AI algorithm is a combination of low-precision(state-of-art AI precision) LU factorization and iterative refinement performed afterwards to bring thesolution back to 64-bit accuracy (traditional HPC precision). However, the LU factorization operationis irrelevant to most of AI workloads. As a micro-benchmark, HPL-AI benchmark is more suitable forevaluating the upper bound performance of the HPC AI system. The HPL-AI benchmark is open sourceand publicly available from https://icl.bitbucket.io/hpl-ai/ .Deep500 (2019) [85] is a reproducible customized benchmarking infrastructure for high-performancedeep learning. It has four levels of abstraction to provide a full-stack evaluation. However, its referenceimplementation uses commercial open source data sets and simple deep learning models, hence cannotreflect real-world HPC AI workloads. Moreover, it fails to propose rules to assure the equivalency,repeatability, and replicability of HPC AI benchmarking. Deep500 is open source and publicly availablefrom https://github.com/deep500/deep500 .AAH (2020) [86] uses AutoML [87] to benchmark HPC AI systems. AutoML is highly compute-intensive and extensible, which fits the requirement of benchmarking HPC systems. However, as a29omplicated AI workload, AutoML involves many hyper-parameters, which usually makes it hard toevaluate [40]. Moreover, the variance of its essential workload–Neural Architecture Search is also high as6.15%, according to the evaluation in [10].
Some specific AI workloads also play an important role in evaluating the HPC AI system. Ima-geNet/Resnet50 is a well-known showcase for optimizing HPC AI systems, motivating a series ofresearches on learning rate scheduling algorithms and efficient communication strategies [19, 23, 24, 26,27, 29, 32].The researchers of Facebook (2017) [19] formally propose linear scaling rule and warmup schema forthe first time and summarized several pitfalls in large scale deep learning. They finish the training in 60minutes with a top1 accuracy of 76.3%.The researchers of Berkeley (2017) [26] firstly propose LARS (Layer-wise Adaptive Rate Scaling),which is a novel learning rate policy. By utilizing this policy, they successfully scale the batchsize ofResNet50 to 32K and reduce the training time to 20 minutes.Preferred Networks (2017) [27], IBM (2017) [23], Tencent (2018) [29], Sony (2018) [28], Google(2018) [30], and Fujitsu (2019) [32] all focus on high efficient communication strategies (scale to largerHPC systems) and other system-level optimizations (e.g. mixed precision training). Their learning ratepolicy or other algorithm-level optimizations follow the work from Facebook [19] and Berkeley [26].These work have reduced the training times from hours to minutes. So far, the fastest training time is 74seconds, which is from Fujitsu (2019) [32].
10 Conclusion
This paper proposes a comprehensive HPC AI benchmarking methodology that achieves the goal of beingequivalent, relevant, representative, affordable, and repeatable. Following this methodology, we presentopen-source benchmarks, and Roofline performance model to benchmarking and optimizing the systems.We propose two innovative metrics: Valid FLOPS, and valid FLOPS per watt to rank the performance andenergy-efficiency of HPC AI systems.The evaluations show our methodology, benchmarks, performance models, and metrics can measure,optimize, and rank the HPC AI systems in a scalable, simple, and affordable way. The specification,source code, and benchmarking data are publicly available from .
11 Acknowledgments
We thank the PengCheng Laboratory for hardware support. We also thank Shaomeng Cao, Xuhui Shao,Yongheng Liu, Changsong Liu, and Jingfei Qiu for technical support in using those systems.30 eferences [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutionalneural networks,” in
Advances in neural information processing systems , pp. 1097–1105, 2012.[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchicalimage database,” in , pp. 248–255,Ieee, 2009.[3] .[4] S. Ravanbakhsh, J. B. Oliva, S. Fromenteau, L. Price, S. Ho, J. G. Schneider, and B. P´oczos,“Estimating cosmological parameters from the dark matter distribution.,” in
ICML , pp. 2407–2416,2016.[5] Y. Liu, E. Racah, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel, M. Wehner, W. Collins, et al. ,“Application of deep convolutional neural networks for detecting extreme weather in climate datasets,” arXiv preprint arXiv:1605.01156 , 2016.[6] A. Mathuriya, D. Bard, P. Mendygral, L. Meadows, J. Arnemann, L. Shao, S. He, T. K¨arn¨a, D. Moise,S. J. Pennycook, et al. , “Cosmoflow: Using deep learning to learn the universe at scale,” in
SC18:International Conference for High Performance Computing, Networking, Storage and Analysis ,pp. 819–829, IEEE, 2018.[7] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips, A. Mahesh, M. Matheson,J. Deslippe, M. Fatica, et al. , “Exascale deep learning for climate analytics,” in
SC18: InternationalConference for High Performance Computing, Networking, Storage and Analysis , pp. 649–660,IEEE, 2018.[8] J. L. Hennessy and D. A. Patterson,
Computer architecture: a quantitative approach . Elsevier, 2011.[9] P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei,P. Bailis, V. Bittorf, et al. , “Mlperf training benchmark,” arXiv preprint arXiv:1910.01500 , 2019.[10] F. Tang, W. Gao, J. Zhan, C. Lan, X. Wen, L. Wang, C. Luo, J. Dai, Z. Cao, X. Xiong, et al. , “Aibench:An industry standard ai benchmark suite from internet services,” arXiv preprint arXiv:2004.14690 ,2020.[11] W. Gao, F. Tang, J. Zhan, X. Wen, L. Wang, Z. Cao, C. Lan, C. Luo, and Z. Jiang, “Aibench:Scenario-distilling ai benchmarking,” arXiv preprint arXiv:2005.03459 , 2020.[12] J. Gray, “Database and transaction processing performance handbook.,” 1993.[13] J. J. Dongarra, H. W. Meuer, E. Strohmaier, et al. , “Top500 supercomputer sites,”
Supercomputer ,vol. 13, pp. 89–111, 1997.[14] W. Gao, C. Luo, L. Wang, X. Xiong, J. Chen, T. Hao, Z. Jiang, F. Fan, M. Du, Y. Huang, et al. ,“Aibench: towards scalable and comprehensive datacenter ai benchmarking,” in
International Sympo-sium on Benchmarking, Measuring and Optimization , pp. 3–9, Springer, 2018.[15] W. Gao, F. Tang, L. Wang, J. Zhan, C. Lan, C. Luo, Y. Huang, C. Zheng, J. Dai, Z. Cao, et al. ,“Aibench: an industry standard internet service ai benchmark suite,” arXiv preprint arXiv:1908.08998 ,2019.[16] J. Zhan, L. Wang, W. Gao, and R. Ren, “Benchcouncil’s view on benchmarking ai and other emergingworkloads,” arXiv preprint arXiv:1912.00572 , 2019.[17] https://icl.bitbucket.io/hpl-ai/ . 3118] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston,O. Kuchaiev, G. Venkatesh, et al. , “Mixed precision training,” arXiv preprint arXiv:1710.03740 ,2017.[19] P. Goyal, P. Doll´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia,and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprintarXiv:1706.02677 , 2017.[20] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprintarXiv:1404.5997 , 2014.[21] V. Codreanu, D. Podareanu, and V. Saletore, “Scale out for large minibatch sgd: Residual net-work training on imagenet-1k with improved accuracy and reduced time to train,” arXiv preprintarXiv:1711.04291 , 2017.[22] S. Sridharan, K. Vaidyanathan, D. Kalamkar, D. Das, M. E. Smorkalov, M. Shiryaev, D. Mudigere,N. Mellempudi, S. Avancha, B. Kaul, et al. , “On scale-out deep learning training for cloud and hpc,” arXiv preprint arXiv:1801.08030 , 2018.[23] M. Cho, U. Finkler, S. Kumar, D. Kung, V. Saxena, and D. Sreedhar, “Powerai ddl,” arXiv preprintarXiv:1708.02188 , 2017.[24] Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “Imagenet training in minutes,” in
Proceedings of the 47th International Conference on Parallel Processing , ICPP 2018, (New York,NY, USA), Association for Computing Machinery, 2018.[25] .[26] Y. You, Z. Zhang, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Imagenet training in 24 minutes,” arXivpreprint arXiv:1709.05011 , 2017.[27] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch sgd: Training resnet-50 on imagenetin 15 minutes,” arXiv preprint arXiv:1711.04325 , 2017.[28] Y. Tanaka and Y. Kageyama, “Imagenet/resnet-50 training in 224 seconds,”[29] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, et al. , “Highlyscalable deep learning training system with mixed-precision: Training imagenet in four minutes,” arXiv preprint arXiv:1807.11205 , 2018.[30] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, “Image classification at supercomputer scale,” arXiv preprint arXiv:1811.06992 , 2018.[31] https://cloud.google.com/tpu/docs/bfloat16 .[32] M. Yamazaki, A. Kasagi, A. Tabuchi, T. Honda, M. Miwa, N. Fukumoto, T. Tabaru, A. Ike, andK. Nakashima, “Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds,” arXivpreprint arXiv:1903.12650 , 2019.[33] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecturefor computer vision,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , pp. 2818–2826, 2016.[34] . 3235] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: Characterizationand architectural implications,” in
Proceedings of the 17th international conference on Parallelarchitectures and compilation techniques , pp. 72–81, 2008.[36] .[37] J. Gray, “The benchmark handbook for database and transasction systems,”
Mergan Kaufmann, SanMateo , 1993.[38] J. Bartlett and C. Frost, “Reliability, repeatability and reproducibility: analysis of measurementerrors in continuous variables,”
Ultrasound in Obstetrics and Gynecology: The Official Journal ofthe International Society of Ultrasound in Obstetrics and Gynecology , vol. 31, no. 4, pp. 466–475,2008.[39] .[40] A. Yang, P. M. Esperanc¸a, and F. M. Carlucci, “Nas evaluation is frustratingly hard,” arXiv preprintarXiv:1912.12522 , 2019.[41] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,M. Isard, et al. , “Tensorflow: A system for large-scale machine learning,” in { USENIX } symposium on operating systems design and implementation ( { OSDI } , pp. 265–283, 2016.[42] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,L. Antiga, et al. , “Pytorch: An imperative style, high-performance deep learning library,” in Advancesin neural information processing systems , pp. 8026–8037, 2019.[43] C. Luo, X. He, J. Zhan, L. Wang, W. Gao, and J. Dai, “Comparison and benchmarking of ai modelsand frameworks on mobile devices,” arXiv preprint arXiv:2005.05085 , 2020.[44] W. Bhimji, S. A. Farrell, T. Kurth, M. Paganini, E. Racah, et al. , “Deep neural networks for physicsanalysis on low-level whole-detector data at the lhc,” in
Journal of Physics: Conference Series ,vol. 1085, p. 042034, IOP Publishing, 2018.[45] Z. Jiang, W. Gao, L. Wang, X. Xiong, Y. Zhang, X. Wen, C. Luo, H. Ye, X. Lu, Y. Zhang, et al. ,“Hpc ai500: a benchmark suite for hpc ai systems,” in
International Symposium on Benchmarking,Measuring and Optimization , pp. 10–22, Springer, 2018.[46] https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html .[47] C. Drummond, “Replicability is not reproducibility: nor is it good science,” 2009.[48] H. E. Plesser, “Reproducibility vs. replicability: a brief history of a confused terminology,”
Frontiersin neuroinformatics , vol. 11, p. 76, 2018.[49] T. Kurth, J. Zhang, N. Satish, E. Racah, I. Mitliagkas, M. M. A. Patwary, T. Malas, N. Sundaram,W. Bhimji, M. Smorkalov, et al. , “Deep learning at 15pf: supervised and semi-supervised classifi-cation for scientific data,” in
Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , pp. 1–11, 2017.[50] E. Racah, C. Beckham, T. Maharaj, S. E. Kahou, M. Prabhat, and C. Pal, “Extremeweather: Alarge-scale climate dataset for semi-supervised detection, localization, and understanding of extremeweather events,” in
Advances in Neural Information Processing Systems , pp. 3402–3413, 2017.[51] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with regionproposal networks,” in
Advances in neural information processing systems , pp. 91–99, 2015.3352] R. Girshick, “Fast r-cnn,” in
Proceedings of the IEEE international conference on computer vision ,pp. 1440–1448, 2015.[53] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate objectdetection and semantic segmentation,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , pp. 580–587, 2014.[54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
Proceedingsof the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016.[55] .[56] A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” arXiv preprint arXiv:1802.05799 , 2018.[57] A. Mathuriya, T. Kurth, V. Rane, M. Mustafa, L. Shao, D. Bard, V. W. Lee, et al. , “Scaling grpctensorflow on 512 nodes of cori supercomputer,” arXiv preprint arXiv:1712.09388 , 2017.[58] http://research.baidu.com/bringing-hpc-techniques-deep-learning .[59] S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, “Don’t decay the learning rate, increase thebatch size,” arXiv preprint arXiv:1711.00489 , 2017.[60] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J.Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” arXiv preprintarXiv:1904.00962 , 2019.[61] .[62] https://developer.nvidia.com/nccl .[63] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh,S. Sengupta, A. Coates, et al. , “Deep speech: Scaling up end-to-end speech recognition,” arXivpreprint arXiv:1412.5567 , 2014.[64] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducinginternal covariate shift,” arXiv preprint arXiv:1502.03167 , 2015.[65] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model formulticore architectures,”
Communications of the ACM , vol. 52, no. 4, pp. 65–76, 2009.[66] https://docs.nvidia.com/cuda/profiler-users-guide/index.html .[67] W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, X. Wen, R. Ren, C. Zheng, X. He, H. Ye, et al. , “Big-databench: A scalable and unified big data and ai benchmark suite,” arXiv preprint arXiv:1802.08254 ,2018.[68] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhimenko, “Tbd:Benchmarking and analyzing deep neural network training,” arXiv preprint arXiv:1803.06905 , 2018.[69] http://https://horovod.readthedocs.io/en/latest/timeline.html .[70] .[71] J. J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,”
Concurrency and Computation: practice and experience , vol. 15, no. 9, pp. 803–820, 2003.[72] . 3473] .[74] L. Humphrey, B. Guilfoos, H. B. Smith, A. Warnock, J. Unpingco, B. H. Elton, and A. Chalker,“Evaluating parallel extensions to high level languages using the hpc challenge benchmarks,” pp. 410–415, 2009.[75] K. Ueno and T. Suzumura, “Highly scalable graph search for the graph500 benchmark,” pp. 149–160,2012.[76] J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugate-gradient benchmark: Anew metric for ranking high-performance computing systems,”
The International Journal of HighPerformance Computing Applications , vol. 30, no. 1, pp. 3–10, 2016.[77] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag,and O. Temam, “Benchnn: On the broad potential application scope of hardware neural networkaccelerators,” in ,pp. 36–45, IEEE, 2012.[78] https://github.com/baidu-research/DeepBench/ .[79] R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks, “Fathom: Reference workloads for moderndeep learning methods,” in , pp. 1–10, IEEE, 2016.[80] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Jayarajan, A. Phanishayee, B. Schroeder, and G. Pekhi-menko, “Benchmarking and analyzing deep neural network training,” in , pp. 88–100, IEEE, 2018.[81] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. R´e,and M. Zaharia, “Dawnbench: An end-to-end deep learning benchmark and competition,”
Training ,vol. 100, no. 101, p. 102, 2017.[82] C. Luo, F. Zhang, C. Huang, X. Xiong, J. Chen, L. Wang, W. Gao, H. Ye, T. Wu, R. Zhou, et al. ,“Aiot bench: towards comprehensive benchmarking mobile and embedded device intelligence,” in
International Symposium on Benchmarking, Measuring and Optimization , pp. 31–35, Springer,2018.[83] T. Hao, Y. Huang, X. Wen, W. Gao, F. Zhang, C. Zheng, L. Wang, H. Ye, K. Hwang, Z. Ren, et al. ,“Edge aibench: towards comprehensive end-to-end edge computing benchmarking,” in
InternationalSymposium on Benchmarking, Measuring and Optimization , pp. 23–30, Springer, 2018.[84] J.-H. Tao, Z.-D. Du, Q. Guo, H.-Y. Lan, L. Zhang, S.-Y. Zhou, L.-J. Xu, C. Liu, H.-F. Liu, S. Tang, et al. , “B ench ip: Benchmarking intelligence processors,”
Journal of Computer Science andTechnology , vol. 33, no. 1, pp. 1–23, 2018.[85] T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler, “A modular benchmarkinginfrastructure for high-performance and reproducible deep learning,” in , pp. 66–77, IEEE, 2019.[86] Z. Ren, Y. Liu, T. Shi, L. Xie, Y. Zhou, H. Chen, H. Fu, Y. Ouyang, J. Zhai, Y. Zhang, Y. Zhang,and W. Chen, “Aah: Automated machine learning as an ai-hpc benchmark,”
Technical Report ofPengcheng Lab and Tsinghua University , 2020.[87] H. Jin, Q. Song, and X. Hu, “Auto-keras: An efficient neural architecture search system,” in